YouTube Analytics Pipeline for NBA Video Sentiment

YouTube Data API - PostgreSQL - SQL - Python - scikit-learn - React

Built an end-to-end data pipeline that extracts YouTube channel, video, and comment data, loads raw API payloads into PostgreSQL, transforms them into clean analytical tables, trains sentiment models, and presents the results in a dashboard-ready format.

Tracked video

#3 KNICKS at #2 SPURS | NBA FINALS GAME 1 HIGHLIGHTS | June 3, 2026

Video ID: AL0AGaoJCWY

YouTube Data API PostgreSQL SVM Lexicon Baseline

Pipeline Results

Comment analysis output from the selected NBA video

Comments 5,480

5,412 usable after filtering

Comment Likes 87,805

2,702 replies in threads

Best Model Linear SVM

193 manual labels used

Avg Length 78.71 chars

68 low-information comments

Supervised Sentiment

Linear SVM prediction output

positive3,328 comments
60.7%
negative1,823 comments
33.3%
neutral261 comments
4.8%
unclassified68 comments
1.2%

Lexicon Baseline

Rule-based comparison

Model Comparison

Best macro F1: Linear SVM

Model Type Accuracy Macro F1 Weighted F1
Linear SVM Supervised 69.2% 48.9% 67.7%
Logistic Regression Supervised 64.1% 45.5% 63.0%
Naive Bayes Supervised 66.7% 41.4% 60.0%
Lexicon Baseline Rule-based 41.4% 38.3% 47.4%

Extraction

Collected channel metadata, video details, and paginated comments from the YouTube Data API.

Transformation

Loaded raw JSON payloads into PostgreSQL and transformed them into cleaned staging and mart tables.

Evaluation

Compared supervised models against a lexicon baseline to understand model behavior and label quality.

Sentiment Quality Notes

  • Linear SVM currently leads on macro F1 and overall accuracy.
  • Lexicon labeling is useful as a baseline, but it over-predicts neutral comments.
  • Neutral remains the hardest class because the manual sample has fewer neutral labels.