YouTube Analytics Pipeline for NBA Video Sentiment
YouTube Data API - PostgreSQL - SQL - Python - scikit-learn - React
Built an end-to-end data pipeline that extracts YouTube channel, video, and comment data, loads raw API payloads into PostgreSQL, transforms them into clean analytical tables, trains sentiment models, and presents the results in a dashboard-ready format.
#3 KNICKS at #2 SPURS | NBA FINALS GAME 1 HIGHLIGHTS | June 3, 2026
Video ID: AL0AGaoJCWY
Pipeline Results
Comment analysis output from the selected NBA video
5,412 usable after filtering
2,702 replies in threads
193 manual labels used
68 low-information comments
Supervised Sentiment
Lexicon Baseline
Model Comparison
Best macro F1: Linear SVM
| Model | Type | Accuracy | Macro F1 | Weighted F1 |
|---|---|---|---|---|
| Linear SVM | Supervised | 69.2% | 48.9% | 67.7% |
| Logistic Regression | Supervised | 64.1% | 45.5% | 63.0% |
| Naive Bayes | Supervised | 66.7% | 41.4% | 60.0% |
| Lexicon Baseline | Rule-based | 41.4% | 38.3% | 47.4% |
Extraction
Collected channel metadata, video details, and paginated comments from the YouTube Data API.
Transformation
Loaded raw JSON payloads into PostgreSQL and transformed them into cleaned staging and mart tables.
Evaluation
Compared supervised models against a lexicon baseline to understand model behavior and label quality.
Sentiment Quality Notes
- Linear SVM currently leads on macro F1 and overall accuracy.
- Lexicon labeling is useful as a baseline, but it over-predicts neutral comments.
- Neutral remains the hardest class because the manual sample has fewer neutral labels.