YouTube Analytics Pipeline for NBA Video Sentiment

YouTube Data API - PostgreSQL - SQL - Python - scikit-learn - React

Built an end-to-end data pipeline that extracts YouTube channel, video, and comment data, loads raw API payloads into PostgreSQL, transforms them into clean analytical tables, trains sentiment models, and presents the results in a dashboard-ready format.

View GitHub Contact

Tracked video

#3 KNICKS at #2 SPURS | NBA FINALS GAME 1 HIGHLIGHTS | June 3, 2026

Video ID: AL0AGaoJCWY

YouTube Data API PostgreSQL SVM Lexicon Baseline

Pipeline Results

Comment analysis output from the selected NBA video

Comments 5,480

5,412 usable after filtering

Comment Likes 87,805

2,702 replies in threads

Best Model Linear SVM

193 manual labels used

Avg Length 78.71 chars

68 low-information comments

Supervised Sentiment

Linear SVM prediction output

positive3,328 comments

60.7%

negative1,823 comments

33.3%

neutral261 comments

4.8%

unclassified68 comments

1.2%

Lexicon Baseline

Rule-based comparison

neutral3,530

positive1,557

negative393

Sentiment Label

Comment Count

Model Comparison

Best macro F1: Linear SVM

Model	Type	Accuracy	Macro F1	Weighted F1
Linear SVM	Supervised	69.2%	48.9%	67.7%
Logistic Regression	Supervised	64.1%	45.5%	63.0%
Naive Bayes	Supervised	66.7%	41.4%	60.0%
Lexicon Baseline	Rule-based	41.4%	38.3%	47.4%

Extraction

Collected channel metadata, video details, and paginated comments from the YouTube Data API.

Transformation

Loaded raw JSON payloads into PostgreSQL and transformed them into cleaned staging and mart tables.

Evaluation

Compared supervised models against a lexicon baseline to understand model behavior and label quality.

Sentiment Quality Notes

Linear SVM currently leads on macro F1 and overall accuracy.
Lexicon labeling is useful as a baseline, but it over-predicts neutral comments.
Neutral remains the hardest class because the manual sample has fewer neutral labels.