CSCI-620 Big Data Project

Reddit Comments
May 2015 Analysis

A comprehensive data engineering pipeline analyzing 50+ million Reddit comments using PostgreSQL, MongoDB, and association rule mining.

0
Comments Analyzed
0
GB Dataset
0
Project Phases
Scroll to explore

Meet the Team

Reddit Comments Dataset

One of the largest publicly available social media datasets

Data Source

Kaggle Reddit Comments May 2015

View Dataset

Time Period

May 2015 (Full Month)

Format

SQLite Database (~30GB)

Dataset Fields

author TEXT
subreddit TEXT
body TEXT
score INT
ups / downs INT
gilded INT
created_utc TIMESTAMP
controversiality INT
01
PostgreSQL

Relational Data Model

Normalized schema design following 3NF principles

Schema Design

6 normalized tables with proper foreign key relationships maintaining referential integrity across the dataset.

  • Users - Author profiles & flair
  • Subreddit - Community metadata
  • Post - Post-level data
  • Post_Link - Post references
  • Comment - All comments with scores
  • Moderation - Moderation actions

Key Features

  • Automatic Kaggle dataset download
  • Streaming SQLite ingestion
  • Batch processing (10K records)
  • Progress tracking & error handling
  • Foreign key constraint validation
  • Sample mode for testing
Load Data Command
python load_data.py \
    --input database.sqlite \
    --host localhost \
    --port 5432 \
    --user postgres \
    --password yourpass \
    --dbname redditdb
Relational Schema Diagram
Click to enlarge
02
MongoDB

Document-Oriented Model

Hybrid embedding strategy optimized for read performance

Document Model Diagram
Click to enlarge

Collections

users Author profiles
subreddits Community data
posts Hybrid embedded
comments Full analytics
moderation Mod signals

Hybrid Design

Top N comments embedded in posts for fast reads, while all comments stored separately for analytics. Balances MongoDB's 16MB document limit with performance.

MongoDB Loader
python load_to_mongo.py \
    --input database.sqlite \
    --mongo_uri "mongodb://localhost:27017/" \
    --dbname reddit_may2015 \
    --chunksize 50000 \
    --embed-cap 200 \
    --reset
03
Data Mining

Association Rule Mining

Pattern discovery using Apriori algorithm

Data Cleaning

Remove deleted content, fix inconsistencies, validate timestamps

Transaction Creation

Transform comments into itemsets with subreddit, score, flags

Apriori Mining

Find frequent itemsets meeting support threshold

Rule Generation

Generate IF-THEN rules with confidence metrics

Mining Parameters

min_support 0.01 (1%)

Minimum frequency threshold for itemsets

min_confidence 0.5 (50%)

Minimum probability for rules

Transaction Structure

subreddit:AskReddit
high_score
gilded
edited

Each comment becomes a transaction with subreddit, score category, and status flags

Data Cleaning Rules

Drop missing authors/bodies
Remove [deleted]/[removed]
Filter invalid timestamps
Fix score = ups - downs
Nullify symbol-only flair
Validate FK references

Benchmark Results

Query performance improvements with optimized indexing

Query
Before Index
After Index
Speedup
AskReddit latest 50 posts
2.34s
0.12s
19.5x
Top 20 subreddits by avg score
5.67s
0.89s
6.4x
Top 20 authors by post count
3.21s
0.45s
7.1x
Gilded but not archived posts
1.89s
0.08s
23.6x
Posts by authors containing 'cat'
4.56s
0.67s
6.8x
Avg comments per post (top 10)
8.23s
1.34s
6.1x
11.6x
Avg Speedup
23.6x
Max Speedup
6
Indexes Created

Sample Association Rules Discovered

IF subreddit:nfl
THEN low_score
Support: 7.09% Confidence: 92.32% Lift: 1.16
IF gilded
THEN very_high_score
Support: 2.34% Confidence: 78.5% Lift: 3.42

Tech Stack

Python
PostgreSQL
MongoDB
Pandas
MLxtend
SQLite