A comprehensive data engineering pipeline analyzing 50+ million Reddit comments using PostgreSQL, MongoDB, and association rule mining.
One of the largest publicly available social media datasets
May 2015 (Full Month)
SQLite Database (~30GB)
Normalized schema design following 3NF principles
6 normalized tables with proper foreign key relationships maintaining referential integrity across the dataset.
python load_data.py \
--input database.sqlite \
--host localhost \
--port 5432 \
--user postgres \
--password yourpass \
--dbname redditdb
Hybrid embedding strategy optimized for read performance
Top N comments embedded in posts for fast reads, while all comments stored separately for analytics. Balances MongoDB's 16MB document limit with performance.
python load_to_mongo.py \
--input database.sqlite \
--mongo_uri "mongodb://localhost:27017/" \
--dbname reddit_may2015 \
--chunksize 50000 \
--embed-cap 200 \
--reset
Pattern discovery using Apriori algorithm
Remove deleted content, fix inconsistencies, validate timestamps
Transform comments into itemsets with subreddit, score, flags
Find frequent itemsets meeting support threshold
Generate IF-THEN rules with confidence metrics
Minimum frequency threshold for itemsets
Minimum probability for rules
Each comment becomes a transaction with subreddit, score category, and status flags
Query performance improvements with optimized indexing