HomeMichaela Jackson

        

      A Large-Scale Data Analysis





Abstract:
What makes a book polarizing? This project explores the most disliked titles on Goodreads by analyzing over 15 million records of metadata and reader reviews. Using a two-phase approach—first with structured numerical data and then full-text review analysis—I examine the patterns behind one-star ratings, from genre bias to mismatched expectations and recurring complaint themes. The result is a rich combination of data storytelling, text mining, and visual interpretation that surfaces what readers love to hate—and why.

Purpose: 
Negative reviews offer a rich, if often blunt, form of feedback. By analyzing one-star ratings, we can surface common patterns of disappointment, identify mismatches between expectations and experience, and uncover genre-specific pitfalls. This project focuses on one-star reviews specifically because they tend to be emotionally charged and thematically rich, making them a valuable source for thematic and linguistic analysis.

Methods:
I used two large public datasets from Hugging Face and UCSD, combining structured metadata with over 15 million full-text reviews. In Phase 1, I cleaned and transformed the metadata, extracting star rating distributions, exploding genre tags, and visualizing genre-specific dissatisfaction patterns. In Phase 2, I did chunked processing of the large datasets, applying natural language cleaning, shelf tag mapping, and sentiment analysis. I visualized linguistic trends, extracted common complaint themes using rule-based keyword detection, and analyzed the contradiction between review and description sentiment. These methods laid the groundwork for future clustering and supervised modeling of review tone.

Research Questions:
  • What themes consistently emerge in negative reviews?
  • How can natural language processing surface deeper insights about reader sentiment?
  • How do reader expectations and genre categories influence dissatisfaction?


Skills Developed:
This project strengthened my ability to work with large-scale datasets, clean and structure complex nested fields, and design efficient pipelines for exploratory and NLP-based analysis. I applied genre classification logic, implemented a rule-based complaint theme matcher, and used text preprocessing, sentiment analysis, and phrase frequency modeling to extract patterns from reader reviews. I also refined my ability to visualize sentiment contradictions and storytelling trends across formats.

Tools Used:
Python, pandas, seaborn, matplotlib, scikit-learn, TextBlob, KeyBERT, Jupyter Notebook, gzip, ast, Counter, Hugging Face, UCSD Goodreads Dataset, Github

Learning Outcomes:
This project demonstrates technical fluency in data wrangling and sentiment analysis at scale. It reflects the ability to identify and model cultural patterns in user-generated data using both structured metadata and unstructured review text. I translated messy shelf tags into meaningful genre labels, explored the language of critique through visualization, and designed the groundwork for a future classifier that predicts complaint themes. This work combines cultural insight with rigorous text-based modeling to interrogate how online communities assign literary value.





View Developing Project Code on Github