IMDb Sentiment Analyzer

Python
NLP
Machine Learning
Predictive Modeling
Web App
A web app that performs real-time NLP sentiment analysis on movie reviews, demonstrating a full workflow from data preparation to model tuning and deployment.
Author

Christopher Hynes

Published

July 12, 2025

Live Application | View the Notebook | GitHub Repository

Project Overview

This project demonstrates a full, end-to-end data science workflow for performing real-time sentiment analysis on movie reviews. It covers data sourcing and preparation, model tuning, and final deployment as an interactive web application. The goal is to provide an automated tool to instantly analyze the sentiment of user-generated text, serving as a proof-of-concept for a valuable business intelligence tool.


Key Features

  • Real-Time Classification: Instantly classify any movie review as Positive or Negative.
  • Probabilistic Confidence: Displays the model’s confidence score for each prediction.
  • Interactive & User-Friendly UI: A clean and simple interface built with Streamlit for ease of use.
  • Optimized Performance: Powered by a Logistic Regression model tuned for optimal performance, achieving ~90.8% accuracy.

Methodology

The project follows a standard data science workflow, documented in the accompanying Jupyter Notebook.

a. Data Preprocessing

  • The “Large Movie Review Dataset” from IMDb, containing 50,000 pre-labeled reviews, was loaded and structured.
  • Sentiment labels were mapped to human-readable classes (“Negative”/“Positive”).
  • The data was split into training (80%) and testing (20%) sets for unbiased evaluation.

b. Feature Engineering and Model Training

  • Feature Extraction: Text reviews were vectorized using a TfidfVectorizer, which represents text based on word frequency while down-weighting common, non-informative words.
  • Model Selection: A Logistic Regression classifier was chosen for its strong balance of high performance and interpretability.

c. Hyperparameter Tuning

  • A GridSearchCV pipeline was implemented to systematically find the optimal hyperparameters for both the vectorizer (ngram_range, min_df) and the model’s regularization parameter (C) to prevent overfitting.
  • The final, tuned model was evaluated on the held-out test set, achieving an accuracy of ~90.8%, indicating it generalizes well to new, unseen movie reviews.

Tools and Libraries

  • Data Science: Pandas, Scikit-learn (TfidfVectorizer, LogisticRegression, Pipeline, GridSearchCV)
  • Web Application: Streamlit
  • Development: Python 3.11, Jupyter Notebook