IMDb Sentiment Analyzer

Python
NLP
Machine Learning
Predictive Modeling
Web App
A web app that performs real-time NLP sentiment analysis on movie reviews, demonstrating a full workflow from data preparation to model tuning and deployment.
Author

Christopher Hynes

Published

July 12, 2025

Project Overview

This project demonstrates an end-to-end workflow for real-time sentiment analysis on movie reviews. It covers data preparation, model tuning, and deployment as an interactive web application. The result is a practical NLP classification pipeline for user-generated text.


Key Features

  • Real-Time Classification: Instantly classify any movie review as Positive or Negative.
  • Probabilistic Confidence: Displays the model’s confidence score for each prediction.
  • Interactive & User-Friendly UI: A clean and simple interface built with Streamlit for ease of use.
  • Optimized Performance: Powered by a Logistic Regression model tuned for optimal performance, achieving ~90.8% accuracy.

Methodology

The project follows a standard data science workflow, documented in the accompanying Jupyter Notebook.

a. Data Preprocessing

  • The “Large Movie Review Dataset” from IMDb, containing 50,000 pre-labeled reviews, was loaded and structured.
  • Sentiment labels were mapped to human-readable classes (“Negative”/“Positive”).
  • The data was split into training (80%) and testing (20%) sets for unbiased evaluation.

b. Feature Engineering and Model Training

  • Feature Extraction: Text reviews were vectorized using a TfidfVectorizer, which represents text based on word frequency while down-weighting common, non-informative words.
  • Model Selection: A Logistic Regression classifier was chosen for its strong balance of high performance and interpretability.

c. Hyperparameter Tuning

  • A GridSearchCV pipeline was implemented to systematically find the optimal hyperparameters for both the vectorizer (ngram_range, min_df) and the model’s regularization parameter (C) to prevent overfitting.
  • The final, tuned model was evaluated on the held-out test set, achieving an accuracy of ~90.8%, indicating it generalizes well to new, unseen movie reviews.

Tech Stack

  • Data Science: pandas, scikit-learn (TfidfVectorizer, LogisticRegression, Pipeline, GridSearchCV)
  • Web Application: Streamlit
  • Development: Python 3.11, Jupyter Notebook