IMDb Sentiment Analyzer
Python
NLP
Machine Learning
Predictive Modeling
Web App
A web app that performs real-time NLP sentiment analysis on movie reviews, demonstrating a full workflow from data preparation to model tuning and deployment.
Project Overview
This project demonstrates an end-to-end workflow for real-time sentiment analysis on movie reviews. It covers data preparation, model tuning, and deployment as an interactive web application. The result is a practical NLP classification pipeline for user-generated text.
Key Features
- Real-Time Classification: Instantly classify any movie review as Positive or Negative.
- Probabilistic Confidence: Displays the model’s confidence score for each prediction.
- Interactive & User-Friendly UI: A clean and simple interface built with Streamlit for ease of use.
- Optimized Performance: Powered by a Logistic Regression model tuned for optimal performance, achieving ~90.8% accuracy.
Methodology
The project follows a standard data science workflow, documented in the accompanying Jupyter Notebook.
a. Data Preprocessing
- The “Large Movie Review Dataset” from IMDb, containing 50,000 pre-labeled reviews, was loaded and structured.
- Sentiment labels were mapped to human-readable classes (“Negative”/“Positive”).
- The data was split into training (80%) and testing (20%) sets for unbiased evaluation.
b. Feature Engineering and Model Training
- Feature Extraction: Text reviews were vectorized using a
TfidfVectorizer, which represents text based on word frequency while down-weighting common, non-informative words. - Model Selection: A Logistic Regression classifier was chosen for its strong balance of high performance and interpretability.
c. Hyperparameter Tuning
- A
GridSearchCVpipeline was implemented to systematically find the optimal hyperparameters for both the vectorizer (ngram_range,min_df) and the model’s regularization parameter (C) to prevent overfitting. - The final, tuned model was evaluated on the held-out test set, achieving an accuracy of ~90.8%, indicating it generalizes well to new, unseen movie reviews.
Tech Stack
- Data Science: pandas, scikit-learn (
TfidfVectorizer,LogisticRegression,Pipeline,GridSearchCV) - Web Application: Streamlit
- Development: Python 3.11, Jupyter Notebook