GreyNoise Customer Churn Predictor
Live Application | View the Notebook | GitHub Repository
Project Overview
This project is a Python-based machine learning application designed to predict customer churn at GreyNoise Intelligence. It serves as a modernized, deployable version of an initial analysis I conducted during an internship using R. The primary goal is to provide an interactive tool to assess churn risk in real-time.
To protect proprietary information, the public-facing version of this project, including the live app and the code repository, uses a synthetically generated dataset.
Model Performance
The logistic regression model was validated using a Stratified 5-Fold Cross-Validation strategy. The metrics below reflect the performance on the original, proprietary dataset.
Metric | Score |
---|---|
Mean Accuracy | 0.79 |
Mean ROC AUC | 0.70 |
Mean PR-AUC | 0.54 |
An ROC AUC score of 0.70 demonstrates a good capability to distinguish between customers who will churn and those who will not. The PR-AUC score of 0.54 is particularly important for this imbalanced dataset, indicating that the model is substantially better at identifying churners than a random model.
Key Predictive Factors
Analysis of the model’s coefficients reveals the most significant factors influencing churn predictions.
Top Factors Increasing Churn Risk
- Geographic Region: The model identified that customers from certain geographic regions had a significantly higher propensity to churn.
- Unknown Account History: When a customer’s prior account signup status was unknown, they were flagged as a high churn risk.
- Specific Industry Segments: Customers within certain specialized industries showed a higher tendency to churn.
Top Factors Decreasing Churn Risk
- Acquisition Source: Customers who came to GreyNoise via direct traffic were the least likely to churn.
- Existing Account History: Knowing a customer had a previous free account was a strong signal of loyalty and a significantly lower churn risk.
- Annual Recurring Revenue (ARR): As a customer’s ARR increased, their likelihood of churn decreased substantially.
Tech Stack & Methodology
- Modeling: Python 3.11, scikit-learn, pandas, numpy, joblib
- Web App: Streamlit
- Validation: Stratified K-Fold Cross-Validation
- Source Code: The
gn_churn.ipynb
notebook contains the complete code for data processing and modeling.