GreyNoise Customer Churn Predictor
Project Overview
This project is a Python-based machine learning application designed to predict customer churn at GreyNoise Intelligence. It serves as a modernized, deployable version of an initial analysis I conducted during an internship using R. The primary goal is to provide an interactive tool to assess churn risk in real-time.
To protect proprietary information, the public-facing version of this project, including the live app and the code repository, uses a synthetically generated dataset.
Model Performance
The logistic regression model was validated using a stratified 5-fold cross-validation strategy. The metrics below reflect performance on the original, proprietary dataset.
| Metric | Score |
|---|---|
| Mean Accuracy | 0.79 |
| Mean ROC AUC | 0.70 |
| Mean PR-AUC | 0.54 |
An ROC AUC score of 0.70 indicates useful separation between customers who churn and those who do not. The PR-AUC score of 0.54 is especially relevant for this imbalanced dataset, indicating performance well above a random baseline.
Key Predictive Factors
Analysis of the model’s coefficients reveals the most significant factors influencing churn predictions.
Top Factors Increasing Churn Risk
- Geographic Region: The model identified that customers from certain geographic regions had a significantly higher propensity to churn.
- Unknown Account History: When a customer’s prior account signup status was unknown, they were flagged as a high churn risk.
- Specific Industry Segments: Customers within certain specialized industries showed a higher tendency to churn.
Top Factors Decreasing Churn Risk
- Acquisition Source: Customers who came to GreyNoise via direct traffic were the least likely to churn.
- Existing Account History: Knowing a customer had a previous free account was a strong signal of loyalty and a significantly lower churn risk.
- Annual Recurring Revenue (ARR): As a customer’s ARR increased, their likelihood of churn decreased substantially.
Tech Stack & Methodology
- Modeling: Python 3.11, scikit-learn, pandas, numpy, joblib
- Web App: Streamlit
- Validation: Stratified K-Fold Cross-Validation
- Source Code: The
notebook.ipynbnotebook contains the complete code for data processing and modeling.