1. Setup and Data Loading

Note: The original dataset is proprietary, so I generate a synthetic dataset that reflects its structure for demonstration.

import pandas as pd
import numpy as np

# set random seed
np.random.seed(42)

# set dataset size
num_rows = 100

# categorical data
industry_options = ['Software', 'IT Services', 'Government (US)', 'Government (International)', 'Financial Services', 'Insurance', 'Telecommunication Services', 'Professional Services', 'Energy']
country_options = ['United States', 'United Kingdom', 'India', 'Canada', 'Australia', 'Germany']
source_options = ['Direct Traffic', 'Offline Sources', 'Email Marketing', 'Organic Search', 'Referrals']
relationship_options = ['Active Customer', 'Former Customer', 'Active Customer; Active Integration', 'Active OEM Partner', 'Former Customer; Active OEM Partner']
company_name_options = ['Company A', 'Company B', 'Company C', 'Company D', 'Company E']

# generate data for columns
data_dict = {
    'Record ID': np.random.randint(10**6, 10**7, size=num_rows),
    'Company name': np.random.choice(company_name_options, num_rows),
    'Industry': np.random.choice(industry_options, num_rows),
    'Country/Region': np.random.choice(country_options, num_rows),
    'Relationship Type': np.random.choice(relationship_options, num_rows),
    'Latest Source': np.random.choice(source_options, num_rows)
}

# generate random dates between 2019 and 2023
start_date = pd.to_datetime('2019-01-01')
end_date_range = pd.to_datetime('2023-07-14')
random_days = np.random.randint(0, (end_date_range - start_date).days, size=num_rows)
commence_dates = start_date + pd.to_timedelta(random_days, unit='d')

# engagement ends 1 to 5 years after commence
end_dates = commence_dates + pd.to_timedelta(np.random.randint(365, 365 * 5, size=num_rows), unit='d')

# signup date can be before/after commence or missing
signup_deltas = pd.to_timedelta(np.random.randint(-365, 365, size=num_rows), unit='d')
signup_dates = commence_dates + signup_deltas

# convert to numpy array to insert NaT values
signup_dates_array = signup_dates.to_numpy()

# make ~15% of signup dates missing
missing_mask = np.random.choice([True, False], size=num_rows, p=[0.15, 0.85])
signup_dates_array[missing_mask] = np.datetime64('NaT')

data_dict['Engagement Commence'] = commence_dates
data_dict['Engagement End'] = end_dates
data_dict['First GreyNoise Account Signup'] = signup_dates_array

# generate ARR as a string with commas
arr_values = np.random.randint(5000, 250000, num_rows)
data_dict['ARR'] = [f"{x:,}" for x in arr_values]

# create final dataframe
data = pd.DataFrame(data_dict)

# convert date columns to string in 'YYYY-MM-DD' format to match original raw data
for col in ['Engagement Commence', 'Engagement End', 'First GreyNoise Account Signup']:
    data[col] = pd.to_datetime(data[col]).dt.strftime('%Y-%m-%d').replace({pd.NaT: None})

2. Exploratory Data Analysis

2.1 Initial Inspection

data.head()

	Record ID	Company name	Industry	Country/Region	Relationship Type	Latest Source	Engagement Commence	Engagement End	First GreyNoise Account Signup	ARR
0	7423388	Company C	Telecommunication Services	United Kingdom	Active Customer; Active Integration	Direct Traffic	2021-03-25	2024-05-03	None	108,677
1	7550634	Company A	Government (International)	India	Active OEM Partner	Direct Traffic	2022-03-18	2026-01-27	2021-03-22	72,779
2	5304572	Company B	Energy	Germany	Former Customer	Direct Traffic	2019-03-21	2022-12-11	2019-06-29	24,029
3	3234489	Company B	Software	United Kingdom	Active Customer; Active Integration	Organic Search	2021-06-04	2024-04-10	2022-01-28	64,201
4	8204212	Company D	Professional Services	Germany	Former Customer; Active OEM Partner	Direct Traffic	2022-05-21	2025-05-03	2023-04-02	37,265

data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100 entries, 0 to 99
Data columns (total 10 columns):
 #   Column                          Non-Null Count  Dtype 
---  ------                          --------------  ----- 
 0   Record ID                       100 non-null    int64 
 1   Company name                    100 non-null    object
 2   Industry                        100 non-null    object
 3   Country/Region                  100 non-null    object
 4   Relationship Type               100 non-null    object
 5   Latest Source                   100 non-null    object
 6   Engagement Commence             100 non-null    object
 7   Engagement End                  100 non-null    object
 8   First GreyNoise Account Signup  88 non-null     object
 9   ARR                             100 non-null    object
dtypes: int64(1), object(9)
memory usage: 7.9+ KB

Note: In the original dataset, all columns were parsed as strings. We’ll need to convert them to their appropriate data types later on.

data.isna().sum()

	0
Record ID	0
Company name	0
Industry	0
Country/Region	0
Relationship Type	0
Latest Source	0
Engagement Commence	0
Engagement End	0
First GreyNoise Account Signup	12
ARR	0

dtype: int64

There are 12 entries with missing values for a ‘First GreyNoise Account Signup’ date.

3. Data Preprocessing and Feature Engineering

3.1. Initial Cleaning and Type Conversion

I drop the columns ‘Record ID’ and ‘Company name’, which contain bookkeeping information and are not useful for modeling.

data = data.drop(columns=['Record ID', 'Company name'])

Since all variables were parsed as strings, I convert them to:

Date: ‘Engagement Commence’, ‘Engagement End’, ‘First GreyNoise Account Signup’
Numeric: ‘ARR’

date_columns = ['Engagement Commence', 'Engagement End', 'First GreyNoise Account Signup']
numeric_columns = ['ARR']

# convert columns to datetime
for col in date_columns:
    data[col] = pd.to_datetime(data[col], errors='coerce')

# remove commas from 'ARR'
data['ARR'] = data['ARR'].str.replace(',', '')
# convert columns to numeric
for col in numeric_columns:
    data[col] = pd.to_numeric(data[col], errors='coerce')

# verify changes
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100 entries, 0 to 99
Data columns (total 8 columns):
 #   Column                          Non-Null Count  Dtype         
---  ------                          --------------  -----         
 0   Industry                        100 non-null    object        
 1   Country/Region                  100 non-null    object        
 2   Relationship Type               100 non-null    object        
 3   Latest Source                   100 non-null    object        
 4   Engagement Commence             100 non-null    datetime64[ns]
 5   Engagement End                  100 non-null    datetime64[ns]
 6   First GreyNoise Account Signup  88 non-null     datetime64[ns]
 7   ARR                             100 non-null    int64         
dtypes: datetime64[ns](3), int64(1), object(4)
memory usage: 6.4+ KB

3.2 Filtering

Contracts with commence dates before September 2019 were incomplete in the original dataset, so I drop them.

data = data[data['Engagement Commence'] >= '2019-09-01']

I only include companies with engagement commencement on or before July 14, 2022, ensuring each had at least one full year to potentially churn by the analysis date (July 14, 2023).

data = data[data['Engagement Commence'] <= '2022-07-14']

3.3 Feature Creation

I create the variable ‘Had Prior Account’ to indicate whether a customer signed up for a free GreyNoise account before the start of their paid engagement. This variable has three possible values:

1: the customer signed up for a free account before their engagement
0: the customer signed up after the engagement commenced
-1: no signup date was recorded

conditions = [
    data['First GreyNoise Account Signup'] < data['Engagement Commence'],   # account before commence
    data['First GreyNoise Account Signup'] >= data['Engagement Commence'],  # no account
    data['First GreyNoise Account Signup'].isnull()                         # unknown
]
outcomes = [1, 0, -1]

# create column based on conditions
data['Had Prior Account'] = np.select(conditions, outcomes, default=-1)

‘Relationship Type’ contains various customer statuses. Any value that includes ‘Former Customer’ indicates a churned customer. I map these values to churn = 1, and all other values to churn = 0.

data['Churned'] = data['Relationship Type'].str.contains('Former Customer', na=False).astype(int)

3.4 Handling Missing Values

First, I drop all columns irrelevant to analysis.

data = data.drop(columns=['Relationship Type', 'Engagement Commence', 'Engagement End', 'First GreyNoise Account Signup'])
data.head()

	Industry	Country/Region	Latest Source	ARR	Had Prior Account	Churned
0	Telecommunication Services	United Kingdom	Direct Traffic	108677	-1	0
1	Government (International)	India	Direct Traffic	72779	1	0
3	Software	United Kingdom	Organic Search	64201	0	0
4	Professional Services	Germany	Direct Traffic	37265	0	1
6	IT Services	India	Email Marketing	248322	0	1

I check again for missing values across the dataset:

data.isna().sum()
data.dropna(inplace=True)

I remove rows with missing values. Given their low count, this should have a minimal effect on the overall dataset.

4. Model Training and Validation

from sklearn.model_selection import StratifiedKFold, cross_validate
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression

# 1. define features and target
X = data.drop('Churned', axis=1)
y = data['Churned']

# 2. define preprocessor
# define column types
categorical_features = ['Industry', 'Country/Region', 'Latest Source', 'Had Prior Account']
numeric_features = ['ARR']

# create preprocessor with ColumnTransformer
preprocessor = ColumnTransformer(
    transformers=[
        ('cat', OneHotEncoder(handle_unknown='ignore'), categorical_features),
        ('num', StandardScaler(), numeric_features)
    ])

# create full pipeline
pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('classifier', LogisticRegression(max_iter=1000))
])


# 3. estimate performance with stratified k-fold cv
# i use n = 5 splits
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

# metrics to calculate
scoring_metrics = ['accuracy', 'roc_auc', 'average_precision']

# cross_validate with the StratifiedKFold object
cv_results = cross_validate(pipeline, X, y, cv=skf, scoring=scoring_metrics)


# print estimates
print("--- Model Performance Estimate (Stratified 5-Fold CV) ---")
print(f"Mean Accuracy: {np.mean(cv_results['test_accuracy']):.2f}")
print(f"Mean ROC AUC: {np.mean(cv_results['test_roc_auc']):.2f}")


# 4. train final model on ALL data
final_model = pipeline.fit(X, y)

--- Model Performance Estimate (Stratified 5-Fold CV) ---
Mean Accuracy: 0.51
Mean ROC AUC: 0.57

5. Model Interpretation

# get trained classifier from pipeline
classifier = pipeline.named_steps['classifier']

# get preprocessor and transformed feature names
preprocessor = pipeline.named_steps['preprocessor']
feature_names = preprocessor.get_feature_names_out()

# create df
importance_df = pd.DataFrame({
    'Feature': feature_names,
    'Coefficient': classifier.coef_[0]
})

# sort by the absolute value to see most impactful features
importance_df['Abs_Coefficient'] = importance_df['Coefficient'].abs()
importance_df = importance_df.sort_values(by='Abs_Coefficient', ascending=False)

importance_df

	Feature	Coefficient	Abs_Coefficient
1	cat__Industry_Financial Services	1.425638	1.425638
19	cat__Latest Source_Referrals	-1.014383	1.014383
10	cat__Country/Region_Canada	0.899791	0.899791
18	cat__Latest Source_Organic Search	0.627315	0.627315
3	cat__Industry_Government (US)	-0.580932	0.580932
21	cat__Had Prior Account_0	0.574032	0.574032
15	cat__Latest Source_Direct Traffic	0.562287	0.562287
22	cat__Had Prior Account_1	-0.550713	0.550713
2	cat__Industry_Government (International)	-0.526404	0.526404
13	cat__Country/Region_United Kingdom	-0.517882	0.517882
6	cat__Industry_Professional Services	-0.508398	0.508398
12	cat__Country/Region_India	0.414341	0.414341
11	cat__Country/Region_Germany	-0.408978	0.408978
5	cat__Industry_Insurance	0.316264	0.316264
7	cat__Industry_Software	-0.261435	0.261435
9	cat__Country/Region_Australia	-0.236989	0.236989
16	cat__Latest Source_Email Marketing	-0.221798	0.221798
14	cat__Country/Region_United States	-0.145851	0.145851
23	num__ARR	0.108085	0.108085
8	cat__Industry_Telecommunication Services	0.107558	0.107558
17	cat__Latest Source_Offline Sources	0.051013	0.051013
0	cat__Industry_Energy	0.031259	0.031259
20	cat__Had Prior Account_-1	-0.018886	0.018886
4	cat__Industry_IT Services	0.000883	0.000883

6. Final Model Export

import joblib

joblib.dump(final_model, 'gn_churn_model.joblib')

['gn_churn_model.joblib']