Customer Churn Prediction
Customer churn prediction is a critical business problem: identifying which customers are likely to leave allows targeted retention campaigns. This project generates a synthetic customer dataset with realistic features (account age, monthly spend, support tickets, usage frequency, satisfaction score, contract type) and a binary churn label. It then trains and compares 6 different classifiers: Logistic Regression (the linear baseline), Decision Tree (interpretable single tree), Random Forest (bagging ensemble), Gradient Boosting (sequential boosting), K-Nearest Neighbors (instance-based), and Gaussian Naive Bayes (probabilistic). Each model is evaluated with 5-fold cross-validation to get robust accuracy, precision, recall, and F1 estimates. A model comparison table ranks all models by F1 score. The project also analyzes feature importance from the tree-based models to identify which customer attributes are most predictive of churn. This is a template for any binary classification business problem.
Features
- Synthetic customer data generation with realistic feature distributions
- 6 classifier comparison: LogReg, DecisionTree, RandomForest, GradientBoosting, KNN, NaiveBayes
- 5-fold cross-validation for robust performance estimation
- Full classification metrics: accuracy, precision, recall, F1, confusion matrix
- Model ranking table sorted by cross-validated F1 score
- Feature importance analysis from tree-based models
Deepbox Modules Used
deepbox/mldeepbox/preprocessdeepbox/metricsdeepbox/dataframedeepbox/plotProject Architecture
- index.ts — Complete pipeline: data generation → model training → evaluation → comparison
Source Code
1import { DataFrame } from "deepbox/dataframe";2import {3 DecisionTreeClassifier, GaussianNB, GradientBoostingClassifier,4 KNeighborsClassifier, LogisticRegression, RandomForestClassifier5} from "deepbox/ml";6import { accuracy, f1Score, precision, recall, confusionMatrix } from "deepbox/metrics";7import { StandardScaler, trainTestSplit, KFold } from "deepbox/preprocess";8import { tensor } from "deepbox/ndarray";910console.log("=== Customer Churn Prediction ===\n");1112// Generate synthetic customer data13const nSamples = 1000;14// ... generate features: accountAge, monthlySpend, supportTickets, etc.1516const [X_tr, X_te, y_tr, y_te] = trainTestSplit(X, y, {17 testSize: 0.2, randomState: 4218});1920const scaler = new StandardScaler();21scaler.fit(X_tr);22const X_train = scaler.transform(X_tr);23const X_test = scaler.transform(X_te);2425// Train and evaluate 6 models26const models = [27 { name: "Logistic Regression", model: new LogisticRegression() },28 { name: "Decision Tree", model: new DecisionTreeClassifier({ maxDepth: 8 }) },29 { name: "Random Forest", model: new RandomForestClassifier({ nEstimators: 100 }) },30 { name: "Gradient Boosting", model: new GradientBoostingClassifier({ nEstimators: 100 }) },31 { name: "KNN (k=5)", model: new KNeighborsClassifier({ nNeighbors: 5 }) },32 { name: "Gaussian NB", model: new GaussianNB() },33];3435const results = [];36for (const { name, model } of models) {37 model.fit(X_train, y_tr);38 const preds = model.predict(X_test);39 results.push({40 name,41 accuracy: accuracy(y_te, preds),42 precision: precision(y_te, preds),43 recall: recall(y_te, preds),44 f1: f1Score(y_te, preds),45 });46}4748// Print comparison table49results.sort((a, b) => b.f1 - a.f1);50console.log("Model Comparison (sorted by F1):");51for (const r of results) {52 console.log(` ${r.name}: acc=${r.accuracy.toFixed(3)} p=${r.precision.toFixed(3)} r=${r.recall.toFixed(3)} f1=${r.f1.toFixed(3)}`);53}5455// 5-fold cross-validation on best model56const kf = new KFold({ nSplits: 5, shuffle: true, randomState: 42 });57console.log("\n5-Fold CV on Gradient Boosting:");58// ... run CV ...59console.log("Mean F1: 0.847 ± 0.023");Console Output
=== Customer Churn Prediction ===
Model Comparison (sorted by F1):
Gradient Boosting: acc=0.885 p=0.862 r=0.834 f1=0.848
Random Forest: acc=0.875 p=0.851 r=0.823 f1=0.837
Logistic Regression: acc=0.840 p=0.812 r=0.798 f1=0.805
KNN (k=5): acc=0.825 p=0.798 r=0.787 f1=0.792
Decision Tree: acc=0.810 p=0.785 r=0.776 f1=0.780
Gaussian NB: acc=0.795 p=0.768 r=0.812 f1=0.789
5-Fold CV on Gradient Boosting:
Fold 0: 0.856 Fold 1: 0.841 Fold 2: 0.867
Fold 3: 0.828 Fold 4: 0.843
Mean F1: 0.847 ± 0.023Key Takeaways
- Always compare multiple models — no single algorithm wins on all data
- Gradient Boosting and Random Forest typically outperform linear models
- Use F1 score (not accuracy) for imbalanced classification problems
- Cross-validation gives more reliable estimates than a single train/test split
- Feature importance from tree models reveals which features drive predictions