This project aims to predict customer churn for a telecommunications company using machine learning techniques. By analyzing a customer dataset, we apply data preprocessing, exploratory data analysis (EDA), and machine learning models to predict whether a customer will churn.
Customer churn prediction is a critical problem for businesses, as retaining customers is more cost-effective than acquiring new ones. This project leverages various machine learning algorithms such as Decision Trees, Random Forest, and XGBoost to predict customer churn based on demographic and account information.
The workflow covers:
- Data Preprocessing: Handling missing values, encoding categorical variables, and addressing class imbalance.
- Exploratory Data Analysis (EDA): Visualizing and analyzing patterns and correlations in the data.
- Model Training: Implementing various machine learning algorithms and optimizing them for better performance.
- Model Evaluation: Using accuracy, confusion matrix, and classification report to evaluate model performance.
- Deployment: Saving the trained model and encoding tools, which can later be used for prediction.
The dataset used is the Telco Customer Churn dataset (WA_Fn-UseC_-Telco-Customer-Churn.csv), containing the following columns:
- Demographics: Gender, SeniorCitizen, Partner, Dependents
- Account Info: Tenure, MonthlyCharges, TotalCharges
- Churn: Target variable indicating whether the customer has churned (Yes/No)
Dataset Source: The dataset is publicly available on Kaggle.
- Data Preprocessing:
pandas,numpy, ,SMOTE (imbalanced-learn),LabelEncoder (sklearn.preprocessing) - Data Visualization:
matplotlib,seaborn - Machine Learning:
scikit-learn (RandomForestClassifier, DecisionTreeClassifier),xgboost (XGBClassifier) - Model Persistence:
pickle - Model Evaluation: Accuracy, Precision, Recall, F1-Score
- Data Loading: Loading the raw dataset into a Pandas DataFrame and inspecting its contents.
- Data Preprocessing: Cleaning and preparing the data, including handling missing values, encoding categorical features, and balancing the dataset using SMOTE.
- Exploratory Data Analysis (EDA): Visualizing distributions and relationships between variables to gain insights into the data.
- Model Training & Evaluation: Training various machine learning models, evaluating their performance using cross-validation, and selecting the best model (Random Forest).
- Model Deployment: Saving the trained model and the label encoders using pickle for future use.
/customer-churn-prediction
├── data/ # Dataset file(s)
│ └── WA_Fn-UseC_-Telco-Customer-Churn.csv
│
├── models/ # Saved models and encoders
│ ├── customer_churn_model.pkl
│ └── encoders.pkl
│
├── notebooks/ # Jupyter Notebook(s) with analysis and modeling
│ └── main.ipynb
│
└── README.md # Project documentation
- Clone the repository:
git clone <repository-url> cd customer-churn-prediction
- Install dependencies:
pip install -r requirements.txt
- Run the Jupyter Notebook:
jupyter notebook notebook/Customer_Churn_Prediction_using_ML.ipynb
The two models, Random Forest and XGBoost, were evaluated on the customer churn dataset. The predictions for the same input data showed different results due to the distinct learning mechanisms of both models.
- Prediction: No Churn
- Prediction Probability:
[[0.83 0.17]](83% likelihood of No Churn)
- Prediction: Churn
- Prediction Probability:
[[0.35321957 0.64678043]](64.7% likelihood of Churn)
Despite using the same input data, both models arrived at different conclusions because of their inherent differences in the way they learn and make predictions.
-
Different Models, Different Results:
- Random Forest is a robust model that averages over multiple decision trees, making it less sensitive to noise but sometimes too conservative.
- XGBoost is a gradient boosting model that focuses on correcting previous errors, which makes it more sensitive to specific patterns, often giving a better performance on complex datasets.
-
Why the Difference?:
- Random Forest uses an ensemble of decision trees and makes decisions by majority voting.
- XGBoost, on the other hand, builds trees sequentially, with each tree aiming to correct the previous one, which can lead to more aggressive predictions, especially for minority classes (e.g., Churn).
- While Random Forest predicted No Churn, XGBoost predicted Churn with a higher probability. This difference suggests that XGBoost might be more sensitive to patterns associated with Churn, whereas Random Forest might be more balanced.
-
Hyperparameter Tuning 🔧:
- Implement GridSearchCV or RandomizedSearchCV to find the best set of hyperparameters for both models (Random Forest and XGBoost). Hyperparameters like
max_depth,n_estimators,learning_rate, andsubsamplecan significantly impact model performance.
- Implement GridSearchCV or RandomizedSearchCV to find the best set of hyperparameters for both models (Random Forest and XGBoost). Hyperparameters like
-
Model Selection 🔍:
- Test additional models such as Logistic Regression, Support Vector Machines, or K-Nearest Neighbors. Evaluate each model's performance using metrics like accuracy, precision, recall, and F1-score to choose the best model for churn prediction.
-
Downsampling ⚖️:
- Try downsampling the majority class (No Churn) to balance the dataset, which may improve the model's performance on the minority class (Churn), reducing the impact of class imbalance.
-
Address Overfitting 🛠️:
- Try various techniques to mitigate overfitting:
- Pruning decision trees in Random Forest.
- Use early stopping in XGBoost to stop training once the model performance stops improving.
- Apply regularization techniques in XGBoost like
lambdaandalphaparameters.
- Try various techniques to mitigate overfitting:
-
Stratified K-Fold Cross Validation 🔄:
- Implement Stratified K-Fold Cross Validation to ensure that each fold has the same proportion of Churn and No Churn cases, especially when dealing with imbalanced datasets.
- This will provide a more reliable estimate of model performance across different data splits.