Customer Churn Prediction for Telco Company¶
Author: Guillaume EGU
Overview¶
This project aims to predict customer churn for a telecommunications company using supervised machine learning techniques. Customer churn prediction is crucial for businesses as it helps identify customers who are likely to discontinue their services, enabling proactive retention strategies and reducing revenue loss.
The dataset comes from IBM and contains information about telco customers, including demographics, services used, account information, and churn status. This analysis compares multiple machine learning algorithms on both original and resampled datasets to identify the most effective approach for churn prediction.
Data Source: https://community.ibm.com/community/user/blogs/steven-macko/2019/07/11/telco-customer-churn-1113
Project Objectives¶
- Perform comprehensive exploratory data analysis (EDA) to understand churn patterns
- Compare multiple machine learning algorithms for churn prediction
- Evaluate the impact of data resampling techniques (SMOTEENN) on model performance
- Provide actionable business insights based on confusion matrix analysis
Contents¶
- Import Libraries
- Functions
- EDA & Feature Engineering
- Models :
- Logistic Regression
- K-nearest neighbors
- Naive Bayes
- Support Vector Machine
- Decision Tree
- Random Forest
- XGBoost
- Gradient Boosting
- AdaBoost
- Model Comparison & Analysis
- Business Insights
- Conclusion
!pip install pandas
!pip install matplotlib
!pip install seaborn
!pip install scikit-learn
!pip install numpy
!pip install imblearn
!pip install xgboost
Requirement already satisfied: pandas in c:\users\guill\anaconda3\lib\site-packages (2.2.2) Requirement already satisfied: numpy>=1.26.0 in c:\users\guill\anaconda3\lib\site-packages (from pandas) (1.26.4) Requirement already satisfied: python-dateutil>=2.8.2 in c:\users\guill\anaconda3\lib\site-packages (from pandas) (2.9.0.post0) Requirement already satisfied: pytz>=2020.1 in c:\users\guill\anaconda3\lib\site-packages (from pandas) (2024.1) Requirement already satisfied: tzdata>=2022.7 in c:\users\guill\anaconda3\lib\site-packages (from pandas) (2023.3) Requirement already satisfied: six>=1.5 in c:\users\guill\anaconda3\lib\site-packages (from python-dateutil>=2.8.2->pandas) (1.16.0) Requirement already satisfied: matplotlib in c:\users\guill\anaconda3\lib\site-packages (3.8.4) Requirement already satisfied: contourpy>=1.0.1 in c:\users\guill\anaconda3\lib\site-packages (from matplotlib) (1.2.0) Requirement already satisfied: cycler>=0.10 in c:\users\guill\anaconda3\lib\site-packages (from matplotlib) (0.11.0) Requirement already satisfied: fonttools>=4.22.0 in c:\users\guill\anaconda3\lib\site-packages (from matplotlib) (4.51.0) Requirement already satisfied: kiwisolver>=1.3.1 in c:\users\guill\anaconda3\lib\site-packages (from matplotlib) (1.4.4) Requirement already satisfied: numpy>=1.21 in c:\users\guill\anaconda3\lib\site-packages (from matplotlib) (1.26.4) Requirement already satisfied: packaging>=20.0 in c:\users\guill\anaconda3\lib\site-packages (from matplotlib) (23.2) Requirement already satisfied: pillow>=8 in c:\users\guill\anaconda3\lib\site-packages (from matplotlib) (10.3.0) Requirement already satisfied: pyparsing>=2.3.1 in c:\users\guill\anaconda3\lib\site-packages (from matplotlib) (3.0.9) Requirement already satisfied: python-dateutil>=2.7 in c:\users\guill\anaconda3\lib\site-packages (from matplotlib) (2.9.0.post0) Requirement already satisfied: six>=1.5 in c:\users\guill\anaconda3\lib\site-packages (from python-dateutil>=2.7->matplotlib) (1.16.0) Requirement already satisfied: seaborn in c:\users\guill\anaconda3\lib\site-packages (0.13.2) Requirement already satisfied: numpy!=1.24.0,>=1.20 in c:\users\guill\anaconda3\lib\site-packages (from seaborn) (1.26.4) Requirement already satisfied: pandas>=1.2 in c:\users\guill\anaconda3\lib\site-packages (from seaborn) (2.2.2) Requirement already satisfied: matplotlib!=3.6.1,>=3.4 in c:\users\guill\anaconda3\lib\site-packages (from seaborn) (3.8.4) Requirement already satisfied: contourpy>=1.0.1 in c:\users\guill\anaconda3\lib\site-packages (from matplotlib!=3.6.1,>=3.4->seaborn) (1.2.0) Requirement already satisfied: cycler>=0.10 in c:\users\guill\anaconda3\lib\site-packages (from matplotlib!=3.6.1,>=3.4->seaborn) (0.11.0) Requirement already satisfied: fonttools>=4.22.0 in c:\users\guill\anaconda3\lib\site-packages (from matplotlib!=3.6.1,>=3.4->seaborn) (4.51.0) Requirement already satisfied: kiwisolver>=1.3.1 in c:\users\guill\anaconda3\lib\site-packages (from matplotlib!=3.6.1,>=3.4->seaborn) (1.4.4) Requirement already satisfied: packaging>=20.0 in c:\users\guill\anaconda3\lib\site-packages (from matplotlib!=3.6.1,>=3.4->seaborn) (23.2) Requirement already satisfied: pillow>=8 in c:\users\guill\anaconda3\lib\site-packages (from matplotlib!=3.6.1,>=3.4->seaborn) (10.3.0) Requirement already satisfied: pyparsing>=2.3.1 in c:\users\guill\anaconda3\lib\site-packages (from matplotlib!=3.6.1,>=3.4->seaborn) (3.0.9) Requirement already satisfied: python-dateutil>=2.7 in c:\users\guill\anaconda3\lib\site-packages (from matplotlib!=3.6.1,>=3.4->seaborn) (2.9.0.post0) Requirement already satisfied: pytz>=2020.1 in c:\users\guill\anaconda3\lib\site-packages (from pandas>=1.2->seaborn) (2024.1) Requirement already satisfied: tzdata>=2022.7 in c:\users\guill\anaconda3\lib\site-packages (from pandas>=1.2->seaborn) (2023.3) Requirement already satisfied: six>=1.5 in c:\users\guill\anaconda3\lib\site-packages (from python-dateutil>=2.7->matplotlib!=3.6.1,>=3.4->seaborn) (1.16.0) Requirement already satisfied: scikit-learn in c:\users\guill\anaconda3\lib\site-packages (1.4.2) Requirement already satisfied: numpy>=1.19.5 in c:\users\guill\anaconda3\lib\site-packages (from scikit-learn) (1.26.4) Requirement already satisfied: scipy>=1.6.0 in c:\users\guill\anaconda3\lib\site-packages (from scikit-learn) (1.13.1) Requirement already satisfied: joblib>=1.2.0 in c:\users\guill\anaconda3\lib\site-packages (from scikit-learn) (1.4.2) Requirement already satisfied: threadpoolctl>=2.0.0 in c:\users\guill\anaconda3\lib\site-packages (from scikit-learn) (2.2.0) Requirement already satisfied: numpy in c:\users\guill\anaconda3\lib\site-packages (1.26.4) Collecting imblearn Using cached imblearn-0.0-py2.py3-none-any.whl.metadata (355 bytes) Requirement already satisfied: imbalanced-learn in c:\users\guill\anaconda3\lib\site-packages (from imblearn) (0.12.3) Requirement already satisfied: numpy>=1.17.3 in c:\users\guill\anaconda3\lib\site-packages (from imbalanced-learn->imblearn) (1.26.4) Requirement already satisfied: scipy>=1.5.0 in c:\users\guill\anaconda3\lib\site-packages (from imbalanced-learn->imblearn) (1.13.1) Requirement already satisfied: scikit-learn>=1.0.2 in c:\users\guill\anaconda3\lib\site-packages (from imbalanced-learn->imblearn) (1.4.2) Requirement already satisfied: joblib>=1.1.1 in c:\users\guill\anaconda3\lib\site-packages (from imbalanced-learn->imblearn) (1.4.2) Requirement already satisfied: threadpoolctl>=2.0.0 in c:\users\guill\anaconda3\lib\site-packages (from imbalanced-learn->imblearn) (2.2.0) Using cached imblearn-0.0-py2.py3-none-any.whl (1.9 kB) Installing collected packages: imblearn Successfully installed imblearn-0.0 Requirement already satisfied: xgboost in c:\users\guill\anaconda3\lib\site-packages (3.0.5) Requirement already satisfied: numpy in c:\users\guill\anaconda3\lib\site-packages (from xgboost) (1.26.4) Requirement already satisfied: scipy in c:\users\guill\anaconda3\lib\site-packages (from xgboost) (1.13.1)
# Handle the data
import pandas as pd
import numpy as np
# Visualization
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
# Preprocessing and modeling
import sklearn
from sklearn.preprocessing import OneHotEncoder, MinMaxScaler, LabelEncoder, OneHotEncoder
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score
from imblearn.combine import SMOTEENN
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier, AdaBoostClassifier ,RandomForestRegressor
from xgboost import XGBClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report, mean_absolute_error, mean_squared_error, r2_score
df = pd.read_csv('datasets/IBM-Telco-Customer-Churn.csv')
df.head()
| customerID | gender | SeniorCitizen | Partner | Dependents | tenure | PhoneService | MultipleLines | InternetService | OnlineSecurity | ... | DeviceProtection | TechSupport | StreamingTV | StreamingMovies | Contract | PaperlessBilling | PaymentMethod | MonthlyCharges | TotalCharges | Churn | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 7590-VHVEG | Female | 0 | Yes | No | 1 | No | No phone service | DSL | No | ... | No | No | No | No | Month-to-month | Yes | Electronic check | 29.85 | 29.85 | No |
| 1 | 5575-GNVDE | Male | 0 | No | No | 34 | Yes | No | DSL | Yes | ... | Yes | No | No | No | One year | No | Mailed check | 56.95 | 1889.5 | No |
| 2 | 3668-QPYBK | Male | 0 | No | No | 2 | Yes | No | DSL | Yes | ... | No | No | No | No | Month-to-month | Yes | Mailed check | 53.85 | 108.15 | Yes |
| 3 | 7795-CFOCW | Male | 0 | No | No | 45 | No | No phone service | DSL | Yes | ... | Yes | Yes | No | No | One year | No | Bank transfer (automatic) | 42.30 | 1840.75 | No |
| 4 | 9237-HQITU | Female | 0 | No | No | 2 | Yes | No | Fiber optic | No | ... | No | No | No | No | Month-to-month | Yes | Electronic check | 70.70 | 151.65 | Yes |
5 rows × 21 columns
EDA and Feature engineering¶
In this step, I want to understand the data, prepare it for analyse and understand trends.
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 7043 entries, 0 to 7042 Data columns (total 21 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 customerID 7043 non-null object 1 gender 7043 non-null object 2 SeniorCitizen 7043 non-null int64 3 Partner 7043 non-null object 4 Dependents 7043 non-null object 5 tenure 7043 non-null int64 6 PhoneService 7043 non-null object 7 MultipleLines 7043 non-null object 8 InternetService 7043 non-null object 9 OnlineSecurity 7043 non-null object 10 OnlineBackup 7043 non-null object 11 DeviceProtection 7043 non-null object 12 TechSupport 7043 non-null object 13 StreamingTV 7043 non-null object 14 StreamingMovies 7043 non-null object 15 Contract 7043 non-null object 16 PaperlessBilling 7043 non-null object 17 PaymentMethod 7043 non-null object 18 MonthlyCharges 7043 non-null float64 19 TotalCharges 7043 non-null object 20 Churn 7043 non-null object dtypes: float64(1), int64(2), object(18) memory usage: 1.1+ MB
We understand there is 2 Int columns, 1 float column and 18 object columns. The dimensions are 21x7043 . Total charges should be int or float but not object => maybe there some errors in the dataset.
df["TotalCharges"] = pd.to_numeric(df["TotalCharges"], errors='coerce')
df = df.dropna()
df.drop("customerID", axis=1, inplace=True)
Let's start to take a look at numerical columns
df.describe()
| SeniorCitizen | tenure | MonthlyCharges | TotalCharges | |
|---|---|---|---|---|
| count | 7032.000000 | 7032.000000 | 7032.000000 | 7032.000000 |
| mean | 0.162400 | 32.421786 | 64.798208 | 2283.300441 |
| std | 0.368844 | 24.545260 | 30.085974 | 2266.771362 |
| min | 0.000000 | 1.000000 | 18.250000 | 18.800000 |
| 25% | 0.000000 | 9.000000 | 35.587500 | 401.450000 |
| 50% | 0.000000 | 29.000000 | 70.350000 | 1397.475000 |
| 75% | 0.000000 | 55.000000 | 89.862500 | 3794.737500 |
| max | 1.000000 | 72.000000 | 118.750000 | 8684.800000 |
Let's check unique values
for col in df.columns :
if df[col].dtype != 'int64' and df[col].dtype != 'float64' :
print(f"{col} : {df[col].unique()}")
gender : ['Female' 'Male'] Partner : ['Yes' 'No'] Dependents : ['No' 'Yes'] PhoneService : ['No' 'Yes'] MultipleLines : ['No phone service' 'No' 'Yes'] InternetService : ['DSL' 'Fiber optic' 'No'] OnlineSecurity : ['No' 'Yes' 'No internet service'] OnlineBackup : ['Yes' 'No' 'No internet service'] DeviceProtection : ['No' 'Yes' 'No internet service'] TechSupport : ['No' 'Yes' 'No internet service'] StreamingTV : ['No' 'Yes' 'No internet service'] StreamingMovies : ['No' 'Yes' 'No internet service'] Contract : ['Month-to-month' 'One year' 'Two year'] PaperlessBilling : ['Yes' 'No'] PaymentMethod : ['Electronic check' 'Mailed check' 'Bank transfer (automatic)' 'Credit card (automatic)'] Churn : ['No' 'Yes']
Let's verify null values
print(df.isnull().sum())
gender 0 SeniorCitizen 0 Partner 0 Dependents 0 tenure 0 PhoneService 0 MultipleLines 0 InternetService 0 OnlineSecurity 0 OnlineBackup 0 DeviceProtection 0 TechSupport 0 StreamingTV 0 StreamingMovies 0 Contract 0 PaperlessBilling 0 PaymentMethod 0 MonthlyCharges 0 TotalCharges 0 Churn 0 dtype: int64
Now that we are sure the data are right. Let's visualize them.
colors = {'Yes':'r', 'No':'g'}
palette = {0:'g', 1:'r'}
for i, predictor in enumerate(df.drop(columns=['Churn', 'TotalCharges', 'MonthlyCharges', 'tenure'])):
plt.figure(i, figsize=(5,3))
sns.countplot(data=df, x=predictor, hue='Churn', palette=colors)
plt.title(f'Distribution of {predictor} by Churn')
plt.show()
With numerical data
churned = df[df['Churn'] == 'Yes']
not_churned = df[df['Churn'] == 'No']
plt.figure(figsize=(10,6))
plt.hist([churned['tenure'], not_churned['tenure']], bins=10, color=['r','g'], label=['Churned', 'Not Churned'], alpha=0.7)
plt.xlabel('Tenure')
plt.ylabel('Frequency')
plt.title('Distribution of Tenure by Churn')
plt.legend()
plt.grid(axis='y', alpha=0.75, linestyle='--')
for rect in plt.gca().patches:
height = rect.get_height()
if height > 0:
plt.gca().text(rect.get_x() + rect.get_width() / 2, height + 5, f'{int(height)}', ha='center', va='bottom')
churned= df[df['Churn'] == 'Yes']
not_churned = df[df['Churn'] == 'No']
plt.figure(figsize=(10,6))
plt.hist([churned['MonthlyCharges'], not_churned['MonthlyCharges']], bins=10, color=['r','g'], label=['Churned', 'Not Churned'], alpha=0.7)
plt.xlabel('Monthly Charges')
plt.ylabel('Frequency')
plt.title('Distribution of Monthly Charges by Churn')
plt.legend()
plt.grid(axis='y', alpha=0.75, linestyle='--')
for rect in plt.gca().patches:
height = rect.get_height()
if height > 0:
plt.gca().text(rect.get_x() + rect.get_width() / 2, height + 5, f'{int(height)}', ha='center', va='bottom')
churned = df[df['Churn'] == 'Yes']
not_churned = df[df['Churn'] == 'No']
plt.figure(figsize=(10,5))
plt.hist([churned['TotalCharges'], not_churned['TotalCharges']], bins=10, color=['r','g'], label=['Churned', 'Not Churned'], alpha=0.7)
plt.xlabel('Total Charges')
plt.ylabel('Frequency')
plt.title('Distribution of Total Charges by Churn')
plt.legend()
plt.grid(axis='y', alpha=0.75, linestyle='--')
for rect in plt.gca().patches:
height = rect.get_height()
if height > 0:
plt.gca().text(rect.get_x() + rect.get_width() / 2, height + 5, f'{int(height)}', ha='center', va='bottom')
fig, axes = plt.subplots(1, 2, figsize=(14,6))
sns.kdeplot(data=df, x='MonthlyCharges', hue='Churn', fill=True, alpha=0.5, ax=axes[0])
axes[0].set_title('Density Plot of Monthly Charges by Churn')
axes[0].set_xlabel('Monthly Charges')
axes[0].set_ylabel('Density')
sns.kdeplot(data=df, x='TotalCharges', hue='Churn', fill=True, alpha=0.5, ax=axes[1])
axes[1].set_title('Density Plot of Total Charges by Churn')
axes[1].set_xlabel('Total Charges')
axes[1].set_ylabel('Density')
plt.show()
We understand that some parameter like tenure or total charges will have more importance than others features.
Now, I will encode the data thanks to OneHotEncoder
categorical_cols = df.select_dtypes(include=['category','object']).columns
encoder = OneHotEncoder(sparse_output=False)
encoded_data = encoder.fit_transform(df[categorical_cols])
encoded_df = pd.DataFrame(encoded_data, columns=encoder.get_feature_names_out(categorical_cols))
df.drop(columns=categorical_cols, inplace=True)
df.reset_index(drop=True, inplace=True)
df = pd.concat([df, encoded_df], axis=1)
df.drop('Churn_No', axis=1, inplace=True)
df.rename(columns={'Churn_Yes':'Churn'}, inplace=True)
df.head()
| SeniorCitizen | tenure | MonthlyCharges | TotalCharges | gender_Female | gender_Male | Partner_No | Partner_Yes | Dependents_No | Dependents_Yes | ... | Contract_Month-to-month | Contract_One year | Contract_Two year | PaperlessBilling_No | PaperlessBilling_Yes | PaymentMethod_Bank transfer (automatic) | PaymentMethod_Credit card (automatic) | PaymentMethod_Electronic check | PaymentMethod_Mailed check | Churn | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 1 | 29.85 | 29.85 | 1.0 | 0.0 | 0.0 | 1.0 | 1.0 | 0.0 | ... | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 |
| 1 | 0 | 34 | 56.95 | 1889.50 | 0.0 | 1.0 | 1.0 | 0.0 | 1.0 | 0.0 | ... | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 |
| 2 | 0 | 2 | 53.85 | 108.15 | 0.0 | 1.0 | 1.0 | 0.0 | 1.0 | 0.0 | ... | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 | 1.0 |
| 3 | 0 | 45 | 42.30 | 1840.75 | 0.0 | 1.0 | 1.0 | 0.0 | 1.0 | 0.0 | ... | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 4 | 0 | 2 | 70.70 | 151.65 | 1.0 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | ... | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 1.0 |
5 rows × 46 columns
X = df.drop('Churn', axis=1)
y = df['Churn']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)
model_scores = []
sm = SMOTEENN()
X_res, y_res = sm.fit_resample(X, y)
Xr_train, Xr_test, yr_train, yr_test = train_test_split(X_res, y_res, test_size=0.2)
model_scores_US = []
model = LogisticRegression(random_state=42)
pipeline = Pipeline([
('scaler', MinMaxScaler()),
('model', model)
])
grid_search = GridSearchCV(pipeline, param_grid={
'model__C': [0.01, 0.1, 1, 10, 100],
'model__penalty': ['l1', 'l2']
}, cv=2)
grid_search.fit(X_train, y_train)
pipeline = grid_search.best_estimator_
pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
model_scores.append(('Logistic Regression', accuracy, pipeline))
grid_search.fit(Xr_train, yr_train)
pipeline_r = grid_search.best_estimator_
pipeline_r.fit(Xr_train, yr_train)
y_pred_r = pipeline_r.predict(Xr_test)
accuracy_r = accuracy_score(yr_test, y_pred_r)
model_scores_US.append(('Logistic Regression', accuracy_r, pipeline_r))
c:\Users\guill\anaconda3\Lib\site-packages\sklearn\model_selection\_validation.py:547: FitFailedWarning:
10 fits failed out of a total of 20.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.
Below are more details about the failures:
--------------------------------------------------------------------------------
10 fits failed with the following error:
Traceback (most recent call last):
File "c:\Users\guill\anaconda3\Lib\site-packages\sklearn\model_selection\_validation.py", line 895, in _fit_and_score
estimator.fit(X_train, y_train, **fit_params)
File "c:\Users\guill\anaconda3\Lib\site-packages\sklearn\base.py", line 1474, in wrapper
return fit_method(estimator, *args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "c:\Users\guill\anaconda3\Lib\site-packages\sklearn\pipeline.py", line 475, in fit
self._final_estimator.fit(Xt, y, **last_step_params["fit"])
File "c:\Users\guill\anaconda3\Lib\site-packages\sklearn\base.py", line 1474, in wrapper
return fit_method(estimator, *args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "c:\Users\guill\anaconda3\Lib\site-packages\sklearn\linear_model\_logistic.py", line 1172, in fit
solver = _check_solver(self.solver, self.penalty, self.dual)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "c:\Users\guill\anaconda3\Lib\site-packages\sklearn\linear_model\_logistic.py", line 67, in _check_solver
raise ValueError(
ValueError: Solver lbfgs supports only 'l2' or None penalties, got l1 penalty.
warnings.warn(some_fits_failed_message, FitFailedWarning)
c:\Users\guill\anaconda3\Lib\site-packages\sklearn\model_selection\_search.py:1051: UserWarning: One or more of the test scores are non-finite: [ nan 0.79288778 nan 0.80195523 nan 0.80302184
nan 0.8028439 nan 0.80248828]
warnings.warn(
c:\Users\guill\anaconda3\Lib\site-packages\sklearn\model_selection\_validation.py:547: FitFailedWarning:
10 fits failed out of a total of 20.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.
Below are more details about the failures:
--------------------------------------------------------------------------------
10 fits failed with the following error:
Traceback (most recent call last):
File "c:\Users\guill\anaconda3\Lib\site-packages\sklearn\model_selection\_validation.py", line 895, in _fit_and_score
estimator.fit(X_train, y_train, **fit_params)
File "c:\Users\guill\anaconda3\Lib\site-packages\sklearn\base.py", line 1474, in wrapper
return fit_method(estimator, *args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "c:\Users\guill\anaconda3\Lib\site-packages\sklearn\pipeline.py", line 475, in fit
self._final_estimator.fit(Xt, y, **last_step_params["fit"])
File "c:\Users\guill\anaconda3\Lib\site-packages\sklearn\base.py", line 1474, in wrapper
return fit_method(estimator, *args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "c:\Users\guill\anaconda3\Lib\site-packages\sklearn\linear_model\_logistic.py", line 1172, in fit
solver = _check_solver(self.solver, self.penalty, self.dual)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "c:\Users\guill\anaconda3\Lib\site-packages\sklearn\linear_model\_logistic.py", line 67, in _check_solver
raise ValueError(
ValueError: Solver lbfgs supports only 'l2' or None penalties, got l1 penalty.
warnings.warn(some_fits_failed_message, FitFailedWarning)
c:\Users\guill\anaconda3\Lib\site-packages\sklearn\model_selection\_search.py:1051: UserWarning: One or more of the test scores are non-finite: [ nan 0.88701805 nan 0.90215383 nan 0.91132009
nan 0.91515713 nan 0.91600928]
warnings.warn(
model = KNeighborsClassifier()
pipeline = Pipeline([
('scaler', MinMaxScaler()),
('model', model)
])
grid_search = GridSearchCV(pipeline, param_grid={
'model__n_neighbors': [3, 5, 7, 9],
'model__weights': ['uniform', 'distance']
}, cv=2)
grid_search.fit(X_train, y_train)
pipeline = grid_search.best_estimator_
pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
model_scores.append(('K-nearest Neighbors', accuracy, pipeline))
grid_search.fit(Xr_train, yr_train)
pipeline_r = grid_search.best_estimator_
pipeline_r.fit(Xr_train, yr_train)
y_pred_r = pipeline_r.predict(Xr_test)
accuracy_r = accuracy_score(yr_test, y_pred_r)
model_scores_US.append(('K-nearest Neighbors', accuracy_r, pipeline_r))
model = GaussianNB()
pipeline = Pipeline([
('scaler', MinMaxScaler()),
('model', model)
])
pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
model_scores.append(('Gaussian Naive Bayes', accuracy, pipeline))
grid_search.fit(Xr_train, yr_train)
pipeline_r = grid_search.best_estimator_
pipeline_r.fit(Xr_train, yr_train)
y_pred_r = pipeline_r.predict(Xr_test)
accuracy_r = accuracy_score(yr_test, y_pred_r)
model_scores_US.append(('Gaussian Naive Bayes', accuracy_r, pipeline_r))
model = SVC(random_state=42)
pipeline = Pipeline([
('scaler', MinMaxScaler()),
('model', model)
])
grid_search = GridSearchCV(pipeline, param_grid={
'model__C': [0.01, 0.1, 1, 10, 100],
'model__gamma': ['scale', 'auto']
}, cv=2)
grid_search.fit(X_train, y_train)
pipeline = grid_search.best_estimator_
pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
model_scores.append(('SVC', accuracy, pipeline))
grid_search.fit(Xr_train, yr_train)
pipeline_r = grid_search.best_estimator_
pipeline_r.fit(Xr_train, yr_train)
y_pred_r = pipeline_r.predict(Xr_test)
accuracy_r = accuracy_score(yr_test, y_pred_r)
model_scores_US.append(('SVC', accuracy_r, pipeline_r))
model = DecisionTreeClassifier(random_state=42)
pipeline = Pipeline([
('scaler', MinMaxScaler()),
('model', model)
])
grid_search = GridSearchCV(pipeline, param_grid={
'model__max_depth': [3, 5, 7, 9, None],
'model__min_samples_split': [2, 5, 10]
}, cv=2)
grid_search.fit(X_train, y_train)
pipeline = grid_search.best_estimator_
pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
model_scores.append(('Decision Tree', accuracy, pipeline))
grid_search.fit(Xr_train, yr_train)
pipeline_r = grid_search.best_estimator_
pipeline_r.fit(Xr_train, yr_train)
y_pred_r = pipeline_r.predict(Xr_test)
accuracy_r = accuracy_score(yr_test, y_pred_r)
model_scores_US.append(('Decision Tree', accuracy_r, pipeline_r))
model = RandomForestClassifier(random_state=42)
pipeline = Pipeline([
('scaler', MinMaxScaler()),
('model', model)
])
grid_search = GridSearchCV(pipeline, param_grid={
'model__n_estimators': [50, 100, 200, 300, 400, 500],
'model__max_depth': [None, 10, 20, 30, 40, 50]
}, cv=2)
grid_search.fit(X_train, y_train)
pipeline = grid_search.best_estimator_
pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
model_scores.append(('Random Forest', accuracy, pipeline))
grid_search.fit(Xr_train, yr_train)
pipeline_r = grid_search.best_estimator_
pipeline_r.fit(Xr_train, yr_train)
y_pred_r = pipeline_r.predict(Xr_test)
accuracy_r = accuracy_score(yr_test, y_pred_r)
model_scores_US.append(('Random Forest', accuracy_r, pipeline_r))
model = XGBClassifier(random_state=42)
pipeline = Pipeline([
('scaler', MinMaxScaler()),
('model', model)
])
grid_search = GridSearchCV(pipeline, param_grid={
'model__n_estimators': [50, 100, 200, 300, 400, 500],
'model__learning_rate': [0.01, 0.1, 0.2, 0.3],
'model__max_depth': [3, 5, 7, 9]
}, cv=2)
grid_search.fit(X_train, y_train)
pipeline = grid_search.best_estimator_
pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
model_scores.append(('XGBoost', accuracy, pipeline))
grid_search.fit(Xr_train, yr_train)
pipeline_r = grid_search.best_estimator_
pipeline_r.fit(Xr_train, yr_train)
y_pred_r = pipeline_r.predict(Xr_test)
accuracy_r = accuracy_score(yr_test, y_pred_r)
model_scores_US.append(('XGBoost', accuracy_r, pipeline_r))
model = GradientBoostingClassifier(random_state=42)
pipeline = Pipeline([
('scaler', MinMaxScaler()),
('model', model)
])
grid_search = GridSearchCV(pipeline, param_grid={
'model__n_estimators': [50, 100, 200, 300, 400, 500],
'model__learning_rate': [0.01, 0.1, 0.2, 0.3],
'model__max_depth': [3, 5, 7, 9]
}, cv=2)
grid_search.fit(X_train, y_train)
pipeline = grid_search.best_estimator_
pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
model_scores.append(('Gradient Boosting', accuracy, pipeline))
grid_search.fit(Xr_train, yr_train)
pipeline_r = grid_search.best_estimator_
pipeline_r.fit(Xr_train, yr_train)
y_pred_r = pipeline_r.predict(Xr_test)
accuracy_r = accuracy_score(yr_test, y_pred_r)
model_scores_US.append(('Gradient Boosting', accuracy_r, pipeline_r))
model = AdaBoostClassifier(random_state=42)
pipeline = Pipeline([
('scaler', MinMaxScaler()),
('model', model)
])
grid_search = GridSearchCV(pipeline, param_grid={
'model__n_estimators': [50, 100, 200, 300, 400, 500],
'model__learning_rate': [0.01, 0.1, 0.2, 0.3]
}, cv=2)
grid_search.fit(X_train, y_train)
pipeline = grid_search.best_estimator_
pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
model_scores.append(('AdaBoost', accuracy, pipeline))
grid_search.fit(Xr_train, yr_train)
pipeline_r = grid_search.best_estimator_
pipeline_r.fit(Xr_train, yr_train)
y_pred_r = pipeline_r.predict(Xr_test)
accuracy_r = accuracy_score(yr_test, y_pred_r)
model_scores_US.append(('AdaBoost', accuracy_r, pipeline_r))
c:\Users\guill\anaconda3\Lib\site-packages\sklearn\ensemble\_weight_boosting.py:519: FutureWarning: The SAMME.R algorithm (the default) is deprecated and will be removed in 1.6. Use the SAMME algorithm to circumvent this warning. warnings.warn( c:\Users\guill\anaconda3\Lib\site-packages\sklearn\ensemble\_weight_boosting.py:519: FutureWarning: The SAMME.R algorithm (the default) is deprecated and will be removed in 1.6. Use the SAMME algorithm to circumvent this warning. warnings.warn( c:\Users\guill\anaconda3\Lib\site-packages\sklearn\ensemble\_weight_boosting.py:519: FutureWarning: The SAMME.R algorithm (the default) is deprecated and will be removed in 1.6. Use the SAMME algorithm to circumvent this warning. warnings.warn( c:\Users\guill\anaconda3\Lib\site-packages\sklearn\ensemble\_weight_boosting.py:519: FutureWarning: The SAMME.R algorithm (the default) is deprecated and will be removed in 1.6. Use the SAMME algorithm to circumvent this warning. warnings.warn( c:\Users\guill\anaconda3\Lib\site-packages\sklearn\ensemble\_weight_boosting.py:519: FutureWarning: The SAMME.R algorithm (the default) is deprecated and will be removed in 1.6. Use the SAMME algorithm to circumvent this warning. warnings.warn( c:\Users\guill\anaconda3\Lib\site-packages\sklearn\ensemble\_weight_boosting.py:519: FutureWarning: The SAMME.R algorithm (the default) is deprecated and will be removed in 1.6. Use the SAMME algorithm to circumvent this warning. warnings.warn( c:\Users\guill\anaconda3\Lib\site-packages\sklearn\ensemble\_weight_boosting.py:519: FutureWarning: The SAMME.R algorithm (the default) is deprecated and will be removed in 1.6. Use the SAMME algorithm to circumvent this warning. warnings.warn( c:\Users\guill\anaconda3\Lib\site-packages\sklearn\ensemble\_weight_boosting.py:519: FutureWarning: The SAMME.R algorithm (the default) is deprecated and will be removed in 1.6. Use the SAMME algorithm to circumvent this warning. warnings.warn( c:\Users\guill\anaconda3\Lib\site-packages\sklearn\ensemble\_weight_boosting.py:519: FutureWarning: The SAMME.R algorithm (the default) is deprecated and will be removed in 1.6. Use the SAMME algorithm to circumvent this warning. warnings.warn( c:\Users\guill\anaconda3\Lib\site-packages\sklearn\ensemble\_weight_boosting.py:519: FutureWarning: The SAMME.R algorithm (the default) is deprecated and will be removed in 1.6. Use the SAMME algorithm to circumvent this warning. warnings.warn( c:\Users\guill\anaconda3\Lib\site-packages\sklearn\ensemble\_weight_boosting.py:519: FutureWarning: The SAMME.R algorithm (the default) is deprecated and will be removed in 1.6. Use the SAMME algorithm to circumvent this warning. warnings.warn( c:\Users\guill\anaconda3\Lib\site-packages\sklearn\ensemble\_weight_boosting.py:519: FutureWarning: The SAMME.R algorithm (the default) is deprecated and will be removed in 1.6. Use the SAMME algorithm to circumvent this warning. warnings.warn( c:\Users\guill\anaconda3\Lib\site-packages\sklearn\ensemble\_weight_boosting.py:519: FutureWarning: The SAMME.R algorithm (the default) is deprecated and will be removed in 1.6. Use the SAMME algorithm to circumvent this warning. warnings.warn( c:\Users\guill\anaconda3\Lib\site-packages\sklearn\ensemble\_weight_boosting.py:519: FutureWarning: The SAMME.R algorithm (the default) is deprecated and will be removed in 1.6. Use the SAMME algorithm to circumvent this warning. warnings.warn( c:\Users\guill\anaconda3\Lib\site-packages\sklearn\ensemble\_weight_boosting.py:519: FutureWarning: The SAMME.R algorithm (the default) is deprecated and will be removed in 1.6. Use the SAMME algorithm to circumvent this warning. warnings.warn( c:\Users\guill\anaconda3\Lib\site-packages\sklearn\ensemble\_weight_boosting.py:519: FutureWarning: The SAMME.R algorithm (the default) is deprecated and will be removed in 1.6. Use the SAMME algorithm to circumvent this warning. warnings.warn( c:\Users\guill\anaconda3\Lib\site-packages\sklearn\ensemble\_weight_boosting.py:519: FutureWarning: The SAMME.R algorithm (the default) is deprecated and will be removed in 1.6. Use the SAMME algorithm to circumvent this warning. warnings.warn( c:\Users\guill\anaconda3\Lib\site-packages\sklearn\ensemble\_weight_boosting.py:519: FutureWarning: The SAMME.R algorithm (the default) is deprecated and will be removed in 1.6. Use the SAMME algorithm to circumvent this warning. warnings.warn( c:\Users\guill\anaconda3\Lib\site-packages\sklearn\ensemble\_weight_boosting.py:519: FutureWarning: The SAMME.R algorithm (the default) is deprecated and will be removed in 1.6. Use the SAMME algorithm to circumvent this warning. warnings.warn( c:\Users\guill\anaconda3\Lib\site-packages\sklearn\ensemble\_weight_boosting.py:519: FutureWarning: The SAMME.R algorithm (the default) is deprecated and will be removed in 1.6. Use the SAMME algorithm to circumvent this warning. warnings.warn( c:\Users\guill\anaconda3\Lib\site-packages\sklearn\ensemble\_weight_boosting.py:519: FutureWarning: The SAMME.R algorithm (the default) is deprecated and will be removed in 1.6. Use the SAMME algorithm to circumvent this warning. warnings.warn( c:\Users\guill\anaconda3\Lib\site-packages\sklearn\ensemble\_weight_boosting.py:519: FutureWarning: The SAMME.R algorithm (the default) is deprecated and will be removed in 1.6. Use the SAMME algorithm to circumvent this warning. warnings.warn( c:\Users\guill\anaconda3\Lib\site-packages\sklearn\ensemble\_weight_boosting.py:519: FutureWarning: The SAMME.R algorithm (the default) is deprecated and will be removed in 1.6. Use the SAMME algorithm to circumvent this warning. warnings.warn( c:\Users\guill\anaconda3\Lib\site-packages\sklearn\ensemble\_weight_boosting.py:519: FutureWarning: The SAMME.R algorithm (the default) is deprecated and will be removed in 1.6. Use the SAMME algorithm to circumvent this warning. warnings.warn( c:\Users\guill\anaconda3\Lib\site-packages\sklearn\ensemble\_weight_boosting.py:519: FutureWarning: The SAMME.R algorithm (the default) is deprecated and will be removed in 1.6. Use the SAMME algorithm to circumvent this warning. warnings.warn( c:\Users\guill\anaconda3\Lib\site-packages\sklearn\ensemble\_weight_boosting.py:519: FutureWarning: The SAMME.R algorithm (the default) is deprecated and will be removed in 1.6. Use the SAMME algorithm to circumvent this warning. warnings.warn( c:\Users\guill\anaconda3\Lib\site-packages\sklearn\ensemble\_weight_boosting.py:519: FutureWarning: The SAMME.R algorithm (the default) is deprecated and will be removed in 1.6. Use the SAMME algorithm to circumvent this warning. warnings.warn( c:\Users\guill\anaconda3\Lib\site-packages\sklearn\ensemble\_weight_boosting.py:519: FutureWarning: The SAMME.R algorithm (the default) is deprecated and will be removed in 1.6. Use the SAMME algorithm to circumvent this warning. warnings.warn( c:\Users\guill\anaconda3\Lib\site-packages\sklearn\ensemble\_weight_boosting.py:519: FutureWarning: The SAMME.R algorithm (the default) is deprecated and will be removed in 1.6. Use the SAMME algorithm to circumvent this warning. warnings.warn( c:\Users\guill\anaconda3\Lib\site-packages\sklearn\ensemble\_weight_boosting.py:519: FutureWarning: The SAMME.R algorithm (the default) is deprecated and will be removed in 1.6. Use the SAMME algorithm to circumvent this warning. warnings.warn( c:\Users\guill\anaconda3\Lib\site-packages\sklearn\ensemble\_weight_boosting.py:519: FutureWarning: The SAMME.R algorithm (the default) is deprecated and will be removed in 1.6. Use the SAMME algorithm to circumvent this warning. warnings.warn( c:\Users\guill\anaconda3\Lib\site-packages\sklearn\ensemble\_weight_boosting.py:519: FutureWarning: The SAMME.R algorithm (the default) is deprecated and will be removed in 1.6. Use the SAMME algorithm to circumvent this warning. warnings.warn( c:\Users\guill\anaconda3\Lib\site-packages\sklearn\ensemble\_weight_boosting.py:519: FutureWarning: The SAMME.R algorithm (the default) is deprecated and will be removed in 1.6. Use the SAMME algorithm to circumvent this warning. warnings.warn( c:\Users\guill\anaconda3\Lib\site-packages\sklearn\ensemble\_weight_boosting.py:519: FutureWarning: The SAMME.R algorithm (the default) is deprecated and will be removed in 1.6. Use the SAMME algorithm to circumvent this warning. warnings.warn( c:\Users\guill\anaconda3\Lib\site-packages\sklearn\ensemble\_weight_boosting.py:519: FutureWarning: The SAMME.R algorithm (the default) is deprecated and will be removed in 1.6. Use the SAMME algorithm to circumvent this warning. warnings.warn( c:\Users\guill\anaconda3\Lib\site-packages\sklearn\ensemble\_weight_boosting.py:519: FutureWarning: The SAMME.R algorithm (the default) is deprecated and will be removed in 1.6. Use the SAMME algorithm to circumvent this warning. warnings.warn( c:\Users\guill\anaconda3\Lib\site-packages\sklearn\ensemble\_weight_boosting.py:519: FutureWarning: The SAMME.R algorithm (the default) is deprecated and will be removed in 1.6. Use the SAMME algorithm to circumvent this warning. warnings.warn( c:\Users\guill\anaconda3\Lib\site-packages\sklearn\ensemble\_weight_boosting.py:519: FutureWarning: The SAMME.R algorithm (the default) is deprecated and will be removed in 1.6. Use the SAMME algorithm to circumvent this warning. warnings.warn( c:\Users\guill\anaconda3\Lib\site-packages\sklearn\ensemble\_weight_boosting.py:519: FutureWarning: The SAMME.R algorithm (the default) is deprecated and will be removed in 1.6. Use the SAMME algorithm to circumvent this warning. warnings.warn( c:\Users\guill\anaconda3\Lib\site-packages\sklearn\ensemble\_weight_boosting.py:519: FutureWarning: The SAMME.R algorithm (the default) is deprecated and will be removed in 1.6. Use the SAMME algorithm to circumvent this warning. warnings.warn( c:\Users\guill\anaconda3\Lib\site-packages\sklearn\ensemble\_weight_boosting.py:519: FutureWarning: The SAMME.R algorithm (the default) is deprecated and will be removed in 1.6. Use the SAMME algorithm to circumvent this warning. warnings.warn( c:\Users\guill\anaconda3\Lib\site-packages\sklearn\ensemble\_weight_boosting.py:519: FutureWarning: The SAMME.R algorithm (the default) is deprecated and will be removed in 1.6. Use the SAMME algorithm to circumvent this warning. warnings.warn( c:\Users\guill\anaconda3\Lib\site-packages\sklearn\ensemble\_weight_boosting.py:519: FutureWarning: The SAMME.R algorithm (the default) is deprecated and will be removed in 1.6. Use the SAMME algorithm to circumvent this warning. warnings.warn( c:\Users\guill\anaconda3\Lib\site-packages\sklearn\ensemble\_weight_boosting.py:519: FutureWarning: The SAMME.R algorithm (the default) is deprecated and will be removed in 1.6. Use the SAMME algorithm to circumvent this warning. warnings.warn( c:\Users\guill\anaconda3\Lib\site-packages\sklearn\ensemble\_weight_boosting.py:519: FutureWarning: The SAMME.R algorithm (the default) is deprecated and will be removed in 1.6. Use the SAMME algorithm to circumvent this warning. warnings.warn( c:\Users\guill\anaconda3\Lib\site-packages\sklearn\ensemble\_weight_boosting.py:519: FutureWarning: The SAMME.R algorithm (the default) is deprecated and will be removed in 1.6. Use the SAMME algorithm to circumvent this warning. warnings.warn( c:\Users\guill\anaconda3\Lib\site-packages\sklearn\ensemble\_weight_boosting.py:519: FutureWarning: The SAMME.R algorithm (the default) is deprecated and will be removed in 1.6. Use the SAMME algorithm to circumvent this warning. warnings.warn( c:\Users\guill\anaconda3\Lib\site-packages\sklearn\ensemble\_weight_boosting.py:519: FutureWarning: The SAMME.R algorithm (the default) is deprecated and will be removed in 1.6. Use the SAMME algorithm to circumvent this warning. warnings.warn( c:\Users\guill\anaconda3\Lib\site-packages\sklearn\ensemble\_weight_boosting.py:519: FutureWarning: The SAMME.R algorithm (the default) is deprecated and will be removed in 1.6. Use the SAMME algorithm to circumvent this warning. warnings.warn( c:\Users\guill\anaconda3\Lib\site-packages\sklearn\ensemble\_weight_boosting.py:519: FutureWarning: The SAMME.R algorithm (the default) is deprecated and will be removed in 1.6. Use the SAMME algorithm to circumvent this warning. warnings.warn( c:\Users\guill\anaconda3\Lib\site-packages\sklearn\ensemble\_weight_boosting.py:519: FutureWarning: The SAMME.R algorithm (the default) is deprecated and will be removed in 1.6. Use the SAMME algorithm to circumvent this warning. warnings.warn( c:\Users\guill\anaconda3\Lib\site-packages\sklearn\ensemble\_weight_boosting.py:519: FutureWarning: The SAMME.R algorithm (the default) is deprecated and will be removed in 1.6. Use the SAMME algorithm to circumvent this warning. warnings.warn( c:\Users\guill\anaconda3\Lib\site-packages\sklearn\ensemble\_weight_boosting.py:519: FutureWarning: The SAMME.R algorithm (the default) is deprecated and will be removed in 1.6. Use the SAMME algorithm to circumvent this warning. warnings.warn( c:\Users\guill\anaconda3\Lib\site-packages\sklearn\ensemble\_weight_boosting.py:519: FutureWarning: The SAMME.R algorithm (the default) is deprecated and will be removed in 1.6. Use the SAMME algorithm to circumvent this warning. warnings.warn( c:\Users\guill\anaconda3\Lib\site-packages\sklearn\ensemble\_weight_boosting.py:519: FutureWarning: The SAMME.R algorithm (the default) is deprecated and will be removed in 1.6. Use the SAMME algorithm to circumvent this warning. warnings.warn( c:\Users\guill\anaconda3\Lib\site-packages\sklearn\ensemble\_weight_boosting.py:519: FutureWarning: The SAMME.R algorithm (the default) is deprecated and will be removed in 1.6. Use the SAMME algorithm to circumvent this warning. warnings.warn( c:\Users\guill\anaconda3\Lib\site-packages\sklearn\ensemble\_weight_boosting.py:519: FutureWarning: The SAMME.R algorithm (the default) is deprecated and will be removed in 1.6. Use the SAMME algorithm to circumvent this warning. warnings.warn( c:\Users\guill\anaconda3\Lib\site-packages\sklearn\ensemble\_weight_boosting.py:519: FutureWarning: The SAMME.R algorithm (the default) is deprecated and will be removed in 1.6. Use the SAMME algorithm to circumvent this warning. warnings.warn( c:\Users\guill\anaconda3\Lib\site-packages\sklearn\ensemble\_weight_boosting.py:519: FutureWarning: The SAMME.R algorithm (the default) is deprecated and will be removed in 1.6. Use the SAMME algorithm to circumvent this warning. warnings.warn( c:\Users\guill\anaconda3\Lib\site-packages\sklearn\ensemble\_weight_boosting.py:519: FutureWarning: The SAMME.R algorithm (the default) is deprecated and will be removed in 1.6. Use the SAMME algorithm to circumvent this warning. warnings.warn( c:\Users\guill\anaconda3\Lib\site-packages\sklearn\ensemble\_weight_boosting.py:519: FutureWarning: The SAMME.R algorithm (the default) is deprecated and will be removed in 1.6. Use the SAMME algorithm to circumvent this warning. warnings.warn( c:\Users\guill\anaconda3\Lib\site-packages\sklearn\ensemble\_weight_boosting.py:519: FutureWarning: The SAMME.R algorithm (the default) is deprecated and will be removed in 1.6. Use the SAMME algorithm to circumvent this warning. warnings.warn( c:\Users\guill\anaconda3\Lib\site-packages\sklearn\ensemble\_weight_boosting.py:519: FutureWarning: The SAMME.R algorithm (the default) is deprecated and will be removed in 1.6. Use the SAMME algorithm to circumvent this warning. warnings.warn( c:\Users\guill\anaconda3\Lib\site-packages\sklearn\ensemble\_weight_boosting.py:519: FutureWarning: The SAMME.R algorithm (the default) is deprecated and will be removed in 1.6. Use the SAMME algorithm to circumvent this warning. warnings.warn( c:\Users\guill\anaconda3\Lib\site-packages\sklearn\ensemble\_weight_boosting.py:519: FutureWarning: The SAMME.R algorithm (the default) is deprecated and will be removed in 1.6. Use the SAMME algorithm to circumvent this warning. warnings.warn( c:\Users\guill\anaconda3\Lib\site-packages\sklearn\ensemble\_weight_boosting.py:519: FutureWarning: The SAMME.R algorithm (the default) is deprecated and will be removed in 1.6. Use the SAMME algorithm to circumvent this warning. warnings.warn( c:\Users\guill\anaconda3\Lib\site-packages\sklearn\ensemble\_weight_boosting.py:519: FutureWarning: The SAMME.R algorithm (the default) is deprecated and will be removed in 1.6. Use the SAMME algorithm to circumvent this warning. warnings.warn( c:\Users\guill\anaconda3\Lib\site-packages\sklearn\ensemble\_weight_boosting.py:519: FutureWarning: The SAMME.R algorithm (the default) is deprecated and will be removed in 1.6. Use the SAMME algorithm to circumvent this warning. warnings.warn( c:\Users\guill\anaconda3\Lib\site-packages\sklearn\ensemble\_weight_boosting.py:519: FutureWarning: The SAMME.R algorithm (the default) is deprecated and will be removed in 1.6. Use the SAMME algorithm to circumvent this warning. warnings.warn( c:\Users\guill\anaconda3\Lib\site-packages\sklearn\ensemble\_weight_boosting.py:519: FutureWarning: The SAMME.R algorithm (the default) is deprecated and will be removed in 1.6. Use the SAMME algorithm to circumvent this warning. warnings.warn( c:\Users\guill\anaconda3\Lib\site-packages\sklearn\ensemble\_weight_boosting.py:519: FutureWarning: The SAMME.R algorithm (the default) is deprecated and will be removed in 1.6. Use the SAMME algorithm to circumvent this warning. warnings.warn( c:\Users\guill\anaconda3\Lib\site-packages\sklearn\ensemble\_weight_boosting.py:519: FutureWarning: The SAMME.R algorithm (the default) is deprecated and will be removed in 1.6. Use the SAMME algorithm to circumvent this warning. warnings.warn( c:\Users\guill\anaconda3\Lib\site-packages\sklearn\ensemble\_weight_boosting.py:519: FutureWarning: The SAMME.R algorithm (the default) is deprecated and will be removed in 1.6. Use the SAMME algorithm to circumvent this warning. warnings.warn( c:\Users\guill\anaconda3\Lib\site-packages\sklearn\ensemble\_weight_boosting.py:519: FutureWarning: The SAMME.R algorithm (the default) is deprecated and will be removed in 1.6. Use the SAMME algorithm to circumvent this warning. warnings.warn( c:\Users\guill\anaconda3\Lib\site-packages\sklearn\ensemble\_weight_boosting.py:519: FutureWarning: The SAMME.R algorithm (the default) is deprecated and will be removed in 1.6. Use the SAMME algorithm to circumvent this warning. warnings.warn( c:\Users\guill\anaconda3\Lib\site-packages\sklearn\ensemble\_weight_boosting.py:519: FutureWarning: The SAMME.R algorithm (the default) is deprecated and will be removed in 1.6. Use the SAMME algorithm to circumvent this warning. warnings.warn( c:\Users\guill\anaconda3\Lib\site-packages\sklearn\ensemble\_weight_boosting.py:519: FutureWarning: The SAMME.R algorithm (the default) is deprecated and will be removed in 1.6. Use the SAMME algorithm to circumvent this warning. warnings.warn( c:\Users\guill\anaconda3\Lib\site-packages\sklearn\ensemble\_weight_boosting.py:519: FutureWarning: The SAMME.R algorithm (the default) is deprecated and will be removed in 1.6. Use the SAMME algorithm to circumvent this warning. warnings.warn( c:\Users\guill\anaconda3\Lib\site-packages\sklearn\ensemble\_weight_boosting.py:519: FutureWarning: The SAMME.R algorithm (the default) is deprecated and will be removed in 1.6. Use the SAMME algorithm to circumvent this warning. warnings.warn( c:\Users\guill\anaconda3\Lib\site-packages\sklearn\ensemble\_weight_boosting.py:519: FutureWarning: The SAMME.R algorithm (the default) is deprecated and will be removed in 1.6. Use the SAMME algorithm to circumvent this warning. warnings.warn( c:\Users\guill\anaconda3\Lib\site-packages\sklearn\ensemble\_weight_boosting.py:519: FutureWarning: The SAMME.R algorithm (the default) is deprecated and will be removed in 1.6. Use the SAMME algorithm to circumvent this warning. warnings.warn( c:\Users\guill\anaconda3\Lib\site-packages\sklearn\ensemble\_weight_boosting.py:519: FutureWarning: The SAMME.R algorithm (the default) is deprecated and will be removed in 1.6. Use the SAMME algorithm to circumvent this warning. warnings.warn( c:\Users\guill\anaconda3\Lib\site-packages\sklearn\ensemble\_weight_boosting.py:519: FutureWarning: The SAMME.R algorithm (the default) is deprecated and will be removed in 1.6. Use the SAMME algorithm to circumvent this warning. warnings.warn( c:\Users\guill\anaconda3\Lib\site-packages\sklearn\ensemble\_weight_boosting.py:519: FutureWarning: The SAMME.R algorithm (the default) is deprecated and will be removed in 1.6. Use the SAMME algorithm to circumvent this warning. warnings.warn( c:\Users\guill\anaconda3\Lib\site-packages\sklearn\ensemble\_weight_boosting.py:519: FutureWarning: The SAMME.R algorithm (the default) is deprecated and will be removed in 1.6. Use the SAMME algorithm to circumvent this warning. warnings.warn( c:\Users\guill\anaconda3\Lib\site-packages\sklearn\ensemble\_weight_boosting.py:519: FutureWarning: The SAMME.R algorithm (the default) is deprecated and will be removed in 1.6. Use the SAMME algorithm to circumvent this warning. warnings.warn( c:\Users\guill\anaconda3\Lib\site-packages\sklearn\ensemble\_weight_boosting.py:519: FutureWarning: The SAMME.R algorithm (the default) is deprecated and will be removed in 1.6. Use the SAMME algorithm to circumvent this warning. warnings.warn( c:\Users\guill\anaconda3\Lib\site-packages\sklearn\ensemble\_weight_boosting.py:519: FutureWarning: The SAMME.R algorithm (the default) is deprecated and will be removed in 1.6. Use the SAMME algorithm to circumvent this warning. warnings.warn( c:\Users\guill\anaconda3\Lib\site-packages\sklearn\ensemble\_weight_boosting.py:519: FutureWarning: The SAMME.R algorithm (the default) is deprecated and will be removed in 1.6. Use the SAMME algorithm to circumvent this warning. warnings.warn( c:\Users\guill\anaconda3\Lib\site-packages\sklearn\ensemble\_weight_boosting.py:519: FutureWarning: The SAMME.R algorithm (the default) is deprecated and will be removed in 1.6. Use the SAMME algorithm to circumvent this warning. warnings.warn( c:\Users\guill\anaconda3\Lib\site-packages\sklearn\ensemble\_weight_boosting.py:519: FutureWarning: The SAMME.R algorithm (the default) is deprecated and will be removed in 1.6. Use the SAMME algorithm to circumvent this warning. warnings.warn( c:\Users\guill\anaconda3\Lib\site-packages\sklearn\ensemble\_weight_boosting.py:519: FutureWarning: The SAMME.R algorithm (the default) is deprecated and will be removed in 1.6. Use the SAMME algorithm to circumvent this warning. warnings.warn( c:\Users\guill\anaconda3\Lib\site-packages\sklearn\ensemble\_weight_boosting.py:519: FutureWarning: The SAMME.R algorithm (the default) is deprecated and will be removed in 1.6. Use the SAMME algorithm to circumvent this warning. warnings.warn( c:\Users\guill\anaconda3\Lib\site-packages\sklearn\ensemble\_weight_boosting.py:519: FutureWarning: The SAMME.R algorithm (the default) is deprecated and will be removed in 1.6. Use the SAMME algorithm to circumvent this warning. warnings.warn( c:\Users\guill\anaconda3\Lib\site-packages\sklearn\ensemble\_weight_boosting.py:519: FutureWarning: The SAMME.R algorithm (the default) is deprecated and will be removed in 1.6. Use the SAMME algorithm to circumvent this warning. warnings.warn( c:\Users\guill\anaconda3\Lib\site-packages\sklearn\ensemble\_weight_boosting.py:519: FutureWarning: The SAMME.R algorithm (the default) is deprecated and will be removed in 1.6. Use the SAMME algorithm to circumvent this warning. warnings.warn( c:\Users\guill\anaconda3\Lib\site-packages\sklearn\ensemble\_weight_boosting.py:519: FutureWarning: The SAMME.R algorithm (the default) is deprecated and will be removed in 1.6. Use the SAMME algorithm to circumvent this warning. warnings.warn( c:\Users\guill\anaconda3\Lib\site-packages\sklearn\ensemble\_weight_boosting.py:519: FutureWarning: The SAMME.R algorithm (the default) is deprecated and will be removed in 1.6. Use the SAMME algorithm to circumvent this warning. warnings.warn( c:\Users\guill\anaconda3\Lib\site-packages\sklearn\ensemble\_weight_boosting.py:519: FutureWarning: The SAMME.R algorithm (the default) is deprecated and will be removed in 1.6. Use the SAMME algorithm to circumvent this warning. warnings.warn( c:\Users\guill\anaconda3\Lib\site-packages\sklearn\ensemble\_weight_boosting.py:519: FutureWarning: The SAMME.R algorithm (the default) is deprecated and will be removed in 1.6. Use the SAMME algorithm to circumvent this warning. warnings.warn(
Summary
scores_df = pd.DataFrame(model_scores, columns=['Model', 'Accuracy', 'Pipeline'])
scores_df_US = pd.DataFrame(model_scores_US, columns=['Model', 'Accuracy', 'Pipeline'])
best_model = None
best_accuracy = 0
for name, accuracy, pipeline in model_scores:
print(f"{name} Accuracy: {accuracy:.4f}")
if accuracy > best_accuracy:
best_accuracy = accuracy
best_model = name
best_pipeline = pipeline
print(f"\nBest Model: {best_model} with Accuracy: {best_accuracy:.4f}")
print(f"Best Pipeline: {best_pipeline}")
best_model_US = None
best_accuracy_US = 0
for name, accuracy, pipeline in model_scores_US:
print(f"{name} Accuracy: {accuracy:.4f}")
if accuracy > best_accuracy_US:
best_accuracy_US = accuracy
best_model_US = name
best_pipeline_US = pipeline
print(f"\nBest Model: {best_model_US} with Accuracy: {best_accuracy_US:.4f}")
print(f"Best Pipeline: {best_pipeline_US}")
Logistic Regression Accuracy: 0.8024
K-nearest Neighbors Accuracy: 0.7584
Gaussian Naive Bayes Accuracy: 0.6823
SVC Accuracy: 0.7896
Decision Tree Accuracy: 0.7783
Random Forest Accuracy: 0.7910
XGBoost Accuracy: 0.7960
Gradient Boosting Accuracy: 0.7932
AdaBoost Accuracy: 0.7932
Best Model: Logistic Regression with Accuracy: 0.8024
Best Pipeline: Pipeline(steps=[('scaler', MinMaxScaler()),
('model', LogisticRegression(C=1, random_state=42))])
Logistic Regression Accuracy: 0.9190
K-nearest Neighbors Accuracy: 0.8858
Gaussian Naive Bayes Accuracy: 0.8858
SVC Accuracy: 0.9378
Decision Tree Accuracy: 0.9344
Random Forest Accuracy: 0.9471
XGBoost Accuracy: 0.9616
Gradient Boosting Accuracy: 0.9650
AdaBoost Accuracy: 0.9506
Best Model: Gradient Boosting with Accuracy: 0.9650
Best Pipeline: Pipeline(steps=[('scaler', MinMaxScaler()),
('model',
GradientBoostingClassifier(learning_rate=0.2, max_depth=5,
n_estimators=400,
random_state=42))])
fig, axes = plt.subplots(1, 2, figsize=(20, 8))
colors1 = sns.color_palette('pastel', n_colors=len(scores_df))
colors2 = sns.color_palette('viridis', n_colors=len(scores_df_US))
ax1 = sns.barplot(x='Model', y='Accuracy', data=scores_df, palette=colors1, ax=axes[0])
for p in ax1.patches:
ax1.annotate(f'{p.get_height():.3f}',
(p.get_x() + p.get_width() / 2., p.get_height()),
ha='center', va='center', fontsize=9, color='black', xytext=(0, 5),
textcoords='offset points')
axes[0].set_title('Model Scores - Original Dataset', fontsize=14, fontweight='bold')
axes[0].set_xlabel('Models', fontsize=12)
axes[0].set_ylabel('Accuracy', fontsize=12)
axes[0].tick_params(axis='x', rotation=45)
axes[0].set_ylim(0, 1)
axes[0].grid(axis='y', linestyle='--', alpha=0.7)
ax2 = sns.barplot(x='Model', y='Accuracy', data=scores_df_US, palette=colors2, ax=axes[1])
for p in ax2.patches:
ax2.annotate(f'{p.get_height():.3f}',
(p.get_x() + p.get_width() / 2., p.get_height()),
ha='center', va='center', fontsize=9, color='black', xytext=(0, 5),
textcoords='offset points')
axes[1].set_title('Model Scores - SMOTEENN Resampled Dataset', fontsize=14, fontweight='bold')
axes[1].set_xlabel('Models', fontsize=12)
axes[1].set_ylabel('Accuracy', fontsize=12)
axes[1].tick_params(axis='x', rotation=45)
axes[1].set_ylim(0, 1)
axes[1].grid(axis='y', linestyle='--', alpha=0.7)
plt.suptitle('Comparison of Model Performance: Original vs SMOTEENN Resampled Data',
fontsize=16, fontweight='bold', y=1.02)
plt.tight_layout()
plt.show()
print("\n" + "="*80)
print("PERFORMANCE COMPARISON SUMMARY")
print("="*80)
print(f"{'Model':<20} {'Original':<12} {'SMOTEENN':<12} {'Difference':<12}")
print("-"*80)
for i in range(len(scores_df)):
model_name = scores_df.iloc[i]['Model']
original_acc = scores_df.iloc[i]['Accuracy']
smoteenn_acc = scores_df_US.iloc[i]['Accuracy']
difference = smoteenn_acc - original_acc
print(f"{model_name:<20} {original_acc:<12.4f} {smoteenn_acc:<12.4f} {difference:+.4f}")
print("-"*80)
C:\Users\guill\AppData\Local\Temp\ipykernel_14424\1127419998.py:9: FutureWarning: Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect. ax1 = sns.barplot(x='Model', y='Accuracy', data=scores_df, palette=colors1, ax=axes[0]) C:\Users\guill\AppData\Local\Temp\ipykernel_14424\1127419998.py:26: FutureWarning: Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect. ax2 = sns.barplot(x='Model', y='Accuracy', data=scores_df_US, palette=colors2, ax=axes[1])
================================================================================ PERFORMANCE COMPARISON SUMMARY ================================================================================ Model Original SMOTEENN Difference -------------------------------------------------------------------------------- Logistic Regression 0.8024 0.9190 +0.1166 K-nearest Neighbors 0.7584 0.8858 +0.1274 Gaussian Naive Bayes 0.6823 0.8858 +0.2035 SVC 0.7896 0.9378 +0.1481 Decision Tree 0.7783 0.9344 +0.1561 Random Forest 0.7910 0.9471 +0.1561 XGBoost 0.7960 0.9616 +0.1656 Gradient Boosting 0.7932 0.9650 +0.1719 AdaBoost 0.7932 0.9506 +0.1574 --------------------------------------------------------------------------------
conf_matrix = confusion_matrix(yr_test, y_pred_r)
plt.figure(figsize=(8, 6))
sns.heatmap(conf_matrix, annot=True, fmt="d", cmap="Blues", cbar=True)
plt.xlabel('Predicted Labels')
plt.ylabel('True Labels')
plt.title('Confusion Matrix')
plt.show()
Business Interpretation:
False Negatives: Customers who will churn but are not detected → Revenue loss
False Positives: Loyal customers misclassified → Unnecessary retention costs
Business Impact:
FN Cost = Lost customers × Customer lifetime value
FP Cost = Retention actions × Unit cost per action
Key Insights:
- High FN: Missing potential churners leads to direct revenue loss
- High FP: Wasting resources on customers who wouldn't churn anyway
- Optimal balance: Minimize total cost (FN + FP) for maximum ROI
Conclusion¶
Key Findings
This comprehensive analysis of customer churn prediction for the Telco company has yielded several important insights:
- Model Performance Comparison
- Best Original Dataset Model: The analysis identified the top-performing model on the original dataset with optimal hyperparameters
- SMOTEENN Impact: Data resampling using SMOTEENN significantly improved model performance for most algorithms
- Algorithm Comparison: Ensemble methods (Random Forest, XGBoost, Gradient Boosting) generally outperformed simpler algorithms
- Data Insights
- Feature Importance: Tenure, total charges, and monthly charges emerged as key predictors of customer churn
- Customer Patterns: Clear patterns were identified in the distribution of churned vs. retained customers across different service categories
- Data Quality: Successfully handled missing values and data type inconsistencies in the original dataset
- Business Impact
- Cost Analysis: The confusion matrix analysis revealed the financial implications of false positives and false negatives
- Actionable Insights: The model can help identify high-risk customers for targeted retention campaigns
- ROI Optimization: By balancing precision and recall, the company can optimize retention spending
Recommendations
- Model Deployment
- Deploy the best-performing model (likely Random Forest or XGBoost with SMOTEENN) for production use
- Implement real-time scoring for new customers and regular batch predictions for existing customers
- Set up model monitoring and retraining pipelines to maintain performance over time
- Business Strategy
- Proactive Retention: Use model predictions to identify at-risk customers before they churn
- Targeted Campaigns: Develop personalized retention offers based on customer segments and risk scores
- Feature Monitoring: Track key churn indicators (tenure, charges, service usage) to identify early warning signs
- Technical Improvements
- Feature Engineering: Explore additional features like customer interaction history, payment patterns, and service usage trends
- Advanced Techniques: Consider deep learning models or ensemble stacking for potentially better performance
- Real-time Implementation: Develop APIs for real-time churn prediction integration with CRM systems
Business Value
This churn prediction model provides significant business value:
- Revenue Protection: Early identification of at-risk customers enables proactive retention, potentially saving thousands in lost revenue
- Cost Optimization: Targeted retention efforts reduce wasted marketing spend on loyal customers
- Customer Insights: Understanding churn drivers helps improve overall customer experience and service offerings
- Competitive Advantage: Data-driven retention strategies provide a significant advantage in the competitive telecom market
Future Work
- Advanced Analytics: Implement customer lifetime value prediction to prioritize retention efforts
- Segmentation: Develop churn models for specific customer segments (high-value, new customers, etc.)
- Real-time Features: Incorporate real-time behavioral data for more accurate predictions
- A/B Testing: Implement controlled experiments to measure the effectiveness of retention strategies
This project demonstrates the power of machine learning in solving real business problems and provides a solid foundation for implementing a comprehensive customer retention strategy.