Customer Churn Prediction for Telco Company¶

Author: Guillaume EGU

Overview¶

This project aims to predict customer churn for a telecommunications company using supervised machine learning techniques. Customer churn prediction is crucial for businesses as it helps identify customers who are likely to discontinue their services, enabling proactive retention strategies and reducing revenue loss.

The dataset comes from IBM and contains information about telco customers, including demographics, services used, account information, and churn status. This analysis compares multiple machine learning algorithms on both original and resampled datasets to identify the most effective approach for churn prediction.

Data Source: https://community.ibm.com/community/user/blogs/steven-macko/2019/07/11/telco-customer-churn-1113

Project Objectives¶

  • Perform comprehensive exploratory data analysis (EDA) to understand churn patterns
  • Compare multiple machine learning algorithms for churn prediction
  • Evaluate the impact of data resampling techniques (SMOTEENN) on model performance
  • Provide actionable business insights based on confusion matrix analysis

Contents¶

  • Import Libraries
  • Functions
  • EDA & Feature Engineering
  • Models :
    • Logistic Regression
    • K-nearest neighbors
    • Naive Bayes
    • Support Vector Machine
    • Decision Tree
    • Random Forest
    • XGBoost
    • Gradient Boosting
    • AdaBoost
  • Model Comparison & Analysis
  • Business Insights
  • Conclusion

Import Libraries¶

Installation and importation of key libraries


In [4]:
!pip install pandas
!pip install matplotlib
!pip install seaborn
!pip install scikit-learn
!pip install numpy
!pip install imblearn
!pip install xgboost
Requirement already satisfied: pandas in c:\users\guill\anaconda3\lib\site-packages (2.2.2)
Requirement already satisfied: numpy>=1.26.0 in c:\users\guill\anaconda3\lib\site-packages (from pandas) (1.26.4)
Requirement already satisfied: python-dateutil>=2.8.2 in c:\users\guill\anaconda3\lib\site-packages (from pandas) (2.9.0.post0)
Requirement already satisfied: pytz>=2020.1 in c:\users\guill\anaconda3\lib\site-packages (from pandas) (2024.1)
Requirement already satisfied: tzdata>=2022.7 in c:\users\guill\anaconda3\lib\site-packages (from pandas) (2023.3)
Requirement already satisfied: six>=1.5 in c:\users\guill\anaconda3\lib\site-packages (from python-dateutil>=2.8.2->pandas) (1.16.0)
Requirement already satisfied: matplotlib in c:\users\guill\anaconda3\lib\site-packages (3.8.4)
Requirement already satisfied: contourpy>=1.0.1 in c:\users\guill\anaconda3\lib\site-packages (from matplotlib) (1.2.0)
Requirement already satisfied: cycler>=0.10 in c:\users\guill\anaconda3\lib\site-packages (from matplotlib) (0.11.0)
Requirement already satisfied: fonttools>=4.22.0 in c:\users\guill\anaconda3\lib\site-packages (from matplotlib) (4.51.0)
Requirement already satisfied: kiwisolver>=1.3.1 in c:\users\guill\anaconda3\lib\site-packages (from matplotlib) (1.4.4)
Requirement already satisfied: numpy>=1.21 in c:\users\guill\anaconda3\lib\site-packages (from matplotlib) (1.26.4)
Requirement already satisfied: packaging>=20.0 in c:\users\guill\anaconda3\lib\site-packages (from matplotlib) (23.2)
Requirement already satisfied: pillow>=8 in c:\users\guill\anaconda3\lib\site-packages (from matplotlib) (10.3.0)
Requirement already satisfied: pyparsing>=2.3.1 in c:\users\guill\anaconda3\lib\site-packages (from matplotlib) (3.0.9)
Requirement already satisfied: python-dateutil>=2.7 in c:\users\guill\anaconda3\lib\site-packages (from matplotlib) (2.9.0.post0)
Requirement already satisfied: six>=1.5 in c:\users\guill\anaconda3\lib\site-packages (from python-dateutil>=2.7->matplotlib) (1.16.0)
Requirement already satisfied: seaborn in c:\users\guill\anaconda3\lib\site-packages (0.13.2)
Requirement already satisfied: numpy!=1.24.0,>=1.20 in c:\users\guill\anaconda3\lib\site-packages (from seaborn) (1.26.4)
Requirement already satisfied: pandas>=1.2 in c:\users\guill\anaconda3\lib\site-packages (from seaborn) (2.2.2)
Requirement already satisfied: matplotlib!=3.6.1,>=3.4 in c:\users\guill\anaconda3\lib\site-packages (from seaborn) (3.8.4)
Requirement already satisfied: contourpy>=1.0.1 in c:\users\guill\anaconda3\lib\site-packages (from matplotlib!=3.6.1,>=3.4->seaborn) (1.2.0)
Requirement already satisfied: cycler>=0.10 in c:\users\guill\anaconda3\lib\site-packages (from matplotlib!=3.6.1,>=3.4->seaborn) (0.11.0)
Requirement already satisfied: fonttools>=4.22.0 in c:\users\guill\anaconda3\lib\site-packages (from matplotlib!=3.6.1,>=3.4->seaborn) (4.51.0)
Requirement already satisfied: kiwisolver>=1.3.1 in c:\users\guill\anaconda3\lib\site-packages (from matplotlib!=3.6.1,>=3.4->seaborn) (1.4.4)
Requirement already satisfied: packaging>=20.0 in c:\users\guill\anaconda3\lib\site-packages (from matplotlib!=3.6.1,>=3.4->seaborn) (23.2)
Requirement already satisfied: pillow>=8 in c:\users\guill\anaconda3\lib\site-packages (from matplotlib!=3.6.1,>=3.4->seaborn) (10.3.0)
Requirement already satisfied: pyparsing>=2.3.1 in c:\users\guill\anaconda3\lib\site-packages (from matplotlib!=3.6.1,>=3.4->seaborn) (3.0.9)
Requirement already satisfied: python-dateutil>=2.7 in c:\users\guill\anaconda3\lib\site-packages (from matplotlib!=3.6.1,>=3.4->seaborn) (2.9.0.post0)
Requirement already satisfied: pytz>=2020.1 in c:\users\guill\anaconda3\lib\site-packages (from pandas>=1.2->seaborn) (2024.1)
Requirement already satisfied: tzdata>=2022.7 in c:\users\guill\anaconda3\lib\site-packages (from pandas>=1.2->seaborn) (2023.3)
Requirement already satisfied: six>=1.5 in c:\users\guill\anaconda3\lib\site-packages (from python-dateutil>=2.7->matplotlib!=3.6.1,>=3.4->seaborn) (1.16.0)
Requirement already satisfied: scikit-learn in c:\users\guill\anaconda3\lib\site-packages (1.4.2)
Requirement already satisfied: numpy>=1.19.5 in c:\users\guill\anaconda3\lib\site-packages (from scikit-learn) (1.26.4)
Requirement already satisfied: scipy>=1.6.0 in c:\users\guill\anaconda3\lib\site-packages (from scikit-learn) (1.13.1)
Requirement already satisfied: joblib>=1.2.0 in c:\users\guill\anaconda3\lib\site-packages (from scikit-learn) (1.4.2)
Requirement already satisfied: threadpoolctl>=2.0.0 in c:\users\guill\anaconda3\lib\site-packages (from scikit-learn) (2.2.0)
Requirement already satisfied: numpy in c:\users\guill\anaconda3\lib\site-packages (1.26.4)
Collecting imblearn
  Using cached imblearn-0.0-py2.py3-none-any.whl.metadata (355 bytes)
Requirement already satisfied: imbalanced-learn in c:\users\guill\anaconda3\lib\site-packages (from imblearn) (0.12.3)
Requirement already satisfied: numpy>=1.17.3 in c:\users\guill\anaconda3\lib\site-packages (from imbalanced-learn->imblearn) (1.26.4)
Requirement already satisfied: scipy>=1.5.0 in c:\users\guill\anaconda3\lib\site-packages (from imbalanced-learn->imblearn) (1.13.1)
Requirement already satisfied: scikit-learn>=1.0.2 in c:\users\guill\anaconda3\lib\site-packages (from imbalanced-learn->imblearn) (1.4.2)
Requirement already satisfied: joblib>=1.1.1 in c:\users\guill\anaconda3\lib\site-packages (from imbalanced-learn->imblearn) (1.4.2)
Requirement already satisfied: threadpoolctl>=2.0.0 in c:\users\guill\anaconda3\lib\site-packages (from imbalanced-learn->imblearn) (2.2.0)
Using cached imblearn-0.0-py2.py3-none-any.whl (1.9 kB)
Installing collected packages: imblearn
Successfully installed imblearn-0.0
Requirement already satisfied: xgboost in c:\users\guill\anaconda3\lib\site-packages (3.0.5)
Requirement already satisfied: numpy in c:\users\guill\anaconda3\lib\site-packages (from xgboost) (1.26.4)
Requirement already satisfied: scipy in c:\users\guill\anaconda3\lib\site-packages (from xgboost) (1.13.1)
In [5]:
# Handle the data 
import pandas as pd
import numpy as np

# Visualization
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns

# Preprocessing and modeling
import sklearn
from sklearn.preprocessing import OneHotEncoder, MinMaxScaler, LabelEncoder, OneHotEncoder
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score
from imblearn.combine import SMOTEENN

from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier, AdaBoostClassifier ,RandomForestRegressor
from xgboost import XGBClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report, mean_absolute_error, mean_squared_error, r2_score

Functions¶

Helper functions


In [ ]:
 

Data Load¶

Cell for loading the data in a single dataframe.


In [6]:
df = pd.read_csv('datasets/IBM-Telco-Customer-Churn.csv')
df.head()
Out[6]:
customerID gender SeniorCitizen Partner Dependents tenure PhoneService MultipleLines InternetService OnlineSecurity ... DeviceProtection TechSupport StreamingTV StreamingMovies Contract PaperlessBilling PaymentMethod MonthlyCharges TotalCharges Churn
0 7590-VHVEG Female 0 Yes No 1 No No phone service DSL No ... No No No No Month-to-month Yes Electronic check 29.85 29.85 No
1 5575-GNVDE Male 0 No No 34 Yes No DSL Yes ... Yes No No No One year No Mailed check 56.95 1889.5 No
2 3668-QPYBK Male 0 No No 2 Yes No DSL Yes ... No No No No Month-to-month Yes Mailed check 53.85 108.15 Yes
3 7795-CFOCW Male 0 No No 45 No No phone service DSL Yes ... Yes Yes No No One year No Bank transfer (automatic) 42.30 1840.75 No
4 9237-HQITU Female 0 No No 2 Yes No Fiber optic No ... No No No No Month-to-month Yes Electronic check 70.70 151.65 Yes

5 rows × 21 columns


EDA and Feature engineering¶

In this step, I want to understand the data, prepare it for analyse and understand trends.


In [7]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 21 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   customerID        7043 non-null   object 
 1   gender            7043 non-null   object 
 2   SeniorCitizen     7043 non-null   int64  
 3   Partner           7043 non-null   object 
 4   Dependents        7043 non-null   object 
 5   tenure            7043 non-null   int64  
 6   PhoneService      7043 non-null   object 
 7   MultipleLines     7043 non-null   object 
 8   InternetService   7043 non-null   object 
 9   OnlineSecurity    7043 non-null   object 
 10  OnlineBackup      7043 non-null   object 
 11  DeviceProtection  7043 non-null   object 
 12  TechSupport       7043 non-null   object 
 13  StreamingTV       7043 non-null   object 
 14  StreamingMovies   7043 non-null   object 
 15  Contract          7043 non-null   object 
 16  PaperlessBilling  7043 non-null   object 
 17  PaymentMethod     7043 non-null   object 
 18  MonthlyCharges    7043 non-null   float64
 19  TotalCharges      7043 non-null   object 
 20  Churn             7043 non-null   object 
dtypes: float64(1), int64(2), object(18)
memory usage: 1.1+ MB

We understand there is 2 Int columns, 1 float column and 18 object columns. The dimensions are 21x7043 . Total charges should be int or float but not object => maybe there some errors in the dataset.

In [8]:
df["TotalCharges"] = pd.to_numeric(df["TotalCharges"], errors='coerce')
df = df.dropna()
df.drop("customerID", axis=1, inplace=True)

Let's start to take a look at numerical columns

In [9]:
df.describe()
Out[9]:
SeniorCitizen tenure MonthlyCharges TotalCharges
count 7032.000000 7032.000000 7032.000000 7032.000000
mean 0.162400 32.421786 64.798208 2283.300441
std 0.368844 24.545260 30.085974 2266.771362
min 0.000000 1.000000 18.250000 18.800000
25% 0.000000 9.000000 35.587500 401.450000
50% 0.000000 29.000000 70.350000 1397.475000
75% 0.000000 55.000000 89.862500 3794.737500
max 1.000000 72.000000 118.750000 8684.800000

Let's check unique values

In [10]:
for col in df.columns :
    if df[col].dtype != 'int64' and df[col].dtype != 'float64' :
        print(f"{col} : {df[col].unique()}")
gender : ['Female' 'Male']
Partner : ['Yes' 'No']
Dependents : ['No' 'Yes']
PhoneService : ['No' 'Yes']
MultipleLines : ['No phone service' 'No' 'Yes']
InternetService : ['DSL' 'Fiber optic' 'No']
OnlineSecurity : ['No' 'Yes' 'No internet service']
OnlineBackup : ['Yes' 'No' 'No internet service']
DeviceProtection : ['No' 'Yes' 'No internet service']
TechSupport : ['No' 'Yes' 'No internet service']
StreamingTV : ['No' 'Yes' 'No internet service']
StreamingMovies : ['No' 'Yes' 'No internet service']
Contract : ['Month-to-month' 'One year' 'Two year']
PaperlessBilling : ['Yes' 'No']
PaymentMethod : ['Electronic check' 'Mailed check' 'Bank transfer (automatic)'
 'Credit card (automatic)']
Churn : ['No' 'Yes']

Let's verify null values

In [11]:
print(df.isnull().sum())
gender              0
SeniorCitizen       0
Partner             0
Dependents          0
tenure              0
PhoneService        0
MultipleLines       0
InternetService     0
OnlineSecurity      0
OnlineBackup        0
DeviceProtection    0
TechSupport         0
StreamingTV         0
StreamingMovies     0
Contract            0
PaperlessBilling    0
PaymentMethod       0
MonthlyCharges      0
TotalCharges        0
Churn               0
dtype: int64

Now that we are sure the data are right. Let's visualize them.

In [12]:
colors = {'Yes':'r', 'No':'g'}
palette = {0:'g', 1:'r'}

for i, predictor in enumerate(df.drop(columns=['Churn', 'TotalCharges', 'MonthlyCharges', 'tenure'])):
    plt.figure(i, figsize=(5,3))
    sns.countplot(data=df, x=predictor, hue='Churn', palette=colors)
    plt.title(f'Distribution of {predictor} by Churn')
    plt.show()
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image

With numerical data

In [13]:
churned = df[df['Churn'] == 'Yes']
not_churned = df[df['Churn'] == 'No']

plt.figure(figsize=(10,6))
plt.hist([churned['tenure'], not_churned['tenure']], bins=10, color=['r','g'], label=['Churned', 'Not Churned'], alpha=0.7)
plt.xlabel('Tenure')
plt.ylabel('Frequency')
plt.title('Distribution of Tenure by Churn')
plt.legend()
plt.grid(axis='y', alpha=0.75, linestyle='--')

for rect in plt.gca().patches:
    height = rect.get_height()
    if height > 0:
        plt.gca().text(rect.get_x() + rect.get_width() / 2, height + 5, f'{int(height)}', ha='center', va='bottom')
No description has been provided for this image
In [14]:
churned= df[df['Churn'] == 'Yes']
not_churned = df[df['Churn'] == 'No']

plt.figure(figsize=(10,6))
plt.hist([churned['MonthlyCharges'], not_churned['MonthlyCharges']], bins=10, color=['r','g'], label=['Churned', 'Not Churned'], alpha=0.7)
plt.xlabel('Monthly Charges')
plt.ylabel('Frequency')
plt.title('Distribution of Monthly Charges by Churn')
plt.legend()
plt.grid(axis='y', alpha=0.75, linestyle='--')

for rect in plt.gca().patches:
    height = rect.get_height()
    if height > 0:
        plt.gca().text(rect.get_x() + rect.get_width() / 2, height + 5, f'{int(height)}', ha='center', va='bottom')
No description has been provided for this image
In [15]:
churned = df[df['Churn'] == 'Yes']
not_churned = df[df['Churn'] == 'No']

plt.figure(figsize=(10,5))
plt.hist([churned['TotalCharges'], not_churned['TotalCharges']], bins=10, color=['r','g'], label=['Churned', 'Not Churned'], alpha=0.7)
plt.xlabel('Total Charges')
plt.ylabel('Frequency')
plt.title('Distribution of Total Charges by Churn')
plt.legend()
plt.grid(axis='y', alpha=0.75, linestyle='--')
for rect in plt.gca().patches:
    height = rect.get_height()
    if height > 0:
        plt.gca().text(rect.get_x() + rect.get_width() / 2, height + 5, f'{int(height)}', ha='center', va='bottom')
No description has been provided for this image
In [16]:
fig, axes = plt.subplots(1, 2, figsize=(14,6))

sns.kdeplot(data=df, x='MonthlyCharges', hue='Churn', fill=True, alpha=0.5, ax=axes[0])
axes[0].set_title('Density Plot of Monthly Charges by Churn')
axes[0].set_xlabel('Monthly Charges')
axes[0].set_ylabel('Density')

sns.kdeplot(data=df, x='TotalCharges', hue='Churn', fill=True, alpha=0.5, ax=axes[1])
axes[1].set_title('Density Plot of Total Charges by Churn')
axes[1].set_xlabel('Total Charges')
axes[1].set_ylabel('Density')

plt.show()
No description has been provided for this image

We understand that some parameter like tenure or total charges will have more importance than others features.

Now, I will encode the data thanks to OneHotEncoder

In [17]:
categorical_cols = df.select_dtypes(include=['category','object']).columns

encoder = OneHotEncoder(sparse_output=False)
encoded_data = encoder.fit_transform(df[categorical_cols])

encoded_df = pd.DataFrame(encoded_data, columns=encoder.get_feature_names_out(categorical_cols))

df.drop(columns=categorical_cols, inplace=True)

df.reset_index(drop=True, inplace=True)

df = pd.concat([df, encoded_df], axis=1)
In [18]:
df.drop('Churn_No', axis=1, inplace=True)
df.rename(columns={'Churn_Yes':'Churn'}, inplace=True)
In [19]:
df.head()
Out[19]:
SeniorCitizen tenure MonthlyCharges TotalCharges gender_Female gender_Male Partner_No Partner_Yes Dependents_No Dependents_Yes ... Contract_Month-to-month Contract_One year Contract_Two year PaperlessBilling_No PaperlessBilling_Yes PaymentMethod_Bank transfer (automatic) PaymentMethod_Credit card (automatic) PaymentMethod_Electronic check PaymentMethod_Mailed check Churn
0 0 1 29.85 29.85 1.0 0.0 0.0 1.0 1.0 0.0 ... 1.0 0.0 0.0 0.0 1.0 0.0 0.0 1.0 0.0 0.0
1 0 34 56.95 1889.50 0.0 1.0 1.0 0.0 1.0 0.0 ... 0.0 1.0 0.0 1.0 0.0 0.0 0.0 0.0 1.0 0.0
2 0 2 53.85 108.15 0.0 1.0 1.0 0.0 1.0 0.0 ... 1.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 1.0 1.0
3 0 45 42.30 1840.75 0.0 1.0 1.0 0.0 1.0 0.0 ... 0.0 1.0 0.0 1.0 0.0 1.0 0.0 0.0 0.0 0.0
4 0 2 70.70 151.65 1.0 0.0 1.0 0.0 1.0 0.0 ... 1.0 0.0 0.0 0.0 1.0 0.0 0.0 1.0 0.0 1.0

5 rows × 46 columns

In [20]:
X = df.drop('Churn', axis=1)
y = df['Churn']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

model_scores = []
In [21]:
sm = SMOTEENN()
X_res, y_res = sm.fit_resample(X, y)

Xr_train, Xr_test, yr_train, yr_test = train_test_split(X_res, y_res, test_size=0.2)
model_scores_US = []

Logistic Regression¶


In [22]:
model = LogisticRegression(random_state=42)

pipeline = Pipeline([
    ('scaler', MinMaxScaler()),
    ('model', model)
])

grid_search = GridSearchCV(pipeline, param_grid={
    'model__C': [0.01, 0.1, 1, 10, 100],
    'model__penalty': ['l1', 'l2']
}, cv=2)


grid_search.fit(X_train, y_train)
pipeline = grid_search.best_estimator_

pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
model_scores.append(('Logistic Regression', accuracy, pipeline))


grid_search.fit(Xr_train, yr_train)
pipeline_r = grid_search.best_estimator_

pipeline_r.fit(Xr_train, yr_train)
y_pred_r = pipeline_r.predict(Xr_test)
accuracy_r = accuracy_score(yr_test, y_pred_r)
model_scores_US.append(('Logistic Regression', accuracy_r, pipeline_r))
c:\Users\guill\anaconda3\Lib\site-packages\sklearn\model_selection\_validation.py:547: FitFailedWarning: 
10 fits failed out of a total of 20.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
10 fits failed with the following error:
Traceback (most recent call last):
  File "c:\Users\guill\anaconda3\Lib\site-packages\sklearn\model_selection\_validation.py", line 895, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "c:\Users\guill\anaconda3\Lib\site-packages\sklearn\base.py", line 1474, in wrapper
    return fit_method(estimator, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "c:\Users\guill\anaconda3\Lib\site-packages\sklearn\pipeline.py", line 475, in fit
    self._final_estimator.fit(Xt, y, **last_step_params["fit"])
  File "c:\Users\guill\anaconda3\Lib\site-packages\sklearn\base.py", line 1474, in wrapper
    return fit_method(estimator, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "c:\Users\guill\anaconda3\Lib\site-packages\sklearn\linear_model\_logistic.py", line 1172, in fit
    solver = _check_solver(self.solver, self.penalty, self.dual)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "c:\Users\guill\anaconda3\Lib\site-packages\sklearn\linear_model\_logistic.py", line 67, in _check_solver
    raise ValueError(
ValueError: Solver lbfgs supports only 'l2' or None penalties, got l1 penalty.

  warnings.warn(some_fits_failed_message, FitFailedWarning)
c:\Users\guill\anaconda3\Lib\site-packages\sklearn\model_selection\_search.py:1051: UserWarning: One or more of the test scores are non-finite: [       nan 0.79288778        nan 0.80195523        nan 0.80302184
        nan 0.8028439         nan 0.80248828]
  warnings.warn(
c:\Users\guill\anaconda3\Lib\site-packages\sklearn\model_selection\_validation.py:547: FitFailedWarning: 
10 fits failed out of a total of 20.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
10 fits failed with the following error:
Traceback (most recent call last):
  File "c:\Users\guill\anaconda3\Lib\site-packages\sklearn\model_selection\_validation.py", line 895, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "c:\Users\guill\anaconda3\Lib\site-packages\sklearn\base.py", line 1474, in wrapper
    return fit_method(estimator, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "c:\Users\guill\anaconda3\Lib\site-packages\sklearn\pipeline.py", line 475, in fit
    self._final_estimator.fit(Xt, y, **last_step_params["fit"])
  File "c:\Users\guill\anaconda3\Lib\site-packages\sklearn\base.py", line 1474, in wrapper
    return fit_method(estimator, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "c:\Users\guill\anaconda3\Lib\site-packages\sklearn\linear_model\_logistic.py", line 1172, in fit
    solver = _check_solver(self.solver, self.penalty, self.dual)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "c:\Users\guill\anaconda3\Lib\site-packages\sklearn\linear_model\_logistic.py", line 67, in _check_solver
    raise ValueError(
ValueError: Solver lbfgs supports only 'l2' or None penalties, got l1 penalty.

  warnings.warn(some_fits_failed_message, FitFailedWarning)
c:\Users\guill\anaconda3\Lib\site-packages\sklearn\model_selection\_search.py:1051: UserWarning: One or more of the test scores are non-finite: [       nan 0.88701805        nan 0.90215383        nan 0.91132009
        nan 0.91515713        nan 0.91600928]
  warnings.warn(

K-nearest Regression¶


In [23]:
model = KNeighborsClassifier()

pipeline = Pipeline([
    ('scaler', MinMaxScaler()),
    ('model', model)
])

grid_search = GridSearchCV(pipeline, param_grid={
    'model__n_neighbors': [3, 5, 7, 9],
    'model__weights': ['uniform', 'distance']
}, cv=2)


grid_search.fit(X_train, y_train)
pipeline = grid_search.best_estimator_

pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
model_scores.append(('K-nearest Neighbors', accuracy, pipeline))


grid_search.fit(Xr_train, yr_train)
pipeline_r = grid_search.best_estimator_

pipeline_r.fit(Xr_train, yr_train)
y_pred_r = pipeline_r.predict(Xr_test)
accuracy_r = accuracy_score(yr_test, y_pred_r)
model_scores_US.append(('K-nearest Neighbors', accuracy_r, pipeline_r))

Naive Bayes¶


In [24]:
model = GaussianNB()

pipeline = Pipeline([
    ('scaler', MinMaxScaler()),
    ('model', model)
])


pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
model_scores.append(('Gaussian Naive Bayes', accuracy, pipeline))


grid_search.fit(Xr_train, yr_train)
pipeline_r = grid_search.best_estimator_

pipeline_r.fit(Xr_train, yr_train)
y_pred_r = pipeline_r.predict(Xr_test)
accuracy_r = accuracy_score(yr_test, y_pred_r)
model_scores_US.append(('Gaussian Naive Bayes', accuracy_r, pipeline_r))

Support Vector Machine¶


In [25]:
model = SVC(random_state=42)

pipeline = Pipeline([
    ('scaler', MinMaxScaler()),
    ('model', model)
])

grid_search = GridSearchCV(pipeline, param_grid={
    'model__C': [0.01, 0.1, 1, 10, 100],
    'model__gamma': ['scale', 'auto']
}, cv=2)


grid_search.fit(X_train, y_train)
pipeline = grid_search.best_estimator_

pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
model_scores.append(('SVC', accuracy, pipeline))


grid_search.fit(Xr_train, yr_train)
pipeline_r = grid_search.best_estimator_

pipeline_r.fit(Xr_train, yr_train)
y_pred_r = pipeline_r.predict(Xr_test)
accuracy_r = accuracy_score(yr_test, y_pred_r)
model_scores_US.append(('SVC', accuracy_r, pipeline_r))

Decision tree¶


In [26]:
model = DecisionTreeClassifier(random_state=42)

pipeline = Pipeline([
    ('scaler', MinMaxScaler()),
    ('model', model)
])

grid_search = GridSearchCV(pipeline, param_grid={
    'model__max_depth': [3, 5, 7, 9, None],
    'model__min_samples_split': [2, 5, 10]
}, cv=2)


grid_search.fit(X_train, y_train)
pipeline = grid_search.best_estimator_

pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
model_scores.append(('Decision Tree', accuracy, pipeline))


grid_search.fit(Xr_train, yr_train)
pipeline_r = grid_search.best_estimator_

pipeline_r.fit(Xr_train, yr_train)
y_pred_r = pipeline_r.predict(Xr_test)
accuracy_r = accuracy_score(yr_test, y_pred_r)
model_scores_US.append(('Decision Tree', accuracy_r, pipeline_r))

Random Forest¶


In [27]:
model = RandomForestClassifier(random_state=42)

pipeline = Pipeline([
    ('scaler', MinMaxScaler()),
    ('model', model)
])

grid_search = GridSearchCV(pipeline, param_grid={
    'model__n_estimators': [50, 100, 200, 300, 400, 500],
    'model__max_depth': [None, 10, 20, 30, 40, 50]
}, cv=2)


grid_search.fit(X_train, y_train)
pipeline = grid_search.best_estimator_

pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
model_scores.append(('Random Forest', accuracy, pipeline))


grid_search.fit(Xr_train, yr_train)
pipeline_r = grid_search.best_estimator_

pipeline_r.fit(Xr_train, yr_train)
y_pred_r = pipeline_r.predict(Xr_test)
accuracy_r = accuracy_score(yr_test, y_pred_r)
model_scores_US.append(('Random Forest', accuracy_r, pipeline_r))

XGBoost¶


In [28]:
model = XGBClassifier(random_state=42)

pipeline = Pipeline([
    ('scaler', MinMaxScaler()),
    ('model', model)
])

grid_search = GridSearchCV(pipeline, param_grid={
    'model__n_estimators': [50, 100, 200, 300, 400, 500],
    'model__learning_rate': [0.01, 0.1, 0.2, 0.3],
    'model__max_depth': [3, 5, 7, 9]
}, cv=2)


grid_search.fit(X_train, y_train)
pipeline = grid_search.best_estimator_

pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
model_scores.append(('XGBoost', accuracy, pipeline))


grid_search.fit(Xr_train, yr_train)
pipeline_r = grid_search.best_estimator_

pipeline_r.fit(Xr_train, yr_train)
y_pred_r = pipeline_r.predict(Xr_test)
accuracy_r = accuracy_score(yr_test, y_pred_r)
model_scores_US.append(('XGBoost', accuracy_r, pipeline_r))

Gradient Boosting¶


In [29]:
model = GradientBoostingClassifier(random_state=42)

pipeline = Pipeline([
    ('scaler', MinMaxScaler()),
    ('model', model)
])

grid_search = GridSearchCV(pipeline, param_grid={
    'model__n_estimators': [50, 100, 200, 300, 400, 500],
    'model__learning_rate': [0.01, 0.1, 0.2, 0.3],
    'model__max_depth': [3, 5, 7, 9]
}, cv=2)


grid_search.fit(X_train, y_train)
pipeline = grid_search.best_estimator_

pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
model_scores.append(('Gradient Boosting', accuracy, pipeline))


grid_search.fit(Xr_train, yr_train)
pipeline_r = grid_search.best_estimator_

pipeline_r.fit(Xr_train, yr_train)
y_pred_r = pipeline_r.predict(Xr_test)
accuracy_r = accuracy_score(yr_test, y_pred_r)
model_scores_US.append(('Gradient Boosting', accuracy_r, pipeline_r))

AdaBoost¶


In [30]:
model = AdaBoostClassifier(random_state=42)

pipeline = Pipeline([
    ('scaler', MinMaxScaler()),
    ('model', model)
])

grid_search = GridSearchCV(pipeline, param_grid={
    'model__n_estimators': [50, 100, 200, 300, 400, 500],
    'model__learning_rate': [0.01, 0.1, 0.2, 0.3]    
}, cv=2)


grid_search.fit(X_train, y_train)
pipeline = grid_search.best_estimator_

pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
model_scores.append(('AdaBoost', accuracy, pipeline))


grid_search.fit(Xr_train, yr_train)
pipeline_r = grid_search.best_estimator_

pipeline_r.fit(Xr_train, yr_train)
y_pred_r = pipeline_r.predict(Xr_test)
accuracy_r = accuracy_score(yr_test, y_pred_r)
model_scores_US.append(('AdaBoost', accuracy_r, pipeline_r))
c:\Users\guill\anaconda3\Lib\site-packages\sklearn\ensemble\_weight_boosting.py:519: FutureWarning: The SAMME.R algorithm (the default) is deprecated and will be removed in 1.6. Use the SAMME algorithm to circumvent this warning.
  warnings.warn(
c:\Users\guill\anaconda3\Lib\site-packages\sklearn\ensemble\_weight_boosting.py:519: FutureWarning: The SAMME.R algorithm (the default) is deprecated and will be removed in 1.6. Use the SAMME algorithm to circumvent this warning.
  warnings.warn(
c:\Users\guill\anaconda3\Lib\site-packages\sklearn\ensemble\_weight_boosting.py:519: FutureWarning: The SAMME.R algorithm (the default) is deprecated and will be removed in 1.6. Use the SAMME algorithm to circumvent this warning.
  warnings.warn(
c:\Users\guill\anaconda3\Lib\site-packages\sklearn\ensemble\_weight_boosting.py:519: FutureWarning: The SAMME.R algorithm (the default) is deprecated and will be removed in 1.6. Use the SAMME algorithm to circumvent this warning.
  warnings.warn(
c:\Users\guill\anaconda3\Lib\site-packages\sklearn\ensemble\_weight_boosting.py:519: FutureWarning: The SAMME.R algorithm (the default) is deprecated and will be removed in 1.6. Use the SAMME algorithm to circumvent this warning.
  warnings.warn(
c:\Users\guill\anaconda3\Lib\site-packages\sklearn\ensemble\_weight_boosting.py:519: FutureWarning: The SAMME.R algorithm (the default) is deprecated and will be removed in 1.6. Use the SAMME algorithm to circumvent this warning.
  warnings.warn(
c:\Users\guill\anaconda3\Lib\site-packages\sklearn\ensemble\_weight_boosting.py:519: FutureWarning: The SAMME.R algorithm (the default) is deprecated and will be removed in 1.6. Use the SAMME algorithm to circumvent this warning.
  warnings.warn(
c:\Users\guill\anaconda3\Lib\site-packages\sklearn\ensemble\_weight_boosting.py:519: FutureWarning: The SAMME.R algorithm (the default) is deprecated and will be removed in 1.6. Use the SAMME algorithm to circumvent this warning.
  warnings.warn(
c:\Users\guill\anaconda3\Lib\site-packages\sklearn\ensemble\_weight_boosting.py:519: FutureWarning: The SAMME.R algorithm (the default) is deprecated and will be removed in 1.6. Use the SAMME algorithm to circumvent this warning.
  warnings.warn(
c:\Users\guill\anaconda3\Lib\site-packages\sklearn\ensemble\_weight_boosting.py:519: FutureWarning: The SAMME.R algorithm (the default) is deprecated and will be removed in 1.6. Use the SAMME algorithm to circumvent this warning.
  warnings.warn(
c:\Users\guill\anaconda3\Lib\site-packages\sklearn\ensemble\_weight_boosting.py:519: FutureWarning: The SAMME.R algorithm (the default) is deprecated and will be removed in 1.6. Use the SAMME algorithm to circumvent this warning.
  warnings.warn(
c:\Users\guill\anaconda3\Lib\site-packages\sklearn\ensemble\_weight_boosting.py:519: FutureWarning: The SAMME.R algorithm (the default) is deprecated and will be removed in 1.6. Use the SAMME algorithm to circumvent this warning.
  warnings.warn(
c:\Users\guill\anaconda3\Lib\site-packages\sklearn\ensemble\_weight_boosting.py:519: FutureWarning: The SAMME.R algorithm (the default) is deprecated and will be removed in 1.6. Use the SAMME algorithm to circumvent this warning.
  warnings.warn(
c:\Users\guill\anaconda3\Lib\site-packages\sklearn\ensemble\_weight_boosting.py:519: FutureWarning: The SAMME.R algorithm (the default) is deprecated and will be removed in 1.6. Use the SAMME algorithm to circumvent this warning.
  warnings.warn(
c:\Users\guill\anaconda3\Lib\site-packages\sklearn\ensemble\_weight_boosting.py:519: FutureWarning: The SAMME.R algorithm (the default) is deprecated and will be removed in 1.6. Use the SAMME algorithm to circumvent this warning.
  warnings.warn(
c:\Users\guill\anaconda3\Lib\site-packages\sklearn\ensemble\_weight_boosting.py:519: FutureWarning: The SAMME.R algorithm (the default) is deprecated and will be removed in 1.6. Use the SAMME algorithm to circumvent this warning.
  warnings.warn(
c:\Users\guill\anaconda3\Lib\site-packages\sklearn\ensemble\_weight_boosting.py:519: FutureWarning: The SAMME.R algorithm (the default) is deprecated and will be removed in 1.6. Use the SAMME algorithm to circumvent this warning.
  warnings.warn(
c:\Users\guill\anaconda3\Lib\site-packages\sklearn\ensemble\_weight_boosting.py:519: FutureWarning: The SAMME.R algorithm (the default) is deprecated and will be removed in 1.6. Use the SAMME algorithm to circumvent this warning.
  warnings.warn(
c:\Users\guill\anaconda3\Lib\site-packages\sklearn\ensemble\_weight_boosting.py:519: FutureWarning: The SAMME.R algorithm (the default) is deprecated and will be removed in 1.6. Use the SAMME algorithm to circumvent this warning.
  warnings.warn(
c:\Users\guill\anaconda3\Lib\site-packages\sklearn\ensemble\_weight_boosting.py:519: FutureWarning: The SAMME.R algorithm (the default) is deprecated and will be removed in 1.6. Use the SAMME algorithm to circumvent this warning.
  warnings.warn(
c:\Users\guill\anaconda3\Lib\site-packages\sklearn\ensemble\_weight_boosting.py:519: FutureWarning: The SAMME.R algorithm (the default) is deprecated and will be removed in 1.6. Use the SAMME algorithm to circumvent this warning.
  warnings.warn(
c:\Users\guill\anaconda3\Lib\site-packages\sklearn\ensemble\_weight_boosting.py:519: FutureWarning: The SAMME.R algorithm (the default) is deprecated and will be removed in 1.6. Use the SAMME algorithm to circumvent this warning.
  warnings.warn(
c:\Users\guill\anaconda3\Lib\site-packages\sklearn\ensemble\_weight_boosting.py:519: FutureWarning: The SAMME.R algorithm (the default) is deprecated and will be removed in 1.6. Use the SAMME algorithm to circumvent this warning.
  warnings.warn(
c:\Users\guill\anaconda3\Lib\site-packages\sklearn\ensemble\_weight_boosting.py:519: FutureWarning: The SAMME.R algorithm (the default) is deprecated and will be removed in 1.6. Use the SAMME algorithm to circumvent this warning.
  warnings.warn(
c:\Users\guill\anaconda3\Lib\site-packages\sklearn\ensemble\_weight_boosting.py:519: FutureWarning: The SAMME.R algorithm (the default) is deprecated and will be removed in 1.6. Use the SAMME algorithm to circumvent this warning.
  warnings.warn(
c:\Users\guill\anaconda3\Lib\site-packages\sklearn\ensemble\_weight_boosting.py:519: FutureWarning: The SAMME.R algorithm (the default) is deprecated and will be removed in 1.6. Use the SAMME algorithm to circumvent this warning.
  warnings.warn(
c:\Users\guill\anaconda3\Lib\site-packages\sklearn\ensemble\_weight_boosting.py:519: FutureWarning: The SAMME.R algorithm (the default) is deprecated and will be removed in 1.6. Use the SAMME algorithm to circumvent this warning.
  warnings.warn(
c:\Users\guill\anaconda3\Lib\site-packages\sklearn\ensemble\_weight_boosting.py:519: FutureWarning: The SAMME.R algorithm (the default) is deprecated and will be removed in 1.6. Use the SAMME algorithm to circumvent this warning.
  warnings.warn(
c:\Users\guill\anaconda3\Lib\site-packages\sklearn\ensemble\_weight_boosting.py:519: FutureWarning: The SAMME.R algorithm (the default) is deprecated and will be removed in 1.6. Use the SAMME algorithm to circumvent this warning.
  warnings.warn(
c:\Users\guill\anaconda3\Lib\site-packages\sklearn\ensemble\_weight_boosting.py:519: FutureWarning: The SAMME.R algorithm (the default) is deprecated and will be removed in 1.6. Use the SAMME algorithm to circumvent this warning.
  warnings.warn(
c:\Users\guill\anaconda3\Lib\site-packages\sklearn\ensemble\_weight_boosting.py:519: FutureWarning: The SAMME.R algorithm (the default) is deprecated and will be removed in 1.6. Use the SAMME algorithm to circumvent this warning.
  warnings.warn(
c:\Users\guill\anaconda3\Lib\site-packages\sklearn\ensemble\_weight_boosting.py:519: FutureWarning: The SAMME.R algorithm (the default) is deprecated and will be removed in 1.6. Use the SAMME algorithm to circumvent this warning.
  warnings.warn(
c:\Users\guill\anaconda3\Lib\site-packages\sklearn\ensemble\_weight_boosting.py:519: FutureWarning: The SAMME.R algorithm (the default) is deprecated and will be removed in 1.6. Use the SAMME algorithm to circumvent this warning.
  warnings.warn(
c:\Users\guill\anaconda3\Lib\site-packages\sklearn\ensemble\_weight_boosting.py:519: FutureWarning: The SAMME.R algorithm (the default) is deprecated and will be removed in 1.6. Use the SAMME algorithm to circumvent this warning.
  warnings.warn(
c:\Users\guill\anaconda3\Lib\site-packages\sklearn\ensemble\_weight_boosting.py:519: FutureWarning: The SAMME.R algorithm (the default) is deprecated and will be removed in 1.6. Use the SAMME algorithm to circumvent this warning.
  warnings.warn(
c:\Users\guill\anaconda3\Lib\site-packages\sklearn\ensemble\_weight_boosting.py:519: FutureWarning: The SAMME.R algorithm (the default) is deprecated and will be removed in 1.6. Use the SAMME algorithm to circumvent this warning.
  warnings.warn(
c:\Users\guill\anaconda3\Lib\site-packages\sklearn\ensemble\_weight_boosting.py:519: FutureWarning: The SAMME.R algorithm (the default) is deprecated and will be removed in 1.6. Use the SAMME algorithm to circumvent this warning.
  warnings.warn(
c:\Users\guill\anaconda3\Lib\site-packages\sklearn\ensemble\_weight_boosting.py:519: FutureWarning: The SAMME.R algorithm (the default) is deprecated and will be removed in 1.6. Use the SAMME algorithm to circumvent this warning.
  warnings.warn(
c:\Users\guill\anaconda3\Lib\site-packages\sklearn\ensemble\_weight_boosting.py:519: FutureWarning: The SAMME.R algorithm (the default) is deprecated and will be removed in 1.6. Use the SAMME algorithm to circumvent this warning.
  warnings.warn(
c:\Users\guill\anaconda3\Lib\site-packages\sklearn\ensemble\_weight_boosting.py:519: FutureWarning: The SAMME.R algorithm (the default) is deprecated and will be removed in 1.6. Use the SAMME algorithm to circumvent this warning.
  warnings.warn(
c:\Users\guill\anaconda3\Lib\site-packages\sklearn\ensemble\_weight_boosting.py:519: FutureWarning: The SAMME.R algorithm (the default) is deprecated and will be removed in 1.6. Use the SAMME algorithm to circumvent this warning.
  warnings.warn(
c:\Users\guill\anaconda3\Lib\site-packages\sklearn\ensemble\_weight_boosting.py:519: FutureWarning: The SAMME.R algorithm (the default) is deprecated and will be removed in 1.6. Use the SAMME algorithm to circumvent this warning.
  warnings.warn(
c:\Users\guill\anaconda3\Lib\site-packages\sklearn\ensemble\_weight_boosting.py:519: FutureWarning: The SAMME.R algorithm (the default) is deprecated and will be removed in 1.6. Use the SAMME algorithm to circumvent this warning.
  warnings.warn(
c:\Users\guill\anaconda3\Lib\site-packages\sklearn\ensemble\_weight_boosting.py:519: FutureWarning: The SAMME.R algorithm (the default) is deprecated and will be removed in 1.6. Use the SAMME algorithm to circumvent this warning.
  warnings.warn(
c:\Users\guill\anaconda3\Lib\site-packages\sklearn\ensemble\_weight_boosting.py:519: FutureWarning: The SAMME.R algorithm (the default) is deprecated and will be removed in 1.6. Use the SAMME algorithm to circumvent this warning.
  warnings.warn(
c:\Users\guill\anaconda3\Lib\site-packages\sklearn\ensemble\_weight_boosting.py:519: FutureWarning: The SAMME.R algorithm (the default) is deprecated and will be removed in 1.6. Use the SAMME algorithm to circumvent this warning.
  warnings.warn(
c:\Users\guill\anaconda3\Lib\site-packages\sklearn\ensemble\_weight_boosting.py:519: FutureWarning: The SAMME.R algorithm (the default) is deprecated and will be removed in 1.6. Use the SAMME algorithm to circumvent this warning.
  warnings.warn(
c:\Users\guill\anaconda3\Lib\site-packages\sklearn\ensemble\_weight_boosting.py:519: FutureWarning: The SAMME.R algorithm (the default) is deprecated and will be removed in 1.6. Use the SAMME algorithm to circumvent this warning.
  warnings.warn(
c:\Users\guill\anaconda3\Lib\site-packages\sklearn\ensemble\_weight_boosting.py:519: FutureWarning: The SAMME.R algorithm (the default) is deprecated and will be removed in 1.6. Use the SAMME algorithm to circumvent this warning.
  warnings.warn(
c:\Users\guill\anaconda3\Lib\site-packages\sklearn\ensemble\_weight_boosting.py:519: FutureWarning: The SAMME.R algorithm (the default) is deprecated and will be removed in 1.6. Use the SAMME algorithm to circumvent this warning.
  warnings.warn(
c:\Users\guill\anaconda3\Lib\site-packages\sklearn\ensemble\_weight_boosting.py:519: FutureWarning: The SAMME.R algorithm (the default) is deprecated and will be removed in 1.6. Use the SAMME algorithm to circumvent this warning.
  warnings.warn(
c:\Users\guill\anaconda3\Lib\site-packages\sklearn\ensemble\_weight_boosting.py:519: FutureWarning: The SAMME.R algorithm (the default) is deprecated and will be removed in 1.6. Use the SAMME algorithm to circumvent this warning.
  warnings.warn(
c:\Users\guill\anaconda3\Lib\site-packages\sklearn\ensemble\_weight_boosting.py:519: FutureWarning: The SAMME.R algorithm (the default) is deprecated and will be removed in 1.6. Use the SAMME algorithm to circumvent this warning.
  warnings.warn(
c:\Users\guill\anaconda3\Lib\site-packages\sklearn\ensemble\_weight_boosting.py:519: FutureWarning: The SAMME.R algorithm (the default) is deprecated and will be removed in 1.6. Use the SAMME algorithm to circumvent this warning.
  warnings.warn(
c:\Users\guill\anaconda3\Lib\site-packages\sklearn\ensemble\_weight_boosting.py:519: FutureWarning: The SAMME.R algorithm (the default) is deprecated and will be removed in 1.6. Use the SAMME algorithm to circumvent this warning.
  warnings.warn(
c:\Users\guill\anaconda3\Lib\site-packages\sklearn\ensemble\_weight_boosting.py:519: FutureWarning: The SAMME.R algorithm (the default) is deprecated and will be removed in 1.6. Use the SAMME algorithm to circumvent this warning.
  warnings.warn(
c:\Users\guill\anaconda3\Lib\site-packages\sklearn\ensemble\_weight_boosting.py:519: FutureWarning: The SAMME.R algorithm (the default) is deprecated and will be removed in 1.6. Use the SAMME algorithm to circumvent this warning.
  warnings.warn(
c:\Users\guill\anaconda3\Lib\site-packages\sklearn\ensemble\_weight_boosting.py:519: FutureWarning: The SAMME.R algorithm (the default) is deprecated and will be removed in 1.6. Use the SAMME algorithm to circumvent this warning.
  warnings.warn(
c:\Users\guill\anaconda3\Lib\site-packages\sklearn\ensemble\_weight_boosting.py:519: FutureWarning: The SAMME.R algorithm (the default) is deprecated and will be removed in 1.6. Use the SAMME algorithm to circumvent this warning.
  warnings.warn(
c:\Users\guill\anaconda3\Lib\site-packages\sklearn\ensemble\_weight_boosting.py:519: FutureWarning: The SAMME.R algorithm (the default) is deprecated and will be removed in 1.6. Use the SAMME algorithm to circumvent this warning.
  warnings.warn(
c:\Users\guill\anaconda3\Lib\site-packages\sklearn\ensemble\_weight_boosting.py:519: FutureWarning: The SAMME.R algorithm (the default) is deprecated and will be removed in 1.6. Use the SAMME algorithm to circumvent this warning.
  warnings.warn(
c:\Users\guill\anaconda3\Lib\site-packages\sklearn\ensemble\_weight_boosting.py:519: FutureWarning: The SAMME.R algorithm (the default) is deprecated and will be removed in 1.6. Use the SAMME algorithm to circumvent this warning.
  warnings.warn(
c:\Users\guill\anaconda3\Lib\site-packages\sklearn\ensemble\_weight_boosting.py:519: FutureWarning: The SAMME.R algorithm (the default) is deprecated and will be removed in 1.6. Use the SAMME algorithm to circumvent this warning.
  warnings.warn(
c:\Users\guill\anaconda3\Lib\site-packages\sklearn\ensemble\_weight_boosting.py:519: FutureWarning: The SAMME.R algorithm (the default) is deprecated and will be removed in 1.6. Use the SAMME algorithm to circumvent this warning.
  warnings.warn(
c:\Users\guill\anaconda3\Lib\site-packages\sklearn\ensemble\_weight_boosting.py:519: FutureWarning: The SAMME.R algorithm (the default) is deprecated and will be removed in 1.6. Use the SAMME algorithm to circumvent this warning.
  warnings.warn(
c:\Users\guill\anaconda3\Lib\site-packages\sklearn\ensemble\_weight_boosting.py:519: FutureWarning: The SAMME.R algorithm (the default) is deprecated and will be removed in 1.6. Use the SAMME algorithm to circumvent this warning.
  warnings.warn(
c:\Users\guill\anaconda3\Lib\site-packages\sklearn\ensemble\_weight_boosting.py:519: FutureWarning: The SAMME.R algorithm (the default) is deprecated and will be removed in 1.6. Use the SAMME algorithm to circumvent this warning.
  warnings.warn(
c:\Users\guill\anaconda3\Lib\site-packages\sklearn\ensemble\_weight_boosting.py:519: FutureWarning: The SAMME.R algorithm (the default) is deprecated and will be removed in 1.6. Use the SAMME algorithm to circumvent this warning.
  warnings.warn(
c:\Users\guill\anaconda3\Lib\site-packages\sklearn\ensemble\_weight_boosting.py:519: FutureWarning: The SAMME.R algorithm (the default) is deprecated and will be removed in 1.6. Use the SAMME algorithm to circumvent this warning.
  warnings.warn(
c:\Users\guill\anaconda3\Lib\site-packages\sklearn\ensemble\_weight_boosting.py:519: FutureWarning: The SAMME.R algorithm (the default) is deprecated and will be removed in 1.6. Use the SAMME algorithm to circumvent this warning.
  warnings.warn(
c:\Users\guill\anaconda3\Lib\site-packages\sklearn\ensemble\_weight_boosting.py:519: FutureWarning: The SAMME.R algorithm (the default) is deprecated and will be removed in 1.6. Use the SAMME algorithm to circumvent this warning.
  warnings.warn(
c:\Users\guill\anaconda3\Lib\site-packages\sklearn\ensemble\_weight_boosting.py:519: FutureWarning: The SAMME.R algorithm (the default) is deprecated and will be removed in 1.6. Use the SAMME algorithm to circumvent this warning.
  warnings.warn(
c:\Users\guill\anaconda3\Lib\site-packages\sklearn\ensemble\_weight_boosting.py:519: FutureWarning: The SAMME.R algorithm (the default) is deprecated and will be removed in 1.6. Use the SAMME algorithm to circumvent this warning.
  warnings.warn(
c:\Users\guill\anaconda3\Lib\site-packages\sklearn\ensemble\_weight_boosting.py:519: FutureWarning: The SAMME.R algorithm (the default) is deprecated and will be removed in 1.6. Use the SAMME algorithm to circumvent this warning.
  warnings.warn(
c:\Users\guill\anaconda3\Lib\site-packages\sklearn\ensemble\_weight_boosting.py:519: FutureWarning: The SAMME.R algorithm (the default) is deprecated and will be removed in 1.6. Use the SAMME algorithm to circumvent this warning.
  warnings.warn(
c:\Users\guill\anaconda3\Lib\site-packages\sklearn\ensemble\_weight_boosting.py:519: FutureWarning: The SAMME.R algorithm (the default) is deprecated and will be removed in 1.6. Use the SAMME algorithm to circumvent this warning.
  warnings.warn(
c:\Users\guill\anaconda3\Lib\site-packages\sklearn\ensemble\_weight_boosting.py:519: FutureWarning: The SAMME.R algorithm (the default) is deprecated and will be removed in 1.6. Use the SAMME algorithm to circumvent this warning.
  warnings.warn(
c:\Users\guill\anaconda3\Lib\site-packages\sklearn\ensemble\_weight_boosting.py:519: FutureWarning: The SAMME.R algorithm (the default) is deprecated and will be removed in 1.6. Use the SAMME algorithm to circumvent this warning.
  warnings.warn(
c:\Users\guill\anaconda3\Lib\site-packages\sklearn\ensemble\_weight_boosting.py:519: FutureWarning: The SAMME.R algorithm (the default) is deprecated and will be removed in 1.6. Use the SAMME algorithm to circumvent this warning.
  warnings.warn(
c:\Users\guill\anaconda3\Lib\site-packages\sklearn\ensemble\_weight_boosting.py:519: FutureWarning: The SAMME.R algorithm (the default) is deprecated and will be removed in 1.6. Use the SAMME algorithm to circumvent this warning.
  warnings.warn(
c:\Users\guill\anaconda3\Lib\site-packages\sklearn\ensemble\_weight_boosting.py:519: FutureWarning: The SAMME.R algorithm (the default) is deprecated and will be removed in 1.6. Use the SAMME algorithm to circumvent this warning.
  warnings.warn(
c:\Users\guill\anaconda3\Lib\site-packages\sklearn\ensemble\_weight_boosting.py:519: FutureWarning: The SAMME.R algorithm (the default) is deprecated and will be removed in 1.6. Use the SAMME algorithm to circumvent this warning.
  warnings.warn(
c:\Users\guill\anaconda3\Lib\site-packages\sklearn\ensemble\_weight_boosting.py:519: FutureWarning: The SAMME.R algorithm (the default) is deprecated and will be removed in 1.6. Use the SAMME algorithm to circumvent this warning.
  warnings.warn(
c:\Users\guill\anaconda3\Lib\site-packages\sklearn\ensemble\_weight_boosting.py:519: FutureWarning: The SAMME.R algorithm (the default) is deprecated and will be removed in 1.6. Use the SAMME algorithm to circumvent this warning.
  warnings.warn(
c:\Users\guill\anaconda3\Lib\site-packages\sklearn\ensemble\_weight_boosting.py:519: FutureWarning: The SAMME.R algorithm (the default) is deprecated and will be removed in 1.6. Use the SAMME algorithm to circumvent this warning.
  warnings.warn(
c:\Users\guill\anaconda3\Lib\site-packages\sklearn\ensemble\_weight_boosting.py:519: FutureWarning: The SAMME.R algorithm (the default) is deprecated and will be removed in 1.6. Use the SAMME algorithm to circumvent this warning.
  warnings.warn(
c:\Users\guill\anaconda3\Lib\site-packages\sklearn\ensemble\_weight_boosting.py:519: FutureWarning: The SAMME.R algorithm (the default) is deprecated and will be removed in 1.6. Use the SAMME algorithm to circumvent this warning.
  warnings.warn(
c:\Users\guill\anaconda3\Lib\site-packages\sklearn\ensemble\_weight_boosting.py:519: FutureWarning: The SAMME.R algorithm (the default) is deprecated and will be removed in 1.6. Use the SAMME algorithm to circumvent this warning.
  warnings.warn(
c:\Users\guill\anaconda3\Lib\site-packages\sklearn\ensemble\_weight_boosting.py:519: FutureWarning: The SAMME.R algorithm (the default) is deprecated and will be removed in 1.6. Use the SAMME algorithm to circumvent this warning.
  warnings.warn(
c:\Users\guill\anaconda3\Lib\site-packages\sklearn\ensemble\_weight_boosting.py:519: FutureWarning: The SAMME.R algorithm (the default) is deprecated and will be removed in 1.6. Use the SAMME algorithm to circumvent this warning.
  warnings.warn(
c:\Users\guill\anaconda3\Lib\site-packages\sklearn\ensemble\_weight_boosting.py:519: FutureWarning: The SAMME.R algorithm (the default) is deprecated and will be removed in 1.6. Use the SAMME algorithm to circumvent this warning.
  warnings.warn(
c:\Users\guill\anaconda3\Lib\site-packages\sklearn\ensemble\_weight_boosting.py:519: FutureWarning: The SAMME.R algorithm (the default) is deprecated and will be removed in 1.6. Use the SAMME algorithm to circumvent this warning.
  warnings.warn(
c:\Users\guill\anaconda3\Lib\site-packages\sklearn\ensemble\_weight_boosting.py:519: FutureWarning: The SAMME.R algorithm (the default) is deprecated and will be removed in 1.6. Use the SAMME algorithm to circumvent this warning.
  warnings.warn(
c:\Users\guill\anaconda3\Lib\site-packages\sklearn\ensemble\_weight_boosting.py:519: FutureWarning: The SAMME.R algorithm (the default) is deprecated and will be removed in 1.6. Use the SAMME algorithm to circumvent this warning.
  warnings.warn(
c:\Users\guill\anaconda3\Lib\site-packages\sklearn\ensemble\_weight_boosting.py:519: FutureWarning: The SAMME.R algorithm (the default) is deprecated and will be removed in 1.6. Use the SAMME algorithm to circumvent this warning.
  warnings.warn(
c:\Users\guill\anaconda3\Lib\site-packages\sklearn\ensemble\_weight_boosting.py:519: FutureWarning: The SAMME.R algorithm (the default) is deprecated and will be removed in 1.6. Use the SAMME algorithm to circumvent this warning.
  warnings.warn(
c:\Users\guill\anaconda3\Lib\site-packages\sklearn\ensemble\_weight_boosting.py:519: FutureWarning: The SAMME.R algorithm (the default) is deprecated and will be removed in 1.6. Use the SAMME algorithm to circumvent this warning.
  warnings.warn(
c:\Users\guill\anaconda3\Lib\site-packages\sklearn\ensemble\_weight_boosting.py:519: FutureWarning: The SAMME.R algorithm (the default) is deprecated and will be removed in 1.6. Use the SAMME algorithm to circumvent this warning.
  warnings.warn(
c:\Users\guill\anaconda3\Lib\site-packages\sklearn\ensemble\_weight_boosting.py:519: FutureWarning: The SAMME.R algorithm (the default) is deprecated and will be removed in 1.6. Use the SAMME algorithm to circumvent this warning.
  warnings.warn(
c:\Users\guill\anaconda3\Lib\site-packages\sklearn\ensemble\_weight_boosting.py:519: FutureWarning: The SAMME.R algorithm (the default) is deprecated and will be removed in 1.6. Use the SAMME algorithm to circumvent this warning.
  warnings.warn(

Summary

In [31]:
scores_df = pd.DataFrame(model_scores, columns=['Model', 'Accuracy', 'Pipeline'])
scores_df_US = pd.DataFrame(model_scores_US, columns=['Model', 'Accuracy', 'Pipeline'])

best_model = None
best_accuracy = 0

for name, accuracy, pipeline in model_scores:
    print(f"{name} Accuracy: {accuracy:.4f}")

    if accuracy > best_accuracy:
        best_accuracy = accuracy
        best_model = name
        best_pipeline = pipeline

print(f"\nBest Model: {best_model} with Accuracy: {best_accuracy:.4f}")
print(f"Best Pipeline: {best_pipeline}")

best_model_US = None
best_accuracy_US = 0

for name, accuracy, pipeline in model_scores_US:
    print(f"{name} Accuracy: {accuracy:.4f}")

    if accuracy > best_accuracy_US:
        best_accuracy_US = accuracy
        best_model_US = name
        best_pipeline_US = pipeline

print(f"\nBest Model: {best_model_US} with Accuracy: {best_accuracy_US:.4f}")
print(f"Best Pipeline: {best_pipeline_US}")
Logistic Regression Accuracy: 0.8024
K-nearest Neighbors Accuracy: 0.7584
Gaussian Naive Bayes Accuracy: 0.6823
SVC Accuracy: 0.7896
Decision Tree Accuracy: 0.7783
Random Forest Accuracy: 0.7910
XGBoost Accuracy: 0.7960
Gradient Boosting Accuracy: 0.7932
AdaBoost Accuracy: 0.7932

Best Model: Logistic Regression with Accuracy: 0.8024
Best Pipeline: Pipeline(steps=[('scaler', MinMaxScaler()),
                ('model', LogisticRegression(C=1, random_state=42))])
Logistic Regression Accuracy: 0.9190
K-nearest Neighbors Accuracy: 0.8858
Gaussian Naive Bayes Accuracy: 0.8858
SVC Accuracy: 0.9378
Decision Tree Accuracy: 0.9344
Random Forest Accuracy: 0.9471
XGBoost Accuracy: 0.9616
Gradient Boosting Accuracy: 0.9650
AdaBoost Accuracy: 0.9506

Best Model: Gradient Boosting with Accuracy: 0.9650
Best Pipeline: Pipeline(steps=[('scaler', MinMaxScaler()),
                ('model',
                 GradientBoostingClassifier(learning_rate=0.2, max_depth=5,
                                            n_estimators=400,
                                            random_state=42))])
In [ ]:
fig, axes = plt.subplots(1, 2, figsize=(20, 8))

colors1 = sns.color_palette('pastel', n_colors=len(scores_df))
colors2 = sns.color_palette('viridis', n_colors=len(scores_df_US))

ax1 = sns.barplot(x='Model', y='Accuracy', data=scores_df, palette=colors1, ax=axes[0])

for p in ax1.patches:
    ax1.annotate(f'{p.get_height():.3f}', 
                (p.get_x() + p.get_width() / 2., p.get_height()), 
                ha='center', va='center', fontsize=9, color='black', xytext=(0, 5), 
                textcoords='offset points')

axes[0].set_title('Model Scores - Original Dataset', fontsize=14, fontweight='bold')
axes[0].set_xlabel('Models', fontsize=12)
axes[0].set_ylabel('Accuracy', fontsize=12)
axes[0].tick_params(axis='x', rotation=45)
axes[0].set_ylim(0, 1)
axes[0].grid(axis='y', linestyle='--', alpha=0.7)

ax2 = sns.barplot(x='Model', y='Accuracy', data=scores_df_US, palette=colors2, ax=axes[1])

for p in ax2.patches:
    ax2.annotate(f'{p.get_height():.3f}', 
                (p.get_x() + p.get_width() / 2., p.get_height()), 
                ha='center', va='center', fontsize=9, color='black', xytext=(0, 5), 
                textcoords='offset points')

axes[1].set_title('Model Scores - SMOTEENN Resampled Dataset', fontsize=14, fontweight='bold')
axes[1].set_xlabel('Models', fontsize=12)
axes[1].set_ylabel('Accuracy', fontsize=12)
axes[1].tick_params(axis='x', rotation=45)
axes[1].set_ylim(0, 1)
axes[1].grid(axis='y', linestyle='--', alpha=0.7)

plt.suptitle('Comparison of Model Performance: Original vs SMOTEENN Resampled Data', 
             fontsize=16, fontweight='bold', y=1.02)
plt.tight_layout()
plt.show()

print("\n" + "="*80)
print("PERFORMANCE COMPARISON SUMMARY")
print("="*80)
print(f"{'Model':<20} {'Original':<12} {'SMOTEENN':<12} {'Difference':<12}")
print("-"*80)

for i in range(len(scores_df)):
    model_name = scores_df.iloc[i]['Model']
    original_acc = scores_df.iloc[i]['Accuracy']
    smoteenn_acc = scores_df_US.iloc[i]['Accuracy']
    difference = smoteenn_acc - original_acc
    
    print(f"{model_name:<20} {original_acc:<12.4f} {smoteenn_acc:<12.4f} {difference:+.4f}")

print("-"*80)
C:\Users\guill\AppData\Local\Temp\ipykernel_14424\1127419998.py:9: FutureWarning: 

Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect.

  ax1 = sns.barplot(x='Model', y='Accuracy', data=scores_df, palette=colors1, ax=axes[0])
C:\Users\guill\AppData\Local\Temp\ipykernel_14424\1127419998.py:26: FutureWarning: 

Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect.

  ax2 = sns.barplot(x='Model', y='Accuracy', data=scores_df_US, palette=colors2, ax=axes[1])
No description has been provided for this image
================================================================================
PERFORMANCE COMPARISON SUMMARY
================================================================================
Model                Original     SMOTEENN     Difference  
--------------------------------------------------------------------------------
Logistic Regression  0.8024       0.9190       +0.1166
K-nearest Neighbors  0.7584       0.8858       +0.1274
Gaussian Naive Bayes 0.6823       0.8858       +0.2035
SVC                  0.7896       0.9378       +0.1481
Decision Tree        0.7783       0.9344       +0.1561
Random Forest        0.7910       0.9471       +0.1561
XGBoost              0.7960       0.9616       +0.1656
Gradient Boosting    0.7932       0.9650       +0.1719
AdaBoost             0.7932       0.9506       +0.1574
--------------------------------------------------------------------------------
In [ ]:
conf_matrix = confusion_matrix(yr_test, y_pred_r)

plt.figure(figsize=(8, 6))
sns.heatmap(conf_matrix, annot=True, fmt="d", cmap="Blues", cbar=True)
plt.xlabel('Predicted Labels')
plt.ylabel('True Labels')
plt.title('Confusion Matrix')
plt.show()
No description has been provided for this image

Business Interpretation:

False Negatives: Customers who will churn but are not detected → Revenue loss
False Positives: Loyal customers misclassified → Unnecessary retention costs

Business Impact:

FN Cost = Lost customers × Customer lifetime value
FP Cost = Retention actions × Unit cost per action

Key Insights:

  • High FN: Missing potential churners leads to direct revenue loss
  • High FP: Wasting resources on customers who wouldn't churn anyway
  • Optimal balance: Minimize total cost (FN + FP) for maximum ROI

Conclusion¶

Key Findings

This comprehensive analysis of customer churn prediction for the Telco company has yielded several important insights:

  1. Model Performance Comparison
  • Best Original Dataset Model: The analysis identified the top-performing model on the original dataset with optimal hyperparameters
  • SMOTEENN Impact: Data resampling using SMOTEENN significantly improved model performance for most algorithms
  • Algorithm Comparison: Ensemble methods (Random Forest, XGBoost, Gradient Boosting) generally outperformed simpler algorithms
  1. Data Insights
  • Feature Importance: Tenure, total charges, and monthly charges emerged as key predictors of customer churn
  • Customer Patterns: Clear patterns were identified in the distribution of churned vs. retained customers across different service categories
  • Data Quality: Successfully handled missing values and data type inconsistencies in the original dataset
  1. Business Impact
  • Cost Analysis: The confusion matrix analysis revealed the financial implications of false positives and false negatives
  • Actionable Insights: The model can help identify high-risk customers for targeted retention campaigns
  • ROI Optimization: By balancing precision and recall, the company can optimize retention spending

Recommendations

  1. Model Deployment
  • Deploy the best-performing model (likely Random Forest or XGBoost with SMOTEENN) for production use
  • Implement real-time scoring for new customers and regular batch predictions for existing customers
  • Set up model monitoring and retraining pipelines to maintain performance over time
  1. Business Strategy
  • Proactive Retention: Use model predictions to identify at-risk customers before they churn
  • Targeted Campaigns: Develop personalized retention offers based on customer segments and risk scores
  • Feature Monitoring: Track key churn indicators (tenure, charges, service usage) to identify early warning signs
  1. Technical Improvements
  • Feature Engineering: Explore additional features like customer interaction history, payment patterns, and service usage trends
  • Advanced Techniques: Consider deep learning models or ensemble stacking for potentially better performance
  • Real-time Implementation: Develop APIs for real-time churn prediction integration with CRM systems

Business Value

This churn prediction model provides significant business value:

  • Revenue Protection: Early identification of at-risk customers enables proactive retention, potentially saving thousands in lost revenue
  • Cost Optimization: Targeted retention efforts reduce wasted marketing spend on loyal customers
  • Customer Insights: Understanding churn drivers helps improve overall customer experience and service offerings
  • Competitive Advantage: Data-driven retention strategies provide a significant advantage in the competitive telecom market

Future Work

  • Advanced Analytics: Implement customer lifetime value prediction to prioritize retention efforts
  • Segmentation: Develop churn models for specific customer segments (high-value, new customers, etc.)
  • Real-time Features: Incorporate real-time behavioral data for more accurate predictions
  • A/B Testing: Implement controlled experiments to measure the effectiveness of retention strategies

This project demonstrates the power of machine learning in solving real business problems and provides a solid foundation for implementing a comprehensive customer retention strategy.