• Site Navigation
    • Project Home
  • Analyst's Notebooks
    • EDA Cleanse - .ipynb
    • EDA Analysis - .ipynb
    • Feature Engineering - .ipynb
  • ML Models
    • Logistical Regression - .ipynb
    • Decision Tree - .ipynb
    • XGBoost - .ipynb
    • Random Forest - .ipynb
    • Model Comparisons - .ipynb
    • Model Demonstration - .ipynb
  • Client Documentation
    • Client Proposal - Docs
    • PACE Plan
    • Analyst's Summary
    • Executive Summary Analysis
    • HR Summary Analysis
    • Team Summary Generator Notebook
    • Team Summary example
    • Project Close
  • Project Conclusion Presentations
    • Executive Summary PRESENTATION
    • HR Summary PRESENTATION

Salifort Motors - ML Modelling - Logistic Regression ¶

Document Information ¶

Document Title Salifort Motors - ML Modelling - Logistical Regression
Author Rod Slater
Version 1.0
Created 01-11-2023
Modified 16-11-2023

Client Details¶

Client Name Salifort Motors
Client Contact Mr HR Team
Client Email hr@salifortmotors.it
Client Project HR Team Data Driven Solutions from Machine Learning Models

Document Overview¶

ML Modelling using Logistical Regression for HR Data provided by Salifort Motors. This notebook details the Logistical Regression Modelling process and Performance comparisons

Table of contents¶

  • Salifort Motors - ML Modelling - Logistic Regression
    • Document Information
    • Client Details
    • Document Overview
  • Logistic Regression:
    • Notes
    • Initialise Notebook
      • Import Packages
      • Initialise Notebook Options
      • Set Pandas Options
    • Define Functions
      • make_lr_results(model_name: string, model_object: object)
      • display_results() - Function to retrieve scores from Results.csv and display them
      • classification_report_summary(name: str, y_var: str, y_pred_var: str)
  • Construct Models - Logistical Regression
    • Construct Model - Logistical Regression 1
      • Import data - model_lr1
        • Code flags:
      • Prepare data for modelling - model_lr1
      • Isolate Outcome and Feature Variables - model_lr1
      • Train Test split data - model_lr1
      • Apply StandardScalar to feature variables - model_lr1
      • Instantiate Model - model_lr1
      • Fit Data for model_lr1 - Train data
      • Make Predictions - model_lr1 - Test data
      • Classification Report - model_lr1 - Test data
      • Prepare and Save Results - model_lr1 - Test data
    • Performance Results & Comparisons - model_lr1 - Test data
      • Display Results - model_lr1 - Test Data
      • Make Predictions - model_lr1 - Train data
      • Classification Report - model_lr1 - Train data
      • Prepare and Save Results - model_lr1 - Train data
    • Performance Results & Comparisons - model_lr1 - Train data
      • Display Results - model_lr1 - Train data
    • Construct Model - Logistical Regression 2
      • Import data - model_lr2
        • Code flags:
      • Prepare data for modelling - model_lr2
      • Isolate Outcome and Feature Variables - model_lr2
      • Train Test split data - model_lr2
      • Apply StandardScalar to feature variables - model_lr2
      • Instantiate Model - model_lr2
      • Fit Data for model_lr2 - Train data
      • Make Predictions - model_lr2 - Test data
      • Classification Report - model_lr2 - Test data
      • Prepare and Save Results - model_lr2 - Test data
    • Performance Results & Comparisons - model_lr2 - Test data
      • Display Results - model_lr2 - Test Data
      • Make Predictions - model_lr2 - Train data
      • Classification Report - model_lr1 - Train data
      • Prepare and Save Results - model_lr2 - Train data
    • Performance Results & Comparisons - model_lr2 - Train data
      • Display Results - model_lr2 - Train data
  • Review Results
    • Dataset comparison - Logistical Regression
    • Model performance
      • Confusion Matrix Heatmap - model_lr1
      • Feature Importance
    • Precision / Recall Curve
    • ROC Curve Plot
    • Conclusions
    • Conclusion, dataset comparison - Logistical Regression.

Logistic Regression:¶

For the purposes of self education, I'll model this on the two data sets. First with all features and second with dept features removed

Notes¶

This runs very quickly, no real need to pickle the model.

Initialise Notebook¶

Import Packages¶

In [88]:
# Import packages

# Data manipulation
import numpy as np
import pandas as pd

# Data visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Set Options
pd.set_option('display.max_columns', None)

# Data modelling Imports
from xgboost import XGBClassifier, XGBRegressor, plot_importance

from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier

# For metrics and helpful functions
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay, classification_report
from sklearn import metrics
from sklearn.metrics import roc_auc_score, roc_curve, precision_recall_fscore_support, precision_recall_curve, auc
from sklearn.tree import plot_tree
from sklearn.preprocessing import StandardScaler

import statsmodels.api as sm

from datetime import datetime as dt

import json

# For saving models
import pickle

Initialise Notebook Options¶

Set Pandas Options¶

In [89]:
# set Pandas Display Options
pd.options.display.float_format = '{:.2f}'.format
pd.set_option('display.precision', 2)
pd.set_option('display.max_columns', None)
pd.set_option('display.width', 120)
In [90]:
# Source folder for cleaned data
load_path = "/home/hass/Documents/Learning/Salifort-Motors-Capstone-Project/00-data_cleaned/" 
load_path = "./00-data_cleaned/" 
# destination for pickle saved models
save_path = "/home/hass/Documents/Learning/Salifort-Motors-Capstone-Project/04-pickle-ML-models/" 

Define Functions¶

make_lr_results(model_name: string, model_object: object)¶

In [91]:
# Get results from Logistic Regression to store in comparison table
def make_results(model_name: str, model_object: object, X_var: str, y_var: str, y_pred_var: str):
    '''
    Returns a pandas df with the F1, recall, precision, and accuracy scores
    for the model with the best mean F1 score across all validation folds.     

    In: 
        model_name (string):  How you want your model to be named in the output table
        model_object:         The model object
        X_var:                numpy array of X data
        y_var:                numpy array of y data
        y_pred_var:           numpy array of predict

    Out: pandas df containing precision, recall, f1, accuracy, and AUC scores of the models
  
 
    '''

    # Get all the results from the CV and put them in a dict
    report = classification_report(y_var, y_pred_var,  output_dict=True)

    # Calculate precision, recall, and F1 score for the "True" class
    predict_precision, predict_recall, predict_f1_score, _ = precision_recall_fscore_support(y_var, y_pred_var, average=None)
    f1_true_class = predict_f1_score[1]  # Index 1 corresponds to the "True" class
    f1_false_class = predict_f1_score[0]  # Index 1 corresponds to the "True" class


    # Extract accuracy, precision, recall, and f1 score from that row
    f1 = report['weighted avg']['f1-score']
    recall = report['weighted avg']['recall']
    precision = report['weighted avg']['precision']
    accuracy = report['accuracy']
    auc = roc_auc_score(y_var, model_object.predict_proba(X_var)[:,1])
    # Create table of results
    table = pd.DataFrame({'Model': model_name,
                          'Precision': precision,
                          'Recall': recall,
                          'F1': f1,
                          'Accuracy': accuracy,
                          'AUC': auc,
                          'Predict Leave': f1_true_class,
                          'Predict Stay' : f1_false_class
                                          
                        },
                        index=[0]
                       )
  
    return table

display_results() - Function to retrieve scores from Results.csv and display them¶

In [92]:
def display_results():
    '''
    Load Results.csv containing store test scores, return the scores for display

    In: 
        none

    Out: pandas df of Results,csv containing precision, recall, f1, accuracy, and AUC scores of the models
    '''
    model_results = pd.read_csv("Results.csv")
    model_results.drop(columns=['Unnamed: 0'], inplace=True)
    model_results = model_results.sort_values(by='AUC', ascending=False)
    
    return model_results

classification_report_summary(name: str, y_var: str, y_pred_var: str)¶

In [93]:
def classification_report_summary(name:str, y_var:str, y_pred_var:str):
    '''
    Gather stats from predictions

    In: 
        name:str    : Test data name for report header e.g. TEST or TRAIN
        y_var       : y variable
        y_pred_var  : y prediction variable

    Out: Display of precision, recall, f1, accuracy, and AUC scores of the models, Weighted Average and Prediction f1 score for true/false
    '''
    targetnames = ['Predicted would not leave', 'Predicted would leave']

  
    
    print("\nClassification Report : ", name)
    print(classification_report(y_var, y_pred_var, target_names=targetnames))
    
    print("Recall        : {:.4%}".format(recall_score(y_var, y_pred_var)))
    print("f1_score      : {:.4%}".format(f1_score(y_var, y_pred_var)))
    print("Precision     : {:.4%}".format(precision_score(y_var, y_pred_var)))
    print("Accuracy      : {:.4%}".format(accuracy_score(y_var, y_pred_var)))
    
    report = classification_report(y_var, y_pred_var,  output_dict=True)
    
    print()
    print('\u2500' * 35) 
    print("Weighted Average")
    print('\u2500' * 35) 
    
    print("Recall        : {:.4%}".format(report['weighted avg']['recall']))
    print("f1 Score      : {:.4%}".format(report['weighted avg']['f1-score']))
    print("Precision     : {:.4%}".format(report['weighted avg']['precision']))
    print("Support       : {:.0f}".format(report['weighted avg']['support']))
    
      # Calculate precision, recall, and F1 score for the "True" class
    predict_precision, predict_recall, predict_f1_score, _ = precision_recall_fscore_support(y_var, y_pred_var, average=None)
    f1_true_class = predict_f1_score[1]  # Index 1 corresponds to the "True" class
    f1_false_class = predict_f1_score[0]  # Index 0 corresponds to the "False" class
    
    print()
    print('\u2500' * 35) 
    print("Prediction F1 score")
    print('\u2500' * 35) 
    print("Predict Leave : {:.4%}".format(f1_true_class))
    print("Support Stay  : {:.4%}".format(f1_false_class))

Construct Models - Logistical Regression¶

  • Feature Engineered
  • Outliers Removed
  • All Features

Construct Model - Logistical Regression 1¶

This is the model that will include all of the features included after Feature Engineering has been completed. There are 17 features in total.

Import data - model_lr1¶

Two datasets for model performance comparison, :

  • salifort_data_FE.csv is the full data set feature engineered with salary encoded to ordinal, avg_mth_hrs binary encoded to overwork and dept encoded with dummies
  • salifort_data_FE_focus.csv is the same data with the dummy encoded dept fields removed.

Dept appears to have low correlation across the dataset and I'm curious how much the models are influenced with low correlation features. It turns out that low correlation features have little impact on model performance. which is no surprise really!

Code flags:¶

rerun = flag to identify the first run of the model comparisons and write a NEW Results.csv file. 0 = first run and a new file will be created with headers / 1 = Continuation, Results will be appended to the file.

dataset = Text indicating which dataset is being used, added to the model description during csv save to Results.csv

model_prefix = text indicating which ML model is being used, added to the model description during csv save to Results.csv

In [94]:
df1 = pd.read_csv(load_path + "data_cleaned_NoOl_NoFE_AllFeat.csv", index_col = False) 

print("\ndata_cleaned_NoOl_NoFE_AllFeat.csv")
print(df1.shape)
print(df1.head(1))

df1 = pd.read_csv(load_path + "data_cleaned_Ol_NoFE_AllFeat.csv", index_col = False) # includes outliers
print("\ndata_cleaned_Ol_NoFE_AllFeat.csv")
print(df1.shape)
print(df1.head(1))

df1 = pd.read_csv(load_path + "data_cleaned_NoOl_FE_AllFeat.csv", index_col = False) 
print("\ndata_cleaned_NoOl_FE_AllFeat.csv")
print(df1.shape)
print(df1.head(1))
df1 = pd.read_csv(load_path + "data_cleaned_NoOl_FE_NoDept.csv", index_col = False) 
print("\ndata_cleaned_NoOl_FE_NoDept.csv")
print(df1.shape)
print(df1.head(1))

# model_prefix : Str = prefix for results.csv added to dataset
model_prefix      = 'lr1'

# dataset : Str = dataset name for results.csv
dataset           = 'ALLFeat'

# rerun : int 1 = append to the Results.csv file with no headers / 0 = Write new file with headers
rerun             = 0

print("df1 - Feature engineering on salary, avg_mnth_hrs, dept, outliers removed\n")

# Display dataframe columns
print(df1.shape)
data_cleaned_NoOl_NoFE_AllFeat.csv
(11167, 10)
   satisfaction  last_eval  number_project  avg_mnth_hrs  tenure  accident  left  promotion   dept salary
0          0.38       0.53            2.00        157.00    3.00      0.00  1.00       0.00  sales    low

data_cleaned_Ol_NoFE_AllFeat.csv
(11991, 10)
   satisfaction  last_eval  number_project  avg_mnth_hrs  tenure  accident  left  promotion   dept salary
0          0.38       0.53               2           157       3         0     1          0  sales    low

data_cleaned_NoOl_FE_AllFeat.csv
(11167, 18)
   satisfaction  last_eval  number_project  tenure  left  promotion  salary  dept_accounting  dept_hr  dept_it  \
0          0.38       0.53            2.00    3.00  1.00       0.00       0            False    False    False   

   dept_management  dept_marketing  dept_product_mng  dept_randd  dept_sales  dept_support  dept_technical  overworked  
0            False           False             False       False        True         False           False           0  

data_cleaned_NoOl_FE_NoDept.csv
(11167, 8)
   satisfaction  last_eval  number_project  overworked  tenure  left  promotion  salary
0          0.38       0.53            2.00           0    3.00  1.00       0.00       0
df1 - Feature engineering on salary, avg_mnth_hrs, dept, outliers removed

(11167, 8)
In [95]:
# Load cleaned dataset into a dataframe
print("Started // Last Run =", dt.now().strftime("%Y-%m-%d %H:%M:%S"),"\n")

# data_cleaned_NoOl_FE_AllFeat.csv = Feature engineering on salary, avg_mnth_hrs, dept. Accident, Duplicates, and Outliers removed
df1 = pd.read_csv(load_path + "data_cleaned_NoOl_NoFE_AllFeat.csv", index_col = False) 
Started // Last Run = 2023-12-04 12:00:15 

In [96]:
# little bit of feature engineering

df1['salary'] = (
    df1['salary'].astype('category')
    .cat.set_categories(['low', 'medium', 'high'])
    .cat.codes
)

# One Hot Encode dept
df1 = pd.get_dummies(df1, columns = ['dept'])
df1.dtypes
Out[96]:
satisfaction        float64
last_eval           float64
number_project      float64
avg_mnth_hrs        float64
tenure              float64
accident            float64
left                float64
promotion           float64
salary                 int8
dept_accounting        bool
dept_hr                bool
dept_it                bool
dept_management        bool
dept_marketing         bool
dept_product_mng       bool
dept_randd             bool
dept_sales             bool
dept_support           bool
dept_technical         bool
dtype: object

Prepare data for modelling - model_lr1¶

In [97]:
model_data = df1.copy() # copy df to df used for modelling

Isolate Outcome and Feature Variables - model_lr1¶

In [98]:
# Linear Regression model

# Save X and Y data into variables
Y = model_data['left'] # Isolate the outcome variable
X = model_data.copy()
X = X.drop('left', axis = 1) # Isolate the feature variables, drop the outcome variable left
#X = sm.add_constant(X)

Train Test split data - model_lr1¶

In [99]:
# Split Test/Train Data
X_train, X_test, y_train, y_test = train_test_split(X,Y, test_size=0.25, stratify=Y, random_state=42)

Apply StandardScalar to feature variables - model_lr1¶

Standardize features by removing the mean and scaling to unit variance. Generally only required for regression models.

The standard score of a sample x is calculated as:

$$ z = \frac{(x - u)}{s}$$

where u is the mean of the training samples or zero if with_mean=False, and s is the standard deviation of the training samples or one if with_std=False.

Centering and scaling happen independently on each feature by computing the relevant statistics on the samples in the training set. Mean and standard deviation are then stored to be used on later data using transform.

Standardisation of a dataset is a common requirement for many machine learning estimators: they might behave badly if the individual features do not more or less look like standard normally distributed data (e.g. Gaussian with 0 mean and unit variance).

In [100]:
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

Instantiate Model - model_lr1¶

In [101]:
# Fit the multiple regression model
model_lr1 = LogisticRegression() # Training Data

Fit Data for model_lr1 - Train data¶

In [102]:
model_lr1.fit(X_train, y_train) # Training data for model_lr1
Out[102]:
LogisticRegression()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LogisticRegression()

Make Predictions - model_lr1 - Test data¶

In [103]:
# Make Predictions using Logistic Regression
y_pred_test = model_lr1.predict(X_test)

y_pred_test
Out[103]:
array([0., 0., 0., ..., 0., 0., 0.])

Classification Report - model_lr1 - Test data¶

In [104]:
classification_report_summary(model_prefix+' - '+dataset+' - '+'TEST', y_test, y_pred_test)
Classification Report :  lr1 - ALLFeat - TEST
                           precision    recall  f1-score   support

Predicted would not leave       0.86      0.93      0.90      2321
    Predicted would leave       0.45      0.27      0.34       471

                 accuracy                           0.82      2792
                macro avg       0.66      0.60      0.62      2792
             weighted avg       0.79      0.82      0.80      2792

Recall        : 27.3885%
f1_score      : 34.1270%
Precision     : 45.2632%
Accuracy      : 82.1633%

───────────────────────────────────
Weighted Average
───────────────────────────────────
Recall        : 82.1633%
f1 Score      : 80.3127%
Precision     : 79.4256%
Support       : 2792

───────────────────────────────────
Prediction F1 score
───────────────────────────────────
Predict Leave : 34.1270%
Support Stay  : 89.6852%

Prepare and Save Results - model_lr1 - Test data¶

In [105]:
result_table = make_results(model_prefix+' - '+dataset+' - '+'test',model_lr1, X_test, y_test, y_pred_test)

# If the model is being run for the first time, create a new Results.csv file
if rerun == 0:

    # First save to Results.csv, no mode set, write headers, write new file
    result_table.to_csv("Results.csv", index=True, header=True) 

# If the model is being RERUN with a new data, APPEND to existing Results.csv
elif rerun == 1 : 
    
    # APPEND save to Results.csv, don't write headers, APPEND new file
    result_table.to_csv("Results.csv", index=True, mode='a', header=False) 

Performance Results & Comparisons - model_lr1 - Test data¶

Display Results - model_lr1 - Test Data¶

In [106]:
print(result_table)
print()
display_results()
                  Model  Precision  Recall   F1  Accuracy  AUC  Predict Leave  Predict Stay
0  lr1 - ALLFeat - test       0.79    0.82 0.80      0.82 0.88           0.34          0.90

Out[106]:
Model Precision Recall F1 Accuracy AUC Predict Leave Predict Stay
0 lr1 - ALLFeat - test 0.79 0.82 0.80 0.82 0.88 0.34 0.90

Make Predictions - model_lr1 - Train data¶

Use the already fitted lr1_model with the training data

In [107]:
# Make Predictions using Logistic Regression

y_pred_train = model_lr1.predict(X_train) # worth a look!

Classification Report - model_lr1 - Train data¶

In [108]:
classification_report_summary(model_prefix+' - '+dataset+' - '+'TRAIN', y_train, y_pred_train)
Classification Report :  lr1 - ALLFeat - TRAIN
                           precision    recall  f1-score   support

Predicted would not leave       0.86      0.94      0.90      6964
    Predicted would leave       0.47      0.25      0.33      1411

                 accuracy                           0.83      8375
                macro avg       0.67      0.60      0.61      8375
             weighted avg       0.80      0.83      0.80      8375

Recall        : 25.1595%
f1_score      : 32.8400%
Precision     : 47.2703%
Accuracy      : 82.6627%

───────────────────────────────────
Weighted Average
───────────────────────────────────
Recall        : 82.6627%
f1 Score      : 80.4086%
Precision     : 79.5988%
Support       : 8375

───────────────────────────────────
Prediction F1 score
───────────────────────────────────
Predict Leave : 32.8400%
Support Stay  : 90.0466%

Prepare and Save Results - model_lr1 - Train data¶

In [109]:
result_table = make_results(model_prefix+' - '+dataset+' - '+'train',model_lr1, X_train, y_train, y_pred_train)
result_table.to_csv("Results.csv", index=True, mode='a', header=False) # Append to existing Results.csv file, mode = 'a', no headers

Performance Results & Comparisons - model_lr1 - Train data¶

Display Results - model_lr1 - Train data¶

In [110]:
pd.set_option('display.max_columns', None)
pd.set_option('display.width', 120)
pd.options.display.float_format = '{:.3f}'.format
print(result_table)
print()
display_results()
                   Model  Precision  Recall    F1  Accuracy   AUC  Predict Leave  Predict Stay
0  lr1 - ALLFeat - train      0.796   0.827 0.804     0.827 0.892          0.328         0.900

Out[110]:
Model Precision Recall F1 Accuracy AUC Predict Leave Predict Stay
1 lr1 - ALLFeat - train 0.796 0.827 0.804 0.827 0.892 0.328 0.900
0 lr1 - ALLFeat - test 0.794 0.822 0.803 0.822 0.882 0.341 0.897

Construct Model - Logistical Regression 2¶

This is the model that will exclude the departments

Import data - model_lr2¶

Two datasets for model performance comparison, :

  • salifort_data_FE.csv is the full data set feature engineered with salary encoded to ordinal, avg_mth_hrs binary encoded to overwork and dept encoded with dummies
  • salifort_data_FE_focus.csv is the same data with the dummy encoded dept fields removed.

Dept appears to have low correlation across the dataset and I'm curious how much the models are influenced with low correlation features. It turns out that low correlation features have little impact on model performance. which is no surprise really!

Code flags:¶

rerun = flag to identify the first run of the model comparisons and write a NEW Results.csv file. 0 = first run and a new file will be created with headers / 1 = Continuation, Results will be appended to the file.

dataset = Text indicating which dataset is being used, added to the model description during csv save to Results.csv

model_prefix = text indicating which ML model is being used, added to the model description during csv save to Results.csv

In [111]:
df1.columns
Out[111]:
Index(['satisfaction', 'last_eval', 'number_project', 'avg_mnth_hrs', 'tenure', 'accident', 'left', 'promotion',
       'salary', 'dept_accounting', 'dept_hr', 'dept_it', 'dept_management', 'dept_marketing', 'dept_product_mng',
       'dept_randd', 'dept_sales', 'dept_support', 'dept_technical'],
      dtype='object')
In [112]:
#df2 = pd.read_csv(load_path + "data_cleaned_NoOl_FE_NoDept.csv", index_col = False) # Feature engineering on salary, avg_mnth_hrs, outliers removed, departments removed

df2= df1[['satisfaction', 'last_eval', 'number_project', 'avg_mnth_hrs', 'tenure', 'accident', 'left', 'promotion',
        'salary']]

# prefix for results.csv added to dataset
model_prefix         = 'lr2'

# dataset name for results.csv
dataset              = 'NOdept'

# 1 = append to the Results.csv file with no headers / 0 = Write new file with headers
rerun                = 1

print("df2 - Feature engineering on salary, avg_mnth_hrs, dept REMOVED, outliers removed\n")
# Display dataframe columns
print(df2.info())
df2 - Feature engineering on salary, avg_mnth_hrs, dept REMOVED, outliers removed

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11167 entries, 0 to 11166
Data columns (total 9 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   satisfaction    11167 non-null  float64
 1   last_eval       11167 non-null  float64
 2   number_project  11167 non-null  float64
 3   avg_mnth_hrs    11167 non-null  float64
 4   tenure          11167 non-null  float64
 5   accident        11167 non-null  float64
 6   left            11167 non-null  float64
 7   promotion       11167 non-null  float64
 8   salary          11167 non-null  int8   
dtypes: float64(8), int8(1)
memory usage: 709.0 KB
None

Prepare data for modelling - model_lr2¶

In [113]:
model_data2 = df2.copy() # copy df to df used for modelling

Isolate Outcome and Feature Variables - model_lr2¶

In [114]:
# Linear Regression model

# Save X and Y data into variables
Y2 = model_data2['left'] # Isolate the outcome variable
X2 = model_data2.copy()
X2 = X2.drop('left', axis = 1) # Isolate the feature variables, drop the outcome variable left
#X = sm.add_constant(X)

Train Test split data - model_lr2¶

In [115]:
# Split Test/Train Data
X_train2, X_test2, y_train2, y_test2 = train_test_split(X2,Y2, test_size=0.25, stratify=Y, random_state=42)

Apply StandardScalar to feature variables - model_lr2¶

Standardize features by removing the mean and scaling to unit variance. Generally only required for regression models.

In [116]:
scaler = StandardScaler()
X_train2 = scaler.fit_transform(X_train2)
X_test2 = scaler.transform(X_test2)

Instantiate Model - model_lr2¶

In [117]:
# Fit the multiple regression model
model_lr2 = LogisticRegression() # Training Data

Fit Data for model_lr2 - Train data¶

In [118]:
model_lr2.fit(X_train2, y_train2) # Training data for model_lr1
Out[118]:
LogisticRegression()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LogisticRegression()

Make Predictions - model_lr2 - Test data¶

In [119]:
# Make Predictions using Logistic Regression
y_pred_test2 = model_lr2.predict(X_test2)

Classification Report - model_lr2 - Test data¶

In [120]:
classification_report_summary(model_prefix+' - '+dataset+' - '+'TEST', y_test2, y_pred_test2)
Classification Report :  lr2 - NOdept - TEST
                           precision    recall  f1-score   support

Predicted would not leave       0.86      0.93      0.90      2321
    Predicted would leave       0.46      0.27      0.34       471

                 accuracy                           0.82      2792
                macro avg       0.66      0.60      0.62      2792
             weighted avg       0.80      0.82      0.80      2792

Recall        : 27.3885%
f1_score      : 34.2629%
Precision     : 45.7447%
Accuracy      : 82.2708%

───────────────────────────────────
Weighted Average
───────────────────────────────────
Recall        : 82.2708%
f1 Score      : 80.3926%
Precision     : 79.5204%
Support       : 2792

───────────────────────────────────
Prediction F1 score
───────────────────────────────────
Predict Leave : 34.2629%
Support Stay  : 89.7537%

Prepare and Save Results - model_lr2 - Test data¶

In [121]:
result_table = make_results(model_prefix+' - '+dataset+' - '+'test',model_lr2, X_test2, y_test2, y_pred_test2)

# If the model is being run for the first time, create a new Results.csv file
if rerun == 0:

    # First save to Results.csv, no mode set, write headers, write new file
    result_table.to_csv("Results.csv", index=True, header=True) 

# If the model is being RERUN with a new data, APPEND to existing Results.csv
elif rerun == 1 : 
    
    # APPEND save to Results.csv, don't write headers, APPEND new file
    result_table.to_csv("Results.csv", index=True, mode='a', header=False) 

Performance Results & Comparisons - model_lr2 - Test data¶

Display Results - model_lr2 - Test Data¶

In [122]:
print(result_table)
print()
display_results()
                 Model  Precision  Recall    F1  Accuracy   AUC  Predict Leave  Predict Stay
0  lr2 - NOdept - test      0.795   0.823 0.804     0.823 0.882          0.343         0.898

Out[122]:
Model Precision Recall F1 Accuracy AUC Predict Leave Predict Stay
1 lr1 - ALLFeat - train 0.796 0.827 0.804 0.827 0.892 0.328 0.900
0 lr1 - ALLFeat - test 0.794 0.822 0.803 0.822 0.882 0.341 0.897
2 lr2 - NOdept - test 0.795 0.823 0.804 0.823 0.882 0.343 0.898

Make Predictions - model_lr2 - Train data¶

Use the already fitted lr2_model with the training data

In [123]:
# Make Predictions using Logistic Regression

y_pred_train2 = model_lr2.predict(X_train2) # worth a look!

Classification Report - model_lr1 - Train data¶

In [124]:
classification_report_summary(model_prefix+' - '+dataset+' - '+'TRAIN', y_train2, y_pred_train2)
Classification Report :  lr2 - NOdept - TRAIN
                           precision    recall  f1-score   support

Predicted would not leave       0.86      0.94      0.90      6964
    Predicted would leave       0.46      0.24      0.32      1411

                 accuracy                           0.82      8375
                macro avg       0.66      0.59      0.61      8375
             weighted avg       0.79      0.82      0.80      8375

Recall        : 24.2381%
f1_score      : 31.7992%
Precision     : 46.2162%
Accuracy      : 82.4836%

───────────────────────────────────
Weighted Average
───────────────────────────────────
Recall        : 82.4836%
f1 Score      : 80.1540%
Precision     : 79.2962%
Support       : 8375

───────────────────────────────────
Prediction F1 score
───────────────────────────────────
Predict Leave : 31.7992%
Support Stay  : 89.9514%

Prepare and Save Results - model_lr2 - Train data¶

In [125]:
result_table = make_results(model_prefix+' - '+dataset+' - '+'train',model_lr2, X_train2, y_train2, y_pred_train2)
result_table.to_csv("Results.csv", index=True, mode='a', header=False) # Append to existing Results.csv file, mode = 'a', no headers

Performance Results & Comparisons - model_lr2 - Train data¶

Display Results - model_lr2 - Train data¶

In [126]:
pd.set_option('display.max_columns', None)
pd.set_option('display.width', 120)
pd.options.display.float_format = '{:.4f}'.format
print(result_table)
print()
display_results()
                  Model  Precision  Recall     F1  Accuracy    AUC  Predict Leave  Predict Stay
0  lr2 - NOdept - train     0.7930  0.8248 0.8015    0.8248 0.8919         0.3180        0.8995

Out[126]:
Model Precision Recall F1 Accuracy AUC Predict Leave Predict Stay
1 lr1 - ALLFeat - train 0.7960 0.8266 0.8041 0.8266 0.8923 0.3284 0.9005
3 lr2 - NOdept - train 0.7930 0.8248 0.8015 0.8248 0.8919 0.3180 0.8995
0 lr1 - ALLFeat - test 0.7943 0.8216 0.8031 0.8216 0.8819 0.3413 0.8969
2 lr2 - NOdept - test 0.7952 0.8227 0.8039 0.8227 0.8819 0.3426 0.8975

Review Results¶

Dataset comparison - Logistical Regression¶

There is little difference in performance between the two datasets.

  • ALLFeat = complete feature engineered dataset
  • NoDept = dataset with dummy encoded dept removed

Model performance¶

In [127]:
# Set parameters for following cells

model1 = model_lr1
model2 = model_lr2


model_name1 = "model_lr1"
model_name2 = "model_lr2"

# model_best_model1 = N/A not grid searched
# model_best_model2 = N/A not grid searched
In [128]:
display_results()
Out[128]:
Model Precision Recall F1 Accuracy AUC Predict Leave Predict Stay
1 lr1 - ALLFeat - train 0.7960 0.8266 0.8041 0.8266 0.8923 0.3284 0.9005
3 lr2 - NOdept - train 0.7930 0.8248 0.8015 0.8248 0.8919 0.3180 0.8995
0 lr1 - ALLFeat - test 0.7943 0.8216 0.8031 0.8216 0.8819 0.3413 0.8969
2 lr2 - NOdept - test 0.7952 0.8227 0.8039 0.8227 0.8819 0.3426 0.8975

Confusion Matrix Heatmap - model_lr1¶

In [129]:
# Prepare confusion matrices for LR1 test
cm_test1 = metrics.confusion_matrix(y_test, y_pred_test)  # Use the optimized model
#cm_test1_percent = cm_test1 / cm_test1.sum() * 100

# Prepare confusion matrices for LR1 train
cm_test2 = metrics.confusion_matrix(y_test2, y_pred_test2)  # Use the optimized model
#cm_test2_percent = cm_test2 / cm_test2.sum() * 100

#cm = confusion_matrix(y_test, y_pred_test, labels=model_lr1.classes_)

# Plot confusion matrix
#disp = ConfusionMatrixDisplay(confusion_matrix=cm,
#                             display_labels=model_lr1.classes_)
#disp.plot(values_format='');

fig, ax = plt.subplots(2, 2, figsize=(10,8))

# Calculate percentages for TEST
sum_by_true_class = np.sum(cm_test1, axis=1)
percentage_matrix = cm_test1 / sum_by_true_class[:, np.newaxis]
model_name = "lr1 - AllFeat - Test"
# Create a figure and plot the percentage confusion matrix as a heatmap

sns.heatmap(percentage_matrix, annot=True, fmt=".2%", cmap="Blues", ax = ax[0,0]) #, xticklabels=model_lr1.classes_, yticklabels=model_lr1.classes_)
ax[0,0].set_title('{} Confusion Matrix (Percentage)'.format(model_name))
ax[0,0].set_ylabel('True label')
ax[0,0].set_xlabel('Predicted label')
ax[0,0].text(0.3, 0.25, '(TN)\nTrue Stay', color='white')
ax[0,0].text(1.3, 0.25, '(FP) type 1\n False Leave', color='black')
ax[0,0].text(0.3, 1.25, '(FN) type 2\n False Stay', color='white')
ax[0,0].text(1.3, 1.25, '(TP)\nTrue Leave', color='black')



# Create a figure and plot the COUNT confusion matrix as a heatmap for TEST
sns.heatmap(cm_test1, annot=True, fmt=".0f", cmap="Blues", ax = ax[1,0])#, xticklabels=class_labels, yticklabels=class_labels)
ax[1,0].set_title('{} Confusion Matrix (Count)'.format(model_name))
ax[1,0].set_ylabel('True label')
ax[1,0].set_xlabel('Predicted label')
ax[1,0].text(0.3, 0.25, '(TN)\nTrue Stay', color='white')
ax[1,0].text(1.3, 0.25, '(FP) type 1\n False Leave', color='black')
ax[1,0].text(0.3, 1.25, '(FN) type 2\n False Stay', color='black')
ax[1,0].text(1.3, 1.25, '(TP)\nTrue Leave', color='black')



# Calculate percentages for TRAIN
sum_by_true_class = np.sum(cm_test2, axis=1)
percentage_matrix = cm_test2 / sum_by_true_class[:, np.newaxis]
model_name = "lr2 - NoDept - Test"

sns.heatmap(percentage_matrix, annot=True, fmt=".2%", cmap="Blues", ax = ax[0,1])#, xticklabels=class_labels, yticklabels=class_labels)
ax[0,1].set_title('{} Confusion Matrix (Percentage)'.format(model_name))
ax[0,1].set_ylabel('True label')
ax[0,1].set_xlabel('Predicted label')
ax[0,1].text(0.3, 0.25, '(TN)\nTrue Stay', color='white')
ax[0,1].text(1.3, 0.25, '(FP) type 1\n False Leave', color='black')
ax[0,1].text(0.3, 1.25, '(FN) type 2\n False Stay', color='white')
ax[0,1].text(1.3, 1.25, '(TP)\nTrue Leave', color='black')


# Create a figure and plot the COUNT confusion matrix as a heatmap for TRAIN
sns.heatmap(cm_test2, annot=True, fmt=".0f", cmap="Blues", ax = ax[1,1])#, xticklabels=class_labels, yticklabels=class_labels)
ax[1,1].set_title('{}  Confusion Matrix (Count)'.format(model_name))
ax[1,1].set_ylabel('True label')
ax[1,1].set_xlabel('Predicted label')
ax[1,1].text(0.3, 0.25, '(TN)\nTrue Stay', color='white')
ax[1,1].text(1.3, 0.25, '(FP) type 1\n False Leave', color='black')
ax[1,1].text(0.3, 1.25, '(FN) type 2\n False Stay', color='black')
ax[1,1].text(1.3, 1.25, '(TP)\nTrue Leave', color='black')


plt.tight_layout()
plt.show
Out[129]:
<function matplotlib.pyplot.show(close=None, block=None)>
No description has been provided for this image

Feature Importance¶

In [130]:
# Get Feature Importance function, returns a dataframe of features 

def get_feature_importance(model, feature_names, model_name):
    ''' Get Feature Importance function, returns a dataframe of features '''
    feature_importance = model.coef_[0]
    sorted_indices = feature_importance.argsort()[::-1]
    sorted_feature_names = [feature_names[i] for i in sorted_indices]
    sorted_importance = feature_importance[sorted_indices]

    feature_importance_df = pd.DataFrame({
        'Feature': sorted_feature_names,
        'Importance': sorted_importance
    })

    return feature_importance_df

# Get feature importance for the second model
feature_importance_1 = get_feature_importance(model_lr1, X.columns, "LogisticalRegression model_lr1 - All Feat")

# Get feature importance for the second model
feature_importance_2 = get_feature_importance(model_lr2, X2.columns, "LogisticalRegression model_lr2 - NoDept")

merged_df = pd.merge(feature_importance_1, feature_importance_2, on='Feature', how='left', suffixes = ('model_lr1 - AllFeat', 'model_lr2 - NoDept'))

# Print the merged DataFrame
print(merged_df)


# Plot side-by-side bar plots
plt.figure(figsize=(12, 5))

plt.subplot(1, 2, 1)
plt.barh(feature_importance_1['Feature'], feature_importance_1['Importance'], color='skyblue')
plt.title('Feature Importance - model_lr1 - AllFeat')
plt.xlabel('Importance')
plt.ylabel('Feature')

plt.subplot(1, 2, 2)
plt.barh(feature_importance_2['Feature'], feature_importance_2['Importance'], color='salmon')
plt.title('Feature Importance - model_lr2 - NoDept')
plt.xlabel('Importance')
plt.ylabel('Feature')

plt.tight_layout()
plt.show()
             Feature  Importancemodel_lr1 - AllFeat  Importancemodel_lr2 - NoDept
0             tenure                         0.9972                        0.9950
1       avg_mnth_hrs                         0.1764                        0.1736
2         dept_sales                         0.0364                           NaN
3       dept_support                         0.0304                           NaN
4     dept_technical                         0.0289                           NaN
5            dept_hr                         0.0115                           NaN
6          last_eval                        -0.0081                       -0.0096
7   dept_product_mng                        -0.0115                           NaN
8            dept_it                        -0.0122                           NaN
9     dept_marketing                        -0.0126                           NaN
10   dept_management                        -0.0281                           NaN
11   dept_accounting                        -0.0535                           NaN
12        dept_randd                        -0.0648                           NaN
13         promotion                        -0.1910                       -0.1970
14            salary                        -0.3332                       -0.3367
15          accident                        -0.5551                       -0.5574
16    number_project                        -0.5710                       -0.5675
17      satisfaction                        -1.0622                       -1.0588
No description has been provided for this image

Precision / Recall Curve¶

In [136]:
y_prob1 = model_lr1.predict_proba(X_test)[:, 1]
y_prob2 = model_lr2.predict_proba(X_test2)[:, 1]

precision1, recall1, _ = precision_recall_curve(y_test, y_prob1)
precision2, recall2, _ = precision_recall_curve(y_test2, y_prob2)

# Compute area under the curve (AUC)
auc_score1 = auc(recall1, precision1)
auc_score2 = auc(recall2, precision2)

#plt.subplot(1, 2, 1)
plt.figure(figsize=(12, 5))
plt.plot(recall1, precision1, color='blue', label=f'{model_name1} - AUC = {auc_score1:.2f}')
plt.xlabel('Recall')
plt.ylabel('Precision')
plt.title(f'Precision-Recall Curve - {model_name1}')
plt.legend(loc='best')
plt.axhline(y=0.5, color='red', linestyle='--', label='Random Guess')
plt.axhline(y=1, color='green', linestyle='--', label='Perfect')
plt.axvline(x=1, color='green', linestyle='--', label='Perfect')
plt.text(0.2, 0.52, 'Random Guess', color='red')
plt.text(0.4, 0.8, 'Better', color='black')
plt.text(0.4, 0.4, 'Worse', color='black')
plt.text(0.8, 1.01, 'Perfect', color='green')
#plt.savefig('plot-prc-curve1.png')
plt.show()

#plt.subplot(1, 2, 2)
plt.figure(figsize=(12, 5))
plt.plot(recall2, precision2, color='blue', label=f'{model_name2} - AUC = {auc_score2:.2f}')
plt.xlabel('Recall')
plt.ylabel('Precision')
plt.title(f'Precision-Recall Curve - {model_name2}')
plt.legend(loc='best')
plt.axhline(y=0.5, color='red', linestyle='--', label='Random Guess')
plt.axhline(y=1, color='green', linestyle='--', label='Perfect')
plt.axvline(x=1, color='green', linestyle='--', label='Perfect')
plt.text(0.2, 0.52, 'Random Guess', color='red')
plt.text(0.4, 0.8, 'Better', color='black')
plt.text(0.4, 0.4, 'Worse', color='black')
plt.text(0.8, 1.01, 'Perfect', color='green')

#plt.tight_layout()
plt.show()
plt.savefig('plot-prc-curve2.png')
No description has been provided for this image
No description has been provided for this image
<Figure size 640x480 with 0 Axes>

ROC Curve Plot¶

In [135]:
# Compute ROC curve
fpr1, tpr1, _ = roc_curve(y_test, y_prob1)
fpr2, tpr2, _ = roc_curve(y_test2, y_prob2)

# Compute area under the curve (AUC)
roc_auc1 = auc(fpr1, tpr1)
roc_auc2 = auc(fpr2, tpr2)


#plt.subplot(1, 2, 1)

plt.figure(figsize=(12, 5))
# Plot ROC curve

plt.plot(fpr1, tpr1, color='darkorange', lw=2, label=f'{model_name1} - AUC = {roc_auc1:.2f}')
plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--', label='Random')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title(f'Receiver Operating Characteristic Curve (ROC)  - {model_name1}')
plt.legend(loc='best')
plt.axhline(y=1, color='green', linestyle='--', label='Perfect')
plt.axvline(x=0, color='green', linestyle='--', label='Perfect')
plt.text(0.4, 0.6, 'Better', color='black')
plt.text(0.4, 0.3, 'Worse', color='black')
plt.text(0.01, 1.01, 'Perfect', color='green')
plt.show()

#plt.subplot(1, 2,2)
plt.figure(figsize=(12, 5))
# Plot ROC curve
plt.plot(fpr2, tpr2, color='darkorange', lw=2, label=f'{model_name2} AUC = {roc_auc2:.2f}')
plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--', label='Random')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title(f'Receiver Operating Characteristic (ROC) Curve - {model_name2}')
plt.legend(loc='best')
plt.axhline(y=1, color='green', linestyle='--', label='Perfect')
plt.axvline(x=0, color='green', linestyle='--', label='Perfect')
plt.text(0.4, 0.6, 'Better', color='black')
plt.text(0.4, 0.3, 'Worse', color='black')
plt.text(0.01, 1.01, 'Perfect', color='green')
plt.show()
No description has been provided for this image
No description has been provided for this image
In [137]:
# Save model plot data

arrays = [['precision1',model_name1],
         ['recall1', model_name1],
         ['precision2',model_name2],
         ['recall2', model_name2],
         ['fpr1', model_name1],
         ['tpr1', model_name1],
         ['fpr2', model_name2],
         ['tpr2', model_name2],
        ]

variables = [['auc_score1', model_name1],
             ['auc_score2', model_name2],
             ['roc_auc1', model_name1],
             ['roc_auc2', model_name2]
             ]

# Save plot data scores (auc, roc)
for var_name, model in variables:
    #print(var_name, model)
    var = globals()[var_name]
    with open(f'99-documentation-project/08-plot_data/{model}-{var_name}.csv', 'w') as file:
        json.dump(var, file)

# Save plot data arrays (recall, precision, fpr, tpr)
for array_name, model in arrays:
    #print(array_name, model)
    var = globals()[array_name]
    df = pd.DataFrame({array_name: var})
    df.to_csv(f'99-documentation-project/08-plot_data/{model}-{array_name}.csv', index=False, header=False)

Conclusions¶

The AUC score (Area Under the Curve) is very low, making predictions very unreliable, the probabilities of correct predictions of staff leaving seem low, the model would be great at predicting who is going to stay, but that's not what the client is looking for.

While the ROC score at 0.88 looks ok, I think alternative models may perform better.

Conclusion, dataset comparison - Logistical Regression.¶

There is almost no difference to the performance of the two datasets (AllFeat vs No Dept) when modelled through Logistic Regression. The dept features that showed low correlation in the correlation matrix carried through when the final models were run showing little difference in key metrics. However, the current Logistical Regression models would be better at predicting people who will stay rather than the goal of predicting who will leave.