Setup a review session with Leadership and HR to compare data pre
and post project
Provide a machine learning model the HR team can use to identify “at
risk” employees.
Data Exploration and
Cleaning
Resources Required
Stakeholder input
Stakeholder Analysis
Project Timeline (Gantt Chart)
Dataset (HR_capstone_dataset.csv)
Python Jupyter Notebooks
Python Libraries
Operations
Python Libraries
Data Import & Manipulation
Pandas Numpty
Data Modelling
sklearn.linear_model - Logistic Regression sklearn.tree - Decision
Tree sklearn.ensemble Random Forest XGBoost -
Modelling support and Metrics
sklearn.model_selection GridSearch sklearn.metrics - model
scoring sklearn.plot_tree - Decision Tree Visualisations
Visualisations
matplotlib seaborn
ML Model Save/Load
pickle
Review the supplied data and apply cleaning operations to prepare it
for further analysis and modelling
Deduplication
Remove Missing or incomplete data
Rename column headings for consistency
Review Outliers and consider removing for relevant ML models
Dataset Preparation:
Because this is a capstone project and not real life, I will
prepare two datasets to evaluate performance of these across the four
models we'll evaluate for this project, because I'm curious how the two
different datasets will perform.
Prepare a dataset with minimal Feature Engineering and a dataset
with comprehensive Feature Engineering
Score of employee's last performance review [0–1]
number_project
Number of projects employee contributes to
average_monthly_hours
Average number of hours employee worked per month
time_spend_company
How long the employee has been with the company (years)
Work_accident
Whether or not the employee experienced an accident while at work
left
Whether or not the employee left the company
promotion_last_5years
Whether or not the employee was promoted in the last 5 years
Department
The employee's department
salary
The employee's salary (low, medium, high)
Exploratory Data Analysis
Pairplot to identify key correlations
Correlation Plot & Heatmap to identify key correlations and mean
average scores
Histogram plot for variable distribution
EDA Data Cleaning &
Feature Engineering
De-duplicate
Remove missing values
Check / correct any categorical value misspellings
Correct any misspellings of column names, convert case of columns to
snake case
Removal of outliers if required by a specific ML Model
As required feature engineering of categorical features to relevant
encoding (dummies, ordinal), binarisation of continuous variables with
threshold.
EDA
Analysis and visualisations of key data features Vs
left
Variable comparison with left:
Visualisation of the feature (histogram, heatmap, box plot, violin
plot)
Detailed Analysis of variable
Further ad hoc visualisations and analysis will be produced driven by
the
variable comparison observations
Deliverable: EDA Analysis
Summary
Client Deliverables
Write up documenting initial findings from initial EDA
Write up suggestions to HR team and Team Managers
ML model performance Analysis comparisons
ML Model Demonstration and prediction demonstration notebook
Project Deliverables-11
Internal Analysis and RFC to analyst team members
Construct
Build, Train/Test
Build train and test various machine learning models:
Logistic Regression
Decision Tree
XGBoost
Random Forest
Disclaimer - in reality I wouldn't carry out the sort of overkill development you see here, but I want to compare model performance because I'm curious to see how they perform with feature engineering vs without any Feature Engineering
cleaned data
data_cleaned_NoOl_FE_AllFeat - Cleaned, No Outliers,
Feature Engineered, all fields included.- (AllFeat)
data_cleaned_NoOl_FE_NoDept - Cleaned, No Outliers, Feature
Engineered, departments removed (NoDept)
data_cleaned_NoOl_NoFE_AllFeat - Cleaned, No Outliers, NOT
Feature Engineered, all fields included.- (AllFeat)
data_cleaned_Ol_NoFE_AllFeat - Cleaned, Outliers, NOT
Feature Engineered, all fields included.- (AllFeat)
And apply them across the datasets to create a comparison table of
results I can use for the final model recommendations to promote to
development of the demonstration model.
Conclusion and next steps
From the results of the model development and testing, two models
will be selected and applied to the development of the interactive and
live demonstrations
Execute
Interpret model
Evaluate model performance using metrics
Prepare results, visualizations, and actionable steps to share with
stakeholders
Conclusion
Project Close
Under instruction from the client the following operations will be
carried
Data Erasure - all supplied data will be backed up to client's
cloud
All local copies of data will be deleted for data security
Agree Dates with client for follow up meetings at six and twelve
months for HR and Team Leaders review meeting