[PYTHON] [1 copy per day] Predict employee attrition [Daily_Coding

at first

――This article is a memorandum article for elementary school students who are self-taught in python, machine learning, etc. ――It will be extremely simple, "study while copying the code that you are interested in". ――We would appreciate your constructive comments (please LGTM & stock if you like).

Theme: IBM HR Analytics Employee Attrition & Performance

――The theme this time is ** IBM HR Analytics Employee Attrition & Performance **. According to the explanation given in kaggle, it seems to be a problem of searching for ** "reasons for employee retirement" **. ――This time, I copied the sutras while watching the following youtube video.

Link：Predict Employee Attrition Using Machine Learning & Python

The data was taken from kaggle.

Link:IBM HR Analytics Employee Attrition & Performance

The analysis used Google Colaboratry, as you can see in the youtube video (it's a convenient time).

Step 1: Read data-check contents

Then I would like to do it.

1.1: Import library

#Loading the library
import numpy as np
import pandas as pd
import seaborn as sns

Load the underlying library. I feel like I will continue to add the necessary libraries as needed.

Next, regarding reading data, load the csv file downloaded from the kaggle site with google colab.

1.2: Uploading files to Google Colab

#Data upload
from google.colab import files
uploaded = files.upload()

By doing this, you can import locally stored files onto google colab. I usually upload files to Google Drive and then load them in conjunction with Google Drive, so this is easier and better.

1.3: Loading with pandas

I will read the uploaded data.

#Data reading
df = pd.read_csv('WA_Fn-UseC_-HR-Employee-Attrition.csv')

#Data confirmation
df.head(7)

It's a familiar code. From now on, we will check the contents of the data.

Confirmation of data contents

The following code is (actually) running separately, but I'll put them together here.

#Check the number of rows / columns in the data frame
df.shape

#Check the data type of the contents of each column
df.dtypes

#Confirmation of missing values
df.isna().sum()
df.isnull().values.any()

#Confirmation of basic statistics
df.describe()

#Confirmation of the number of retirees and enrolled persons (Understanding the number of explained variables)
df['Attrition'].values_counts() #Figure 1

#Visualization of retirees and enrollment
sns.countplot(df['Attrition'])

#Visualization of the number of retirees and enrollments by age
import matplotlib.pyplot as plt
plt.subplots(figsize=(12,4))
sns.countplot(x='Age', hue='Attrition', data=df, palette='colorblind') #Figure 2

【Figure 1】

【Figure 2】

Up to this point, we are checking the data as we always do. First of all, I think it is necessary to confirm the contents of the data firmly.

1.4: Confirmation of unique value of object type

Next, check the unique value of the object type column of the data types you checked earlier.

for column in df.columns:
  if df[column].dtype == object:
    print(str(column) + ':' + str(df[column].unique()))
    print(df[column].value_counts())
    print('___________________________________________')

1st row: Repeat each column in a for loop
2nd row: Determine if the fetched column is an object type
Line 3: Output column name + unique value for that column
4th line: Output the number of each unique value

1.5: Delete unnecessary lines

Remove columns that don't make sense to predict with .drop () .

df = df.drop('Over18', axis=1)
df = df.drop('EmployeeNumber', axis=1)
df = df.drop('StandardHours', axis=1)
df = df.drop('EmployeeCount', axis=1)

This is self-explanatory. Remove from df what is not a reason to retire.

1.6: Checking the correlation between columns

I think this is also a familiar process. Check the correlation (correlation) between each column and visualize heatmap.

df.corr()

plt.figure(figsize=(14, 14))
sns.heatmap(df.corr(), annot=True, fmt='.0%')

This time, the following two are specified when creating the heatmap.

Item	Description
annot	When set to True, the value is output to the cell.
fmt	annot=Specify the output format as a character string when set to True or when a data set is specified.

Reference: Create a heatmap with Seaborn

1.7: Labeling categorical (non-numerical) data with sklearn

from sklearn.preprocessing import LabelEncoder

for column in df.columns:
    if df[columen].dtype == np.number:
        continue
    df[column] = LabelEncoder().fit_transform(df[column])

Here, sklearn's LabelEncoder is used to replace the object type data with numerical data ("Convert character data to discrete values (0, 1, ...) before applying to the classifier").

After the replacement, change the order of the df columns to make it easier to analyze.

#Duplicate Age to new column
df['Age_Years'] = df['Age']

#Drop the Age column
df = df.drop('Age', axis=1)

Step2: Analyze with sklearn

This is the actual production (not to mention that preprocessing is important).

#Divide df data into explanatory variables and explained variables
X = df.iloc[:, 1:df.shape[1]].values
Y = df.iloc[:, 0].values

#Test data size of training data and teacher data (25)%)
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.25, ramdom_state = 0)

#Classification by random forest
from sklearn.ensemble import RandomForestClassifier
forest = RandomForestClassifier(n_estimators = 10, criterion = 'entropy', random_state = 0)
forest.fit(X_train, Y_train)

Let's look at it from above.

--_ ʻiloc [] _ is used to separate the explanatory variable and the dependent variable. --Split training data and test data using sklearn's _train_test_split`_. The arguments of train_test_split are as follows.

Item	Description
arrays	Array of Numpy, multiple lists with the same length to split, matrix,Specify a Pandas data frame.
test_size	Specify a decimal or integer. If specified as a decimal, the percentage of test data is 0.0 〜 1.Specify between 0. If you specify an integer, specify the number of records to be included in the test data as an integer. If not specified or if None is set, train_Set to supplement the size of size. train_0 as default value if size is not set.Use 25.
train_size	Specify a decimal or integer. If specified as a decimal, the percentage of training data is 0.0 〜 1.Specify between 0. If you specify an integer, specify the number of records to be included in the training data. If not specified or set to None, test from the entire dataset_It is the size obtained by subtracting size.
random_state	Set an integer or RandomState instance to seed random number generation. If not specified, Numpy's np.Use random to set a random number.

(See: [Create training and test data with scikit-learn](https://pythondatascience.plavox.info/scikit-learn/%e3%83%88%e3%83%ac%e3%83%bc % e3% 83% 8b% e3% 83% b3% e3% 82% b0% e3% 83% 87% e3% 83% bc% e3% 82% bf% e3% 81% a8% e3% 83% 86% e3 % 82% b9% e3% 83% 88% e3% 83% 87% e3% 83% bc% e3% 82% bf))

--Classify using Ranfam Forest. The argument here is

n_estimators: Specify the number of trees (default is 100) criterion: Specify gini or ʻentropy (default is gini`)

After this, train the model with forest.fit (...).

Now let's look at the accuracy.

forest.score(X_train, Y_train)

After this, we use confusion_matrix (confusion matrix) to calculate ʻAccuracy`.

from sklearn.metrics import confusion_matrix

cm = confusion_matrix(Y_test, forest.predict(X_test)) #cm: confusion_matrix

TN = cm[0][0]
TP = cm[1][1]
FN = cm[1][0]
FP = cm[0][1]

print(cm)
print('Model Testing Accuracy = {}'.format( (TP + TN) / (TP + TN + FN + FP)))

With the above, although it is easy, it is a copying of binary classification using sklean.

Finally

Although the content is not that difficult, I realized that there are still some parts that I do not understand, so I would like to continue studying.

that's all.

[PYTHON] [1 copy per day] Predict employee attrition [Daily_Coding_001]