[PYTHON] Machine learning / data preprocessing

1.First of all

I will summarize the preprocessing that is often used in machine learning as a memorandum in jupyter_notebook format. The dataset uses kaggle's home_loan.

2. Change display settings

The screen display of jupyter notebbok is strangely narrow, and if you display a lot of columns and rows, the middle will be omitted. Therefore, it is convenient to change the display settings first if necessary.

#Expansion of screen display width
from IPython.core.display import display, HTML
display(HTML("<style>.container { width:100% !important; }</style>"))

#Change the maximum number of displayed columns and rows
import pandas as pd
pd.options.display.max_columns = 50  #Maximum column display setting(Default 20)
pd.options.display.max_rows = 100  #Maximum line display setting(Default 60)

3. Reading training data

This time, the training data and the test data are read and processed separately. First, read the training data.

train = pd.read_csv('./train.csv')
train.head()

スクリーンショット 2019-12-18 11.01.48.png Next, it is separated into features and targets.

#Divided into features and targets
X_train = train.iloc[ : , :-1]  #All rows, columns except the last column
y_train = train.iloc[ : , -1]  #All rows, last column
X_train = X_train.drop('Loan_ID', axis=1)  # Loan_Delete ID column
y_train = y_train.map( {'Y':1, 'N':0} )  #Y 1,Replace N with 0
X_train.head()

スクリーンショット 2019-12-14 12.55.56.png In order to easily check the features each time, define the function check (df).

#Feature check
def check(df):
    col_list = df.columns.values  #Get column name
    row = []    
    for col in col_list:
        tmp = ( col,  #Column name
                df[col].dtypes,  #data type
                df[col].isnull().sum(),  #Number of nulls
                df[col].count(),  #The number of data(Exclude missing values)
                df[col].nunique(),  #Number of unique values(Exclude missing values) 
                df[col].unique() )  #Unique value
        row.append(tmp)  #Save tmp to row sequentially
    df = pd.DataFrame(row)  #Convert row to dataframe format
    df.columns = ['feature', 'dtypes', 'nan',  'count', 'num_unique', 'unique']  #Specify column name of data frame
    return df

check(X_train)

スクリーンショット 2019-12-17 19.20.24.png feature is the feature name, dtypes is the data type, nan is the number of missing values, count is the number of data (excluding missing values), num_unique is the number of unique values (excluding missing values), and unique is the unique value.

4. Handling of categorical variables

Basically, object is a categorical variable, but there is a way to make it a categorical variable even if it is a numerical value. Credit_History is a number (float64), but since the number of unique values is 2, let's make it a categorical variable this time.

Change the data type of Credit_History to object.

# Credit_Convert History to Object
X_train['Credit_History'] = X_train['Credit_History'].astype(object)

Let's look at the relationship between object type features and targets. First, replace the missing value of object with'nan'.

#Missing value of Object'nan'Replace with
c_list = X_train.dtypes[X_train.dtypes=='object'].index.tolist()  #Get column name list of Object
X_train[c_list] = X_train[c_list].fillna('nan')

Use sns.barplot to graph the relationship between each feature of object type and Loan_Status. The graph shows the height as the mean and the confidence intervals as error bars.

#Relationship between object type features and targets
import matplotlib.pyplot as plt
import seaborn as sns

fig = plt.figure(figsize=(14,6))
for j , i in enumerate([0, 1, 2, 3, 4, 9, 10]):
    ax = fig.add_subplot(2, 4 , j+1)
    sns.barplot(x=X_train.iloc[:,i], y=y_train, data=X_train, palette='Set3' )

plt.tight_layout()
plt.show()

スクリーンショット 2019-12-17 20.05.30.png Credit_History seems to have the most influence on Loan_Status.   For categorical variables, it is often one-hot encoded. This is because by decomposing the feature elements, only the feature elements that are effective for prediction can be selected. Again, one-hot encode the object.

#One-hot encoding of categorical variables
X_train1 = pd.get_dummies(X_train) 
check(X_train1)

スクリーンショット 2019-12-17 20.01.12.png The total number of features is 26.

5. Processing of numeric variables

For numeric variables, missing value completion can be mean, median, most_frequent, constant, etc., but here we use mean.

#Numerical missing value completion
X_train2 = X_train1.fillna(X_train1.mean())
check(X_train2)

スクリーンショット 2019-12-18 09.44.37.png Now, let's use sns.FacetGrid to draw a histogram for each numeric variable when Loan_Status is 0 and 1.

X_tmp = X_train2.join(y_train)
for i in [0, 1, 2, 3]:
    facet = sns.FacetGrid(X_tmp, hue='Loan_Status', aspect=2)
    facet.map(sns.kdeplot, X_tmp.columns.values[i], shade= True)
    facet.set(xlim=(0, X_tmp.iloc[:, i].max()))
    facet.add_legend()
    plt.show()

スクリーンショット 2019-12-18 09.48.36.png スクリーンショット 2019-12-18 09.49.25.png All four features have areas where Loan_Status1 is clearly dominant.

8. Importance of features

Let's prioritize 26 features using the Gini coefficient of Random Forest.

import numpy as np
from sklearn.ensemble import RandomForestClassifier

#Fit in random forest
clf = RandomForestClassifier(n_estimators=100, random_state=0)
clf.fit(X_train2, y_train)

#Feature importance display
labels = X_train2.columns.values  #Feature label acquisition
importances = clf.feature_importances_  #Numerical acquisition of features
index = np.argsort(importances)  #Get ascending index

plt.figure(figsize=(8, 5))
plt.barh(range(X_train2.shape[1]), importances[index])
plt.yticks(range(X_train2.shape[1]), labels[index])
plt.title('feature importances')
plt.show()

スクリーンショット 2019-12-18 09.53.31.png The best three are Credit_History_0.0, ApplicantIncome, LoanAmount.

9. Narrowing down the features

RFE (Recursive Feature Elimination) reduces features. The algorithm builds a model starting with all features and removes the least important features of the model. Then, the process of creating a model and deleting the least important features is repeated until a predetermined number of features are reached.

from sklearn.feature_selection import RFE

clf = RandomForestClassifier(n_estimators=100, random_state=0)
selector = RFE(estimator=clf,
                        n_features_to_select=10,  #Number of features narrowed down
                        step=.05)  #Percentage of features to be truncated at each step

selector.fit(X_train2, y_train)

#Compress 26 dimensions to 10 dimensions
select = selector.transform(X_train2)
X_train3 = pd.DataFrame(select,
                            columns=X_train2.columns[selector.support_])

check(X_train3)

スクリーンショット 2019-12-18 11.23.22.png This is the training data finally obtained.

10. Reading and processing test data

Next, load the test data.

#Read test data
test = pd.read_csv('./test.csv')
X_test = test.drop('Loan_ID', axis=1)  # Loan_Delete ID column
check(X_test)

スクリーンショット 2019-12-18 10.00.10.png As before, convert Credit_History to object, then one-hot encode including missing values.

Here, by adding the option dummy_na = True to the pd.get_dummies () method, one-hot encoding including missing values is performed in one shot.

# Credit_Convert History to Object
X_test['Credit_History'] = X_test['Credit_History'].astype(object)

#Missing value completion of categorical variables and one-hot encoding
X_test1 = pd.get_dummies(X_test, 
                         dummy_na = True)  #Include missing values
check(X_test1)

スクリーンショット 2019-12-18 10.32.16.png Ah, there are two more features compared to the learning data.

This is because using the dummy_na = True option creates a XX_nan with all 0 data, even if there are no missing values.

Here, 9.Married_nan, 17.Education_nan, 27.Property_Area_nan correspond to it.

Let's see the difference between the features of the training data and the test data.

#Read the names of features as a set
cols_train = set(X_train1.columns.values)
cols_test = set(X_test1.columns.values)

#Features in train but not in test
diff1 = cols_train - cols_test
print('train only', diff1)

#Features in test but not in train
diff2 = cols_test - cols_train
print('test only', diff2)

スクリーンショット 2019-12-18 10.37.08.png Only test has a lot of Education_nan and Property_Area_nan. What happened to Married_nan is that Married_nan exists properly in the training data with a unique value [0,1].

Basically, the features that are not in the test but in the train are restored, and the features that are in the test and not in the train are deleted. Here, only the features that are in test but not in train are deleted.

X_test1 = X_test1.drop(['Education_nan', 'Property_Area_nan'], axis=1)
check(X_test1)

スクリーンショット 2019-12-18 13.16.35.png

Now, let's use the mean value of the training data to complete the missing value of the numerical variable.

#Numerical missing value completion
X_test2 = X_test1.fillna(X_train1.mean())

Then, as with the training data, the features are finally narrowed down to 10. Before that, the order of the features of the training data and the test data is secured.

# X_X the features of test2_Make the same order as train2
X_test1 = X_test1.reindex(X_train1.columns.values,axis=1)

Finally, the test data is also narrowed down using the RFE filtering result selector.support_ that was done with the training data.

X_test3 = X_test2.loc[ : , X_test2.columns[selector.support_]]
check(X_test3)

スクリーンショット 2019-12-18 10.57.54.png This is the final test data. This completes the data preprocessing.

Recommended Posts

Machine learning / data preprocessing
Preprocessing in machine learning 4 Data conversion
Python: Preprocessing in machine learning: Data acquisition
Python: Preprocessing in machine learning: Data conversion
Preprocessing in machine learning 1 Data analysis process
Data set for machine learning
Japanese preprocessing for machine learning
[Python] Data analysis, machine learning practice (Kaggle) -Data preprocessing-
I started machine learning with Python Data preprocessing
About data preprocessing of systems that use machine learning
Machine learning using gene expression data
Machine learning in Delemas (data acquisition)
Python: Preprocessing in Machine Learning: Overview
Basic machine learning procedure: ② Prepare data
How to collect machine learning data
Performance verification of data preprocessing for machine learning (numerical data) (Part 2)
Pre-processing in machine learning 3 Missing values, outliers, and imbalanced data
Performance verification of data preprocessing for machine learning (numerical data) (Part 1)
Machine Learning sample
Machine learning imbalanced data sklearn with k-NN
[Python] First data analysis / machine learning (Kaggle)
Machine learning template for handwritten digit data
Machine learning tutorial summary
Data supply tricks using deques in machine learning
About machine learning overfitting
Machine learning ⑤ AdaBoost Summary
[Updated Ver1.3.1] I made a data preprocessing library DataLiner for machine learning.
Machine Learning: Supervised --AdaBoost
Machine learning logistic regression
Machine learning support vector machine
Studying Machine Learning ~ matplotlib ~
Machine learning linear regression
Machine learning course memo
Machine learning Training data division and learning / prediction / verification
Machine learning library dlib
Correlation by data preprocessing
Machine learning (TensorFlow) + Lotto 6
Somehow learn machine learning
Preprocessing of prefecture data
Python: Preprocessing in machine learning: Handling of missing, outlier, and imbalanced data
A story about data analysis by machine learning
Machine learning library Shogun
Machine learning rabbit challenge
Introduction to machine learning
Machine Learning: k-Nearest Neighbors
What is machine learning?
Time series data prediction by AutoML (automatic machine learning)
xgboost: A valid machine learning model for table data
Machine learning model considering maintainability
Machine learning learned with Pokemon
Reinforcement learning 7 Learning data log output
Machine learning in Delemas (practice)
An introduction to machine learning
Machine learning / classification related techniques
Machine Learning: Supervised --Linear Regression
Basics of Machine Learning (Notes)
Machine learning beginners tried RBM
[Machine learning] Understanding random forest
Machine learning with Python! Preparation
Machine Learning Study Resource Notepad
Machine learning ② Naive Bayes Summary