1.First of all

I will summarize the preprocessing that is often used in machine learning as a memorandum in jupyter_notebook format. The dataset uses kaggle's home_loan.

2. Change display settings

The screen display of jupyter notebbok is strangely narrow, and if you display a lot of columns and rows, the middle will be omitted. Therefore, it is convenient to change the display settings first if necessary.

#Expansion of screen display width
from IPython.core.display import display, HTML
display(HTML("<style>.container { width:100% !important; }</style>"))

#Change the maximum number of displayed columns and rows
import pandas as pd
pd.options.display.max_columns = 50  #Maximum column display setting(Default 20)
pd.options.display.max_rows = 100  #Maximum line display setting(Default 60)

3. Reading training data

This time, the training data and the test data are read and processed separately. First, read the training data.

train = pd.read_csv('./train.csv')
train.head()

スクリーンショット 2019-12-18 11.01.48.png Next, it is separated into features and targets.

#Divided into features and targets
X_train = train.iloc[ : , :-1]  #All rows, columns except the last column
y_train = train.iloc[ : , -1]  #All rows, last column
X_train = X_train.drop('Loan_ID', axis=1)  # Loan_Delete ID column
y_train = y_train.map( {'Y':1, 'N':0} )  #Y 1,Replace N with 0
X_train.head()

スクリーンショット 2019-12-14 12.55.56.png In order to easily check the features each time, define the function check (df).

#Feature check
def check(df):
    col_list = df.columns.values  #Get column name
    row = []    
    for col in col_list:
        tmp = ( col,  #Column name
                df[col].dtypes,  #data type
                df[col].isnull().sum(),  #Number of nulls
                df[col].count(),  #The number of data(Exclude missing values)
                df[col].nunique(),  #Number of unique values(Exclude missing values) 
                df[col].unique() )  #Unique value
        row.append(tmp)  #Save tmp to row sequentially
    df = pd.DataFrame(row)  #Convert row to dataframe format
    df.columns = ['feature', 'dtypes', 'nan',  'count', 'num_unique', 'unique']  #Specify column name of data frame
    return df

check(X_train)

スクリーンショット 2019-12-17 19.20.24.png feature is the feature name, dtypes is the data type, nan is the number of missing values, count is the number of data (excluding missing values), num_unique is the number of unique values (excluding missing values), and unique is the unique value.

4. Handling of categorical variables

Basically, object is a categorical variable, but there is a way to make it a categorical variable even if it is a numerical value. Credit_History is a number (float64), but since the number of unique values is 2, let's make it a categorical variable this time.

Change the data type of Credit_History to object.

# Credit_Convert History to Object
X_train['Credit_History'] = X_train['Credit_History'].astype(object)

Let's look at the relationship between object type features and targets. First, replace the missing value of object with'nan'.

#Missing value of Object'nan'Replace with
c_list = X_train.dtypes[X_train.dtypes=='object'].index.tolist()  #Get column name list of Object
X_train[c_list] = X_train[c_list].fillna('nan')

Use sns.barplot to graph the relationship between each feature of object type and Loan_Status. The graph shows the height as the mean and the confidence intervals as error bars.

#Relationship between object type features and targets
import matplotlib.pyplot as plt
import seaborn as sns

fig = plt.figure(figsize=(14,6))
for j , i in enumerate([0, 1, 2, 3, 4, 9, 10]):
    ax = fig.add_subplot(2, 4 , j+1)
    sns.barplot(x=X_train.iloc[:,i], y=y_train, data=X_train, palette='Set3' )

plt.tight_layout()
plt.show()

スクリーンショット 2019-12-17 20.05.30.png Credit_History seems to have the most influence on Loan_Status. 　 For categorical variables, it is often one-hot encoded. This is because by decomposing the feature elements, only the feature elements that are effective for prediction can be selected. Again, one-hot encode the object.

#One-hot encoding of categorical variables
X_train1 = pd.get_dummies(X_train) 
check(X_train1)

スクリーンショット 2019-12-17 20.01.12.png The total number of features is 26.

5. Processing of numeric variables

For numeric variables, missing value completion can be mean, median, most_frequent, constant, etc., but here we use mean.

#Numerical missing value completion
X_train2 = X_train1.fillna(X_train1.mean())
check(X_train2)

スクリーンショット 2019-12-18 09.44.37.png Now, let's use sns.FacetGrid to draw a histogram for each numeric variable when Loan_Status is 0 and 1.

X_tmp = X_train2.join(y_train)
for i in [0, 1, 2, 3]:
    facet = sns.FacetGrid(X_tmp, hue='Loan_Status', aspect=2)
    facet.map(sns.kdeplot, X_tmp.columns.values[i], shade= True)
    facet.set(xlim=(0, X_tmp.iloc[:, i].max()))
    facet.add_legend()
    plt.show()

スクリーンショット 2019-12-18 09.48.36.png スクリーンショット 2019-12-18 09.49.25.png All four features have areas where Loan_Status1 is clearly dominant.

8. Importance of features

Let's prioritize 26 features using the Gini coefficient of Random Forest.

import numpy as np
from sklearn.ensemble import RandomForestClassifier

#Fit in random forest
clf = RandomForestClassifier(n_estimators=100, random_state=0)
clf.fit(X_train2, y_train)

#Feature importance display
labels = X_train2.columns.values  #Feature label acquisition
importances = clf.feature_importances_  #Numerical acquisition of features
index = np.argsort(importances)  #Get ascending index

plt.figure(figsize=(8, 5))
plt.barh(range(X_train2.shape[1]), importances[index])
plt.yticks(range(X_train2.shape[1]), labels[index])
plt.title('feature importances')
plt.show()

スクリーンショット 2019-12-18 09.53.31.png The best three are Credit_History_0.0, ApplicantIncome, LoanAmount.

9. Narrowing down the features

RFE (Recursive Feature Elimination) reduces features. The algorithm builds a model starting with all features and removes the least important features of the model. Then, the process of creating a model and deleting the least important features is repeated until a predetermined number of features are reached.

from sklearn.feature_selection import RFE

clf = RandomForestClassifier(n_estimators=100, random_state=0)
selector = RFE(estimator=clf,
                        n_features_to_select=10,  #Number of features narrowed down
                        step=.05)  #Percentage of features to be truncated at each step

selector.fit(X_train2, y_train)

#Compress 26 dimensions to 10 dimensions
select = selector.transform(X_train2)
X_train3 = pd.DataFrame(select,
                            columns=X_train2.columns[selector.support_])

check(X_train3)

スクリーンショット 2019-12-18 11.23.22.png This is the training data finally obtained.

10. Reading and processing test data

Next, load the test data.

#Read test data
test = pd.read_csv('./test.csv')
X_test = test.drop('Loan_ID', axis=1)  # Loan_Delete ID column
check(X_test)

スクリーンショット 2019-12-18 10.00.10.png As before, convert Credit_History to object, then one-hot encode including missing values.

Here, by adding the option dummy_na = True to the pd.get_dummies () method, one-hot encoding including missing values is performed in one shot.

# Credit_Convert History to Object
X_test['Credit_History'] = X_test['Credit_History'].astype(object)

#Missing value completion of categorical variables and one-hot encoding
X_test1 = pd.get_dummies(X_test, 
                         dummy_na = True)  #Include missing values
check(X_test1)

スクリーンショット 2019-12-18 10.32.16.png Ah, there are two more features compared to the learning data.

This is because using the dummy_na = True option creates a XX_nan with all 0 data, even if there are no missing values.

Here, 9.Married_nan, 17.Education_nan, 27.Property_Area_nan correspond to it.

Let's see the difference between the features of the training data and the test data.

#Read the names of features as a set
cols_train = set(X_train1.columns.values)
cols_test = set(X_test1.columns.values)

#Features in train but not in test
diff1 = cols_train - cols_test
print('train only', diff1)

#Features in test but not in train
diff2 = cols_test - cols_train
print('test only', diff2)

スクリーンショット 2019-12-18 10.37.08.png Only test has a lot of Education_nan and Property_Area_nan. What happened to Married_nan is that Married_nan exists properly in the training data with a unique value [0,1].

Basically, the features that are not in the test but in the train are restored, and the features that are in the test and not in the train are deleted. Here, only the features that are in test but not in train are deleted.

X_test1 = X_test1.drop(['Education_nan', 'Property_Area_nan'], axis=1)
check(X_test1)

スクリーンショット 2019-12-18 13.16.35.png

Now, let's use the mean value of the training data to complete the missing value of the numerical variable.

#Numerical missing value completion
X_test2 = X_test1.fillna(X_train1.mean())

Then, as with the training data, the features are finally narrowed down to 10. Before that, the order of the features of the training data and the test data is secured.

# X_X the features of test2_Make the same order as train2
X_test1 = X_test1.reindex(X_train1.columns.values,axis=1)

Finally, the test data is also narrowed down using the RFE filtering result selector.support_ that was done with the training data.

X_test3 = X_test2.loc[ : , X_test2.columns[selector.support_]]
check(X_test3)

スクリーンショット 2019-12-18 10.57.54.png This is the final test data. This completes the data preprocessing.

[PYTHON] Machine learning / data preprocessing