[Python] Complete preprocessing Memo as it is

This is a memo I wrote and tried before.

It is as it is more than I imagined.

I refer to the following books. Complete preprocessing [SQL / R / Python practice technique for data analysis]

Data to use

https://github.com/ghmagazine/awesomebook

Extraction

#use query function
pd.query('"2018-01-01" <= checkout_data <= "2018-01-20"')

Sampling

#df to 50%To sample
df.sample(frac=0.5)

set

Sampling based on set ID


pd.Series(df['customer_id'].unique()).sample(frac=0.5)

Calculation of the number of data and the number of types


df.groupby('hotel_id').agg({'reserve_id': 'count', 'customer_id': 'nunique'})

Calculation of total value


df.groupby(['hotel_id', 'people_num'])['total_price'].sum().reset_index()

Calculation of mode


#The method with the most frequent mode
df['total_price'].round(-3).mode()

About round

If a positive integer is specified, the digits after the decimal point are specified, and if a negative integer is specified, the digits (digits) of the integer are specified. -1 is rounded to the tens place, -2 is rounded to the 100s place. 0 is rounded to an integer (1's place), but unlike the case where it is omitted, it returns a float type.

Calculation of ranking


df['reserve_datetime'] = pd.to_datetime(df['reserve_datetime'], format='%Y-%m-%d %H:%M:%S')
df['log_no'] = df.groupby('customer_id')['reserve_datetime'].rank(ascending=True, method='first')
df

Join

Master table join


pd.merge(df.query('people_num == 1'), df_hotel.query('is_business'), on='hotel_id', how='inner')

Total value of the past n cases


df['price_sum'] = pd.Series(
 df
    .groupby('customer_id')
    .apply(lambda x: x.sort_values(by='reserve_datetime', ascending=True))
    .loc[:, 'total_price']
    .rolling(center=False, window=3, min_periods=3).sum()
    .reset_index(drop=True)
)

Split

What is cross-validation?

Data split
Learning using some data
Verification using other data
Repeat patterns 2 and 3 https://mathwords.net/kousakakunin

k-validated cross-validation

Divide the data into k pieces
Learn with k-1 and verify with the remaining one
Repeat 1 and 2 https://mathwords.net/kousakakunin

Holdout verification

If you only do cross-validation, you will approach a state of over-learning about the cross-validation problem. The solution to this is holdout verification. Prepare the data separately from the cross-validation, and use the data prepared at the final verification of accuracy.

Generate

Data generation is used when adjusting for imbalanced data There are three patterns, and the most commonly used is the combination of oversampling and undersampling.

Increase the number of data so that oversampling does not adversely affect Reduce the number of data so that undersampling does not have an adverse effect.

Oversampling

Increase the number of data

Undersampling

Reduce the number of data


#Oversampling
from imblearn.over_sampling import SMOTE

sm = SMOTE(ratio='auto', k_neighbors=5, random_state=71)
balance_data, balance_target = sm.fit_sample(df[['length', 'thickness']], df['fault_flg'])

Numeric type

Convert to integer type and floating point type


#Convert to integer type
df['value'].astype('int8')

#Convert to floating point type
df['values'].astype('float64')

Logarithmic

Convert the input value to logarithm.


df['total_price_log'] = df['total_price'].apply(lambda x: np.log(x / 1000 + 1))

Numerical categorization


df = (np.floor(df['age'] / 10) * 10).astype('category')

Normalization

Mechanism to prevent overfitting.


from sklearn.preprocessing import StandardScaler

df['people_num'] = df['people_num'].astype(float)
ss = StandardScaler()
result = ss.fit_transform(df[['people_num', 'total_price']])

df['people_num_normalized'] = [x[0] for x in result]
df['total_price_normalized'] = [x[1] for x in result]

Removal of outliers

#Removes values that are more than a certain multiple of the standard deviation from the mean
#Setting a value greater than 3
#The value according to the normal distribution is about 99 within the range of 3 times the standard deviation value from the mean value..73%Since the value of is settled, the probability of occurrence is 0.27%Consider the following values as outliers
df[(abs(df['total_price'] - np.mean(df['total_price'])) / np.std(df['total_price']) <= 3)].reset_index()

Dimensional compression by principal component analysis


from sklearn.decomposition import PCA

pca = PCA(n_components=2)
pca_values = pca.fit_transform(df[['length', 'thickness']])

print('Cumulative contribution rate: {0}'.format(sum(pca.explained_variance_ratio_)))
print('Contribution rate of each dimension: {0}'.format(pca.explained_variance_ratio_))
print('Contribution rate of each dimension: {}'.format(pca.explained_variance_ratio_))

pca_newvalues = pca.transform(df[['length', 'thickness']])

PCA Information Loss

Principal component analysis PCA is used to analyze handwritten numbers. Part 1 --Qiita Machine learning term "Dimensionality Reduction" that can be understood in 30 minutes-Qiita

Numerical completion

Constant completion


#Convert None to nan with replace function
df.replace('None', np.nan, inplace=True)

#Fill missing values for thickness with fillna function
df['thickness'].fillna(1, inplace=True)

Category type

Conversion to category type


#Convert to bool type
df[['sex_is_man']] = (df[['sex']] == 'man').astype('bool')

#Convert sex to category type
df['sex_c'] = pd.Categorical(df['sex'], categories=['man', 'woman'])

#Can be converted with astype
df['sex_c'] = df['sex_c'].astype('category')

df['sex_c'].cat.codes
df['sex_c'].cat.categories

Dummy variable


#Convert to category type
df['sex'] = pd.Categorical(df['sex'])
# get_Dummy variable sex with dummies function
dummy_vars = pd.get_dummies(df['sex'], drop_first=False)

Aggregation of category values

#Convert to Category type
df['age_rank'] = pd.Categorical(np.floor(df['age']/10)*10)

#To master data`60 or more`Add
df['age_rank'].cat.add_categories(['60 or more'], inplace=True)

#isin function
#Check if a column in the data frame contains a value
df.loc[df['age_rank'].isin([60.0, 70.0, 80.0]), 'age_rank'] = '60 or more'
df['age_rank'].cat.remove_unused_categories(inplace=True)

Complement by KNN


from sklearn.neighbors import KNeighborsClassifier

df.replace('None', np.nan, inplace=True)

#Extraction of non-missing data
train = df.dropna(subset=['type'], inplace=False)

#Extraction of missing data
test = df.loc[df.index.difference(train.index), :]

#knn model generation
kn = KNeighborsClassifier(n_neighbors=3)

#knn model learning
kn.fit(train[['length', 'thickness']], train['type'])

#Calculate the predicted value with knnn and complement the type
test['type'] = kn.predict(test[['length', 'thickness']])