# [Python] Complete preprocessing Memo as it is

This is a memo I wrote and tried before.

• It is as it is more than I imagined.

I refer to the following books. Complete preprocessing [SQL / R / Python practice technique for data analysis]

## Data to use

https://github.com/ghmagazine/awesomebook

## Extraction

``````#use query function
pd.query('"2018-01-01" <= checkout_data <= "2018-01-20"')
``````

### Sampling

``````#df to 50%To sample
df.sample(frac=0.5)
``````

## set

### Sampling based on set ID

``````
pd.Series(df['customer_id'].unique()).sample(frac=0.5)
``````

### Calculation of the number of data and the number of types

``````
df.groupby('hotel_id').agg({'reserve_id': 'count', 'customer_id': 'nunique'})
``````

### Calculation of total value

``````
df.groupby(['hotel_id', 'people_num'])['total_price'].sum().reset_index()
``````

### Calculation of mode

``````
#The method with the most frequent mode
df['total_price'].round(-3).mode()
``````

If a positive integer is specified, the digits after the decimal point are specified, and if a negative integer is specified, the digits (digits) of the integer are specified. -1 is rounded to the tens place, -2 is rounded to the 100s place. 0 is rounded to an integer (1's place), but unlike the case where it is omitted, it returns a float type.

### Calculation of ranking

``````
df['reserve_datetime'] = pd.to_datetime(df['reserve_datetime'], format='%Y-%m-%d %H:%M:%S')
df['log_no'] = df.groupby('customer_id')['reserve_datetime'].rank(ascending=True, method='first')
df
``````

## Join

### Master table join

``````
pd.merge(df.query('people_num == 1'), df_hotel.query('is_business'), on='hotel_id', how='inner')
``````

### Total value of the past n cases

``````
df['price_sum'] = pd.Series(
df
.groupby('customer_id')
.apply(lambda x: x.sort_values(by='reserve_datetime', ascending=True))
.loc[:, 'total_price']
.rolling(center=False, window=3, min_periods=3).sum()
.reset_index(drop=True)
)
``````

## Split

### What is cross-validation?

1. Data split
2. Learning using some data
3. Verification using other data
4. Repeat patterns 2 and 3 https://mathwords.net/kousakakunin

### k-validated cross-validation

1. Divide the data into k pieces
2. Learn with k-1 and verify with the remaining one
3. Repeat 1 and 2 https://mathwords.net/kousakakunin

### Holdout verification

If you only do cross-validation, you will approach a state of over-learning about the cross-validation problem. The solution to this is holdout verification. Prepare the data separately from the cross-validation, and use the data prepared at the final verification of accuracy.

## Generate

Data generation is used when adjusting for imbalanced data There are three patterns, and the most commonly used is the combination of oversampling and undersampling.

Increase the number of data so that oversampling does not adversely affect Reduce the number of data so that undersampling does not have an adverse effect.

### Oversampling

Increase the number of data

### Undersampling

Reduce the number of data

``````
#Oversampling
from imblearn.over_sampling import SMOTE

sm = SMOTE(ratio='auto', k_neighbors=5, random_state=71)
balance_data, balance_target = sm.fit_sample(df[['length', 'thickness']], df['fault_flg'])

``````

## Numeric type

### Convert to integer type and floating point type

``````
#Convert to integer type
df['value'].astype('int8')

#Convert to floating point type
df['values'].astype('float64')

``````

### Logarithmic

Convert the input value to logarithm.

``````
df['total_price_log'] = df['total_price'].apply(lambda x: np.log(x / 1000 + 1))

``````

### Numerical categorization

``````
df = (np.floor(df['age'] / 10) * 10).astype('category')
``````

### Normalization

Mechanism to prevent overfitting.

``````
from sklearn.preprocessing import StandardScaler

df['people_num'] = df['people_num'].astype(float)
ss = StandardScaler()
result = ss.fit_transform(df[['people_num', 'total_price']])

df['people_num_normalized'] = [x for x in result]
df['total_price_normalized'] = [x for x in result]
``````

### Removal of outliers

``````#Removes values that are more than a certain multiple of the standard deviation from the mean
#Setting a value greater than 3
#The value according to the normal distribution is about 99 within the range of 3 times the standard deviation value from the mean value..73%Since the value of is settled, the probability of occurrence is 0.27%Consider the following values as outliers
df[(abs(df['total_price'] - np.mean(df['total_price'])) / np.std(df['total_price']) <= 3)].reset_index()
``````

### Dimensional compression by principal component analysis

``````
from sklearn.decomposition import PCA

pca = PCA(n_components=2)
pca_values = pca.fit_transform(df[['length', 'thickness']])

print('Cumulative contribution rate: {0}'.format(sum(pca.explained_variance_ratio_)))
print('Contribution rate of each dimension: {0}'.format(pca.explained_variance_ratio_))
print('Contribution rate of each dimension: {}'.format(pca.explained_variance_ratio_))

pca_newvalues = pca.transform(df[['length', 'thickness']])
``````

### Numerical completion

#### Constant completion

``````
#Convert None to nan with replace function
df.replace('None', np.nan, inplace=True)

#Fill missing values for thickness with fillna function
df['thickness'].fillna(1, inplace=True)
``````

## Category type

### Conversion to category type

``````
#Convert to bool type
df[['sex_is_man']] = (df[['sex']] == 'man').astype('bool')

#Convert sex to category type
df['sex_c'] = pd.Categorical(df['sex'], categories=['man', 'woman'])

#Can be converted with astype
df['sex_c'] = df['sex_c'].astype('category')

df['sex_c'].cat.codes
df['sex_c'].cat.categories

``````

### Dummy variable

``````
#Convert to category type
df['sex'] = pd.Categorical(df['sex'])
# get_Dummy variable sex with dummies function
dummy_vars = pd.get_dummies(df['sex'], drop_first=False)

``````

### Aggregation of category values

``````#Convert to Category type
df['age_rank'] = pd.Categorical(np.floor(df['age']/10)*10)

#To master data`60 or more`Add
df['age_rank'].cat.add_categories(['60 or more'], inplace=True)

#isin function
#Check if a column in the data frame contains a value
df.loc[df['age_rank'].isin([60.0, 70.0, 80.0]), 'age_rank'] = '60 or more'
df['age_rank'].cat.remove_unused_categories(inplace=True)
``````

### Complement by KNN

``````
from sklearn.neighbors import KNeighborsClassifier

df.replace('None', np.nan, inplace=True)

#Extraction of non-missing data
train = df.dropna(subset=['type'], inplace=False)

#Extraction of missing data
test = df.loc[df.index.difference(train.index), :]

#knn model generation
kn = KNeighborsClassifier(n_neighbors=3)

#knn model learning
kn.fit(train[['length', 'thickness']], train['type'])

#Calculate the predicted value with knnn and complement the type
test['type'] = kn.predict(test[['length', 'thickness']])
``````