This is a memo I wrote and tried before.

- It is as it is more than I imagined.

I refer to the following books. Complete preprocessing [SQL / R / Python practice technique for data analysis]

https://github.com/ghmagazine/awesomebook

```
#use query function
pd.query('"2018-01-01" <= checkout_data <= "2018-01-20"')
```

```
#df to 50%To sample
df.sample(frac=0.5)
```

```
pd.Series(df['customer_id'].unique()).sample(frac=0.5)
```

```
df.groupby('hotel_id').agg({'reserve_id': 'count', 'customer_id': 'nunique'})
```

```
df.groupby(['hotel_id', 'people_num'])['total_price'].sum().reset_index()
```

```
#The method with the most frequent mode
df['total_price'].round(-3).mode()
```

If a positive integer is specified, the digits after the decimal point are specified, and if a negative integer is specified, the digits (digits) of the integer are specified. -1 is rounded to the tens place, -2 is rounded to the 100s place. 0 is rounded to an integer (1's place), but unlike the case where it is omitted, it returns a float type.

```
df['reserve_datetime'] = pd.to_datetime(df['reserve_datetime'], format='%Y-%m-%d %H:%M:%S')
df['log_no'] = df.groupby('customer_id')['reserve_datetime'].rank(ascending=True, method='first')
df
```

```
pd.merge(df.query('people_num == 1'), df_hotel.query('is_business'), on='hotel_id', how='inner')
```

```
df['price_sum'] = pd.Series(
df
.groupby('customer_id')
.apply(lambda x: x.sort_values(by='reserve_datetime', ascending=True))
.loc[:, 'total_price']
.rolling(center=False, window=3, min_periods=3).sum()
.reset_index(drop=True)
)
```

- Data split
- Learning using some data
- Verification using other data
- Repeat patterns 2 and 3 https://mathwords.net/kousakakunin

- Divide the data into k pieces
- Learn with k-1 and verify with the remaining one
- Repeat 1 and 2 https://mathwords.net/kousakakunin

If you only do cross-validation, you will approach a state of over-learning about the cross-validation problem. The solution to this is holdout verification. Prepare the data separately from the cross-validation, and use the data prepared at the final verification of accuracy.

Data generation is used when adjusting for imbalanced data There are three patterns, and the most commonly used is the combination of oversampling and undersampling.

Increase the number of data so that oversampling does not adversely affect Reduce the number of data so that undersampling does not have an adverse effect.

Increase the number of data

Reduce the number of data

```
#Oversampling
from imblearn.over_sampling import SMOTE
sm = SMOTE(ratio='auto', k_neighbors=5, random_state=71)
balance_data, balance_target = sm.fit_sample(df[['length', 'thickness']], df['fault_flg'])
```

```
#Convert to integer type
df['value'].astype('int8')
#Convert to floating point type
df['values'].astype('float64')
```

Convert the input value to logarithm.

```
df['total_price_log'] = df['total_price'].apply(lambda x: np.log(x / 1000 + 1))
```

```
df = (np.floor(df['age'] / 10) * 10).astype('category')
```

Mechanism to prevent overfitting.

```
from sklearn.preprocessing import StandardScaler
df['people_num'] = df['people_num'].astype(float)
ss = StandardScaler()
result = ss.fit_transform(df[['people_num', 'total_price']])
df['people_num_normalized'] = [x[0] for x in result]
df['total_price_normalized'] = [x[1] for x in result]
```

```
#Removes values that are more than a certain multiple of the standard deviation from the mean
#Setting a value greater than 3
#The value according to the normal distribution is about 99 within the range of 3 times the standard deviation value from the mean value..73%Since the value of is settled, the probability of occurrence is 0.27%Consider the following values as outliers
df[(abs(df['total_price'] - np.mean(df['total_price'])) / np.std(df['total_price']) <= 3)].reset_index()
```

```
from sklearn.decomposition import PCA
pca = PCA(n_components=2)
pca_values = pca.fit_transform(df[['length', 'thickness']])
print('Cumulative contribution rate: {0}'.format(sum(pca.explained_variance_ratio_)))
print('Contribution rate of each dimension: {0}'.format(pca.explained_variance_ratio_))
print('Contribution rate of each dimension: {}'.format(pca.explained_variance_ratio_))
pca_newvalues = pca.transform(df[['length', 'thickness']])
```

Principal component analysis PCA is used to analyze handwritten numbers. Part 1 --Qiita Machine learning term "Dimensionality Reduction" that can be understood in 30 minutes-Qiita

```
#Convert None to nan with replace function
df.replace('None', np.nan, inplace=True)
#Fill missing values for thickness with fillna function
df['thickness'].fillna(1, inplace=True)
```

```
#Convert to bool type
df[['sex_is_man']] = (df[['sex']] == 'man').astype('bool')
#Convert sex to category type
df['sex_c'] = pd.Categorical(df['sex'], categories=['man', 'woman'])
#Can be converted with astype
df['sex_c'] = df['sex_c'].astype('category')
df['sex_c'].cat.codes
df['sex_c'].cat.categories
```

```
#Convert to category type
df['sex'] = pd.Categorical(df['sex'])
# get_Dummy variable sex with dummies function
dummy_vars = pd.get_dummies(df['sex'], drop_first=False)
```

```
#Convert to Category type
df['age_rank'] = pd.Categorical(np.floor(df['age']/10)*10)
#To master data`60 or more`Add
df['age_rank'].cat.add_categories(['60 or more'], inplace=True)
#isin function
#Check if a column in the data frame contains a value
df.loc[df['age_rank'].isin([60.0, 70.0, 80.0]), 'age_rank'] = '60 or more'
df['age_rank'].cat.remove_unused_categories(inplace=True)
```

```
from sklearn.neighbors import KNeighborsClassifier
df.replace('None', np.nan, inplace=True)
#Extraction of non-missing data
train = df.dropna(subset=['type'], inplace=False)
#Extraction of missing data
test = df.loc[df.index.difference(train.index), :]
#knn model generation
kn = KNeighborsClassifier(n_neighbors=3)
#knn model learning
kn.fit(train[['length', 'thickness']], train['type'])
#Calculate the predicted value with knnn and complement the type
test['type'] = kn.predict(test[['length', 'thickness']])
```

Recommended Posts