[Python] Complete preprocessing Memo as it is

This is a memo I wrote and tried before.

I refer to the following books. Complete preprocessing [SQL / R / Python practice technique for data analysis]

Data to use



#use query function
pd.query('"2018-01-01" <= checkout_data <= "2018-01-20"')


#df to 50%To sample


Sampling based on set ID


Calculation of the number of data and the number of types

df.groupby('hotel_id').agg({'reserve_id': 'count', 'customer_id': 'nunique'})

Calculation of total value

df.groupby(['hotel_id', 'people_num'])['total_price'].sum().reset_index()

Calculation of mode

#The method with the most frequent mode

About round

If a positive integer is specified, the digits after the decimal point are specified, and if a negative integer is specified, the digits (digits) of the integer are specified. -1 is rounded to the tens place, -2 is rounded to the 100s place. 0 is rounded to an integer (1's place), but unlike the case where it is omitted, it returns a float type.

Calculation of ranking

df['reserve_datetime'] = pd.to_datetime(df['reserve_datetime'], format='%Y-%m-%d %H:%M:%S')
df['log_no'] = df.groupby('customer_id')['reserve_datetime'].rank(ascending=True, method='first')


Master table join

pd.merge(df.query('people_num == 1'), df_hotel.query('is_business'), on='hotel_id', how='inner')

Total value of the past n cases

df['price_sum'] = pd.Series(
    .apply(lambda x: x.sort_values(by='reserve_datetime', ascending=True))
    .loc[:, 'total_price']
    .rolling(center=False, window=3, min_periods=3).sum()


What is cross-validation?

  1. Data split
  2. Learning using some data
  3. Verification using other data
  4. Repeat patterns 2 and 3 https://mathwords.net/kousakakunin

k-validated cross-validation

  1. Divide the data into k pieces
  2. Learn with k-1 and verify with the remaining one
  3. Repeat 1 and 2 https://mathwords.net/kousakakunin

Holdout verification

If you only do cross-validation, you will approach a state of over-learning about the cross-validation problem. The solution to this is holdout verification. Prepare the data separately from the cross-validation, and use the data prepared at the final verification of accuracy.


Data generation is used when adjusting for imbalanced data There are three patterns, and the most commonly used is the combination of oversampling and undersampling.

Increase the number of data so that oversampling does not adversely affect Reduce the number of data so that undersampling does not have an adverse effect.


Increase the number of data


Reduce the number of data

from imblearn.over_sampling import SMOTE

sm = SMOTE(ratio='auto', k_neighbors=5, random_state=71)
balance_data, balance_target = sm.fit_sample(df[['length', 'thickness']], df['fault_flg'])

Numeric type

Convert to integer type and floating point type

#Convert to integer type

#Convert to floating point type


Convert the input value to logarithm.

df['total_price_log'] = df['total_price'].apply(lambda x: np.log(x / 1000 + 1))

Numerical categorization

df = (np.floor(df['age'] / 10) * 10).astype('category')


Mechanism to prevent overfitting.

from sklearn.preprocessing import StandardScaler

df['people_num'] = df['people_num'].astype(float)
ss = StandardScaler()
result = ss.fit_transform(df[['people_num', 'total_price']])

df['people_num_normalized'] = [x[0] for x in result]
df['total_price_normalized'] = [x[1] for x in result]

Removal of outliers

#Removes values that are more than a certain multiple of the standard deviation from the mean
#Setting a value greater than 3
#The value according to the normal distribution is about 99 within the range of 3 times the standard deviation value from the mean value..73%Since the value of is settled, the probability of occurrence is 0.27%Consider the following values as outliers
df[(abs(df['total_price'] - np.mean(df['total_price'])) / np.std(df['total_price']) <= 3)].reset_index()

Dimensional compression by principal component analysis

from sklearn.decomposition import PCA

pca = PCA(n_components=2)
pca_values = pca.fit_transform(df[['length', 'thickness']])

print('Cumulative contribution rate: {0}'.format(sum(pca.explained_variance_ratio_)))
print('Contribution rate of each dimension: {0}'.format(pca.explained_variance_ratio_))
print('Contribution rate of each dimension: {}'.format(pca.explained_variance_ratio_))

pca_newvalues = pca.transform(df[['length', 'thickness']])

PCA Information Loss

Principal component analysis PCA is used to analyze handwritten numbers. Part 1 --Qiita Machine learning term "Dimensionality Reduction" that can be understood in 30 minutes-Qiita

Numerical completion

Constant completion

#Convert None to nan with replace function
df.replace('None', np.nan, inplace=True)

#Fill missing values for thickness with fillna function
df['thickness'].fillna(1, inplace=True)

Category type

Conversion to category type

#Convert to bool type
df[['sex_is_man']] = (df[['sex']] == 'man').astype('bool')

#Convert sex to category type
df['sex_c'] = pd.Categorical(df['sex'], categories=['man', 'woman'])

#Can be converted with astype
df['sex_c'] = df['sex_c'].astype('category')


Dummy variable

#Convert to category type
df['sex'] = pd.Categorical(df['sex'])
# get_Dummy variable sex with dummies function
dummy_vars = pd.get_dummies(df['sex'], drop_first=False)

Aggregation of category values

#Convert to Category type
df['age_rank'] = pd.Categorical(np.floor(df['age']/10)*10)

#To master data`60 or more`Add
df['age_rank'].cat.add_categories(['60 or more'], inplace=True)

#isin function
#Check if a column in the data frame contains a value
df.loc[df['age_rank'].isin([60.0, 70.0, 80.0]), 'age_rank'] = '60 or more'

Complement by KNN

from sklearn.neighbors import KNeighborsClassifier

df.replace('None', np.nan, inplace=True)

#Extraction of non-missing data
train = df.dropna(subset=['type'], inplace=False)

#Extraction of missing data
test = df.loc[df.index.difference(train.index), :]

#knn model generation
kn = KNeighborsClassifier(n_neighbors=3)

#knn model learning
kn.fit(train[['length', 'thickness']], train['type'])

#Calculate the predicted value with knnn and complement the type
test['type'] = kn.predict(test[['length', 'thickness']])

Recommended Posts

[Python] Complete preprocessing Memo as it is
What is Python? What is it used for?
Python memo
python memo
Python memo
python memo
Python memo
Python memo
Python memo
Python memo using perl --format is convenient.
When you want to use it as it is when using it with lambda memo
Read CSV file with Python and convert it to DataFrame as it is
[Python] Memo dictionary
Python is easy
python beginner memo (9.2-10)
python memo: Treat lists as a set type
python beginner memo (9.1)
Since python is read as "Pichon", it can be executed with "Pichon" (it is a story)
[Python] Pre-processing tricks
[Python] EDA memo
Python 3 operator memo
What is python
Python is instance
[My memo] python
Python3 metaclass memo
[Python] Basemap memo
Judge whether it is a prime number [Python]
Python beginner memo (2)
[Python] Numpy memo
What is Python
Since memory_profiler of python is heavy, I measured it
[Python] What is energy data? Calender Heatmap [Plotly] Memo
Python memo using perl-Dictionary type (case is not valid)
python int is infinite
Python class (Python learning memo ⑦)
My python environment memo
python openCV installation (memo)
Python module (Python learning memo ④)
[Python] What is Pipeline ...
Is it Google Colaboratory?
Python test package memo
I tried "Streamlit" which turns the Python code into a web application as it is
[Python] Memo about functions
python regular expression memo
Binary search (python2.7) memo
[My memo] python -v / python -V
Python3 List / dictionary memo
[Memo] Python3 list sort
[Python] Memo about errors
DynamoDB Script Memo (Python)
Python basic memo --Part 2
python recipe book Memo
Complete Python compound pretreatment
Basic Python command memo
Python OpenCV tutorial memo
Python basic grammar memo
TensorFlow API memo (Python)
python useful memo links
Python decorator operation memo
Python basic memo --Part 1
Effective Python Memo Item 3