Hello, this is Motty. Since I started kaggle recently, I would like to summarize the simple operation of data processing using Pandas, which is essential ('`). ・ Kaggle official website https://www.kaggle.com/ ・ Pandas User Guide https://pandas.pydata.org/docs/user_guide/index.html
Pandas is a Python data processing library. You can easily operate tables, aggregate, and complete missing values. There are many other things, such as joining columns, displaying correlation matrices, and cross-tabulating.
It's a classic kaggle, but it uses data about Titanic's survival.
import numpy as np
import matplotlib.pyplot as plt
%matplotlib.inline
import pandas as pd
df_train = pd.read_csv("train.csv")
df_train.columns #Item list
Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp', 'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'], dtype='object')<Figure size 1080x1080 with 0 Axes>
df_train.shape #size
(891,12) It's the size of the matrix. Let's take a look at only the first 5 lines.
df_train.head()
Quantitative data and qualitative data are mixed. (PClass is not a quantitative relationship with each other, so it can be regarded as qualitative data!)
df_train["Name"] #1 column extraction
df_train.loc[:,["Name","Sex"]] #Multi-column extraction
df_train[10:15] #Row extraction
df_train[10:20:2] #Skip one line
df_train[df_train["Age"] < 20] #Value specification
df_train[1:300].query('Age < 20 and Sex == "male"') #Specify multiple conditions with query
df_train["Pclass"].value_counts()
3 491 1 216 2 184 Name: Pclass, dtype: int64 It returns the aggregated result for each item.
df_train.isnull() #True → Missing value
df_train.isnull().sum()
PassengerId 0 Survived 0 Pclass 0 Name 0 Sex 0 Age 177 SibSp 0 Parch 0 Ticket 0 Fare 0 Cabin 687 Embarked 2 dtype: int64
You can get the total number of NaNs for each column item in the list. Based on this, we will establish a policy for complementing missing values.
The correlation matrix is the correlation coefficient between each variable.
df_train.corr() #Display of correlation coefficient
PassengerId Survived Pclass Age SibSp Parch Fare PassengerId 1.000000 -0.005007 -0.035144 0.036847 -0.057527 -0.001652 0.012658 Survived -0.005007 1.000000 -0.338481 -0.077221 -0.035322 0.081629 0.257307 Pclass -0.035144 -0.338481 1.000000 -0.369226 0.083081 0.018443 -0.549500 Age 0.036847 -0.077221 -0.369226 1.000000 -0.308247 -0.189119 0.096067 SibSp -0.057527 -0.035322 0.083081 -0.308247 1.000000 0.414838 0.159651 Parch -0.001652 0.081629 0.018443 -0.189119 0.414838 1.000000 0.216225 Fare 0.012658 0.257307 -0.549500 0.096067 0.159651 0.216225 1.000000
The diagonal component is always 1 because it correlates with itself. However, there are not only quantitative data but also character strings (Sex, Embark) and qualitative data using numbers (Pclass), which must be converted into one-hot representations.
dummy_df = pd.get_dummies(df_train, columns = ["Sex","Embarked","Pclass"]) #Get dummy variables
dummy_df.columns
Index(['PassengerId', 'Survived', 'Name', 'Age', 'SibSp', 'Parch', 'Ticket', 'Fare', 'Cabin', 'Sex_female', 'Sex_male', 'Embarked_C', 'Embarked_Q', 'Embarked_S', 'Pclass_1', 'Pclass_2', 'Pclass_3'], dtype='object')
dummy_df.corr()
I won't write it because it's huge, but the correlation matrix I mentioned earlier was omitted for string data. This time, the correlation matrix with the qualitative variable replaced with a dummy (0,1) is returned.
Let's look at the correlation matrix using seaborn's HeatMap. For the color palette, I chose a color palette with contrasting color depth centered on 0.
import seaborn as sns
plt.figure(figsize = (12,10))
sns.heatmap(dummy_df.corr(), cmap = "seismic",vmin = -1 ,vmax = 1, annot = True)
Those that are highly correlated with Servived can be thought of as affecting Servived. The correlation between Sex_female and Pclass_1 is high. This means that those who are female or have a room class of 1 have a high survival rate. If you use it well, you can set a policy star in the pre-machine learning stage!
I referred to the following article. Reference URL ・ Pandas basic operations that frequently occur in data analysis https://qiita.com/ysdyt/items/9ccca82fc5b504e7913a ・ PS. It's time to use the seaborn heatmap as well. from mom https://qiita.com/hiroyuki_kageyama/items/00d0f52724f16ad7cf77 Reference book <a target="_blank" href="https://www.amazon.co.jp/gp/product/B07C3JFK3V/ref=as_li_tl?ie=UTF8&camp=247&creative=1211&creativeASIN=B07C3JFK3V&linkCode=as2&tag=organiccrypt-22&linkId=6c77a22&linkId=6c77a > Complete preprocessing [SQL / R / Python practice technique for data analysis] <img src = "// ir-jp.amazon-adsystem.com/e/ir?t=organiccrypt-22&l=am2&o" = 9 & a = B07C3JFK3V "width =" 1 "height =" 1 "border =" 0 "alt =" "style =" border: none! Important; margin: 0px! Important; "/> Can be used in the field! Introduction to pandas data preprocessing Preprocessing methods useful in machine learning and data science <img src = "// ir-jp.amazon-adsystem.com/e/ir?t=organiccrypt-22&l=am2&o=9&a=" B084MD5DGG "width =" 1 "height =" 1 "border =" 0 "alt =" "style =" border: none! Important; margin: 0px! Important; "/>
Isn't it possible to just put it in a Numpy array for data processing? I thought, but it is attractive that data processing can be done easily. On the contrary, if you want to perform advanced calculations on DataFrame, it may be better to move it to an array once ...?
Recommended Posts