[PYTHON] Basic operation of Pandas

Hello, this is Motty. Since I started kaggle recently, I would like to summarize the simple operation of data processing using Pandas, which is essential ('`). ・ Kaggle official website https://www.kaggle.com/ ・ Pandas User Guide https://pandas.pydata.org/docs/user_guide/index.html

Overview

Pandas is a Python data processing library. You can easily operate tables, aggregate, and complete missing values. There are many other things, such as joining columns, displaying correlation matrices, and cross-tabulating.  2020-07-31 18.05.25.png

Data to use

It's a classic kaggle, but it uses data about Titanic's survival.

import numpy as np
import matplotlib.pyplot as plt
%matplotlib.inline
import pandas as pd
df_train = pd.read_csv("train.csv")

Data confirmation

df_train.columns #Item list

Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp', 'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'], dtype='object')<Figure size 1080x1080 with 0 Axes>

df_train.shape #size

(891,12) It's the size of the matrix. Let's take a look at only the first 5 lines.

df_train.head()

 2020-07-31 18.25.12.png Quantitative data and qualitative data are mixed. (PClass is not a quantitative relationship with each other, so it can be regarded as qualitative data!)

Column / row extraction

df_train["Name"] #1 column extraction
df_train.loc[:,["Name","Sex"]] #Multi-column extraction

df_train[10:15] #Row extraction
df_train[10:20:2] #Skip one line

df_train[df_train["Age"] < 20] #Value specification
df_train[1:300].query('Age < 20 and Sex == "male"') #Specify multiple conditions with query

Aggregate

df_train["Pclass"].value_counts()

3 491 1 216 2 184 Name: Pclass, dtype: int64 It returns the aggregated result for each item.

Confirmation of missing values

df_train.isnull() #True → Missing value
df_train.isnull().sum()

PassengerId 0 Survived 0 Pclass 0 Name 0 Sex 0 Age 177 SibSp 0 Parch 0 Ticket 0 Fare 0 Cabin 687 Embarked 2 dtype: int64

You can get the total number of NaNs for each column item in the list. Based on this, we will establish a policy for complementing missing values.

Display of correlation matrix

The correlation matrix is the correlation coefficient between each variable.

df_train.corr() #Display of correlation coefficient

PassengerId Survived Pclass Age SibSp Parch Fare PassengerId 1.000000 -0.005007 -0.035144 0.036847 -0.057527 -0.001652 0.012658 Survived -0.005007 1.000000 -0.338481 -0.077221 -0.035322 0.081629 0.257307 Pclass -0.035144 -0.338481 1.000000 -0.369226 0.083081 0.018443 -0.549500 Age 0.036847 -0.077221 -0.369226 1.000000 -0.308247 -0.189119 0.096067 SibSp -0.057527 -0.035322 0.083081 -0.308247 1.000000 0.414838 0.159651 Parch -0.001652 0.081629 0.018443 -0.189119 0.414838 1.000000 0.216225 Fare 0.012658 0.257307 -0.549500 0.096067 0.159651 0.216225 1.000000

The diagonal component is always 1 because it correlates with itself. However, there are not only quantitative data but also character strings (Sex, Embark) and qualitative data using numbers (Pclass), which must be converted into one-hot representations.

dummy_df = pd.get_dummies(df_train, columns = ["Sex","Embarked","Pclass"]) #Get dummy variables
dummy_df.columns

Index(['PassengerId', 'Survived', 'Name', 'Age', 'SibSp', 'Parch', 'Ticket', 'Fare', 'Cabin', 'Sex_female', 'Sex_male', 'Embarked_C', 'Embarked_Q', 'Embarked_S', 'Pclass_1', 'Pclass_2', 'Pclass_3'], dtype='object')

dummy_df.corr()

I won't write it because it's huge, but the correlation matrix I mentioned earlier was omitted for string data. This time, the correlation matrix with the qualitative variable replaced with a dummy (0,1) is returned.

Visualization

Let's look at the correlation matrix using seaborn's HeatMap. For the color palette, I chose a color palette with contrasting color depth centered on 0.

import seaborn as sns
plt.figure(figsize = (12,10))
sns.heatmap(dummy_df.corr(), cmap = "seismic",vmin = -1 ,vmax = 1, annot = True)

 2020-07-31 18.02.21.png

Those that are highly correlated with Servived can be thought of as affecting Servived. The correlation between Sex_female and Pclass_1 is high. This means that those who are female or have a room class of 1 have a high survival rate. If you use it well, you can set a policy star in the pre-machine learning stage!

References

I referred to the following article. Reference URL ・ Pandas basic operations that frequently occur in data analysis https://qiita.com/ysdyt/items/9ccca82fc5b504e7913a ・ PS. It's time to use the seaborn heatmap as well. from mom https://qiita.com/hiroyuki_kageyama/items/00d0f52724f16ad7cf77 Reference book <a target="_blank" href="https://www.amazon.co.jp/gp/product/B07C3JFK3V/ref=as_li_tl?ie=UTF8&camp=247&creative=1211&creativeASIN=B07C3JFK3V&linkCode=as2&tag=organiccrypt-22&linkId=6c77a22&linkId=6c77a > Complete preprocessing [SQL / R / Python practice technique for data analysis] <img src = "// ir-jp.amazon-adsystem.com/e/ir?t=organiccrypt-22&l=am2&o" = 9 & a = B07C3JFK3V "width =" 1 "height =" 1 "border =" 0 "alt =" "style =" border: none! Important; margin: 0px! Important; "/> Can be used in the field! Introduction to pandas data preprocessing Preprocessing methods useful in machine learning and data science <img src = "// ir-jp.amazon-adsystem.com/e/ir?t=organiccrypt-22&l=am2&o=9&a=" B084MD5DGG "width =" 1 "height =" 1 "border =" 0 "alt =" "style =" border: none! Important; margin: 0px! Important; "/>

At the end

Isn't it possible to just put it in a Numpy array for data processing? I thought, but it is attractive that data processing can be done easily. On the contrary, if you want to perform advanced calculations on DataFrame, it may be better to move it to an array once ...?

Recommended Posts

Basic operation of pandas
Basic operation of Pandas
Basic usage of Pandas Summary
Basic operation of Python Pandas Series and Dataframe (1)
[Python] Operation memo of pandas DataFrame
Pandas operation memorandum
I wrote the basic operation of Pandas with Jupyter Lab (Part 1)
I wrote the basic operation of Pandas with Jupyter Lab (Part 2)
[Python] Operation of enumerate
Automatic operation of Chrome with Python + Selenium + pandas
Basic usage of flask-classy
Basic usage of Jinja2
Basic operation list of Python3 list, tuple, dictionary, set
About MultiIndex of pandas
Basic usage of SQLAlchemy
Basic knowledge of Python
Basic processing of librosa
Python Basic --Pandas, Numpy-
Python application: Pandas Part 1: Basic
Super basic usage of pytest
Basic usage of PySimple GUI
Formatted display of pandas DataFrame
Operation of filter (None, list)
Basic flow of anomaly detection
XPath Basics (1) -Basic Concept of XPath
One-liner basic graph of HoloViews
Behavior of pandas rolling () method
Basic usage of Python f-string
Index of certain pandas usage
The Power of Pandas: Python
I wrote the basic operation of Seaborn in Jupyter Lab
[Scientific / technical calculation by Python] Basic operation of arrays, numpy
I wrote the basic operation of Numpy in Jupyter Lab.
I wrote the basic operation of matplotlib with Jupyter Lab
Basic knowledge of Linux and basic commands
Work memorandum (pymongo) Part 1. Basic operation
Summary of basic knowledge of PyPy Part 1
Summary of basic implementation by PyTorch
Features of pd.NA in pandas 1.0.0 (rc0)
Etosetra related to read_csv of Pandas
Pandas
About the basic type of Go
[Memo] Small story of pandas, numpy
1. Statistics learned with Python 1-1. Basic statistics (Pandas)
Notify LINE of train operation information
Basic grammar of Python3 system (dictionary)
Basic Python operation 2nd: Function (argument)
Operation memo of Conda virtual environment
Basic study of OpenCV with Python
Bar graph display in pandas (basic edition)
Python basic operation 1st: List comprehension notation
[Python] Summary of how to use pandas
[Linux] Review of frequently used basic commands 2
Summary of methods often used in pandas
Import of japandas with pandas 1.0 and above
[Design study 1] Design study of PC operation automation system 1
Operation of virtual currency automatic trading script
A little scrutiny of pandas 1.0 and dask
Basic writing of various programming languages (self-memo)
Basic usage of Btrfs on Arch Linux
Python basic operation 3rd: Object-oriented and class