[PYTHON] I want to do ○○ with Pandas

Change log


In Pandas, I often ask "What should I do when I want to do it?", So I will summarize it according to the purpose.

Premise

In this sample code, A list of Titanic survivors (train.csv) provided by Kaggle Read and use with pandas.read_csv ().

Titanic: Machine Learning from Disaster | Kaggle

import pandas as pd
df = pd.read_csv('train.csv')

pandas.read_csv — pandas 1.0.5 documentation

I want to output statistical information

df.describe()
PassengerId Survived Pclass Age SibSp Parch Fare
count 891.000000 891.000000 891.000000 714.000000 891.000000 891.000000 891.000000
mean 446.000000 0.383838 2.308642 29.699118 0.523008 0.381594 32.204208
std 257.353842 0.486592 0.836071 14.526497 1.102743 0.806057 49.693429
min 1.000000 0.000000 1.000000 0.420000 0.000000 0.000000 0.000000
25% 223.500000 0.000000 2.000000 20.125000 0.000000 0.000000 7.910400
50% 446.000000 0.000000 3.000000 28.000000 0.000000 0.000000 14.454200
75% 668.500000 1.000000 3.000000 38.000000 1.000000 0.000000 31.000000
max 891.000000 1.000000 3.000000 80.000000 8.000000 6.000000 512.329200
#Narrow down the output columns
df['Age'].describe()
count    714.000000
mean      29.699118
std       14.526497
min        0.420000
25%       20.125000
50%       28.000000
75%       38.000000
max       80.000000
Name: Age, dtype: float64

pandas.DataFrame.describe — pandas 1.0.5 documentation

I want to check the number of data

df['Age'].count()
714

You can check the number of rows / columns that contain values other than None, NaN, and NaT.

I want to narrow down the data

# 20 < Age <Extract 40 rows
df[(20 < df['Age']) & (df['Age'] < 40)].head()
Index PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S

If you want to narrow down by multiple AND / OR conditions, specify the conditions by enclosing them in (), as in df [(A) & (B)].

I want to convert categorized data to numbers

# Embarked(C, Q, S)Numerical value(1, 2, 3)Conversion to
df['Embarked'] = df['Embarked'].map({'C': 1, 'Q': 2, 'S': 3})
Index PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN 3.0
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 1.0
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN 3.0
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 3.0
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN 3.0

pandas.Series.map — pandas 1.0.4 documentation

I want to change the column name

# Sex(female, male)Numerical value(0, 1)Convert to and column name(Sex)To Male
df['Sex'] = df['Sex'].map({'female': 0, 'male': 1})
df = df.rename(columns={'Sex': 'Male'})
Index PassengerId Survived Pclass Name Male Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris 1 22.0 1 0 A/5 21171 7.2500 NaN 3.0
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... 0 38.0 1 0 PC 17599 71.2833 C85 1.0
2 3 1 3 Heikkinen, Miss. Laina 0 26.0 0 0 STON/O2. 3101282 7.9250 NaN 3.0
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) 0 35.0 1 0 113803 53.1000 C123 3.0
4 5 0 3 Allen, Mr. William Henry 1 35.0 0 0 373450 8.0500 NaN 3.0

pandas.DataFrame.rename — pandas 1.0.4 documentation

If you pass an array with a list of column names, you can change all the column names at once.

pd.DataFrame({'c': [1, 2], 'd': [10, 20]}).columns = ['a', 'b']
Index a b
0 1 10
1 2 20

python - Renaming columns in pandas - Stack Overflow

I want to check the number of missing values for each column

df.isnull().sum()
PassengerId      0
Survived         0
Pclass           0
Name             0
Male             0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

pandas.isnull — pandas 1.0.4 documentation pandas.DataFrame.sum — pandas 1.0.4 documentation

I want to exclude missing values

#Exclude all rows containing missing values
df_dn = df.dropna()
df_dn.count()
PassengerId    183
Survived       183
Pclass         183
Name           183
Male           183
Age            183
SibSp          183
Parch          183
Ticket         183
Fare           183
Cabin          183
Embarked       183
dtype: int64

pandas.DataFrame.dropna — pandas 1.0.5 documentation

I want to retrieve the specified column

#Extract Survived and Age columns
df[['Survived', 'Age']]
Index Survived Age
0 0 22.0
1 1 38.0
2 1 26.0
3 1 35.0
4 0 35.0

Indexing and selecting data — pandas 1.0.4 documentation Get / change the value of any position with pandas at, iat, loc, iloc | note.nkmk.me

I want to exclude the specified column

df_dn = df.drop('Cabin', axis='columns')
Index PassengerId Survived Pclass Name Male Age SibSp Parch Ticket Fare Embarked
0 1 0 3 Braund, Mr. Owen Harris 1 22.0 1 0 A/5 21171 7.2500 3.0
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... 0 38.0 1 0 PC 17599 71.2833 1.0
2 3 1 3 Heikkinen, Miss. Laina 0 26.0 0 0 STON/O2. 3101282 7.9250 3.0
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) 0 35.0 1 0 113803 53.1000 3.0
4 5 0 3 Allen, Mr. William Henry 1 35.0 0 0 373450 8.0500 3.0

pandas.DataFrame.dropna — pandas 1.0.5 documentation

I want to apply a function to a row / column value and create a new row / column

import re

#Function to extract title
def getTitle(row):
    name = row['Name']
    p = re.compile('.*\ (.*)\.\ .*')
    surname = p.search(name)
    return surname.group(1)

df['Title'] = df.apply(getTitle, axis='columns')
Index PassengerId Survived Pclass Name Male Age SibSp Parch Ticket Fare Cabin Embarked Title
0 1 0 3 Braund, Mr. Owen Harris 1 22.0 1 0 A/5 21171 7.2500 NaN 3.0 Mr
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... 0 38.0 1 0 PC 17599 71.2833 C85 1.0 Mrs
2 3 1 3 Heikkinen, Miss. Laina 0 26.0 0 0 STON/O2. 3101282 7.9250 NaN 3.0 Miss
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) 0 35.0 1 0 113803 53.1000 C123 3.0 Mrs
4 5 0 3 Allen, Mr. William Henry 1 35.0 0 0 373450 8.0500 NaN 3.0 Mr

pandas.DataFrame.apply — pandas 1.0.5 documentation

I want to calculate the average for each category

#Find the average age for each title
df.groupby('Title').mean()['Age']
Title
Capt        70.000000
Col         58.000000
Countess    33.000000
Don         40.000000
Dr          42.000000
Jonkheer    38.000000
L           54.000000
Lady        48.000000
Major       48.500000
Master       4.574167
Miss        21.773973
Mlle        24.000000
Mme         24.000000
Mr          32.368090
Mrs         35.728972
Ms          28.000000
Rev         43.166667
Sir         49.000000
Name: Age, dtype: float64

You can also find the number of data items for each title by using df.groupby ('Title'). Count (). How to use Pandas groupby --Qiita

I want to sort rows by value

df.sort_values(by='Age')
Index PassengerId Survived Pclass Name Male Age SibSp Parch Ticket Fare Cabin Embarked Title AgeMean
803 804 1 3 Thomas, Master. Assad Alexander 1 0.42 0 1 2625 8.5167 NaN 1.0 Master NaN
755 756 1 2 Hamalainen, Master. Viljo 1 0.67 1 1 250649 14.5000 NaN 3.0 Master NaN
644 645 1 3 Baclini, Miss. Eugenie 0 0.75 2 1 2666 19.2583 NaN 1.0 Miss NaN
469 470 1 3 Baclini, Miss. Helene Barbara 0 0.75 2 1 2666 19.2583 NaN 1.0 Miss NaN
78 79 1 2 Caldwell, Master. Alden Gates 1 0.83 0 2 248738 29.0000 NaN 3.0 Master NaN

pandas.DataFrame.sort_values — pandas 1.0.5 documentation

Normally, DaraFrame that executed sort_values () is not changed, and the return value is obtained in the sorted state. If ʻascending = False is specified, the specified columns will be sorted in descending order. If ʻinplace = True is specified, the DataFrame that executed sort_values () will be sorted and the return value will be None.

I want to find out the unique values contained in a column

df['Survived'].unique()
array([0, 1], dtype=int64)

pandas.unique — pandas 1.0.5 documentation

I want to retrieve a line that contains a specific string

df[df['Name'].str.contains('Thomas')]
Index PassengerId Survived Pclass Name Male Age SibSp Parch Ticket Fare Cabin Embarked Title AgeMean
149 150 0 2 Byles, Rev. Thomas Roussel Davids 1 42.00 0 0 244310 13.0000 NaN 3.0 Rev NaN
151 152 1 1 Pears, Mrs. Thomas (Edith Wearne) 0 22.00 1 0 113776 66.6000 C2 3.0 Mrs NaN
159 160 0 3 Sage, Master. Thomas Henry 1 NaN 8 2 CA. 2343 69.5500 NaN 3.0 Master NaN
186 187 1 3 O'Brien, Mrs. Thomas (Johanna "Hannah" Godfrey) 0 NaN 1 0 370365 15.5000 NaN 2.0 Mrs NaN
252 253 0 1 Stead, Mr. William Thomas 1 62.00 0 0 113514 26.5500 C87 3.0 Mr NaN

pandas.Series.str.contains — pandas 1.0.5 documentation python - How to filter rows containing a string pattern from a Pandas dataframe - Stack Overflow

Use the ~ operator if you want to retrieve a * value that does not * contain a specific string.

df[~df['Name'].str.contains('Thomas')]
Index PassengerId Survived Pclass Name Male Age SibSp Parch Ticket Fare Cabin Embarked Title AgeMean
0 1 0 3 Braund, Mr. Owen Harris 1 22.0 1 0 A/5 21171 7.2500 NaN 3.0 Mr NaN
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... 0 38.0 1 0 PC 17599 71.2833 C85 1.0 Mrs NaN
2 3 1 3 Heikkinen, Miss. Laina 0 26.0 0 0 STON/O2. 3101282 7.9250 NaN 3.0 Miss NaN
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) 0 35.0 1 0 113803 53.1000 C123 3.0 Mrs NaN
4 5 0 3 Allen, Mr. William Henry 1 35.0 0 0 373450 8.0500 NaN 3.0 Mr NaN

python - Search for "does-not-contain" on a DataFrame in pandas - Stack Overflow

I want to add color when displaying a data frame

#value"Mr"Make the background color of the column yellow
df.style.apply(lambda x: ['background-color: yellow' if v == 'Mr' else '' for v in x])
Index PassengerId Survived Pclass Name Male Age SibSp Parch Ticket Fare Cabin Embarked Title AgeMean
0 1 0 3 Braund, Mr. Owen Harris 1 22.000000 1 0 A/5 21171 7.250000 nan 3.000000 Mr nan
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Thayer) 0 38.000000 1 0 PC 17599 71.283300 C85 1.000000 Mrs nan
2 3 1 3 Heikkinen, Miss. Laina 0 26.000000 0 0 STON/O2. 3101282 7.925000 nan 3.000000 Miss nan
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) 0 35.000000 1 0 113803 53.100000 C123 3.000000 Mrs nan
4 5 0 3 Allen, Mr. William Henry 1 35.000000 0 0 373450 8.050000 nan 3.000000 Mr nan

When opened in Jupyter Notebook, the corresponding column is displayed with a colored background. Note that when you open Jupyter Notebook on GitHub, the background color will not be added.

pandas.io.formats.style.Styler.apply — pandas 1.0.5 documentation python - Pandas style function to highlight specific columns - Stack Overflow

I want to output in CSV format

df.to_csv('output.csv', index=False)

If you do not want to include the index (line number), specify ʻindex = False`. pandas.DataFrame.to_csv — pandas 1.0.5 documentation

If you don't want to put a line break on the last line of the file, pass line_terminator =" " only on the last line python - How to stop writing a blank line at the end of csv file - pandas - Stack Overflow

Recommended Posts

I want to do ○○ with Pandas
I want to debug with Python
I want to detect objects with OpenCV
I want to blog with Jupyter Notebook
I want to pip install with PythonAnywhere
I want to analyze logs with Python
I want to play with aws with python
Convert 202003 to 2020-03 with pandas
I want to do a full text search with elasticsearch + python
I want to do Dunnett's test in Python
I want to use MATLAB feval with python
I want to analyze songs with Spotify API 2
I want to mock datetime.datetime.now () even with pytest!
I want to display multiple images with matplotlib.
I want to knock 100 data sciences with Colaboratory
I want to make a game with Python
I want to analyze songs with Spotify API 1
I want to use Temporary Directory with Python2
I don't want to use -inf with np.log
#Unresolved I want to compile gobject-introspection with Python3
I want to use ip vrf with SONiC
I want to solve APG4b with Python (Chapter 2)
I want to do pyenv + pipenv on Windows
I want to start over with Django's Migrate
I want to write to a file with Python
I want to do it with Python lambda Django, but I will stop
Note: I want to do home automation with Home Assistant + Raspberry Pi + sensor # 1
I want to solve Sudoku (Sudoku)
I want to detect unauthorized login to facebook with Jubatus (1)
I want to give a group_id to a pandas data frame
I want to transition with a button in flask
I want to handle optimization with python and cplex
I want to climb a mountain with reinforcement learning
Links to do what you want with Sublime Text
I want to inherit to the back with python dataclass
I want to split a character string with hiragana
I want to AWS Lambda with Python on Mac!
I want to manually create a legend with matplotlib
[TensorFlow] I want to process windows with Ragged Tensor
I want to run a quantum computer with Python
I want to do something in Python when I finish
I want to do Wake On LAN fully automatically
I want to bind a local variable with lambda
I want to improve efficiency with Python even in an experimental system (3) I want to do something like Excel with Pandas
I want to be able to analyze data with Python (Part 3)
I want to remove Python's Unresolved Import Warning with vsCode
I want to use R functions easily with ipython notebook
[AWS] What to do when you want to pip with Lambda
I want to specify another version of Python with pyvenv
I want to be able to analyze data with Python (Part 1)
I want to do something like sort uniq in Python
I want to understand systemd roughly
I want to start a jupyter environment with one command
[NetworkX] I want to search for nodes with specific attributes
I want to make a click macro with pyautogui (desire)
I want to change the Japanese flag to the Palau flag with Numpy
To do tail recursion with Python2
I want to be able to analyze data with Python (Part 4)
I want to save a file with "Do not compress images in file" set in OpenPyXL
I want to color black-and-white photos of memories with GAN
I want to automatically attend online classes with Python + Selenium!