[PYTHON] I wrote the basic operation of Pandas with Jupyter Lab (Part 1)

This article is an article that I actually coded the basic operation of Pandas described in Kame (@usdatascientist)'s blog (https://datawokagaku.com/python_for_ds_summary/) using Jupyter Lab.

Summary of basic operations of Pandas

10th

import pandas as pd
import numpy as np

Series

data = {'name':'John', 'sex':'male', 'age': 22}
john_s = pd.Series(data)
print(john_s)
name    John
sex     male
age       22
dtype: object
array = np.array([10,20,30])
pd.Series(array)
0    10
1    20
2    30
dtype: int64
array = np.array([10,20,30])
labels = ['a','b','c']
pd.Series(array, labels)
a    10
b    20
c    30
dtype: int64

11th

How to make a DataFrame

Make from ndarray

data = {'name':'John', 'sex':'male', 'age': 22}
john_s = pd.Series(data)
print(john_s)
print(john_s['age'])
name    John
sex     male
age       22
dtype: object
22
ndarray = np.random.randint(5, size=(5,4))
pd.DataFrame(data=ndarray)
0 1 2 3
0 1 1 1 0
1 4 1 0 0
2 3 2 1 0
3 3 1 1 3
4 4 0 1 3
columns = ['a','b','c','d']
index = np.arange(0,50,10)
pd.DataFrame(data=ndarray, index=index, columns=columns)
a b c d
0 1 1 1 0
10 4 1 0 0
20 3 2 1 0
30 3 1 1 3
40 4 0 1 3

Make from dictionary

data1 = {
    'name':'John',
    'sex':'male',
    'age':22
}
data2 = {
    'name':'Zack',
    'sex':'male',
    'age':30
}
data3 ={
    'name':'Emily',
    'sex':'female',
    'age':32
}
pd.DataFrame([data1, data2, data3])
name sex age
0 John male 22
1 Zack male 30
2 Emily female 32
df = pd.read_csv('train.csv')
df.head()
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S

12th

Display the first 5 lines with .head ()

df.head()
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S

Check statistics with .describe ()

df.describe()
PassengerId Survived Pclass Age SibSp Parch Fare
count 891.000000 891.000000 891.000000 714.000000 891.000000 891.000000 891.000000
mean 446.000000 0.383838 2.308642 29.699118 0.523008 0.381594 32.204208
std 257.353842 0.486592 0.836071 14.526497 1.102743 0.806057 49.693429
min 1.000000 0.000000 1.000000 0.420000 0.000000 0.000000 0.000000
25% 223.500000 0.000000 2.000000 20.125000 0.000000 0.000000 7.910400
50% 446.000000 0.000000 3.000000 28.000000 0.000000 0.000000 14.454200
75% 668.500000 1.000000 3.000000 38.000000 1.000000 0.000000 31.000000
max 891.000000 1.000000 3.000000 80.000000 8.000000 6.000000 512.329200
type(df.describe()) #type is DataFrame
pandas.core.frame.DataFrame

Show list of columns in .columns

df.columns
Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
       'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],
      dtype='object')
type(df.columns) #type is index
pandas.core.indexes.base.Index
df.index #There is also an index.
RangeIndex(start=0, stop=891, step=1)

Get the Series with a specific column embraced with the bracket [].

df['Age'].head()
0    22.0
1    38.0
2    26.0
3    35.0
4    35.0
Name: Age, dtype: float64
type(df['Age'])
pandas.core.series.Series

Put a list of columns in the bracket [] and extract multiple columns at once

df[['Age','Parch','Fare']].head()
Age Parch Fare
0 22.0 0 7.2500
1 38.0 0 71.2833
2 26.0 0 7.9250
3 35.0 0 53.1000
4 35.0 0 8.0500

Get a specific row in Series with .iloc [int]

df.iloc[888] #index location
PassengerId                                         889
Survived                                              0
Pclass                                                3
Name           Johnston, Miss. Catherine Helen "Carrie"
Sex                                              female
Age                                                 NaN
SibSp                                                 1
Parch                                                 2
Ticket                                       W./C. 6607
Fare                                              23.45
Cabin                                               NaN
Embarked                                              S
Name: 888, dtype: object
df.iloc[888]['Age']
nan
np.isnan(df.iloc[888]['Age'])
True
np.random.seed(1)
ndarray = np.random.randint(10, size=(5,5))
columns = [0,1,2,3,4]
index = ['a','b','c','d','e']
df_1 = pd.DataFrame(data=ndarray, index=index, columns=columns)
df_1
0 1 2 3 4
a 5 8 9 5 0
b 0 1 7 6 9
c 2 4 5 2 4
d 2 4 7 7 9
e 1 7 0 6 9
df_1[0] 
a    5
b    0
c    2
d    2
e    1
Name: 0, dtype: int64
df_1.loc['c'] #When the line is not an int['str']To.
0    2
1    4
2    5
3    2
4    4
Name: c, dtype: int64

Drop certain rows and columns with Slicing

Drop index = 0 (0th column)

df.drop(0) .head()
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S
5 6 0 3 Moran, Mr. James male NaN 0 0 330877 8.4583 NaN Q

Drop the'Age'column

df.drop('Age', axis=1) .head()
PassengerId Survived Pclass Name Sex SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 0 0 373450 8.0500 NaN S

When dropping multiple columns, pass a list as an argument .drop ([]). Drop does not change the original df

df.drop(['Age','PassengerId'], axis=1) .head()
Survived Pclass Name Sex SibSp Parch Ticket Fare Cabin Embarked
0 0 3 Braund, Mr. Owen Harris male 1 0 A/5 21171 7.2500 NaN S
1 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 1 0 PC 17599 71.2833 C85 C
2 1 3 Heikkinen, Miss. Laina female 0 0 STON/O2. 3101282 7.9250 NaN S
3 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 1 0 113803 53.1000 C123 S
4 0 3 Allen, Mr. William Henry male 0 0 373450 8.0500 NaN S
df.head()#Drop does not change the original df
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S

There are two ways to overwrite df. Setting place = True will change the original DataFrame

df = pd.read_csv('train.csv')
df.drop(['Age', 'Cabin'], axis=1, inplace=True) 
df .head()
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S
df = pd.read_csv('train.csv')
df = df.drop(['Age', 'Cabin'], axis=1)
id(df)
140285150057616

Get multiple lines with slicing

df.iloc[5:10]
PassengerId Survived Pclass Name Sex SibSp Parch Ticket Fare Embarked
5 6 0 3 Moran, Mr. James male 0 0 330877 8.4583 Q
6 7 0 1 McCarthy, Mr. Timothy J male 0 0 17463 51.8625 S
7 8 0 3 Palsson, Master. Gosta Leonard male 3 1 349909 21.0750 S
8 9 1 3 Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg) female 0 2 347742 11.1333 S
9 10 1 2 Nasser, Mrs. Nicholas (Adele Achem) female 1 0 237736 30.0708 C

13th

Filter the DataFrame by specific conditions

df = pd.read_csv('train.csv')
df = df['Survived'] == 1#Filter survivors
df.head()
0    False
1     True
2     True
3     True
4    False
Name: Survived, dtype: bool
filter = df['Survived'] ==1 #Put it in a variable called filter
df = df[filter]
df.head()
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
8 9 1 3 Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg) female 27.0 0 2 347742 11.1333 NaN S
9 10 1 2 Nasser, Mrs. Nicholas (Adele Achem) female 14.0 1 0 237736 30.0708 NaN C
df = df[df['Survived'] ==1] #This is more common
df.head()
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
8 9 1 3 Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg) female 27.0 0 2 347742 11.1333 NaN S
9 10 1 2 Nasser, Mrs. Nicholas (Adele Achem) female 14.0 1 0 237736 30.0708 NaN C
df[df['Survived'] ==1].describe() #Describe only survivor data
PassengerId Survived Pclass Age SibSp Parch Fare
count 342.000000 342.0 342.000000 290.000000 342.000000 342.000000 342.000000
mean 444.368421 1.0 1.950292 28.343690 0.473684 0.464912 48.395408
std 252.358840 0.0 0.863321 14.950952 0.708688 0.771712 66.596998
min 2.000000 1.0 1.000000 0.420000 0.000000 0.000000 0.000000
25% 250.750000 1.0 1.000000 19.000000 0.000000 0.000000 12.475000
50% 439.500000 1.0 2.000000 28.000000 0.000000 0.000000 26.000000
75% 651.500000 1.0 3.000000 36.000000 1.000000 1.000000 57.000000
max 890.000000 1.0 3.000000 80.000000 4.000000 5.000000 512.329200
df.describe() #raw data
PassengerId Survived Pclass Age SibSp Parch Fare
count 891.000000 891.000000 891.000000 714.000000 891.000000 891.000000 891.000000
mean 446.000000 0.383838 2.308642 29.699118 0.523008 0.381594 32.204208
std 257.353842 0.486592 0.836071 14.526497 1.102743 0.806057 49.693429
min 1.000000 0.000000 1.000000 0.420000 0.000000 0.000000 0.000000
25% 223.500000 0.000000 2.000000 20.125000 0.000000 0.000000 7.910400
50% 446.000000 0.000000 3.000000 28.000000 0.000000 0.000000 14.454200
75% 668.500000 1.000000 3.000000 38.000000 1.000000 0.000000 31.000000
max 891.000000 1.000000 3.000000 80.000000 8.000000 6.000000 512.329200
df[df['Age'] >= 60].describe() #'Age'>=60 only
PassengerId Survived Pclass Age SibSp Parch Fare
count 26.000000 26.000000 26.000000 26.000000 26.000000 26.000000 26.000000
mean 455.807692 0.269231 1.538462 65.096154 0.230769 0.307692 43.467950
std 240.078490 0.452344 0.811456 5.110811 0.429669 0.837579 51.269998
min 34.000000 0.000000 1.000000 60.000000 0.000000 0.000000 6.237500
25% 277.250000 0.000000 1.000000 61.250000 0.000000 0.000000 10.500000
50% 489.000000 0.000000 1.000000 63.500000 0.000000 0.000000 28.275000
75% 629.750000 0.750000 2.000000 69.000000 0.000000 0.000000 58.860450
max 852.000000 1.000000 3.000000 80.000000 1.000000 4.000000 263.000000
df[(df['Age']>=60) & (df['Sex']=='female')] #Data for women over 60 years old only
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
275 276 1 1 Andrews, Miss. Kornelia Theodosia female 63.0 1 0 13502 77.9583 D7 S
366 367 1 1 Warren, Mrs. Frank Manley (Anna Sophia Atkinson) female 60.0 1 0 110813 75.2500 D37 C
483 484 1 3 Turkula, Mrs. (Hedwig) female 63.0 0 0 4134 9.5875 NaN S
829 830 1 1 Stone, Mrs. George Nelson (Martha Evelyn) female 62.0 0 0 113572 80.0000 B28 NaN
df[(df['Pclass']==1) | (df['Age']<10)] #Data for 1st class or under 10 years old only
df.head()
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
8 9 1 3 Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg) female 27.0 0 2 347742 11.1333 NaN S
9 10 1 2 Nasser, Mrs. Nicholas (Adele Achem) female 14.0 1 0 237736 30.0708 NaN C

If ~ (squiggle) is added, it can be filtered by NOT operation.

data =[{'Name':'John', 'Survived':True},
      {'Name':'Emily', 'Survived':False},
      {'Name':'Ben', 'Survived':True}]
df = pd.DataFrame(data)
df
Name Survived
0 John True
1 Emily False
2 Ben True

It is often used when filtering by a column whose value is boolean.

df[df['Survived']==True] 
Name Survived
0 John True
2 Ben True

Since the Survived column is already Boolean, you don't need == True. Since df ['Survived'] is already a Boolean Series, you can filter it as it is as shown on the left.

df[df['Survived']] 
Name Survived
0 John True
2 Ben True

If you want to narrow down to Survived == False, you can do the following without having to do df [df ['Survived'== False]

df[~df['Survived']] 
Name Survived
1 Emily False

Change index

Reallocate index with .reset_index ()

df = pd.read_csv('train.csv')
df = df[df['Sex']=='male']
df.head() #index is disjointed
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S
5 6 0 3 Moran, Mr. James male NaN 0 0 330877 8.4583 NaN Q
6 7 0 1 McCarthy, Mr. Timothy J male 54.0 0 0 17463 51.8625 E46 S
7 8 0 3 Palsson, Master. Gosta Leonard male 2.0 3 1 349909 21.0750 NaN S

Align indexes

As with .drop (), the original df is not overwritten, so if you want to update df, reassign it with inplace = True or df = df.reset_index ().

df.reset_index() .head()
index PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S
2 5 6 0 3 Moran, Mr. James male NaN 0 0 330877 8.4583 NaN Q
3 6 7 0 1 McCarthy, Mr. Timothy J male 54.0 0 0 17463 51.8625 E46 S
4 7 8 0 3 Palsson, Master. Gosta Leonard male 2.0 3 1 349909 21.0750 NaN S

Use .set_index () to index a specific column

Set index to ‘Name’

As with .reset_index (), you can overwrite the original df with inplace = True.

df.set_index('Name').head()
PassengerId Survived Pclass Sex Age SibSp Parch Ticket Fare Cabin Embarked
Name
Braund, Mr. Owen Harris 1 0 3 male 22.0 1 0 A/5 21171 7.2500 NaN S
Allen, Mr. William Henry 5 0 3 male 35.0 0 0 373450 8.0500 NaN S
Moran, Mr. James 6 0 3 male NaN 0 0 330877 8.4583 NaN Q
McCarthy, Mr. Timothy J 7 0 1 male 54.0 0 0 17463 51.8625 E46 S
Palsson, Master. Gosta Leonard 8 0 3 male 2.0 3 1 349909 21.0750 NaN S

Recommended Posts

I wrote the basic operation of Pandas with Jupyter Lab (Part 1)
I wrote the basic operation of Pandas with Jupyter Lab (Part 2)
I wrote the basic operation of matplotlib with Jupyter Lab
I wrote the basic grammar of Python with Jupyter Lab
I wrote the basic operation of Seaborn in Jupyter Lab
I wrote the basic operation of Numpy in Jupyter Lab.
Basic operation of pandas
Basic operation of Pandas
I tried running the DNN part of OpenPose with Chainer CPU
Build the execution environment of Jupyter Lab
I made a mistake in fetching the hierarchy with MultiIndex of pandas
I tried the pivot table function of pandas
Automatic operation of Chrome with Python + Selenium + pandas
I checked the list of shortcut keys of Jupyter
Basic operation of Python Pandas Series and Dataframe (1)
I compared the moving average of IIR filter type with pandas and scipy
Find the sum of unique values with pandas crosstab
I tried to summarize the basic form of GPLVM
I want to plot the location information of GTFS Realtime on Jupyter! (With balloon)
Make a note of the list of basic Pandas usage
Drawing on Jupyter using the plot function of pandas
I measured the performance of 1 million documents with mongoDB
Summary of the basic flow of machine learning with Python
I tried to erase the negative part of Meros
Get the operation status of JR West with Python
I tried to compare the processing speed with dplyr of R and pandas of Python
I wrote you to watch the signal with Go
I wrote a doctest in "I tried to simulate the probability of a bingo game with Python"
I tried to find the entropy of the image with python
I tried "gamma correction" of the image with Python + OpenCV
I wrote the code for Japanese sentence generation with DeZero
I tried to find the average of the sequence with TensorFlow
About the garbled Japanese part of pandas-profiling in Jupyter notebook
I evaluated the strategy of stock system trading with Python.
I want to get the operation information of yahoo route
I implemented the FloodFill algorithm with TRON BATTLE of CodinGame.
I made a dot picture of the image of Irasutoya. (part1)
Try to automate the operation of network devices with Python
I made a dot picture of the image of Irasutoya. (part2)
I wrote GP with numpy
Python application: Pandas Part 1: Basic
Change the theme of Jupyter
Basic usage of Pandas Summary
The Power of Pandas: Python
Basic calculation of pandas to enjoy Hakone Ekiden while competing with the best members of all time
Part 1 I wrote the answer to the reference problem of how to write offline in real time in Python
I compared the speed of Hash with Topaz, Ruby and Python
I tried scraping the ranking of Qiita Advent Calendar with Python
[AWS / Tello] I tried operating the drone with my voice Part2
I tried standalone deployment of play with fabric [AWS operation with boto] [Play deployment]
I tried to automate the watering of the planter with Raspberry Pi
[Python] I wrote the route of the typhoon on the map using folium
[Introduction to StyleGAN] I played with "The Life of a Man" ♬
I want to output the beginning of the next month with Python
I wrote the code to write the code of Brainf * ck in python
Count the maximum concatenated part of a random graph with NetworkX
Format the CSV file of "National Holiday" of the Cabinet Office with pandas
[AWS / Tello] I tried operating the drone with my voice Part1
I tried to expand the size of the logical volume with LVM
I want to check the position of my face with OpenCV!
I checked the image of Science University on Twitter with Word2Vec.