This article is an article that I actually coded the basic operation of Pandas described in Kame (@usdatascientist)'s blog (https://datawokagaku.com/python_for_ds_summary/) using Jupyter Lab.

Summary of basic operations of Pandas

10th

import pandas as pd
import numpy as np

Series

data = {'name':'John', 'sex':'male', 'age': 22}
john_s = pd.Series(data)
print(john_s)

name    John
sex     male
age       22
dtype: object

array = np.array([10,20,30])
pd.Series(array)

0    10
1    20
2    30
dtype: int64

array = np.array([10,20,30])
labels = ['a','b','c']
pd.Series(array, labels)

a    10
b    20
c    30
dtype: int64

11th

How to make a DataFrame

Make from ndarray

data = {'name':'John', 'sex':'male', 'age': 22}
john_s = pd.Series(data)
print(john_s)
print(john_s['age'])

name    John
sex     male
age       22
dtype: object
22

ndarray = np.random.randint(5, size=(5,4))
pd.DataFrame(data=ndarray)

	0	1	2	3
0	1	1	1	0
1	4	1	0	0
2	3	2	1	0
3	3	1	1	3
4	4	0	1	3

columns = ['a','b','c','d']
index = np.arange(0,50,10)
pd.DataFrame(data=ndarray, index=index, columns=columns)

	a	b	c	d
0	1	1	1	0
10	4	1	0	0
20	3	2	1	0
30	3	1	1	3
40	4	0	1	3

Make from dictionary

data1 = {
    'name':'John',
    'sex':'male',
    'age':22
}
data2 = {
    'name':'Zack',
    'sex':'male',
    'age':30
}
data3 ={
    'name':'Emily',
    'sex':'female',
    'age':32
}
pd.DataFrame([data1, data2, data3])

	name	sex	age
0	John	male	22
1	Zack	male	30
2	Emily	female	32

df = pd.read_csv('train.csv')
df.head()

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Ticket	Fare	Cabin	Embarked
0	1	0	3	Braund, Mr. Owen Harris	male	22.0	1	A/5 21171	7.2500	NaN	S
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	PC 17599	71.2833	C85	C
2	3	1	3	Heikkinen, Miss. Laina	female	26.0	0	STON/O2. 3101282	7.9250	NaN	S
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	113803	53.1000	C123	S
4	5	0	3	Allen, Mr. William Henry	male	35.0	0	373450	8.0500	NaN	S

12th

Display the first 5 lines with .head ()

df.head()

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Ticket	Fare	Cabin	Embarked
0	1	0	3	Braund, Mr. Owen Harris	male	22.0	1	A/5 21171	7.2500	NaN	S
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	PC 17599	71.2833	C85	C
2	3	1	3	Heikkinen, Miss. Laina	female	26.0	0	STON/O2. 3101282	7.9250	NaN	S
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	113803	53.1000	C123	S
4	5	0	3	Allen, Mr. William Henry	male	35.0	0	373450	8.0500	NaN	S

Check statistics with .describe ()

df.describe()

	PassengerId	Survived	Pclass	Age	SibSp	Parch	Fare
count	891.000000	891.000000	891.000000	714.000000	891.000000	891.000000	891.000000
mean	446.000000	0.383838	2.308642	29.699118	0.523008	0.381594	32.204208
std	257.353842	0.486592	0.836071	14.526497	1.102743	0.806057	49.693429
min	1.000000	0.000000	1.000000	0.420000	0.000000	0.000000	0.000000
25%	223.500000	0.000000	2.000000	20.125000	0.000000	0.000000	7.910400
50%	446.000000	0.000000	3.000000	28.000000	0.000000	0.000000	14.454200
75%	668.500000	1.000000	3.000000	38.000000	1.000000	0.000000	31.000000
max	891.000000	1.000000	3.000000	80.000000	8.000000	6.000000	512.329200

type(df.describe()) #type is DataFrame

pandas.core.frame.DataFrame

Show list of columns in .columns

df.columns

Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
       'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],
      dtype='object')

type(df.columns) #type is index

pandas.core.indexes.base.Index

df.index #There is also an index.

RangeIndex(start=0, stop=891, step=1)

Get the Series with a specific column embraced with the bracket [].

df['Age'].head()

0    22.0
1    38.0
2    26.0
3    35.0
4    35.0
Name: Age, dtype: float64

type(df['Age'])

pandas.core.series.Series

Put a list of columns in the bracket [] and extract multiple columns at once

df[['Age','Parch','Fare']].head()

	Age	Fare
0	22.0	7.2500
1	38.0	71.2833
2	26.0	7.9250
3	35.0	53.1000
4	35.0	8.0500

Get a specific row in Series with .iloc [int]

df.iloc[888] #index location

PassengerId                                         889
Survived                                              0
Pclass                                                3
Name           Johnston, Miss. Catherine Helen "Carrie"
Sex                                              female
Age                                                 NaN
SibSp                                                 1
Parch                                                 2
Ticket                                       W./C. 6607
Fare                                              23.45
Cabin                                               NaN
Embarked                                              S
Name: 888, dtype: object

df.iloc[888]['Age']

nan

np.isnan(df.iloc[888]['Age'])

True

np.random.seed(1)
ndarray = np.random.randint(10, size=(5,5))
columns = [0,1,2,3,4]
index = ['a','b','c','d','e']
df_1 = pd.DataFrame(data=ndarray, index=index, columns=columns)
df_1

	0	1	2	3	4
a	5	8	9	5	0
b	0	1	7	6	9
c	2	4	5	2	4
d	2	4	7	7	9
e	1	7	0	6	9

df_1[0]

a    5
b    0
c    2
d    2
e    1
Name: 0, dtype: int64

df_1.loc['c'] #When the line is not an int['str']To.

0    2
1    4
2    5
3    2
4    4
Name: c, dtype: int64

Drop certain rows and columns with Slicing

Drop index = 0 (0th column)

df.drop(0) .head()

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Ticket	Fare	Cabin	Embarked
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	PC 17599	71.2833	C85	C
2	3	1	3	Heikkinen, Miss. Laina	female	26.0	0	STON/O2. 3101282	7.9250	NaN	S
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	113803	53.1000	C123	S
4	5	0	3	Allen, Mr. William Henry	male	35.0	0	373450	8.0500	NaN	S
5	6	0	3	Moran, Mr. James	male	NaN	0	330877	8.4583	NaN	Q

Drop the'Age'column

df.drop('Age', axis=1) .head()

	PassengerId	Survived	Pclass	Name	Sex	SibSp	Ticket	Fare	Cabin	Embarked
0	1	0	3	Braund, Mr. Owen Harris	male	1	A/5 21171	7.2500	NaN	S
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	1	PC 17599	71.2833	C85	C
2	3	1	3	Heikkinen, Miss. Laina	female	0	STON/O2. 3101282	7.9250	NaN	S
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	1	113803	53.1000	C123	S
4	5	0	3	Allen, Mr. William Henry	male	0	373450	8.0500	NaN	S

When dropping multiple columns, pass a list as an argument .drop ([]). Drop does not change the original df

df.drop(['Age','PassengerId'], axis=1) .head()

	Survived	Pclass	Name	Sex	SibSp	Ticket	Fare	Cabin	Embarked
0	0	3	Braund, Mr. Owen Harris	male	1	A/5 21171	7.2500	NaN	S
1	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	1	PC 17599	71.2833	C85	C
2	1	3	Heikkinen, Miss. Laina	female	0	STON/O2. 3101282	7.9250	NaN	S
3	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	1	113803	53.1000	C123	S
4	0	3	Allen, Mr. William Henry	male	0	373450	8.0500	NaN	S

df.head()#Drop does not change the original df

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Ticket	Fare	Cabin	Embarked
0	1	0	3	Braund, Mr. Owen Harris	male	22.0	1	A/5 21171	7.2500	NaN	S
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	PC 17599	71.2833	C85	C
2	3	1	3	Heikkinen, Miss. Laina	female	26.0	0	STON/O2. 3101282	7.9250	NaN	S
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	113803	53.1000	C123	S
4	5	0	3	Allen, Mr. William Henry	male	35.0	0	373450	8.0500	NaN	S

There are two ways to overwrite df. Setting place = True will change the original DataFrame

df = pd.read_csv('train.csv')
df.drop(['Age', 'Cabin'], axis=1, inplace=True)

df .head()

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Ticket	Fare	Cabin	Embarked
0	1	0	3	Braund, Mr. Owen Harris	male	22.0	1	A/5 21171	7.2500	NaN	S
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	PC 17599	71.2833	C85	C
2	3	1	3	Heikkinen, Miss. Laina	female	26.0	0	STON/O2. 3101282	7.9250	NaN	S
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	113803	53.1000	C123	S
4	5	0	3	Allen, Mr. William Henry	male	35.0	0	373450	8.0500	NaN	S

df = pd.read_csv('train.csv')
df = df.drop(['Age', 'Cabin'], axis=1)

id(df)

140285150057616

Get multiple lines with slicing

df.iloc[5:10]

	PassengerId	Survived	Pclass	Name	Sex	SibSp	Parch	Ticket	Fare	Embarked
5	6	0	3	Moran, Mr. James	male	0	0	330877	8.4583	Q
6	7	0	1	McCarthy, Mr. Timothy J	male	0	0	17463	51.8625	S
7	8	0	3	Palsson, Master. Gosta Leonard	male	3	1	349909	21.0750	S
8	9	1	3	Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)	female	0	2	347742	11.1333	S
9	10	1	2	Nasser, Mrs. Nicholas (Adele Achem)	female	1	0	237736	30.0708	C

13th

Filter the DataFrame by specific conditions

df = pd.read_csv('train.csv')
df = df['Survived'] == 1#Filter survivors
df.head()

0    False
1     True
2     True
3     True
4    False
Name: Survived, dtype: bool

filter = df['Survived'] ==1 #Put it in a variable called filter
df = df[filter]
df.head()

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	0	PC 17599	71.2833	C85	C
2	3	1	3	Heikkinen, Miss. Laina	female	26.0	0	0	STON/O2. 3101282	7.9250	NaN	S
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	0	113803	53.1000	C123	S
8	9	1	3	Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)	female	27.0	0	2	347742	11.1333	NaN	S
9	10	1	2	Nasser, Mrs. Nicholas (Adele Achem)	female	14.0	1	0	237736	30.0708	NaN	C

df = df[df['Survived'] ==1] #This is more common
df.head()

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	0	PC 17599	71.2833	C85	C
2	3	1	3	Heikkinen, Miss. Laina	female	26.0	0	0	STON/O2. 3101282	7.9250	NaN	S
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	0	113803	53.1000	C123	S
8	9	1	3	Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)	female	27.0	0	2	347742	11.1333	NaN	S
9	10	1	2	Nasser, Mrs. Nicholas (Adele Achem)	female	14.0	1	0	237736	30.0708	NaN	C

df[df['Survived'] ==1].describe() #Describe only survivor data

	PassengerId	Survived	Pclass	Age	SibSp	Parch	Fare
count	342.000000	342.0	342.000000	290.000000	342.000000	342.000000	342.000000
mean	444.368421	1.0	1.950292	28.343690	0.473684	0.464912	48.395408
std	252.358840	0.0	0.863321	14.950952	0.708688	0.771712	66.596998
min	2.000000	1.0	1.000000	0.420000	0.000000	0.000000	0.000000
25%	250.750000	1.0	1.000000	19.000000	0.000000	0.000000	12.475000
50%	439.500000	1.0	2.000000	28.000000	0.000000	0.000000	26.000000
75%	651.500000	1.0	3.000000	36.000000	1.000000	1.000000	57.000000
max	890.000000	1.0	3.000000	80.000000	4.000000	5.000000	512.329200

df.describe() #raw data

	PassengerId	Survived	Pclass	Age	SibSp	Parch	Fare
count	891.000000	891.000000	891.000000	714.000000	891.000000	891.000000	891.000000
mean	446.000000	0.383838	2.308642	29.699118	0.523008	0.381594	32.204208
std	257.353842	0.486592	0.836071	14.526497	1.102743	0.806057	49.693429
min	1.000000	0.000000	1.000000	0.420000	0.000000	0.000000	0.000000
25%	223.500000	0.000000	2.000000	20.125000	0.000000	0.000000	7.910400
50%	446.000000	0.000000	3.000000	28.000000	0.000000	0.000000	14.454200
75%	668.500000	1.000000	3.000000	38.000000	1.000000	0.000000	31.000000
max	891.000000	1.000000	3.000000	80.000000	8.000000	6.000000	512.329200

df[df['Age'] >= 60].describe() #'Age'>=60 only

	PassengerId	Survived	Pclass	Age	SibSp	Parch	Fare
count	26.000000	26.000000	26.000000	26.000000	26.000000	26.000000	26.000000
mean	455.807692	0.269231	1.538462	65.096154	0.230769	0.307692	43.467950
std	240.078490	0.452344	0.811456	5.110811	0.429669	0.837579	51.269998
min	34.000000	0.000000	1.000000	60.000000	0.000000	0.000000	6.237500
25%	277.250000	0.000000	1.000000	61.250000	0.000000	0.000000	10.500000
50%	489.000000	0.000000	1.000000	63.500000	0.000000	0.000000	28.275000
75%	629.750000	0.750000	2.000000	69.000000	0.000000	0.000000	58.860450
max	852.000000	1.000000	3.000000	80.000000	1.000000	4.000000	263.000000

df[(df['Age']>=60) & (df['Sex']=='female')] #Data for women over 60 years old only

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Ticket	Fare	Cabin	Embarked
275	276	1	1	Andrews, Miss. Kornelia Theodosia	female	63.0	1	13502	77.9583	D7	S
366	367	1	1	Warren, Mrs. Frank Manley (Anna Sophia Atkinson)	female	60.0	1	110813	75.2500	D37	C
483	484	1	3	Turkula, Mrs. (Hedwig)	female	63.0	0	4134	9.5875	NaN	S
829	830	1	1	Stone, Mrs. George Nelson (Martha Evelyn)	female	62.0	0	113572	80.0000	B28	NaN

df[(df['Pclass']==1) | (df['Age']<10)] #Data for 1st class or under 10 years old only
df.head()

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	0	PC 17599	71.2833	C85	C
2	3	1	3	Heikkinen, Miss. Laina	female	26.0	0	0	STON/O2. 3101282	7.9250	NaN	S
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	0	113803	53.1000	C123	S
8	9	1	3	Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)	female	27.0	0	2	347742	11.1333	NaN	S
9	10	1	2	Nasser, Mrs. Nicholas (Adele Achem)	female	14.0	1	0	237736	30.0708	NaN	C

If ~ (squiggle) is added, it can be filtered by NOT operation.

data =[{'Name':'John', 'Survived':True},
      {'Name':'Emily', 'Survived':False},
      {'Name':'Ben', 'Survived':True}]
df = pd.DataFrame(data)
df

	Name	Survived
0	John	True
1	Emily	False
2	Ben	True

It is often used when filtering by a column whose value is boolean.

df[df['Survived']==True]

	Name	Survived
0	John	True
2	Ben	True

Since the Survived column is already Boolean, you don't need == True. Since df ['Survived'] is already a Boolean Series, you can filter it as it is as shown on the left.

df[df['Survived']]

	Name	Survived
0	John	True
2	Ben	True

If you want to narrow down to Survived == False, you can do the following without having to do df [df ['Survived'== False]

df[~df['Survived']]

	Name	Survived
1	Emily	False

Change index

Reallocate index with .reset_index ()

df = pd.read_csv('train.csv')
df = df[df['Sex']=='male']
df.head() #index is disjointed

	PassengerId	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked
0	1	3	Braund, Mr. Owen Harris	male	22.0	1	0	A/5 21171	7.2500	NaN	S
4	5	3	Allen, Mr. William Henry	male	35.0	0	0	373450	8.0500	NaN	S
5	6	3	Moran, Mr. James	male	NaN	0	0	330877	8.4583	NaN	Q
6	7	1	McCarthy, Mr. Timothy J	male	54.0	0	0	17463	51.8625	E46	S
7	8	3	Palsson, Master. Gosta Leonard	male	2.0	3	1	349909	21.0750	NaN	S

Align indexes

As with .drop (), the original df is not overwritten, so if you want to update df, reassign it with inplace = True or df = df.reset_index ().

df.reset_index() .head()

	index	PassengerId	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked
0	0	1	3	Braund, Mr. Owen Harris	male	22.0	1	0	A/5 21171	7.2500	NaN	S
1	4	5	3	Allen, Mr. William Henry	male	35.0	0	0	373450	8.0500	NaN	S
2	5	6	3	Moran, Mr. James	male	NaN	0	0	330877	8.4583	NaN	Q
3	6	7	1	McCarthy, Mr. Timothy J	male	54.0	0	0	17463	51.8625	E46	S
4	7	8	3	Palsson, Master. Gosta Leonard	male	2.0	3	1	349909	21.0750	NaN	S

Use .set_index () to index a specific column

Set index to ‘Name’

As with .reset_index (), you can overwrite the original df with inplace = True.

df.set_index('Name').head()

	PassengerId	Survived	Pclass	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked
Name
Braund, Mr. Owen Harris	1	0	3	male	22.0	1	0	A/5 21171	7.2500	NaN	S
Allen, Mr. William Henry	5	0	3	male	35.0	0	0	373450	8.0500	NaN	S
Moran, Mr. James	6	0	3	male	NaN	0	0	330877	8.4583	NaN	Q
McCarthy, Mr. Timothy J	7	0	1	male	54.0	0	0	17463	51.8625	E46	S
Palsson, Master. Gosta Leonard	8	0	3	male	2.0	3	1	349909	21.0750	NaN	S

[PYTHON] I wrote the basic operation of Pandas with Jupyter Lab (Part 1)

This article is an article that I actually coded the basic operation of Pandas described in Kame (@usdatascientist)'s blog (https://datawokagaku.com/python_for_ds_summary/) using Jupyter Lab.

Summary of basic operations of Pandas

10th

11th

How to make a DataFrame

Make from ndarray

Make from dictionary

12th

Display the first 5 lines with .head ()

Check statistics with .describe ()

Show list of columns in .columns

Get the Series with a specific column embraced with the bracket [].

Put a list of columns in the bracket [] and extract multiple columns at once

Get a specific row in Series with .iloc [int]

Drop certain rows and columns with Slicing

Drop index = 0 (0th column)

Drop the'Age'column

When dropping multiple columns, pass a list as an argument .drop ([]). Drop does not change the original df

There are two ways to overwrite df. Setting place = True will change the original DataFrame

Get multiple lines with slicing

13th

Filter the DataFrame by specific conditions

If ~ (squiggle) is added, it can be filtered by NOT operation.

It is often used when filtering by a column whose value is boolean.

Since the Survived column is already Boolean, you don't need == True. Since df ['Survived'] is already a Boolean Series, you can filter it as it is as shown on the left.

If you want to narrow down to Survived == False, you can do the following without having to do df [df ['Survived'== False]

Change index

Reallocate index with .reset_index ()

Align indexes

As with .drop (), the original df is not overwritten, so if you want to update df, reassign it with inplace = True or df = df.reset_index ().

Use .set_index () to index a specific column

Set index to ‘Name’

As with .reset_index (), you can overwrite the original df with inplace = True.