[PYTHON] I wrote the basic operation of Pandas with Jupyter Lab (Part 2)

14th

About NaN of DataFrame

import numpy as np
import pandas as pd
df = pd.read_csv('train.csv')
df.head()
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S

All records containing NaN have been dropped. (index remains the same. Basically, index is not reassigned unless .reset_index ()

df.dropna().head()
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
6 7 0 1 McCarthy, Mr. Timothy J male 54.0 0 0 17463 51.8625 E46 S
10 11 1 3 Sandstrom, Miss. Marguerite Rut female 4.0 1 1 PP 9549 16.7000 G6 S
11 12 1 1 Bonnell, Miss. Elizabeth female 58.0 0 0 113783 26.5500 C103 S

You can drop a column containing NaN by taking axis = 1 as an argument (default is line with axis = 0).

do not use much. When building a model, it is used in the strategy of reducing the variables (explanatory variables) that explain the data without reducing the number of data, but it is unlikely that "reduce the explanatory variables because there is even one NaN". .. Which explanatory variables are important for model building is a very important and careful issue.
df.dropna(axis=1) .head()
PassengerId Survived Pclass Name Sex SibSp Parch Ticket Fare
0 1 0 3 Braund, Mr. Owen Harris male 1 0 A/5 21171 7.2500
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 1 0 PC 17599 71.2833
2 3 1 3 Heikkinen, Miss. Laina female 0 0 STON/O2. 3101282 7.9250
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 1 0 113803 53.1000
4 5 0 3 Allen, Mr. William Henry male 0 0 373450 8.0500

By passing a list of column names to the subset argument, only the rows containing NaN in that column will be dropped.

You only need to drop the NaN row in a particular column. Remember, it's very convenient.

Of course, .dropna () does not overwrite the original df. If you want to update the original df, reassign it with the familiar inplace = True or df = df.dropna ().

df.dropna(subset=['Age']).head() ###index=888 is dropped.
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S

.fillna(value)

Assign a specific Value to NaN.

df.fillna('THIS IS IT').head()
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22 1 0 A/5 21171 7.2500 THIS IS IT S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26 0 0 STON/O2. 3101282 7.9250 THIS IS IT S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35 0 0 373450 8.0500 THIS IS IT S

Substitute the column mean for the NaN contained in a particular column

df['Age'].mean()
29.69911764705882
df['Age'].fillna(df['Age'].mean()).head()
0    22.0
1    38.0
2    26.0
3    35.0
4    35.0
Name: Age, dtype: float64
df['Age'] = df['Age'].fillna(df['Age'].mean())
df.head()
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S
pd.isna(df).head()
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 False False False False False False False False False False True False
1 False False False False False False False False False False False False
2 False False False False False False False False False False True False
3 False False False False False False False False False False False False
4 False False False False False False False False False False True False

Substitute the NaN judgment result of Cabin using the Cabin_nan column

df['Cabin_nan'] = pd.isna(df['Cabin']) df

15th

groupby with the .groupby () function ()

df = pd.read_csv('train.csv')
df.head()
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S

Group by in Pclass

Since df.groupby ('Pclass') alone does not know what to do after groupby is grouped, call functions such as .mean () and .count () after that.

df.groupby('Pclass').mean()
PassengerId Survived Age SibSp Parch Fare
Pclass
1 461.597222 0.629630 38.233441 0.416667 0.356481 84.154687
2 445.956522 0.472826 29.877630 0.402174 0.380435 20.662183
3 439.154786 0.242363 25.140620 0.615071 0.393075 13.675550
df = df[df['Pclass']==1]
df.head()
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
6 7 0 1 McCarthy, Mr. Timothy J male 54.0 0 0 17463 51.8625 E46 S
11 12 1 1 Bonnell, Miss. Elizabeth female 58.0 0 0 113783 26.5500 C103 S
23 24 1 1 Sloper, Mr. William Thompson male 28.0 0 0 113788 35.5000 A6 S

Take the statistic for each column with'Pclass'== 1

df[df['Pclass']==1].describe()
PassengerId Survived Pclass Age SibSp Parch Fare
count 216.000000 216.000000 216.0 186.000000 216.000000 216.000000 216.000000
mean 461.597222 0.629630 1.0 38.233441 0.416667 0.356481 84.154687
std 246.737616 0.484026 0.0 14.802856 0.611898 0.693997 78.380373
min 2.000000 0.000000 1.0 0.920000 0.000000 0.000000 0.000000
25% 270.750000 0.000000 1.0 27.000000 0.000000 0.000000 30.923950
50% 472.000000 1.000000 1.0 37.000000 0.000000 0.000000 60.287500
75% 670.500000 1.000000 1.0 49.000000 1.000000 0.000000 93.500000
max 890.000000 1.000000 1.0 80.000000 3.000000 4.000000 512.329200

Extract only mean

It's hard to take them out one by one, so it's a good idea to use .groupby () to see them all together.

df[df['Pclass']==1].describe().loc['mean']
PassengerId    461.597222
Survived         0.629630
Pclass           1.000000
Age             38.233441
SibSp            0.416667
Parch            0.356481
Fare            84.154687
Name: mean, dtype: float64

After groupby, take the value specified in by, which is the first argument of groupby, in index. In the above example, it is the value of Pclass (1, 2, 3). Of course, the result of grouby is also a DataFrame, so you can fetch a specific group of Series with .loc [].

df.groupby('Pclass').mean().loc[1]
PassengerId    461.597222
Survived         0.629630
Age             38.233441
SibSp            0.416667
Parch            0.356481
Fare            84.154687
Name: 1, dtype: float64
df.groupby('Pclass').count().loc[1] #count()And sum()Etc. are also possible
PassengerId    216
Survived       216
Name           216
Sex            216
Age            186
SibSp          216
Parch          216
Ticket         216
Fare           216
Cabin          176
Embarked       214
Name: 1, dtype: int64
df.groupby('Pclass').describe()
PassengerId Survived Age SibSp Parch Fare
count mean std min 25% 50% 75% max count mean std min 25% 50% 75% max count mean std min 25% 50% 75% max count mean std min 25% 50% 75% max count mean std min 25% 50% 75% max count mean std min 25% 50% 75% max
Pclass
1 216.0 461.597222 246.737616 2.0 270.75 472.0 670.5 890.0 216.0 0.629630 0.484026 0.0 0.0 1.0 1.0 1.0 186.0 38.233441 14.802856 0.92 27.0 37.0 49.0 80.0 216.0 0.416667 0.611898 0.0 0.0 0.0 1.0 3.0 216.0 0.356481 0.693997 0.0 0.0 0.0 0.0 4.0 216.0 84.154687 78.380373 0.0 30.92395 60.2875 93.5 512.3292
2 184.0 445.956522 250.852161 10.0 234.50 435.5 668.0 887.0 184.0 0.472826 0.500623 0.0 0.0 0.0 1.0 1.0 173.0 29.877630 14.001077 0.67 23.0 29.0 36.0 70.0 184.0 0.402174 0.601633 0.0 0.0 0.0 1.0 3.0 184.0 0.380435 0.690963 0.0 0.0 0.0 1.0 3.0 184.0 20.662183 13.417399 0.0 13.00000 14.2500 26.0 73.5000
3 491.0 439.154786 264.441453 1.0 200.00 432.0 666.5 891.0 491.0 0.242363 0.428949 0.0 0.0 0.0 0.0 1.0 355.0 25.140620 12.495398 0.42 18.0 24.0 32.0 74.0 491.0 0.615071 1.374883 0.0 0.0 0.0 1.0 8.0 491.0 0.393075 0.888861 0.0 0.0 0.0 0.0 6.0 491.0 13.675550 11.778142 0.0 7.75000 8.0500 15.5 69.5500
df.groupby('Pclass').describe()['Age'] #I took out only Age
count mean std min 25% 50% 75% max
Pclass
1 186.0 38.233441 14.802856 0.92 27.0 37.0 49.0 80.0
2 173.0 29.877630 14.001077 0.67 23.0 29.0 36.0 70.0
3 355.0 25.140620 12.495398 0.42 18.0 24.0 32.0 74.0

In Jupyter, if the columns and rows of DataFrame cannot be displayed, they will be omitted.

If you want to display all columns (or all rows) without omitting them, you can prevent them from being omitted by executing the following.

#Display without omitting columns
pd.set_option('display.max_columns', None)
#Display without omitting lines
pd.set_option('display.max_rows', None)

Turn the result of groupby with a for statement

for i, group_df in df.groupby('Pclass'):
    print("{}: group_df's type is {} and has {}".format(i, type(group_df), len(group_df)))
1: group_df's type is <class 'pandas.core.frame.DataFrame'> and has 216
2: group_df's type is <class 'pandas.core.frame.DataFrame'> and has 184
3: group_df's type is <class 'pandas.core.frame.DataFrame'> and has 491

Number the highest Fare for each record in each Pclass group

df = pd.read_csv('train.csv')
results = []
for i, group_df in df.groupby('Pclass'):
    sorted_group_df = group_df.sort_values('Fare')
    sorted_group_df['RankInClass'] = np.arange(len(sorted_group_df))
    results.append(sorted_group_df)

results_df = pd.concat(results)
results_df.head()
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked RankInClass
633 634 0 1 Parr, Mr. William Henry Marsh male NaN 0 0 112052 0.0 NaN S 0
822 823 0 1 Reuchlin, Jonkheer. John George male 38.0 0 0 19972 0.0 NaN S 1
815 816 0 1 Fry, Mr. Richard male NaN 0 0 112058 0.0 B102 S 2
806 807 0 1 Andrews, Mr. Thomas Jr male 39.0 0 0 112050 0.0 A36 S 3
263 264 0 1 Harrison, Mr. William male 40.0 0 0 112059 0.0 B94 S 4

Join table

There are two main types of table joins.

Join by using a specific column or index as a Key

Simply combine (or vertically) DataFrames horizontally (or vertically)

import pandas as pd
df1 = pd.DataFrame({'Key':['k0','k','k2'],
                   'A':['a0','a1','a2'],
                   'B':['b0','b1','b2']})
df2 = pd.DataFrame({'Key':['k0','k1','k2'],
                   'C':['c0','c2','c3'],
                   'D':['d0','d1','d2']})
df1
Key A B
0 k0 a0 b0
1 k a1 b1
2 k2 a2 b2
df2
Key C D
0 k0 c0 d0
1 k1 c2 d1
2 k2 c3 d2

Both have a column called'Key'and their values are the same.

The other columns have different values. This column called ‘Key’ is used as a Key to join two DataFrames horizontally. Use .merge () for merging.

df1.merge(df2)
Key A B C D
0 k0 a0 b0 c0 d0
1 k2 a2 b2 c3 d2

Simply combine (or vertically) DataFrames horizontally (or vertically)

Use pd.concat () concat = concatenate

#Vertical (often used)
pd.concat([df1,df2], axis=0) #The default is axis=0
/opt/anaconda3/lib/python3.7/site-packages/ipykernel_launcher.py:2: FutureWarning: Sorting because non-concatenation axis is not aligned. A future version
of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.

To retain the current behavior and silence the warning, pass 'sort=True'.
A B C D Key
0 a0 b0 NaN NaN k0
1 a1 b1 NaN NaN k
2 a2 b2 NaN NaN k2
0 NaN NaN c0 d0 k0
1 NaN NaN c2 d1 k1
2 NaN NaN c3 d2 k2
#side
pd.concat([df1,df2], axis=1)
Key A B Key C D
0 k0 a0 b0 k0 c0 d0
1 k a1 b1 k1 c2 d1
2 k2 a2 b2 k2 c3 d2

16th

How to use .merge ()

how: How to combine → {‘left’, ‘right’, ‘outer’, ‘inner’}, default is ‘inner’

Specify the column to be on: key (column that exists in both DataFrames). If not specified, they will be joined in a common column.

left_on: Column to be the key of the left DataFrame

right_on: Column to be the key of right DataFrame

left_index: Specify True when setting the left Key to index

right_index: Specify True when setting the Key of right to index

how

df1
Key A B
0 k0 a0 b0
1 k a1 b1
2 k2 a2 b2
df2
Key C D
0 k0 c0 d0
1 k1 c2 d1
2 k2 c3 d2
df1.merge(df2, how='left')
Key A B C D
0 k0 a0 b0 c0 d0
1 k a1 b1 NaN NaN
2 k2 a2 b2 c3 d2
df1.merge(df2, how='outer')
Key A B C D
0 k0 a0 b0 c0 d0
1 k a1 b1 NaN NaN
2 k2 a2 b2 c3 d2
3 k1 NaN NaN c2 d1
df1.merge(df2, how='inner')
Key A B C D
0 k0 a0 b0 c0 d0
1 k2 a2 b2 c3 d2

on

The argument on specifies which column should be used as the Key when joining. You can only specify columns that are in both the left table and the right table.

If there is a common column, that column will be the Key and will be combined without specifying anything, but let's specify the basics. It's easy to understand and safe. It is often said that "they were joined in an unexpected column". If there are multiple common columns, you do not know which column will be joined, and even if there is only one common column, it is better to specify it. (Since the examples so far were before the explanation of on, I intentionally wrote it without specifying the on argument.)

To set index as Key, set right_index and left_index described later to True. If the column names of each table (DataFrame) are different, specify left_on and right_on described later.

df1 = pd.DataFrame({'Key':['k0','k1','k2'],
                    'ID':['aa','bb','cc'],
                   'A':['a0','a1','a2'],
                   'B':['b0','b1','b2']})
df2 = pd.DataFrame({'Key':['k0','k1','k3'],
                    'ID':['aa','bb','cc'],
                   'C':['c0','c1','c3'],
                   'D':['d0','d1','d3']})
df1.merge(df2, on='Key')
Key ID_x A B ID_y C D
0 k0 aa a0 b0 aa c0 d0
1 k1 bb a1 b1 bb c1 d1
df1.merge(df2, on='ID')
Key_x ID A B Key_y C D
0 k0 aa a0 b0 k0 c0 d0
1 k1 bb a1 b1 k1 c1 d1
2 k2 cc a2 b2 k3 c3 d3

Change suffix

df1.merge(df2, on='ID', suffixes=('_left', '_right'))
Key_left ID A B Key_right C D
0 k0 aa a0 b0 k0 c0 d0
1 k1 bb a1 b1 k1 c1 d1
2 k2 cc a2 b2 k3 c3 d3

left_on, right_on

Specify this argument when the column name you want to use as Key is different between left and right.

df1 = pd.DataFrame({'Key1':['k0','k1','k2'],
                   'A':['a0','a1','a2'],
                   'B':['b0','b1','b2']})
df2 = pd.DataFrame({'Key2':['k0','k1','k3'],
                   'C':['c0','c1','c3'],
                   'D':['d0','d1','d3']})
df1.merge(df2, left_on='Key1', right_on='Key2')
Key1 A B Key2 C D
0 k0 a0 b0 k0 c0 d0
1 k1 a1 b1 k1 c1 d1

left_index, right_index

If you want to specify Index as Key instead of column, specify True for left_index and right_index.

df1.merge(df2, left_index=True, right_index=True)
Key1 A B Key2 C D
0 k0 a0 b0 k0 c0 d0
1 k1 a1 b1 k1 c1 d1
2 k2 a2 b2 k3 c3 d3

join

If you use the join function, it will join by index, but you do not need to remember because you can do almost the same with merge.

df1 = pd.DataFrame({'Key1':['k0','k1','k2'],
                   'A':['a0','a1','a2'],
                   'B':['b0','b1','b2']})
df2 = pd.DataFrame({'Key2':['k0','k1','k3'],
                   'C':['c0','c1','c3'],
                   'D':['d0','d1','d3']})
df1.join(df2)
Key1 A B Key2 C D
0 k0 a0 b0 k0 c0 d0
1 k1 a1 b1 k1 c1 d1
2 k2 a2 b2 k3 c3 d3
df1.merge(df2, left_index=True, right_index=True)
Key1 A B Key2 C D
0 k0 a0 b0 k0 c0 d0
1 k1 a1 b1 k1 c1 d1
2 k2 a2 b2 k3 c3 d3
df1 = pd.DataFrame({'Key1':['k0','k1','k2'],
                   'A':['a0','a1','a2'],
                   'B':['b0','b1','b2']})
df2 = pd.DataFrame({'Key2':['k0','k1','k3'],
                   'C':['c0','c1','c3'],
                   'D':['d0','d1','d3']})
df3 = pd.DataFrame({'Key3':['k0','k1','k4'],
                   'E':['c0','c1','c3'],
                   'F':['d0','d1','d3']})
df1.join([df2, df3])
Key1 A B Key2 C D Key3 E F
0 k0 a0 b0 k0 c0 d0 k0 c0 d0
1 k1 a1 b1 k1 c1 d1 k1 c1 d1
2 k2 a2 b2 k3 c3 d3 k4 c3 d3

17th

.unique() .nunique()

import pandas as pd
df = pd.read_csv('train.csv')
df.head()
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S
df['Pclass'].unique()
array([3, 1, 2])
df['Pclass'].nunique()
3

.value_counts()

df['Pclass'].value_counts()
3    491
1    216
2    184
Name: Pclass, dtype: int64

(Super important) .apply ()

You can use the apply () function to process all records in a DataFrame and store the result in a separate column. It is an image that applies processing to each line

def get_age_group(age):
    return str(age)[0] + '0s'

get_age_group(45)
'40s'
df = pd.DataFrame({'name':['John','Mike','Emily'],
                  'age':['23','36','42']})
df
name age
0 John 23
1 Mike 36
2 Emily 42
df['age'].apply(get_age_group)
0    20s
1    30s
2    40s
Name: age, dtype: object

How to use .apply () with a lambda function

#Assign to the variable f in the lambda function
f = lambda x: str(x)[0] + '0s'
#Put 43 as a trial
f(43)
'40s'
df['age_group'] = df['age'].apply(lambda x: str(x)[0] + '0s')
df
name age age_group
0 John 23 20s
1 Mike 36 30s
2 Emily 42 40s

How to use .apply () for the whole record

df = pd.DataFrame({'name':['John','Mike','Emily'],
                  'age':['23','36','42']})
df['description'] = df.apply(lambda row:'{} is {} years old'.format(row['name'], row['age']), axis=1)
df
name age description
0 John 23 John is 23 years old
1 Mike 36 Mike is 36 years old
2 Emily 42 Emily is 42 years old

18th

Save DataFrame in csv format with .to_csv ()

df = pd.read_csv('train.csv')
df.head()
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S
df['Adult'] = df['Age'].apply(lambda x: x>20)
df.tail()
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked Adult
886 887 0 2 Montvila, Rev. Juozas male 27.0 0 0 211536 13.00 NaN S True
887 888 1 1 Graham, Miss. Margaret Edith female 19.0 0 0 112053 30.00 B42 S False
888 889 0 3 Johnston, Miss. Catherine Helen "Carrie" female NaN 1 2 W./C. 6607 23.45 NaN S False
889 890 1 1 Behr, Mr. Karl Howell male 26.0 0 0 111369 30.00 C148 C True
890 891 0 3 Dooley, Mr. Patrick male 32.0 0 0 370376 7.75 NaN Q True
df.to_csv('train_w_adult.csv')
df = pd.read_csv('train_w_adult.csv')
df.head(3)
Unnamed: 0 PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked Adult
0 0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S True
1 1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C True
2 2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S True

Looking at the result of df.head (3), the index information when it was saved last time is saved in the mysterious column ‘Unnamed: 0’.

If you specify index = False in .to_csv (), you do not need to save the index. Basically, always specify index = False and save it in csv format.

Please note that if the same file already exists in the save destination, it will be overwritten.

df = pd.read_csv('train.csv')
df['Adult'] = df['Age'].apply(lambda x: x>20)
df.to_csv('train_w_adult.csv', index=False)
df = pd.read_csv('train_w_adult.csv')
df.head(3)
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked Adult
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S True
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C True
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S True
df = pd.DataFrame({'A':[['a', 'b'], 2, 3], 'B':[['c', 'd'], 5, 6]})
df
A B
0 [a, b] [c, d]
1 2 5
2 3 6
#Make sure that the stored values are a list.
type(df['A'].iloc[0])
list
#Save as csv
df.to_csv('temp.csv', index=False)
#Load the saved csv
df = pd.read_csv('temp.csv')
df
A B
0 ['a', 'b'] ['c', 'd']
1 2 5
2 3 6
type(df['A'].iloc[0])
str

Iterate a DataFrame with .iterrows ()

It is used when iterating a DataFrame with a for statement. It's hard to remember, but since "rows" is "iterated", let's remember iter + row + s. Since it can be turned with for, let's think that there are multiple s.

"Iteration" means to repeat the process. It's a loop. For example, if it is a list, it can be iterated with for i in list: (see Part 4).

In DataFrame, you cannot directly say for i in df: like a list. Write as follows using a function called .iterrows ().

df = pd.read_csv('train.csv')
for idx, row in df.iterrows():
    if row['Age'] > 40 and row['Pclass'] == 3 and row['Sex'] == 'male' and row['Survived'] == 1:
        print('{} is very lucky guy...!'.format(row['Name']))
Dahl, Mr. Karl Edwart is very lucky guy...!
Sundman, Mr. Johan Julian is very lucky guy...!

.apply () is often used to save the result of processing each record in another column, and this time .iterows () is often used when you want to process only instead of returning a value.

For example, the file path is stored in DataFrame, and .iterrows () is used to move or read the file.

Sort by specific column with .sort_values ()

#Sort by age
df.sort_values('Age')
df.head()
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S

Create a pivot table with .pivot_table ()

data = {'Data':['Jan-1','Jan-1','Jan-1','Jan-2','Jan-2','Jan-2'],
       'User':['Emily', 'John','Nick','Kevin','Emily','John'],
       'Method':['Card','Card','Cash','Card','Cash','Cash'],
       'Price':[100,250,200,460,200,130]}
df = pd.DataFrame(data)
df
Data User Method Price
0 Jan-1 Emily Card 100
1 Jan-1 John Card 250
2 Jan-1 Nick Cash 200
3 Jan-2 Kevin Card 460
4 Jan-2 Emily Cash 200
5 Jan-2 John Cash 130

In values, enter the column you want to aggregate. This time it is Price.

First, the value specified by values (Price in this case) is entered in each cell. This is the column you want to aggregate.

On the other hand, just pass the columns for which you want to specify index and columns in a list.

df.pivot_table(values='Price', index=['Data', 'User'], columns=['Method'])
Method Card Cash
Data User
Jan-1 Emily 100.0 NaN
John 250.0 NaN
Nick NaN 200.0
Jan-2 Emily NaN 200.0
John NaN 130.0
Kevin 460.0 NaN
df.pivot_table(values='Price', index=['Data', 'Method'], columns=['User'])
User Emily John Kevin Nick
Data Method
Jan-1 Card 100.0 250.0 NaN NaN
Cash NaN NaN NaN 200.0
Jan-2 Card NaN NaN 460.0 NaN
Cash 200.0 130.0 NaN NaN

First, clarify "which column you want to aggregate" and put that column in values, then just put the information you want in index and columns.

Cross-section operation with .xs ()

.xs () is an abbreviation for cross section. I don't use this much either, but it comes in handy when working with DataFrames that have multiple indexes, such as pivots. It's a good idea to remember it along with the pivot.

What you need to do with this .xs () is to use it in the pivot table above, for example, when you want to extract only the "Card" row. (Exactly cross-section)

#pivot.xs('Card', level = 'Method')

Recommended Posts

I wrote the basic operation of Pandas with Jupyter Lab (Part 1)
I wrote the basic operation of Pandas with Jupyter Lab (Part 2)
I wrote the basic operation of matplotlib with Jupyter Lab
I wrote the basic grammar of Python with Jupyter Lab
I wrote the basic operation of Seaborn in Jupyter Lab
Basic operation of pandas
Basic operation of Pandas
I studied with Kaggle Start Book on the subject of kaggle [Part 1]
Build the execution environment of Jupyter Lab
I made a mistake in fetching the hierarchy with MultiIndex of pandas
I tried the pivot table function of pandas
Automatic operation of Chrome with Python + Selenium + pandas
I checked the list of shortcut keys of Jupyter
Fill the browser with the width of Jupyter Notebook
Basic operation of Python Pandas Series and Dataframe (1)
I compared the moving average of IIR filter type with pandas and scipy
Find the sum of unique values with pandas crosstab
I want to plot the location information of GTFS Realtime on Jupyter! (With balloon)
Make a note of the list of basic Pandas usage
Drawing on Jupyter using the plot function of pandas
I measured the performance of 1 million documents with mongoDB
Summary of the basic flow of machine learning with Python
I tried to erase the negative part of Meros
Get the operation status of JR West with Python
I tried to compare the processing speed with dplyr of R and pandas of Python
I wrote you to watch the signal with Go
I wrote a doctest in "I tried to simulate the probability of a bingo game with Python"
I tried to find the entropy of the image with python
I tried "gamma correction" of the image with Python + OpenCV
I wrote the code for Japanese sentence generation with DeZero
I tried to find the average of the sequence with TensorFlow
About the garbled Japanese part of pandas-profiling in Jupyter notebook
I evaluated the strategy of stock system trading with Python.
I want to get the operation information of yahoo route
Reformat the timeline of the pandas time series plot with matplotlib
I implemented the FloodFill algorithm with TRON BATTLE of CodinGame.
I made a dot picture of the image of Irasutoya. (part1)
Try to automate the operation of network devices with Python
I made a dot picture of the image of Irasutoya. (part2)
I wrote GP with numpy
Python application: Pandas Part 1: Basic
Change the theme of Jupyter
Basic usage of Pandas Summary
The Power of Pandas: Python
Basic calculation of pandas to enjoy Hakone Ekiden while competing with the best members of all time
Part 1 I wrote the answer to the reference problem of how to write offline in real time in Python
I compared the speed of Hash with Topaz, Ruby and Python
I tried scraping the ranking of Qiita Advent Calendar with Python
[AWS / Tello] I tried operating the drone with my voice Part2
I tried standalone deployment of play with fabric [AWS operation with boto] [Play deployment]
Find the optimal value of a function with a genetic algorithm (Part 2)
I tried to automate the watering of the planter with Raspberry Pi
[Python] I wrote the route of the typhoon on the map using folium
I want to output the beginning of the next month with Python
I wrote the code to write the code of Brainf * ck in python
Count the maximum concatenated part of a random graph with NetworkX
[AWS / Tello] I tried operating the drone with my voice Part1
I tried to expand the size of the logical volume with LVM
I want to check the position of my face with OpenCV!
I checked the image of Science University on Twitter with Word2Vec.
I tried to improve the efficiency of daily work with Python