[PYTHON] Processing summary 2 often done in Pandas (data reference, editing operation)

Continuing from Last time, this time I would like to summarize the processes that are often used when referencing and editing data.

First, run the code below to create a sample dataframe.

import pandas as pd

df = pd.DataFrame({
    'name' : ['userA', 'userB', 'userC', 'userD', 'userE'],
    'age' : [20, 25, 41, 33, 22],
    'sex' : ['male', 'female', 'female', 'male', 'male'],
    'price' : 10000
})
name age sex price
0 userA 20 male 10000
1 userB 25 female 10000
2 userC 41 female 10000
3 userD 33 male 10000
4 userE 22 male 10000

To get a specific column

If you specify by column name (column name), the data of the corresponding column will be You can get it in Series format.

df['name']
name
0 userA
1 userB
2 userC
3 userD
4 userE

When retrieving multiple columns

If you specify in the list of column names (column names), the data of the corresponding column (plural) will be displayed. It can be obtained in the form of DataFrame.

target_columns_list = ['name', 'age']
df[target_columns_list]
name age
0 userA 20
1 userB 25
2 userC 41
3 userD 33
4 userE 22

When adding a new column

If you specify a new column name and perform an assign operation, the column will be added.

df['new'] = df['age'] * 10
name age sex price new
0 userA 20 male 10000 200
1 userB 25 female 10000 250
2 userC 41 female 10000 410
3 userD 33 male 10000 330
4 userE 22 male 10000 220

When retrieving only specific row data

If you want to get only specific row data, specify the index in the loc method. You can get the data of the specified row in the Series format.

df.loc[0]
name age sex price
0 userA 20 male 10000

When retrieving multiple lines of data

If you want to get multiple rows of data, specify a list of indexes in the loc method. You can get all the data in the specified row in the DataFrame format.

target_index_list = [0, 1, 2]
df.loc[target_index_list]
name age sex price
0 userA 20 male 10000
1 userB 25 female 10000
2 userC 41 female 10000

When retrieving data by specifying rows and columns

It's not a very frequent operation, but if you specify the index and column name in the loc method, You can get only the specified element.


df.loc[1, 'sex']
# female

#If you assign to the element obtained above, it will be updated.
df.loc[1, 'sex'] = 'updated'
name age sex price
0 userA 20 male 10000
1 userB 25 updated 10000
2 userC 41 female 10000
3 userD 33 male 10000
4 userE 22 male 10000

When extracting rows by specifying conditions

#For example, when extracting rows with age 25 or more
df[(df['age'] >= 25)]

#At the beginning~If you add, it becomes NOT
df[~(df['age'] >= 25)]

#When extracting rows with AND
df[(df['age'] >= 25) & (df['sex'] == 'female')]

#When extracting rows with OR
df[(df['age'] < 25) | (df['age'] > 40)]

The following is an example of extracting rows with AND.

name age sex price
1 userB 25 female 10000
2 userC 41 female 10000

When calculating statistics with groupby

#This is convenient when you want to add the result of calculating statistics for each category to the original data frame.
#The following calculates the average age for each gender
#Other statistics can be calculated by changing the argument specified for transform.
df['average_age_by_sex'] = df.groupby('sex')['age'].transform('mean')

#If you just want to see the average on jupyter notebook etc., the following is OK
# df.groupby('sex')['age'].mean()
name age sex price new average_age_by_sex
0 userA 20 male 10000 200 25
1 userB 25 female 10000 250 33
2 userC 41 female 10000 410 33
3 userD 33 male 10000 330 25
4 userE 22 male 10000 220 25

When implementing the simplest loop processing in Pandas

With iterrows, the index and the data in that row are in Series format, row by row. You can get it at. However, if the number of lines in the data frame increases, the processing will slow down, so I think that the data frame is about 100 lines, and it is limited to use when you want to skip a little.

for index, row in df.iterrows():
    print(index)
    print(type(row))

When creating a new column by referring to a specific column

In the case like the title, and when you want to use the if statement, you will want to process line by line using iterrows, but in most cases it is better to stop from the viewpoint of processing speed. is. In that case, you can speed up the process by using the apply method as shown below.

df['age_boundary'] = df['age'].apply(lambda x: '25 years and over' if x >=25 else 'Under 25 years old')

#Even if you define a function and pass it to the apply method, the result is the same
# def get_age_boundary(age):
#     if age >= 25:
#         return '25 years and over'
#     else:
#         return 'Under 25 years old'

# df['age_boundary'] = df['age'].apply(get_age_boundary)

name age sex price age_boundary
0 userA 20 male 10000 Under 25 years old
1 userB 25 female 10000 25 years and over
2 userC 41 female 10000 25 years and over
3 userD 33 male 10000 25 years and over
4 userE 22 male 10000 Under 25 years old

When creating a new column by referring to multiple columns

In some cases, you may need to refer to multiple columns to create a new column. In that case, you can refer to the value of each column for each row by specifying the argument axis = 1 of the apply method.

def get_price_with_discount_rate(row):

    age = row['age']
    price = row['price']
    
    discount_rate = 1.0
    
    if age >= 40:
        discount_rate = 0.5
            
    return int(price * discount_rate)

df['price_with_discount_rate'] = df.apply(get_price_with_discount_rate, axis=1)

name age sex price price_with_discount_rate
0 userA 20 male 10000 10000
1 userB 25 female 10000 10000
2 userC 41 female 10000 5000
3 userD 33 male 10000 10000
4 userE 22 male 10000 10000

When using np.vectorize

In the above case of creating a new column by referring to multiple columns, we took the method of specifying the argument axis = 1 of the apply method, but this method has a problem in terms of processing speed. As an alternative to the above, I haven't used it in practice, but you might want to consider using Numpy's vectorize method as well.

import numpy as np

def get_price_with_discount_rate(age, price):
        
    discount_rate = 1.0
    
    if age >= 40:
        discount_rate = 0.5
            
    return int(price * discount_rate)

vectorized_func=np.vectorize(get_price_with_discount_rate)
df['price_with_discount_rate'] = vectorized_func(df['age'], df['price'])

name age sex price price_with_discount_rate
0 userA 20 male 10000 10000
1 userB 25 female 10000 10000
2 userC 41 female 10000 5000
3 userD 33 male 10000 10000
4 userE 22 male 10000 10000

Mutual conversion of Series, ndarray, list

How to convert between Pandas Series, NumPy ndarray, and Python standard list I often forget it, so I will summarize it here.


import numpy as np

# Series → ndarray
df['name'].values

# ndarray → list
df['name'].values.tolist()

# list → Series
pd.Series([1, 2, 3, 4, 5])

# list → ndarray
np.array([1, 2, 3, 4, 5])


In writing this article, I referred to the following sites. http://sinhrks.hatenablog.com/entry/2015/07/11/223124 https://qiita.com/3x8tacorice/items/3cc5399e18a7e3f9db86 https://note.nkmk.me/python-pandas-numpy-conversion/

Recommended Posts

Processing summary 2 often done in Pandas (data reference, editing operation)
Grammar summary often used in pandas
Summary of methods often used in pandas
Processing memos often used in pandas (beginners)
Data processing tips with Pandas
Check the data summary in CASTable
Summary of processes often performed in Pandas 1 (CSV, Excel file related operations)
Summary of probability distributions that often appear in statistics and data analysis
RDS data via stepping stones in Pandas
Working with 3D data structures in pandas
Example of efficient data processing with PANDAS