[PYTHON] [Pandas speedup] Flag odd or even lines

Introduction

In processing the data, there was work to flag odd or even rows. At that time, I tried various things to speed up, so make a note of it.

Prerequisites

As a prerequisite, assume that you have a data frame of 10000 rows as shown below.

df = pd.DataFrame({'hoge':np.zeros(10000)}
df
hoge
0 0.0
1 0.0
2 0.0
3 0.0
4 0.0
... ...
9995 0.0
9996 0.0
9997 0.0
9998 0.0
9999 0.0

Add the following column to this data frame, called'target_record', with flags on odd or even rows.

df['target_record'] = [1,0,1,0,1,...,0,1,0,1,0]
df
hoge target_record
0 0.0 1
1 0.0 0
2 0.0 1
3 0.0 0
4 0.0 1
... ... ...
9995 0.0 0
9996 0.0 1
9997 0.0 0
9998 0.0 1
9999 0.0 0

Calculate the time to create this target_record column. By the way, the processing time was calculated as the average measured 10,000 times.

Processing speed comparison

loc + slice

First of all, the simplest? From the way Add a'target_record' column with 0 assigned to all records, and assign 1 to the specified row with loc + slice.

df['target_record'] = 0
df.loc[0::2, 'target_record'] = 1  #Df for even rows.loc[1::2, 'target_record'] = 1

#Average processing time: 0.0009912237882614137 sec

By the way, with iloc,

df['target_record'] = 0
df.iloc[0::2, 1] = 1

#Average processing time: 0.0009658613920211792 sec

Is it slightly faster than loc?

np.zeros + slice

It is a famous story that the processing speed of loc is slow, so create an array with numpy and assign 1 to slice.

target_record = np.zeros(10000, dtype=int)
target_record[0::2] = 1  #Target for even rows_record[1::2] = 1
df['target_record'] = target_record

#Average processing time: 0.00035130116939544677 sec

The processing time has been reduced to about 1/3.

np.arange + remainder

Create an array from 0 to 9999 with np.arange (10000) and substitute the remainder value divided by 2.

target_record = np.arange(10000)
df['target_record'] = (target_record + 1) % 2  #Df for even rows['target_record'] = target_record % 2

#Average processing time: 0.00046031529903411863 sec

It's a little devised, but it seems that np.zeros + slice is 0.0001 seconds faster.

Summary

When flagging odd or even rows, np.zeros + slice is the fastest?

By the way, when I measured the time for each process, the difference in processing speed was divided. The difference was whether to calculate the remainder or substitute in slices. There was almost no difference in processing speed between np.zeros and np.arange (zeros is faster by 1.0e-06 seconds).

Recommended Posts

[Pandas speedup] Flag odd or even lines
Even or odd