[PYTHON] That's why I quit pandas [Data Science 100 Knock (Structured Data Processing) # 1]

That's why I quit pandas [Data Science 100 Knock (Structured Data Processing) # 1]

We will solve the Python problem of Data Science 100 Knock (Structured Data Processing). This group of questions uses pandas for data processing in the model answer, but we will process it using NumPy's structured array after studying.

: arrow_forward: Next article (# 2)

Introduction

Many people who do data science in Python may be pandas lovers, but in fact ** you can do the same with NumPy without using pandas **. And NumPy is usually faster. As a person who loves pandas, I'm still not used to operating NumPy, so I'd like to try to graduate from pandas by operating this "Data Science 100 Knock" with NumPy this time.

This time I will do up to the 8th question. It seems that only receipt.csv is used this time. The initial data was read as follows (data type specification is postponed for the time being).

import numpy as np
import pandas as pd

#For model answer
df_receipt = pd.read_csv('data/receipt.csv')

#Data we handle
arr_receipt = np.genfromtxt(
    'data/receipt.csv', delimiter=',', encoding='utf-8',
    names=True, dtype=None)

It is memory saving.

import sys

sys.getsizeof(df_receipt)
# 26065721
sys.getsizeof(arr_receipt)
# 15074160

P_001

P-001: Display the first 10 items of all items from the data frame (df_receipt) of the receipt details, and visually check what kind of data you have.

Take it in slices.

In[001]


arr_receipt[:10]

You can get the data as follows. Moreover, it is smart enough to align the commas (unless there is a double-byte character string).

Out[001]


array([(20181103, 1257206400, 'S14006',  112, 1, 'CS006214000001', 'P070305012', 1, 158),
       (20181118, 1258502400, 'S13008', 1132, 2, 'CS008415000097', 'P070701017', 1,  81),
       (20170712, 1215820800, 'S14028', 1102, 1, 'CS028414000014', 'P060101005', 1, 170),
       (20190205, 1265328000, 'S14042', 1132, 1, 'ZZ000000000000', 'P050301001', 1,  25),
       (20180821, 1250812800, 'S14025', 1102, 2, 'CS025415000050', 'P060102007', 1,  90),
       (20190605, 1275696000, 'S13003', 1112, 1, 'CS003515000195', 'P050102002', 1, 138),
       (20181205, 1259971200, 'S14024', 1102, 2, 'CS024514000042', 'P080101005', 1,  30),
       (20190922, 1285113600, 'S14040', 1102, 1, 'CS040415000178', 'P070501004', 1, 128),
       (20170504, 1209859200, 'S13020', 1112, 2, 'ZZ000000000000', 'P071302010', 1, 770),
       (20191010, 1286668800, 'S14027', 1102, 1, 'CS027514000015', 'P071101003', 1, 680)],
      dtype=[('sales_ymd', '<i4'), ('sales_epoch', '<i4'), ('store_cd', '<U6'), ('receipt_no', '<i4'), ('receipt_sub_no', '<i4'), ('customer_id', '<U14'), ('product_cd', '<U10'), ('quantity', '<i4'), ('amount', '<i4')])

Let's compare the speed with the model answer (pandas).

Time[001]


#Model answer
%timeit df_receipt.head(10)
# 130 µs ± 5.02 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

%timeit arr_receipt[:10]
# 244 ns ± 8.23 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

I was able to get it at 1/500 of the speed of the model answer.

P_002

P-002: Specify columns in the order of sales date (sales_ymd), customer ID (customer_id), product code (product_cd), and sales amount (amount) from the receipt statement data frame (df_receipt), and display 10 items.

NumPy structured arrays can be manipulated in the same way as the pd.DataFrame we are used to.

In[002]


arr_receipt[['sales_ymd', 'customer_id', 'product_cd', 'amount']][:10]

Out[002]


array([(20181103, 'CS006214000001', 'P070305012', 158),
       (20181118, 'CS008415000097', 'P070701017',  81),
       (20170712, 'CS028414000014', 'P060101005', 170),
       (20190205, 'ZZ000000000000', 'P050301001',  25),
       (20180821, 'CS025415000050', 'P060102007',  90),
       (20190605, 'CS003515000195', 'P050102002', 138),
       (20181205, 'CS024514000042', 'P080101005',  30),
       (20190922, 'CS040415000178', 'P070501004', 128),
       (20170504, 'ZZ000000000000', 'P071302010', 770),
       (20191010, 'CS027514000015', 'P071101003', 680)],
      dtype={'names':['sales_ymd','customer_id','product_cd','amount'], 'formats':['<i4','<U14','<U10','<i4'], 'offsets':[0,40,96,140], 'itemsize':144})

Faster than manipulating pd.DataFrame.

Time[002]


%timeit df_receipt[['sales_ymd', 'customer_id', 'product_cd', 'amount']].head(10)
# 5.19 ms ± 43.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%timeit arr_receipt[['sales_ymd', 'customer_id', 'product_cd', 'amount']][:10]
# 906 ns ± 17.5 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

P_003

P-003: Specify columns in the order of sales date (sales_ymd), customer ID (customer_id), product code (product_cd), and sales amount (amount) from the receipt statement data frame (df_receipt), and display 10 items. However, sales_ymd should be extracted while changing the item name to sales_date.

Renaming is a simple and annoying thing. It's easiest to go through np.lib.recfunctions.rename_fields (), but this function is still quite awkward to use.

In[003]


np.lib.recfunctions.rename_fields(arr_receipt, {'sales_ymd': 'sales_date'})[
    ['sales_date', 'customer_id', 'product_cd', 'amount']][:10]

Out[003]


array([(20181103, 'CS006214000001', 'P070305012', 158),
       (20181118, 'CS008415000097', 'P070701017',  81),
       (20170712, 'CS028414000014', 'P060101005', 170),
       (20190205, 'ZZ000000000000', 'P050301001',  25),
       (20180821, 'CS025415000050', 'P060102007',  90),
       (20190605, 'CS003515000195', 'P050102002', 138),
       (20181205, 'CS024514000042', 'P080101005',  30),
       (20190922, 'CS040415000178', 'P070501004', 128),
       (20170504, 'ZZ000000000000', 'P071302010', 770),
       (20191010, 'CS027514000015', 'P071101003', 680)],
      dtype={'names':['sales_date','customer_id','product_cd','amount'], 'formats':['<i4','<U14','<U10','<i4'], 'offsets':[0,40,96,140], 'itemsize':144})

However, it processes very fast compared to pandas.

Time[003]


%timeit df_receipt[['sales_ymd', 'customer_id', 'product_cd', 'amount']].rename(columns={'sales_ymd': 'sales_date'}).head(10)
# 12.8 ms ± 252 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%timeit np.lib.recfunctions.rename_fields(arr_receipt, {'sales_ymd': 'sales_date'})[['sales_date', 'customer_id', 'product_cd', 'amount']][:10]
# 6.19 µs ± 132 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

Pandas is slow and heavy because it creates new dataframe objects every time it processes.

P_004

P-004: Data that meets the following conditions by specifying columns in the order of sales date (sales_ymd), customer ID (customer_id), product code (product_cd), and sales amount (amount) from the receipt statement data frame (df_receipt). Extract.

--Customer ID (customer_id) is "CS018205000001"

This can also be obtained intuitively.

In[004]


arr_receipt[['sales_ymd', 'customer_id', 'product_cd', 'amount']][
    arr_receipt['customer_id'] == 'CS018205000001']

Out[004]


arr_receipt[['sales_ymd','customer_id','product_cd','amount']][arr_receipt['customer_id'] == 'CS018205000001']
array([(20180911, 'CS018205000001', 'P071401012', 2200),
       (20180414, 'CS018205000001', 'P060104007',  600),
       (20170614, 'CS018205000001', 'P050206001',  990),
       (20170614, 'CS018205000001', 'P060702015',  108),
       (20190216, 'CS018205000001', 'P071005024',  102),
       (20180414, 'CS018205000001', 'P071101002',  278),
       (20190226, 'CS018205000001', 'P070902035',  168),
       (20190924, 'CS018205000001', 'P060805001',  495),
       (20190226, 'CS018205000001', 'P071401020', 2200),
       (20180911, 'CS018205000001', 'P071401005', 1100),
       (20190216, 'CS018205000001', 'P040101002',  218),
       (20190924, 'CS018205000001', 'P091503001',  280)],
      dtype={'names':['sales_ymd','customer_id','product_cd','amount'], 'formats':['<i4','<U14','<U10','<i4'], 'offsets':[0,40,96,140], 'itemsize':144})

The model answer seems to love pd.DataFrame.query (). I think it should be indexed normally.

Time[004]


%timeit df_receipt[['sales_ymd', 'customer_id', 'product_cd', 'amount']].query('customer_id == "CS018205000001"')
# 11.6 ms ± 477 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%timeit df_receipt.loc[df_receipt['customer_id'] == 'CS018205000001', ['sales_ymd', 'customer_id', 'product_cd', 'amount']]
# 9.49 ms ± 212 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%timeit arr_receipt[['sales_ymd', 'customer_id', 'product_cd', 'amount']][arr_receipt['customer_id'] == 'CS018205000001']
# 2.7 ms ± 475 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

P_005

P-005: Data that meets the following conditions by specifying columns in the order of sales date (sales_ymd), customer ID (customer_id), product code (product_cd), and sales amount (amount) from the receipt statement data frame (df_receipt). Extract.

--Customer ID (customer_id) is "CS018205000001" --Sales amount (amount) is 1,000 or more

It is the same even if there are multiple conditions.

In[005]


arr_receipt[['sales_ymd', 'customer_id', 'product_cd', 'amount']][
    (arr_receipt['customer_id'] == 'CS018205000001') & (arr_receipt['amount'] >= 1000)]

Out[005]


array([(20180911, 'CS018205000001', 'P071401012', 2200),
       (20190226, 'CS018205000001', 'P071401020', 2200),
       (20180911, 'CS018205000001', 'P071401005', 1100)],
      dtype={'names':['sales_ymd','customer_id','product_cd','amount'], 'formats':['<i4','<U14','<U10','<i4'], 'offsets':[0,40,96,140], 'itemsize':144})

P_006

P-006: Specify the columns in the order of sales date (sales_ymd), customer ID (customer_id), product code (product_cd), sales quantity (quantity), sales amount (amount) from the receipt detail data frame "df_receipt", and then Extract data that meets the conditions of.

--Customer ID (customer_id) is "CS018205000001" --Sales amount (amount) is 1,000 or more or sales quantity (quantity) is 5 or more

Somehow the line is getting longer, so I will divide it.

In[006]


col_list = ['sales_ymd', 'customer_id', 'product_cd', 'quantity', 'amount']
cond = ((arr_receipt['customer_id'] == 'CS018205000001')
        & ((arr_receipt['amount'] >= 1000) | (arr_receipt['quantity'] >= 5)))

arr_receipt[col_list][cond]

Out[006]


array([(20180911, 'CS018205000001', 'P071401012', 2200),
       (20180414, 'CS018205000001', 'P060104007',  600),
       (20170614, 'CS018205000001', 'P050206001',  990),
       (20190226, 'CS018205000001', 'P071401020', 2200),
       (20180911, 'CS018205000001', 'P071401005', 1100)],
      dtype={'names':['sales_ymd','customer_id','product_cd','amount'], 'formats':['<i4','<U14','<U10','<i4'], 'offsets':[0,40,96,140], 'itemsize':144})

P_007

P-007: Data that meets the following conditions by specifying columns in the order of receipt details data frame (df_receipt), sales date (sales_ymd), customer ID (customer_id), product code (product_cd), and sales amount (amount). Extract.

--Customer ID (customer_id) is "CS018205000001" --Sales amount (amount) is 1,000 or more and 2,000 or less

In[007]


col_list = ['sales_ymd', 'customer_id', 'product_cd', 'quantity', 'amount']
cond = ((arr_receipt['customer_id'] == 'CS018205000001')
        & ((1000 <= arr_receipt['amount']) & (arr_receipt['amount'] <= 2000)))

arr_receipt[col_list][cond]

Out[007]


array([(20180911, 'CS018205000001', 'P071401005', 1, 1100)],
      dtype={'names':['sales_ymd','customer_id','product_cd','quantity','amount'], 'formats':['<i4','<U14','<U10','<i4','<i4'], 'offsets':[0,40,96,136,140], 'itemsize':144})

P_008

P-008: Data that meets the following conditions by specifying columns in the order of receipt details data frame (df_receipt), sales date (sales_ymd), customer ID (customer_id), product code (product_cd), and sales amount (amount). Extract.

--Customer ID (customer_id) is "CS018205000001" --Product code (product_cd) is other than "P071401019"

In[008]


col_list = ['sales_ymd', 'customer_id', 'product_cd', 'quantity', 'amount']
cond = ((arr_receipt['customer_id'] == 'CS018205000001')
        & (arr_receipt['product_cd'] != 'P071401019'))

arr_receipt[col_list][cond]

Out[008]


array([(20180911, 'CS018205000001', 'P071401012', 1, 2200),
       (20180414, 'CS018205000001', 'P060104007', 6,  600),
       (20170614, 'CS018205000001', 'P050206001', 5,  990),
       (20170614, 'CS018205000001', 'P060702015', 1,  108),
       (20190216, 'CS018205000001', 'P071005024', 1,  102),
       (20180414, 'CS018205000001', 'P071101002', 1,  278),
       (20190226, 'CS018205000001', 'P070902035', 1,  168),
       (20190924, 'CS018205000001', 'P060805001', 1,  495),
       (20190226, 'CS018205000001', 'P071401020', 1, 2200),
       (20180911, 'CS018205000001', 'P071401005', 1, 1100),
       (20190216, 'CS018205000001', 'P040101002', 1,  218),
       (20190924, 'CS018205000001', 'P091503001', 1,  280)],
      dtype={'names':['sales_ymd','customer_id','product_cd','quantity','amount'], 'formats':['<i4','<U14','<U10','<i4','<i4'], 'offsets':[0,40,96,136,140], 'itemsize':144})

Time[008]


col_list = ['sales_ymd', 'customer_id', 'product_cd', 'quantity', 'amount']

%timeit df_receipt[col_list].query('customer_id == "CS018205000001" & product_cd != "P071401019"')
# 15.6 ms ± 1.19 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit arr_receipt[col_list][((arr_receipt['customer_id'] == 'CS018205000001') & (arr_receipt['product_cd'] != 'P071401019'))]
# 4.28 ms ± 86.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

in conclusion

Hey This is high-speed, high-speed, data processing made with NumPy. You can spit it out to csv after processing, and pandas will not do heavy processing. That's why I quit pandas

Recommended Posts

That's why I quit pandas [Data Science 100 Knock (Structured Data Processing) # 2]
That's why I quit pandas [Data Science 100 Knock (Structured Data Processing) # 1]
That's why I quit pandas [Data Science 100 Knock (Structured Data Processing) # 3]
That's why I quit pandas [Data Science 100 Knock (Structured Data Processing) # 5]
That's why I quit pandas [Data Science 100 Knock (Structured Data Processing) # 4]
That's why I quit pandas [Data Science 100 Knock (Structured Data Processing) # 6]
"Data Science 100 Knock (Structured Data Processing)" Python-007 Explanation
"Data Science 100 Knock (Structured Data Processing)" Python-006 Explanation
"Data Science 100 Knock (Structured Data Processing)" Python-001 Explanation
"Data Science 100 Knock (Structured Data Processing)" Python-002 Explanation
[Python] Data Science 100 Knock (Structured Data Processing) 021 Explanation
"Data Science 100 Knock (Structured Data Processing)" Python-005 Explanation
"Data Science 100 Knock (Structured Data Processing)" Python-004 Explanation
[Python] Data Science 100 Knock (Structured Data Processing) 020 Explanation
[Python] Data Science 100 Knock (Structured Data Processing) 025 Explanation
"Data Science 100 Knock (Structured Data Processing)" Python-003 Explanation
[Python] Data Science 100 Knock (Structured Data Processing) 019 Explanation
Preparing to try "Data Science 100 Knock (Structured Data Processing)"
Data science 100 knock (structured data processing) environment construction (Windows10)
[Python] Data Science 100 Knock (Structured Data Processing) 001-010 Impressions + Explanation Link Summary
That's why I quit pandas [Three ways to groupby.mean () with just NumPy]
[Python] 100 knocks on data science (structured data processing) 018 Explanation
[Python] 100 knocks on data science (structured data processing) 023 Explanation
[Python] 100 knocks on data science (structured data processing) 030 Explanation
[Python] 100 knocks on data science (structured data processing) 022 Explanation
100 language processing knock-20 (using pandas): reading JSON data
[Python] 100 knocks on data science (structured data processing) 017 Explanation
[Python] 100 knocks on data science (structured data processing) 026 Explanation
[Python] 100 knocks on data science (structured data processing) 016 Explanation
[Python] 100 knocks on data science (structured data processing) 024 Explanation
[Python] 100 knocks on data science (structured data processing) 027 Explanation
[Python] 100 knocks on data science (structured data processing) 029 Explanation
[Python] 100 knocks on data science (structured data processing) 015 Explanation
[Python] 100 knocks on data science (structured data processing) 028 Explanation
Data science 100 knock commentary (P021 ~ 040)
Data science 100 knock commentary (P061 ~ 080)
Data science 100 knock commentary (P081 ~ 100)
I tried 100 language processing knock 2020
Data processing tips with Pandas
I tried 100 language processing knock 2020: Chapter 3
100 Language Processing Knock-31 (using pandas): Verb
I tried 100 language processing knock 2020: Chapter 1
I tried 100 language processing knock 2020: Chapter 2
I tried 100 language processing knock 2020: Chapter 4
100 Language Processing Knock-38 (using pandas): Histogram
100 Language Processing Knock-33 (using pandas): Sahen noun
I took Udemy's "Practical Python Data Science"
100 Language Processing Knock-35 (using pandas): Noun concatenation
100 Language Processing Knock-39 (using pandas): Zipf's Law
Example of efficient data processing with PANDAS
100 Language Processing Knock-34 (using pandas): "A B"