That's why I quit pandas [Data Science 100 Knock (Structured Data Processing) # 5]

We will solve the Python problem of Data Science 100 Knock (Structured Data Processing). This group of questions uses pandas for data processing in the model answer, but we will process it using NumPy after studying.

: arrow_up: First article (# 1) : arrow_backward: Previous article (# 4) : arrow_forward: Next article (# 6)

Introduction

As a study of NumPy, I will solve the Python problem of Data Science 100 Knock (Structured Data Processing).

Many people who do data science in Python may be pandas lovers, but in fact ** you can do the same with NumPy without using pandas **. And NumPy is usually faster. As a person who loves pandas, I'm still not used to operating NumPy, so I'd like to try to graduate from pandas by operating this "Data Science 100 Knock" with NumPy this time.

This time, I will ask questions 36-44. It seems to be the theme of merging multiple data frames. The initial data was read as follows (data type specification is postponed for the time being).

import numpy as np
import pandas as pd
from numpy.lib import recfunctions as rfn

#For model answer
df_category = pd.read_csv('data/category.csv', dtype='string')
df_customer = pd.read_csv('data/customer.csv')
df_product = pd.read_csv(
    'data/product.csv',
    dtype={col: 'string' for col in
           ['category_major_cd', 'category_medium_cd', 'category_small_cd']})
df_receipt = pd.read_csv('data/receipt.csv')
df_store = pd.read_csv('data/store.csv')

#Data we handle
arr_category = np.genfromtxt(
    'data/category.csv', delimiter=',', encoding='utf-8-sig',
    names=True, dtype=tuple(['<U15']*6))
arr_customer = np.genfromtxt(
    'data/customer.csv', delimiter=',', encoding='utf-8',
    names=True, dtype=None)
arr_product = np.genfromtxt(
    'data/product.csv', delimiter=',', encoding='utf-8-sig',
    names=True, dtype=tuple(['<U10']*4+['<i4']*2))
arr_receipt = np.genfromtxt(
    'data/receipt.csv', delimiter=',', encoding='utf-8',
    names=True, dtype=None)
arr_store = np.genfromtxt(
    'data/store.csv', delimiter=',', encoding='utf-8',
    names=True, dtype=None)

`Finally, a function to output the calculation result as a structured array`


def make_array(size, **kwargs):
    arr = np.empty(size, dtype=[(colname, subarr.dtype)
                                for colname, subarr in kwargs.items()])
    for colname, subarr in kwargs.items():
        arr[colname] = subarr
    return arr

P_036

P-036: Internally combine the receipt detail data frame (df_receipt) and the store data frame (df_store), and display all items of the receipt detail data frame and 10 store names (store_name) of the store data frame.

It's an inner join, but the content is VLOOKUP. First, perform np.unique () on the matrix that combines the key columns of both frames to convert it to numerical data. Then use np.searchsorted () to replace the store code column in the receipt statement.

`In[023]`


_, inv = np.unique(np.concatenate([arr_store['store_cd'],
                                   arr_receipt['store_cd']]),
                   return_inverse=True)
inv_map, inv_arr = inv[:arr_store.size], inv[arr_store.size:]
sorter_index = np.argsort(inv_map)
idx = np.searchsorted(inv_map, inv_arr, sorter=sorter_index)
store_name = arr_store['store_name'][sorter_index[idx]]

new_arr = make_array(arr_receipt.size, **{col: arr_receipt[col]
                                          for col in arr_receipt.dtype.names},
                     store_name=store_name)
new_arr[:10]

`Out[023]`


array([(20181103, 1257206400, 'S14006',  112, 1, 'CS006214000001', 'P070305012', 1, 158, 'Kuzugaya store'),
       (20181118, 1258502400, 'S13008', 1132, 2, 'CS008415000097', 'P070701017', 1,  81, 'Seijo store'),
       (20170712, 1215820800, 'S14028', 1102, 1, 'CS028414000014', 'P060101005', 1, 170, 'Futatsubashi store'),
       (20190205, 1265328000, 'S14042', 1132, 1, 'ZZ000000000000', 'P050301001', 1,  25, 'Shinyamashita store'),
       (20180821, 1250812800, 'S14025', 1102, 2, 'CS025415000050', 'P060102007', 1,  90, 'Yamato store'),
       (20190605, 1275696000, 'S13003', 1112, 1, 'CS003515000195', 'P050102002', 1, 138, 'Komae store'),
       (20181205, 1259971200, 'S14024', 1102, 2, 'CS024514000042', 'P080101005', 1,  30, 'Sanda store'),
       (20190922, 1285113600, 'S14040', 1102, 1, 'CS040415000178', 'P070501004', 1, 128, 'Nagatsuta store'),
       (20170504, 1209859200, 'S13020', 1112, 2, 'ZZ000000000000', 'P071302010', 1, 770, 'Jujonakahara store'),
       (20191010, 1286668800, 'S14027', 1102, 1, 'CS027514000015', 'P071101003', 1, 680, 'Minamifujisawa store')],
      dtype=[('sales_ymd', '<i4'), ('sales_epoch', '<i4'), ('store_cd', '<U6'), ('receipt_no', '<i4'), ('receipt_sub_no', '<i4'), ('customer_id', '<U14'), ('product_cd', '<U10'), ('quantity', '<i4'), ('amount', '<i4'), ('store_name', '<U6')])

In the case of the method using pd.merge () of the model answer, it is forced to sort by the key column (because the key column is not unique). If this is the problem, if you do it with pandas

store_name = df_receipt['store_cd'].map(
    df_store.set_index('store_cd')['store_name'])
df = df_receipt.assign(store_name=store_name)

It should be done.

P_037

P-037: Internally combine the product data frame (df_product) and the category data frame (df_category), and display all items of the product data frame and 10 subcategory names (category_small_name) of the category data frame.

the same.

`In[037]`


_, inv = np.unique(np.concatenate([arr_category['category_small_cd'],
                                   arr_product['category_small_cd']]),
                   return_inverse=True)
inv_map, inv_arr = inv[:arr_category.size], inv[arr_category.size:]
sorter_index = np.argsort(inv_map)
idx = np.searchsorted(inv_map, inv_arr, sorter=sorter_index)
store_name = arr_category['category_small_name'][sorter_index[idx]]

new_arr = make_array(arr_product.size, **{col: arr_product[col]
                                          for col in arr_product.dtype.names},
                     store_name=store_name)
new_arr[:10]

`Out[037]`


array([('P040101001', '04', '0401', '040101', 198, 149, 'Lunch boxes'),
       ('P040101002', '04', '0401', '040101', 218, 164, 'Lunch boxes'),
       ('P040101003', '04', '0401', '040101', 230, 173, 'Lunch boxes'),
       ('P040101004', '04', '0401', '040101', 248, 186, 'Lunch boxes'),
       ('P040101005', '04', '0401', '040101', 268, 201, 'Lunch boxes'),
       ('P040101006', '04', '0401', '040101', 298, 224, 'Lunch boxes'),
       ('P040101007', '04', '0401', '040101', 338, 254, 'Lunch boxes'),
       ('P040101008', '04', '0401', '040101', 420, 315, 'Lunch boxes'),
       ('P040101009', '04', '0401', '040101', 498, 374, 'Lunch boxes'),
       ('P040101010', '04', '0401', '040101', 580, 435, 'Lunch boxes')],
      dtype=[('product_cd', '<U10'), ('category_major_cd', '<U10'), ('category_medium_cd', '<U10'), ('category_small_cd', '<U10'), ('unit_price', '<i4'), ('unit_cost', '<i4'), ('store_name', '<U15')])

P_038

P-038: Find the total sales amount for each customer from the customer data frame (df_customer) and receipt detail data frame (df_receipt). However, for customers who have no shopping record, the sales amount should be displayed as 0. In addition, customers should target those whose gender code (gender_cd) is female (1), and exclude non-members (customer IDs starting with'Z'). Only 10 results need to be displayed.

Similarly, first combine the key columns of both frames and then convert them to numerical data. Next, use np.bincount () to calculate the total for each customer. At this time, if the third argument minlength is used, the output matrix will be of any size, so ʻunq.size` To specify. Finally, get the value by indexing the receipt detail data side of the digitized key column.

`In[038]`


is_member_receipt = arr_receipt['customer_id'].astype('<U1') != 'Z'
is_member_customer = ((arr_customer['customer_id'].astype('<U1') != 'Z')
                      & (arr_customer['gender_cd'] == 1))
customer = arr_customer['customer_id'][is_member_customer]

unq, inv = np.unique(
    np.concatenate([customer, arr_receipt['customer_id'][is_member_receipt]]),
    return_inverse=True)
customer_size = customer.size

amount_sum = np.bincount(
    inv[customer_size:], arr_receipt['amount'][is_member_receipt], unq.size)

new_arr = make_array(customer_size,
                     customer_id=customer,
                     amount=amount_sum[inv[:customer_size]])
new_arr[:10]

`Out[038]`


array([('CS021313000114',    0.), ('CS031415000172', 5088.),
       ('CS028811000001',    0.), ('CS001215000145',  875.),
       ('CS015414000103', 3122.), ('CS033513000180',  868.),
       ('CS035614000014',    0.), ('CS011215000048', 3444.),
       ('CS009413000079',    0.), ('CS040412000191',  210.)],
      dtype=[('customer_id', '<U14'), ('amount', '<f8')])

P_039

P-039: Extract the top 20 customers with the highest number of sales days and the top 20 customers with the highest total sales amount from the receipt detail data frame (df_receipt), and completely externally combine them. However, non-members (customer IDs starting with'Z') should be excluded.

Use np.partition () to get the 20th place value, replace the value not in the top 20th place with np.nan, and then get the index.

`In[039]`


is_member = arr_receipt['customer_id'].astype('<U1') != 'Z'
unq, inv = np.unique(arr_receipt['customer_id'][is_member],
                     return_inverse=True)

sums = np.bincount(inv, arr_receipt['amount'][is_member], unq.size)
is_sum_top = sums >= -np.partition(-sums, 20)[20]
sums[~is_sum_top] = np.nan

unq2 = np.unique([inv, arr_receipt['sales_ymd'][is_member]], axis=-1)
counts = np.bincount(unq2[0]).astype(float)
is_cnt_top = counts >= -np.partition(-counts, 20)[20]
counts[~is_cnt_top] = np.nan

interserction = is_cnt_top | is_sum_top
make_array(interserction.sum(),
           customer_id=unq[interserction],
           amount=sums[interserction],
           sales_ymd=counts[interserction])

`Out[039]`


array([('CS001605000009', 18925., nan), ('CS006515000023', 18372., nan),
       ('CS007514000094', 15735., nan), ('CS007515000107',    nan, 18.),
       ('CS009414000059', 15492., nan), ('CS010214000002',    nan, 21.),
       ('CS010214000010', 18585., 22.), ('CS011414000106', 18338., nan),
       ('CS011415000006', 16094., nan), ('CS014214000023',    nan, 19.),
       ('CS014415000077',    nan, 18.), ('CS015415000185', 20153., 22.),
       ('CS015515000034', 15300., nan), ('CS016415000101', 16348., nan),
       ('CS016415000141', 18372., 20.), ('CS017415000097', 23086., 20.),
       ('CS021514000045',    nan, 19.), ('CS021515000056',    nan, 18.),
       ('CS021515000089', 17580., nan), ('CS021515000172',    nan, 19.),
       ('CS021515000211',    nan, 18.), ('CS022515000028',    nan, 18.),
       ('CS022515000226',    nan, 19.), ('CS026414000059', 15153., nan),
       ('CS028415000007', 19127., 21.), ('CS030214000008',    nan, 18.),
       ('CS030415000034', 15468., nan), ('CS031414000051', 19202., 19.),
       ('CS031414000073',    nan, 18.), ('CS032414000072', 16563., nan),
       ('CS032415000209',    nan, 18.), ('CS034415000047', 16083., nan),
       ('CS035414000024', 17615., nan), ('CS038415000104', 17847., nan),
       ('CS039414000052',    nan, 19.), ('CS040214000008',    nan, 23.)],
      dtype=[('customer_id', '<U14'), ('amount', '<f8'), ('sales_ymd', '<f8')])

P_040

P-040: I would like to investigate how many data will be obtained by combining all stores and all products. Calculate the number of direct products of stores (df_store) and products (df_product).

I don't understand the intent of the problem ...

`In[040]`


arr_store.size * arr_product.size

`Out[040]`

`Time[040]`


#Model answer
%%timeit
df_store_tmp = df_store.copy()
df_product_tmp = df_product.copy()

df_store_tmp['key'] = 0
df_product_tmp['key'] = 0
len(pd.merge(df_store_tmp, df_product_tmp, how='outer', on='key'))
# 277 ms ± 6.09 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

# NumPy
%timeit arr_store.size * arr_product.size
# 136 ns ± 1.69 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)

P_041

P-041: Aggregate the sales amount (amount) of the receipt detail data frame (df_receipt) for each date (sales_ymd), and calculate the increase or decrease in the sales amount from the previous day. It is sufficient to display 10 calculation results.

Use np.ediff1d () to get the difference before and after in the array.

`In[041]`


unq, inv = np.unique(arr_receipt['sales_ymd'], return_inverse=True)
diff_amount = np.ediff1d(np.bincount(inv, arr_receipt['amount']))
make_array(unq.size, sales_ymd=unq,
           diff_amount=np.concatenate([[np.nan], diff_amount]))[:10]

`Out[041]`


array([(20170101,    nan), (20170102, -9558.), (20170103,  3338.),
       (20170104,  8662.), (20170105,  1665.), (20170106, -5443.),
       (20170107, -8972.), (20170108,  1322.), (20170109,  1981.),
       (20170110, -6575.)],
      dtype=[('sales_ymd', '<i4'), ('diff_amount', '<f8')])

P_042

P-042: Aggregate the sales amount (amount) of the receipt detail data frame (df_receipt) for each date (sales_ymd), and combine the data of 1 day ago, 2 days ago, and 3 days ago with the data of each date. Only 10 results should be displayed.

Take it in slices.

`In[042]`


unq, inv = np.unique(arr_receipt['sales_ymd'], return_inverse=True)
amount = np.bincount(inv, arr_receipt['amount'])
make_array(unq.size - 3,
           sales_ymd=unq[3:], amount=amount[3:],
           lag_ymd_1=unq[2:-1], lag_amount_1=amount[2:-1],
           lag_ymd_2=unq[1:-2], lag_amount_2=amount[1:-2],
           lag_ymd_3=unq[:-3], lag_amount_3=amount[:-3])[:10]

`Out[042]`


array([(20170104, 36165., 20170103, 27503., 20170102, 24165., 20170101, 33723.),
       (20170105, 37830., 20170104, 36165., 20170103, 27503., 20170102, 24165.),
       (20170106, 32387., 20170105, 37830., 20170104, 36165., 20170103, 27503.),
       (20170107, 23415., 20170106, 32387., 20170105, 37830., 20170104, 36165.),
       (20170108, 24737., 20170107, 23415., 20170106, 32387., 20170105, 37830.),
       (20170109, 26718., 20170108, 24737., 20170107, 23415., 20170106, 32387.),
       (20170110, 20143., 20170109, 26718., 20170108, 24737., 20170107, 23415.),
       (20170111, 24287., 20170110, 20143., 20170109, 26718., 20170108, 24737.),
       (20170112, 23526., 20170111, 24287., 20170110, 20143., 20170109, 26718.),
       (20170113, 28004., 20170112, 23526., 20170111, 24287., 20170110, 20143.)],
      dtype=[('sales_ymd', '<i4'), ('amount', '<f8'), ('lag_ymd_1', '<i4'), ('lag_amount_1', '<f8'), ('lag_ymd_2', '<i4'), ('lag_amount_2', '<f8'), ('lag_ymd_3', '<i4'), ('lag_amount_3', '<f8')])

P_043

P-043: Sales summary data frame (df_sales_summary) that combines the receipt detail data frame (df_receipt) and the customer data frame (df_customer) and totals the sales amount (amount) for each gender (gender) and age (calculated from age). ). Gender is 0 for male, 1 for female, and 9 for unknown.

However, the item composition should be four items: age, sales amount for women, sales amount for men, and sales amount for unknown gender (cross tabulation of age vertically and gender horizontally). Also, the age group should be every 10 years old.

First, the key columns of both frames are combined and then converted to numerical data. Next, create a matrix map_array filled with missing values, and insert the age and gender into the matrix using the digitized customer ID on the customer data side as an index. After that, the quantified customer ID on the receipt detail data side is used as an index to acquire the age and gender. Finally, create a two-dimensional plane of age and gender, index the gender and age, and add the sales amount.

`In[043]`


#Convert customer ID to numerical data
unq, inv = np.unique(np.concatenate([arr_customer['customer_id'],
                                     arr_receipt['customer_id']]),
                     return_inverse=True)
inv_map, inv_arr = inv[:arr_customer.size], inv[arr_customer.size:]

#Acquisition of age (missing value=0）
map_array = np.zeros(unq.size, dtype=int)
map_array[inv_map] = arr_customer['age']//10
arr_age = map_array[inv_arr]
max_age = arr_age.max()+1

#Gender acquisition (missing value=9）
# map_array = np.full(unq.size, 9, dtype=int)
map_array[:] = 9
map_array[inv_map] = arr_customer['gender_cd']
arr_gender = map_array[inv_arr]

#Total sales on a two-dimensional plane of age and gender
arr_sales_summary = np.zeros((max_age, arr_gender.max()+1), dtype=int)
np.add.at(arr_sales_summary, (arr_age, arr_gender), arr_receipt['amount'])

#Convert to a structured array
make_array(max_age,
           era=np.arange(max_age)*10,
           male=arr_sales_summary[:, 0],
           female=arr_sales_summary[:, 1],
           unknown=arr_sales_summary[:, 9])

`Out[043]`


array([( 0,      0,       0, 12395003), (10,   1591,  149836,     4317),
       (20,  72940, 1363724,    44328), (30, 177322,  693047,    50441),
       (40,  19355, 9320791,   483512), (50,  54320, 6685192,   342923),
       (60, 272469,  987741,    71418), (70,  13435,   29764,     2427),
       (80,  46360,  262923,     5111), (90,      0,    6260,        0)],
      dtype=[('era', '<i4'), ('male', '<i4'), ('female', '<i4'), ('unknown', '<i4')])

`Time[043]`


Model answer: 177 ms ± 3.45 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
NumPy：66.4 ms ± 1.28 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

P_044

P-044: The sales summary data frame (df_sales_summary) created in the previous question was a horizontal sales of gender. Let's hold the gender vertically from this data frame and convert it into 3 items: age, gender code, and sales amount. However, the gender code is '00' for men, '01' for women, and '99' for unknowns.

`In[044]`


arr_amount = arr_sales_summary[:, [0, 1, 9]].ravel()
make_array(arr_amount.size,
           era=(np.arange(max_age)*10).repeat(3),
           gender_cd=np.tile(np.array(['00', '01', '99']), max_age),
           amount=arr_amount)

`Out[044]`


array([( 0, '00',        0), ( 0, '01',        0), ( 0, '99', 12395003),
       (10, '00',     1591), (10, '01',   149836), (10, '99',     4317),
       (20, '00',    72940), (20, '01',  1363724), (20, '99',    44328),
       (30, '00',   177322), (30, '01',   693047), (30, '99',    50441),
       (40, '00',    19355), (40, '01',  9320791), (40, '99',   483512),
       (50, '00',    54320), (50, '01',  6685192), (50, '99',   342923),
       (60, '00',   272469), (60, '01',   987741), (60, '99',    71418),
       (70, '00',    13435), (70, '01',    29764), (70, '99',     2427),
       (80, '00',    46360), (80, '01',   262923), (80, '99',     5111),
       (90, '00',        0), (90, '01',     6260), (90, '99',        0)],
      dtype=[('era', '<i4'), ('gender_cd', '<U2'), ('amount', '<i4')])

[PYTHON] That's why I quit pandas [Data Science 100 Knock (Structured Data Processing) # 5]

That's why I quit pandas [Data Science 100 Knock (Structured Data Processing) # 5]

Introduction

Finally, a function to output the calculation result as a structured array

In[023]

Out[023]

In[037]

Out[037]

In[038]

Out[038]

In[039]

Out[039]

In[040]

Out[040]

Time[040]

In[041]

Out[041]

In[042]

Out[042]

In[043]

Out[043]

Time[043]

In[044]

Out[044]