That's why I quit pandas [Data Science 100 Knock (Structured Data Processing) # 6]

We will solve the Python problem of Data Science 100 Knock (Structured Data Processing). This group of questions uses pandas for data processing in the model answer, but we will process it using NumPy after studying.

: arrow_up: First article (# 1) : arrow_backward: Previous article (# 5) : arrow_forward: Next article (# 7)

Introduction

As a study of NumPy, I will solve the Python problem of Data Science 100 Knock (Structured Data Processing).

Many people who do data science in Python may be pandas lovers, but in fact ** you can do the same with NumPy without using pandas **. And NumPy is usually faster. As a person who loves pandas, I'm still not used to operating NumPy, so I'd like to try to graduate from pandas by operating this "Data Science 100 Knock" with NumPy this time.

This time, I will do the 52nd to 62nd questions. It seems to be the theme of data categorization. The initial data was read as follows.

import numpy as np
import pandas as pd
from numpy.lib import recfunctions as rfn

#For model answer
df_customer = pd.read_csv('data/customer.csv')
df_receipt = pd.read_csv('data/receipt.csv')

#Data we handle
arr_customer = np.genfromtxt(
    'data/customer.csv', delimiter=',', encoding='utf-8',
    names=True, dtype=None)
arr_receipt = np.genfromtxt(
    'data/receipt.csv', delimiter=',', encoding='utf-8',
    names=True, dtype=None)

`Finally, a function to output the calculation result as a structured array`


def make_array(size, **kwargs):
    arr = np.empty(size, dtype=[(colname, subarr.dtype)
                                for colname, subarr in kwargs.items()])
    for colname, subarr in kwargs.items():
        arr[colname] = subarr
    return arr

P_052

P-052: Total sales amount (amount) of receipt detail data frame (df_receipt) for each customer ID (customer_id), and 0 for 2000 yen or less and 1 for 2000 yen or more for the total sales amount. Then, display 10 items together with the customer ID and the total sales amount. However, if the customer ID starts with "Z", it represents a non-member, so exclude it from the calculation.

Sales by customer are calculated by the usual np.unique () ＋ np.bincount (). "0 for 2000 yen or less, 1 for more" simply converts the boolean array compared to 2000 into a numerical value (False is equivalent to 0, True is equivalent to 1).

`In[052]`


is_member = arr_receipt['customer_id'].astype('<U1') != 'Z'
unq_id, inv_id = np.unique(arr_receipt['customer_id'][is_member],
                           return_inverse=True)
amount_arr = np.bincount(inv_id, arr_receipt['amount'][is_member])
make_array(unq_id.size,
           customer_id=unq_id,
           amount=amount_arr,
           sales_flg=(amount_arr > 2000).view(np.int8))[:10]

`Out[052]`


array([('CS001113000004', 1298., 0), ('CS001114000005',  626., 0),
       ('CS001115000010', 3044., 1), ('CS001205000004', 1988., 0),
       ('CS001205000006', 3337., 1), ('CS001211000025',  456., 0),
       ('CS001212000027',  448., 0), ('CS001212000031',  296., 0),
       ('CS001212000046',  228., 0), ('CS001212000070',  456., 0)],
      dtype=[('customer_id', '<U14'), ('amount', '<f8'), ('sales_flg', 'i1')])

`Time[052]`


#Model answer
%%timeit
df_sales_amount = df_receipt.query('not customer_id.str.startswith("Z")', engine='python')
df_sales_amount = df_sales_amount[['customer_id', 'amount']].groupby('customer_id').sum().reset_index()
df_sales_amount['sales_flg'] = df_sales_amount['amount'].apply(lambda x: 1 if x > 2000 else 0)
df_sales_amount.head(10)
# 72.9 ms ± 1.67 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

#Improvement
%%timeit
df_sales_amount = df_receipt.loc[~df_receipt['customer_id'].str.startswith('Z'), ['customer_id', 'amount']].groupby('customer_id', as_index=False).sum()
df_sales_amount['sales_flg'] = (df_sales_amount['amount'] > 2000).astype(int)
df_sales_amount.head(10)
# 63.5 ms ± 226 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

#NumPy (see code above)
# 33.8 ms ± 252 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

P_053

P-053: Binar Tokyo (the first 3 digits are 100 to 209) to 1 and the other ones to 0 for the postal code (postal_cd) of the customer data frame (df_customer). In addition, combine it with the receipt detail data frame (df_receipt) and count the number of customers who have a shopping record for the entire period for each of the created two values.

First, use np.in1d () to extract only customers with a purchase history. Next, convert the first 3 digits of the zip code to a number. You can use .astype ('<U3'). Astype (int), but here we use np.frombuffer () to convert without going through .astype () ([P_011) ](See https://qiita.com/nkay/items/65a77b97ae9b7331b39c#p_011).

`In[053]`


isin_rec = np.in1d(arr_customer['customer_id'], arr_receipt['customer_id'])
post_int_arr = ((np.frombuffer(
    arr_customer['postal_cd'][isin_rec].tobytes(), dtype=np.int32)
    - 48).reshape(-1, 8)[:, :3]*np.array([100, 10, 1])).sum(1)
postal_flg = ((101 <= post_int_arr) & (post_int_arr <= 209))

make_array(2, postal_flg=np.arange(2), count=np.bincount(postal_flg))

`Out[053]`


array([(0, 3906), (1, 4400)],
      dtype=[('postal_flg', '<i4'), ('count', '<i8')])

`Time[053]`


#Model answer
%%timeit
df_tmp = df_customer[['customer_id', 'postal_cd']].copy()
df_tmp['postal_flg'] = df_tmp['postal_cd'].apply(lambda x: 1 if 100 <= int(x[0:3]) <= 209 else 0)

pd.merge(df_tmp, df_receipt, how='inner', on='customer_id').groupby('postal_flg').agg({'customer_id':'nunique'})
# 109 ms ± 4.89 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

#NumPy (see code above)
# 50.2 ms ± 349 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

P_054

P-054: The address of the customer data data frame (df_customer) is one of Saitama prefecture, Chiba prefecture, Tokyo metropolitan area, and Kanagawa prefecture. Create a code value for each prefecture and extract it along with the customer ID and address. The values should be 11 for Saitama prefecture, 12 for Chiba prefecture, 13 for Tokyo, and 14 for Kanagawa prefecture. You can display 10 results.

Look at the first letter of the address column and replace it with any number (see [P_036](see https://qiita.com/nkay/items/a77fcb2dd3810d1f20c3#p_036)).

`In[054]`


caps = np.array(['Sai', 'thousand', 'east', 'God'], dtype='<U1')
sorter_index = caps.argsort()
idx = np.searchsorted(caps, arr_customer['address'].astype('<U1'),
                      sorter=sorter_index)
address_code = np.array([11, 12, 13, 14])[sorter_index[idx]]

make_array(arr_customer.size,
           customer_id=arr_customer['customer_id'],
           address=arr_customer['address'],
           address_code=address_code)[:10]

`Out[054]`


array([('CS021313000114', 'Awakubo, Isehara City, Kanagawa Prefecture**********', 14),
       ('CS037613000071', 'Minamisuna, Koto-ku, Tokyo**********', 13),
       ('CS031415000172', 'Yoyogi, Shibuya-ku, Tokyo**********', 13),
       ('CS028811000001', 'Izumi-cho, Izumi-ku, Yokohama-shi, Kanagawa**********', 14),
       ('CS001215000145', 'Nakarokugo, Ota-ku, Tokyo**********', 13),
       ('CS020401000016', 'Wakagi, Itabashi-ku, Tokyo**********', 13),
       ('CS015414000103', 'Kitasuna, Koto-ku, Tokyo**********', 13),
       ('CS029403000008', 'Kairaku, Urayasu City, Chiba Prefecture**********', 12),
       ('CS015804000004', 'Kitasuna, Koto-ku, Tokyo**********', 13),
       ('CS033513000180', 'Zenbu-cho, Asahi-ku, Yokohama-shi, Kanagawa**********', 14)],
      dtype=[('customer_id', '<U14'), ('address', '<U26'), ('address_code', '<i4')])

`Time[054]`


#Model answer
pd.concat([df_customer[['customer_id', 'address']],
           df_customer['address'].str[0:3].map({'Saitama': '11', 'Chiba': '12', 'Tokyo': '13', 'Kanagawa': '14'})],
          axis=1).head(10)
# 16.2 ms ± 963 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)

#NumPy (see code above)
# 4.52 ms ± 176 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

P_055

P-055: Sum the sales amount (amount) of the receipt detail data frame (df_receipt) for each customer ID (customer_id), and find the quartile of the total amount. Then, create a category value for the total sales amount for each customer based on the following criteria, and display both the customer ID and sales amount. Category values are 1 to 4 in order from the top. You can display 10 results.

--Minimum value or more and less than the first quartile --From the first quartile to less than the second quartile --From the second quartile to less than the third quartile --Third quartile and above

Sales by customer are calculated by the usual np.unique () ＋ np.bincount (). Then use np.quantile () to find the quartile. Finally, compare the sales by customer with the quartile value to find out which group you belong to (you can also use np.searchsorted ()).

`In[055]`


unq_id, inv_id = np.unique(arr_receipt['customer_id'],
                           return_inverse=True)
amount_arr = np.bincount(inv_id, arr_receipt['amount'])
quantiles = np.quantile(amount_arr, np.arange(1, 4)/4)
pct_group = (quantiles[:, None] <= amount_arr).sum(0) + 1

make_array(unq_id.size,
           customer_id=unq_id, amount=amount_arr, pct_group=pct_group)[:10]

`Out[055]`


array([('CS001113000004', 1298., 2), ('CS001114000005',  626., 2),
       ('CS001115000010', 3044., 3), ('CS001205000004', 1988., 3),
       ('CS001205000006', 3337., 3), ('CS001211000025',  456., 1),
       ('CS001212000027',  448., 1), ('CS001212000031',  296., 1),
       ('CS001212000046',  228., 1), ('CS001212000070',  456., 1)],
      dtype=[('customer_id', '<U14'), ('amount', '<f8'), ('pct_group', '<i8')])

P_056

P-056: Calculate the age in 10-year increments based on the age of the customer data frame (df_customer), and extract it together with the customer ID (customer_id) and date of birth (birth_day). However, all people over the age of 60 should be in their 60s. The category name indicating the age is arbitrary. The first 10 items should be displayed.

`In[056]`


age = arr_customer['age']//10*10
age[age > 60] = 60
arr_customer_era = make_array(arr_customer.size,
                              customer_id=arr_customer['customer_id'],
                              birth_day=arr_customer['birth_day'],
                              age=age)
arr_customer_era[:10]

You can also use np.clip () or np.where ().

`Out[056]`


array([('CS021313000114', '1981-04-29', 30),
       ('CS037613000071', '1952-04-01', 60),
       ('CS031415000172', '1976-10-04', 40),
       ('CS028811000001', '1933-03-27', 60),
       ('CS001215000145', '1995-03-29', 20),
       ('CS020401000016', '1974-09-15', 40),
       ('CS015414000103', '1977-08-09', 40),
       ('CS029403000008', '1973-08-17', 40),
       ('CS015804000004', '1931-05-02', 60),
       ('CS033513000180', '1962-07-11', 50)],
      dtype=[('customer_id', '<U14'), ('birth_day', '<U10'), ('age', '<i4')])

P_057

P-057: Combine the extraction result of the previous question and gender (gender), and create new category data that represents the combination of gender and age. The value of the category representing the combination is arbitrary. The first 10 items should be displayed.

`In[057]`


arr_customer_era = make_array(arr_customer.size,
                              customer_id=arr_customer['customer_id'],
                              birth_day=arr_customer['birth_day'],
                              age=age,
                              era_gender=arr_customer['gender_cd']*100+age)
arr_customer_era[:10]

`Out[057]`


array([('CS021313000114', '1981-04-29', 30, 130),
       ('CS037613000071', '1952-04-01', 60, 960),
       ('CS031415000172', '1976-10-04', 40, 140),
       ('CS028811000001', '1933-03-27', 60, 160),
       ('CS001215000145', '1995-03-29', 20, 120),
       ('CS020401000016', '1974-09-15', 40,  40),
       ('CS015414000103', '1977-08-09', 40, 140),
       ('CS029403000008', '1973-08-17', 40,  40),
       ('CS015804000004', '1931-05-02', 60,  60),
       ('CS033513000180', '1962-07-11', 50, 150)],
      dtype=[('customer_id', '<U14'), ('birth_day', '<U10'), ('age', '<i4'), ('era_gender', '<i4')])

P_058

P-058: Make the gender code (gender_cd) of the customer data frame (df_customer) a dummy variable and extract it together with the customer ID (customer_id). You can display 10 results.

Broca.

`In[058]`


dummies = (arr_customer['gender_cd']
           == np.array([[0], [1], [9]])).view(np.int8)
make_array(arr_customer.size,
           customer_id=arr_customer['customer_id'],
           gender_cd_0=dummies[0],
           gender_cd_1=dummies[1],
           gender_cd_9=dummies[2])[:10]

`Out[058]`


gender_cd = np.ascontiguousarray(arr_customer['gender_cd'])...
array([('CS021313000114', 0, 1, 0), ('CS037613000071', 0, 0, 1),
       ('CS031415000172', 0, 1, 0), ('CS028811000001', 0, 1, 0),
       ('CS001215000145', 0, 1, 0), ('CS020401000016', 1, 0, 0),
       ('CS015414000103', 0, 1, 0), ('CS029403000008', 1, 0, 0),
       ('CS015804000004', 1, 0, 0), ('CS033513000180', 0, 1, 0)],
      dtype=[('customer_id', '<U14'), ('gender_cd_0', 'i1'), ('gender_cd_1', 'i1'), ('gender_cd_9', 'i1')])

P_059

P-059: Total sales amount (amount) of receipt detail data frame (df_receipt) is totaled for each customer ID (customer_id), and the total sales amount is standardized to average 0 and standard deviation 1 to total customer ID and sales amount Display with. The standard deviation used for standardization may be either unbiased standard deviation or sample standard deviation. However, if the customer ID starts with "Z", it represents a non-member, so exclude it from the calculation. You can display 10 results.

Just apply to the formula. It's easy with scipy.stats.zscore ().

`In[059]`


is_member = arr_receipt['customer_id'].astype('<U1') != 'Z'
unq_id, inv_id = np.unique(arr_receipt['customer_id'][is_member],
                           return_inverse=True)
amount_arr = np.bincount(inv_id, arr_receipt['amount'][is_member])

amount_mean = amount_arr.mean()
amount_std = np.sqrt(((amount_arr-amount_mean)**2).mean())
amount_ss = (amount_arr - amount_mean) / amount_std
# amount_ss = scipy.stats.zscore(amount_mean)But OK

make_array(unq_id.size,
           customer_id=unq_id, amount=amount_arr, amount_ss=amount_ss)[:10]

`Out[059]`


array([('CS001113000004', 1298., -0.45937788),
       ('CS001114000005',  626., -0.70639037),
       ('CS001115000010', 3044.,  0.18241349),
       ('CS001205000004', 1988., -0.20574899),
       ('CS001205000006', 3337.,  0.29011387),
       ('CS001211000025',  456., -0.76887864),
       ('CS001212000027',  448., -0.77181927),
       ('CS001212000031',  296., -0.82769114),
       ('CS001212000046',  228., -0.85268645),
       ('CS001212000070',  456., -0.76887864)],
      dtype=[('customer_id', '<U14'), ('amount', '<f8'), ('amount_ss', '<f8')])

P_060

P-060: The sales amount (amount) of the receipt detail data frame (df_receipt) is totaled for each customer ID (customer_id), and the total sales amount is normalized to the minimum value 0 and the maximum value 1 to the customer ID and sales amount. Display with the total. However, if the customer ID starts with "Z", it represents a non-member, so exclude it from the calculation. You can display 10 results.

Just apply to the formula.

`In[060]`


is_member = arr_receipt['customer_id'].astype('<U1') != 'Z'
unq_id, inv_id = np.unique(arr_receipt['customer_id'][is_member],
                           return_inverse=True)
amount_arr = np.bincount(inv_id, arr_receipt['amount'][is_member])
amount_min, amount_max = amount_arr.min(), amount_arr.max()
amount_mm = (amount_arr - amount_min) / (amount_max - amount_min)

make_array(unq_id.size,
           customer_id=unq_id, amount=amount_arr, amount_mm=amount_mm)[:10]

`Out[060]`


array([('CS001113000004', 1298., 0.05335419),
       ('CS001114000005',  626., 0.02415711),
       ('CS001115000010', 3044., 0.12921446),
       ('CS001205000004', 1988., 0.08333333),
       ('CS001205000006', 3337., 0.14194473),
       ('CS001211000025',  456., 0.01677094),
       ('CS001212000027',  448., 0.01642336),
       ('CS001212000031',  296., 0.00981926),
       ('CS001212000046',  228., 0.00686479),
       ('CS001212000070',  456., 0.01677094)],
      dtype=[('customer_id', '<U14'), ('amount', '<f8'), ('amount_mm', '<f8')])

P_061

P-061: The sales amount (amount) of the receipt detail data frame (df_receipt) is totaled for each customer ID (customer_id), and the total sales amount is converted to the common logarithm (base = 10) to total the customer ID and sales amount. Display with. However, if the customer ID starts with "Z", it represents a non-member, so exclude it from the calculation. You can display 10 results.

`In[061]`


is_member = arr_receipt['customer_id'].astype('<U1') != 'Z'
unq_id, inv_id = np.unique(arr_receipt['customer_id'][is_member],
                           return_inverse=True)
amount_arr = np.bincount(inv_id, arr_receipt['amount'][is_member])

make_array(unq_id.size, customer_id=unq_id, amount=amount_arr,
           amount_log10=np.log10(amount_arr + 1))[:10]

`Out[061]`


array([('CS001113000004', 1298., 3.11360915),
       ('CS001114000005',  626., 2.79726754),
       ('CS001115000010', 3044., 3.4835873 ),
       ('CS001205000004', 1988., 3.29863478),
       ('CS001205000006', 3337., 3.52348633),
       ('CS001211000025',  456., 2.6599162 ),
       ('CS001212000027',  448., 2.65224634),
       ('CS001212000031',  296., 2.47275645),
       ('CS001212000046',  228., 2.35983548),
       ('CS001212000070',  456., 2.6599162 )],
      dtype=[('customer_id', '<U14'), ('amount', '<f8'), ('amount_log10', '<f8')])

P_062

P-062: The sales amount (amount) of the receipt detail data frame (df_receipt) is totaled for each customer ID (customer_id), and the total sales amount is converted to natural logarithm (base = e) to total the customer ID and sales amount. Display with. However, if the customer ID starts with "Z", it represents a non-member, so exclude it from the calculation. You can display 10 results.

`In[062]`


is_member = arr_receipt['customer_id'].astype('<U1') != 'Z'
unq_id, inv_id = np.unique(arr_receipt['customer_id'][is_member],
                           return_inverse=True)
amount_arr = np.bincount(inv_id, arr_receipt['amount'][is_member])

make_array(unq_id.size, customer_id=unq_id, amount=amount_arr,
           amount_log10=np.log1p(amount_arr))[:10]

`Out[062]`


array([('CS001113000004', 1298., 7.16935002),
       ('CS001114000005',  626., 6.44094654),
       ('CS001115000010', 3044., 8.02125618),
       ('CS001205000004', 1988., 7.59538728),
       ('CS001205000006', 3337., 8.1131271 ),
       ('CS001211000025',  456., 6.12468339),
       ('CS001212000027',  448., 6.10702289),
       ('CS001212000031',  296., 5.69373214),
       ('CS001212000046',  228., 5.433722  ),
       ('CS001212000070',  456., 6.12468339)],
      dtype=[('customer_id', '<U14'), ('amount', '<f8'), ('amount_log10', '<f8')])

[PYTHON] That's why I quit pandas [Data Science 100 Knock (Structured Data Processing) # 6]

That's why I quit pandas [Data Science 100 Knock (Structured Data Processing) # 6]

Introduction

Finally, a function to output the calculation result as a structured array

In[052]

Out[052]

Time[052]

In[053]

Out[053]

Time[053]

In[054]

Out[054]

Time[054]

In[055]

Out[055]

In[056]

Out[056]

In[057]

Out[057]

In[058]

Out[058]

In[059]

Out[059]

In[060]

Out[060]

In[061]

Out[061]

In[062]

Out[062]