[PYTHON] That's why I quit pandas [Data Science 100 Knock (Structured Data Processing) # 3]

That's why I quit pandas [Data Science 100 Knock (Structured Data Processing) # 3]

We will solve the Python problem of Data Science 100 Knock (Structured Data Processing). This group of questions uses pandas for data processing in the model answer, but we will process it using NumPy after studying.

: arrow_up: First article (# 1) : arrow_backward: Previous article (# 2) : arrow_forward: Next article (# 4)

Introduction

As a study of NumPy, I will solve the Python problem of Data Science 100 Knock (Structured Data Processing).

Many people who do data science in Python may be pandas lovers, but in fact ** you can do the same with NumPy without using pandas **. And NumPy is usually faster. As a person who loves pandas, I'm still not used to operating NumPy, so I'd like to try to graduate from pandas by operating this "Data Science 100 Knock" with NumPy this time.

This time, I will ask the 17th to 22nd questions. This area seems to be the theme of sorting and ranking. The initial data was read as follows. It is a policy not to vectorize functions by np.vectorize () or np.frompyfunc ().

import numpy as np
import pandas as pd
from numpy.lib import recfunctions as rfn

#For model answer
df_customer = pd.read_csv('data/customer.csv')
df_receipt = pd.read_csv('data/receipt.csv')

#Data we handle
arr_customer = np.genfromtxt(
    'data/customer.csv', delimiter=',', encoding='utf-8',
    names=True, dtype=None)
arr_receipt = np.genfromtxt(
    'data/receipt.csv', delimiter=',', encoding='utf-8',
    names=True, dtype=None)

P_017

P-17: Sort the customer data frame (df_customer) by date of birth (birth_day) in chronological order, and display all the first 10 items.

If you want to sort the array, use np.sort (), which will return the sorted array. This time I want the sort order, so use np.argsort () or np.ndarray.argsort (). There is also a way to pass the column name to the ʻorder argument of np.sort () `, but it is not recommended because the behavior is not good.

In[017]


arr_customer[arr_customer['birth_day'].argsort()][:10]

Out[017]


array([('CS003813000014', 'Nanami Murayama', 1, 'Female', '1928-11-26', 90, '182-0007', 'Kikunodai, Chofu-shi, Tokyo**********', 'S13003', 20160214, '0-00000000-0'),
       ('CS026813000004', 'Yoshimura Chaoyang', 1, 'Female', '1928-12-14', 90, '251-0043', 'Tsujido Motomachi, Fujisawa City, Kanagawa Prefecture**********', 'S14026', 20150723, '0-00000000-0'),
       ('CS018811000003', 'Misato Kumazawa', 1, 'Female', '1929-01-07', 90, '204-0004', 'Noshio, Kiyose City, Tokyo**********', 'S13018', 20150403, '0-00000000-0'),
       ('CS027803000004', 'Takuro Uchimura', 0, 'male', '1929-01-12', 90, '251-0031', 'Kugenumafujigaya, Fujisawa City, Kanagawa Prefecture**********', 'S14027', 20151227, '0-00000000-0'),
       ('CS013801000003', 'Takuro Amano', 0, 'male', '1929-01-15', 90, '274-0824', 'Maebarahigashi, Funabashi City, Chiba Prefecture**********', 'S12013', 20160120, '0-00000000-0'),
       ('CS001814000022', 'Riho Tsuruta', 1, 'Female', '1929-01-28', 90, '144-0045', 'Minamirokugo, Ota-ku, Tokyo**********', 'S13001', 20161012, 'A-20090415-7'),
       ('CS016815000002', 'Miki Yamamoto', 1, 'Female', '1929-02-22', 90, '184-0005', 'Sakuracho, Koganei City, Tokyo**********', 'S13016', 20150629, 'C-20090923-C'),
       ('CS009815000003', 'Riho Nakata', 1, 'Female', '1929-04-08', 89, '154-0014', 'Shinmachi, Setagaya-ku, Tokyo**********', 'S13009', 20150421, 'D-20091021-E'),
       ('CS005813000015', 'Erika Kanaya', 1, 'Female', '1929-04-09', 89, '165-0032', 'Saginomiya, Nakano-ku, Tokyo**********', 'S13005', 20150506, '0-00000000-0'),
       ('CS012813000013', 'Uno Minami Tomo', 1, 'Female', '1929-04-09', 89, '231-0806', 'Honmoku-cho, Naka-ku, Yokohama-shi, Kanagawa**********', 'S14012', 20150712, '0-00000000-0')],
      dtype=[('customer_id', '<U14'), ('customer_name', '<U10'), ('gender_cd', '<i4'), ('gender', '<U2'), ('birth_day', '<U10'), ('age', '<i4'), ('postal_cd', '<U8'), ('address', '<U26'), ('application_store_cd', '<U6'), ('application_date', '<i4'), ('status_cd', '<U12')])

Time[017]


%timeit df_customer.sort_values('birth_day', ascending=True).head(10)
# 21.8 ms ± 346 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit arr_customer[arr_customer['birth_day'].argsort()][:10]
# 17.1 ms ± 473 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

P_018

P-18: Sort the customer data frame (df_customer) by date of birth (birth_day) in ascending order, and display all the first 10 items.

Just reverse the order.

In[018]


arr_customer[arr_customer['birth_day'].argsort()][::-1][:10]

Out[018]


array([('CS035114000004', 'Misato Omura', 1, 'Female', '2007-11-25', 11, '156-0053', 'Sakura, Setagaya-ku, Tokyo**********', 'S13035', 20150619, '6-20091205-6'),
       ('CS022103000002', 'Fukuyama Hajime', 9, 'unknown', '2007-10-02', 11, '249-0006', 'Zushi, Zushi City, Kanagawa Prefecture**********', 'S14022', 20160909, '0-00000000-0'),
       ('CS002113000009', 'Mayuko Shibata', 1, 'Female', '2007-09-17', 11, '184-0014', 'Nukuiminami-cho, Koganei-shi, Tokyo**********', 'S13002', 20160304, '0-00000000-0'),
       ('CS004115000014', 'Kyoko Matsui', 1, 'Female', '2007-08-09', 11, '165-0031', 'Kamisaginomiya, Nakano-ku, Tokyo**********', 'S13004', 20161120, '1-20081231-1'),
       ('CS002114000010', 'Haruka Yamauchi', 1, 'Female', '2007-06-03', 11, '184-0015', 'Nukui Kitamachi, Koganei City, Tokyo**********', 'S13002', 20160920, '6-20100510-1'),
       ('CS025115000002', 'Natsuki Koyanagi', 1, 'Female', '2007-04-18', 11, '245-0018', 'Kamiiida-cho, Izumi-ku, Yokohama-shi, Kanagawa**********', 'S14025', 20160116, 'D-20100913-D'),
       ('CS002113000025', 'Hirosue Manami', 1, 'Female', '2007-03-30', 12, '184-0015', 'Nukui Kitamachi, Koganei City, Tokyo**********', 'S13002', 20171030, '0-00000000-0'),
       ('CS033112000003', 'Miki Nagano', 1, 'Female', '2007-03-22', 12, '245-0051', 'Nase-cho, Totsuka-ku, Yokohama-shi, Kanagawa**********', 'S14033', 20150606, '0-00000000-0'),
       ('CS007115000006', 'Fukuoka Shun', 1, 'Female', '2007-03-10', 12, '285-0845', 'Nishishizu, Sakura City, Chiba Prefecture**********', 'S12007', 20151118, 'F-20101016-F'),
       ('CS014113000008', 'Rio Yaguchi', 1, 'Female', '2007-03-05', 12, '260-0041', 'Higashichiba, Chuo-ku, Chiba City, Chiba Prefecture**********', 'S12014', 20150622, '3-20091108-6')],
      dtype=[('customer_id', '<U14'), ('customer_name', '<U10'), ('gender_cd', '<i4'), ('gender', '<U2'), ('birth_day', '<U10'), ('age', '<i4'), ('postal_cd', '<U8'), ('address', '<U26'), ('application_store_cd', '<U6'), ('application_date', '<i4'), ('status_cd', '<U12')])

P_019

P-19: Give ranks to the receipt detail data frame (df_receipt) in descending order of sales amount (amount) per item, and extract the first 10 items. Items should display customer ID (customer_id), sales amount (amount), and assigned rank. If the sales amount (amount) is the same, the same ranking shall be given.

It's easy to sort by highest sales amount.

sorted_array = arr_receipt[['customer_id', 'amount']][
    arr_receipt['amount'].argsort()[::-1]]

sorted_array[:10]
# array([('CS011415000006', 10925), ('ZZ000000000000',  6800),
#        ('CS028605000002',  5780), ('ZZ000000000000',  5480),
#        ('ZZ000000000000',  5480), ('CS015515000034',  5480),
#        ('CS021515000089',  5440), ('ZZ000000000000',  5440),
#        ('ZZ000000000000',  5280), ('ZZ000000000000',  5280)],
#       dtype={'names':['customer_id','amount'], 'formats':['<U14','<i4'], 'offsets':[40,140], 'itemsize':144})

Rank this sales amount. The easiest way is to use scipy.stats.rankdata (). This time we ask for the most, but this function returns the ranking in ascending order, so subtract it from the maximum value and flip it over.

import scipy.stats

rank_asc = scipy.stats.rankdata(sorted_array['amount'], 'max')
rank = rank_asc.max() - rank_asc + 1

rank[:10]
# array([1, 2, 3, 4, 4, 4, 7, 7, 9, 9], dtype=int64)

If you don't use the scipy.stats.rankdata () function, it checks for the existence of ties and says "what is the largest value in each row" and "what is the nth largest value". Create an array to create a rank array.

#An array of False if it has the same rank as the next higher row, and True otherwise.
rank_cutidx = np.concatenate(
    ([True], sorted_array['amount'][1:] != sorted_array['amount'][:-1]))

# rank_cutidx.cumsum()-1: An array showing the highest value of each row
# rank_cutidx.nonzero(): An array indicating the number of each "nth largest value"
rank = rank_cutidx.nonzero()[0][rank_cutidx.cumsum()-1] + 1

rank[:10]
# array([1, 2, 3, 4, 4, 4, 7, 7, 9, 9], dtype=int64)

I use numpy.lib.recfunctions.append_fields () to add columns, but this function was extremely slow (it seems to recreate a new array from scratch).

In[019]


sorted_array = arr_receipt[['customer_id', 'amount']][
    arr_receipt['amount'].argsort()[::-1]]
rank_cutidx = np.concatenate(
    ([True], sorted_array['amount'][1:] != sorted_array['amount'][:-1]))
rank = rank_cutidx.nonzero()[0][rank_cutidx.cumsum()-1]+1
rfn.append_fields(sorted_array, 'ranking', rank,
                  dtypes=rank.dtype, usemask=False)[:10]

Then you should prepare a blank table by yourself ...?

In[019]


sorter_index = arr_receipt['amount'].argsort()[::-1]
sorted_id = arr_receipt['customer_id'][sorter_index]
sorted_amount = arr_receipt['amount'][sorter_index]
rank_cutidx = np.concatenate(
    ([True], sorted_amount[1:] != sorted_amount[:-1]))
rank = rank_cutidx.nonzero()[0][rank_cutidx.cumsum()-1]+1

#Array creation
new_arr = np.empty(arr_receipt.size, dtype=[('customer_id', sorted_id.dtype),
                                            ('amount', sorted_amount.dtype),
                                            ('ranking', rank.dtype)])
new_arr['customer_id'] = sorted_id
new_arr['amount'] = sorted_amount
new_arr['ranking'] = rank
new_arr[:10]

Out[019]


array([('CS011415000006', 10925, 1), ('ZZ000000000000',  6800, 2),
       ('CS028605000002',  5780, 3), ('ZZ000000000000',  5480, 4),
       ('ZZ000000000000',  5480, 4), ('CS015515000034',  5480, 4),
       ('CS021515000089',  5440, 7), ('ZZ000000000000',  5440, 7),
       ('ZZ000000000000',  5280, 9), ('ZZ000000000000',  5280, 9)],
      dtype=[('customer_id', '<U14'), ('amount', '<i4'), ('ranking', '<i8')])

Time[019]


#Model answer
%%timeit
df_tmp = pd.concat([df_receipt[['customer_id', 'amount']],
                    df_receipt['amount'].rank(method='min', ascending=False)], axis=1)
df_tmp.columns = ['customer_id', 'amount', 'ranking']
df_tmp.sort_values('ranking', ascending=True).head(10)
# 35 ms ± 643 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

#A little improvement in one line
%%timeit
df_receipt[['customer_id', 'amount']] \
    .assign(ranking=df_receipt['amount'].rank(method='min', ascending=False)) \
    .sort_values('ranking', ascending=True).head(10)
# 34.1 ms ± 943 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

# Numpy
%%timeit
tmp_amount = np.ascontiguousarray(arr_receipt['amount'])
sorter_index = tmp_amount.argsort()[::-1]
sorted_id = arr_receipt['customer_id'][sorter_index]
sorted_amount = tmp_amount[sorter_index]
rank_cutidx = np.concatenate(([True], sorted_amount[1:] != sorted_amount[:-1]))
rank = rank_cutidx.nonzero()[0][rank_cutidx.cumsum()-1]+1
new_arr = np.empty(arr_receipt.size, dtype=[('customer_id', sorted_id.dtype), ('amount', sorted_amount.dtype), ('ranking', rank.dtype)])
new_arr['customer_id'] = sorted_id
new_arr['amount'] = sorted_amount
new_arr['ranking'] = rank
new_arr[:10]
# 22.6 ms ± 464 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

I managed to process it faster than pandas.

P_020

P-020: Give ranks to the receipt detail data frame (df_receipt) in descending order of sales amount (amount) per item, and extract the first 10 items. Items should display customer ID (customer_id), sales amount (amount), and assigned rank. In addition, even if the sales amount (amount) is the same, give a different ranking.

This is a ranking, or just a serial number with np.arange ().

In[020]


rfn.append_fields(arr_receipt[['customer_id', 'amount']]
                  [arr_receipt['amount'].argsort()[::-1]],
                  'ranking', np.arange(1, arr_receipt.size+1),
                  dtypes='<i4', usemask=False)[:10]

In[020]


sorter_index = arr_receipt['amount'].argsort()[::-1]
sorted_id = arr_receipt['customer_id'][sorter_index]
sorted_amount = arr_receipt['amount'][sorter_index]
rank = np.arange(1, sorted_id.size+1)

#Array creation
new_arr = np.empty(arr_receipt.size, dtype=[('customer_id', sorted_id.dtype),
                                            ('amount', sorted_amount.dtype),
                                            ('ranking', rank.dtype)])
new_arr['customer_id'] = sorted_id
new_arr['amount'] = sorted_amount
new_arr['ranking'] = rank
new_arr[:10]

Out[020]


array([('CS011415000006', 10925,  1), ('ZZ000000000000',  6800,  2),
       ('CS028605000002',  5780,  3), ('ZZ000000000000',  5480,  4),
       ('ZZ000000000000',  5480,  5), ('CS015515000034',  5480,  6),
       ('CS021515000089',  5440,  7), ('ZZ000000000000',  5440,  8),
       ('ZZ000000000000',  5280,  9), ('ZZ000000000000',  5280, 10)],
      dtype=[('customer_id', '<U14'), ('amount', '<i4'), ('ranking', '<i4')])

P_021

P-021: Count the number of receipt details data frame (df_receipt).

In[021]


arr_receipt.size
len(arr_receipt)
arr_receipt.shape[0]

Out[021]


104681
104681
104681

P_022

P-022: Count the number of unique cases for the customer ID (customer_id) of the receipt detail data frame (df_receipt).

The unique count of string data is not good for NumPy.

In[022]


np.unique(arr_receipt['customer_id']).size

Out[022]


8307

This is known to be much faster with pandas than with NumPy, as pandas boasts in the official docs, and we, the NumPy pilots, probably can't help.

Time[022]


%timeit len(df_receipt['customer_id'].unique())
# 8.19 ms ± 204 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%timeit np.unique(arr_receipt['customer_id']).size
# 30.9 ms ± 635 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Recommended Posts

That's why I quit pandas [Data Science 100 Knock (Structured Data Processing) # 2]
That's why I quit pandas [Data Science 100 Knock (Structured Data Processing) # 1]
That's why I quit pandas [Data Science 100 Knock (Structured Data Processing) # 3]
That's why I quit pandas [Data Science 100 Knock (Structured Data Processing) # 5]
That's why I quit pandas [Data Science 100 Knock (Structured Data Processing) # 4]
That's why I quit pandas [Data Science 100 Knock (Structured Data Processing) # 6]
"Data Science 100 Knock (Structured Data Processing)" Python-007 Explanation
"Data Science 100 Knock (Structured Data Processing)" Python-001 Explanation
"Data Science 100 Knock (Structured Data Processing)" Python-002 Explanation
[Python] Data Science 100 Knock (Structured Data Processing) 021 Explanation
"Data Science 100 Knock (Structured Data Processing)" Python-005 Explanation
"Data Science 100 Knock (Structured Data Processing)" Python-004 Explanation
[Python] Data Science 100 Knock (Structured Data Processing) 020 Explanation
[Python] Data Science 100 Knock (Structured Data Processing) 025 Explanation
"Data Science 100 Knock (Structured Data Processing)" Python-003 Explanation
[Python] Data Science 100 Knock (Structured Data Processing) 019 Explanation
Preparing to try "Data Science 100 Knock (Structured Data Processing)"
Data science 100 knock (structured data processing) environment construction (Windows10)
[Python] Data Science 100 Knock (Structured Data Processing) 001-010 Impressions + Explanation Link Summary
That's why I quit pandas [Three ways to groupby.mean () with just NumPy]
[Python] 100 knocks on data science (structured data processing) 018 Explanation
[Python] 100 knocks on data science (structured data processing) 023 Explanation
[Python] 100 knocks on data science (structured data processing) 030 Explanation
[Python] 100 knocks on data science (structured data processing) 022 Explanation
100 language processing knock-20 (using pandas): reading JSON data
[Python] 100 knocks on data science (structured data processing) 017 Explanation
[Python] 100 knocks on data science (structured data processing) 026 Explanation
[Python] 100 knocks on data science (structured data processing) 016 Explanation
[Python] 100 knocks on data science (structured data processing) 024 Explanation
[Python] 100 knocks on data science (structured data processing) 027 Explanation
[Python] 100 knocks on data science (structured data processing) 029 Explanation
[Python] 100 knocks on data science (structured data processing) 015 Explanation
[Python] 100 knocks on data science (structured data processing) 028 Explanation
Data science 100 knock commentary (P021 ~ 040)
Data science 100 knock commentary (P061 ~ 080)
Data science 100 knock commentary (P041 ~ 060)
Data science 100 knock commentary (P081 ~ 100)
I tried 100 language processing knock 2020
Data processing tips with Pandas
I tried 100 language processing knock 2020: Chapter 3
100 Language Processing Knock-31 (using pandas): Verb
I tried 100 language processing knock 2020: Chapter 1
I tried 100 language processing knock 2020: Chapter 2
I tried 100 language processing knock 2020: Chapter 4
100 Language Processing Knock-38 (using pandas): Histogram
100 Language Processing Knock-33 (using pandas): Sahen noun
100 Language Processing Knock-91: Preparation of Analogy Data
I took Udemy's "Practical Python Data Science"
100 Language Processing Knock-35 (using pandas): Noun concatenation
100 Language Processing Knock-39 (using pandas): Zipf's Law
Example of efficient data processing with PANDAS
100 Language Processing Knock-34 (using pandas): "A B"