[PYTHON] That's why I quit pandas [Data Science 100 Knock (Structured Data Processing) # 4]

That's why I quit pandas [Data Science 100 Knock (Structured Data Processing) # 4]

We will solve the Python problem of Data Science 100 Knock (Structured Data Processing). This group of questions uses pandas for data processing in the model answer, but we will process it using NumPy after studying.

: arrow_up: First article (# 1) : arrow_backward: Previous article (# 3) : arrow_forward: Next article (# 5)

Introduction

As a study of NumPy, I will solve the Python problem of Data Science 100 Knock (Structured Data Processing).

Many people who do data science in Python may be pandas lovers, but in fact ** you can do the same with NumPy without using pandas **. And NumPy is usually faster. As a person who loves pandas, I'm still not used to operating NumPy, so I'd like to try to graduate from pandas by operating this "Data Science 100 Knock" with NumPy this time.

This time I will do the 23rd to 35th questions. It seems to be a theme called Group By. The initial data was read as follows (data type specification is postponed for the time being).

import numpy as np
import pandas as pd
from numpy.lib import recfunctions as rfn

#For model answer
df_customer = pd.read_csv('data/customer.csv')
df_receipt = pd.read_csv('data/receipt.csv')

#Data we handle
arr_customer = np.genfromtxt(
    'data/customer.csv', delimiter=',', encoding='utf-8',
    names=True, dtype=None)
arr_receipt = np.genfromtxt(
    'data/receipt.csv', delimiter=',', encoding='utf-8',
    names=True, dtype=None)

Next

When it comes to this theme, the inefficiency of the structured array that just reads csv as it is becomes noticeable, so I stopped using this this time: stuck_out_tongue_winking_eye :. In order to operate the NumPy array efficiently, there are two points that are important: the array is "numerical data" and "the memory layout is optimized". Therefore, we will introduce the following function.

def array_to_dict(arr):
    dic = dict()
    for colname in arr.dtype.names:
        if np.issubdtype(arr[colname].dtype, np.number):
            #For numerical data, optimize the memory layout and store it in the dictionary
            dic[colname] = np.ascontiguousarray(arr[colname])
        else:
            #In the case of character string data, convert it to numerical data and store it in the dictionary.
            unq, inv = np.unique(arr[colname], return_inverse=True)
            dic[colname] = inv
            dic['Code_' + colname] = unq  #Array used to convert numbers back to strings
    return dic


dic_customer = array_to_dict(arr_customer)
dic_receipt = array_to_dict(arr_receipt)

dic_receipt

This function returns the table as a dictionary. There is an ndarray for each column in the table.

{'sales_ymd':
    array([20181103, 20181118, 20170712, ..., 20170311, 20170331, 20190423]),
 'sales_epoch':
    array([1257206400, 1258502400, 1215820800, ..., 1205193600, 1206921600, 1271980800]),
 'store_cd':
    array([29, 10, 40, ..., 44,  6, 13], dtype=int64),
 'Code_store_cd':
    array(['S12007', 'S12013', 'S12014', ..., 'S14048', 'S14049', 'S14050'], dtype='<U6'),
 'receipt_no':
    array([ 112, 1132, 1102, ..., 1122, 1142, 1102]),
 'receipt_sub_no':
    array([1, 2, 1, ..., 1, 1, 2]),
 'customer_id':
    array([1689, 2112, 5898, ..., 8103,  582, 8306], dtype=int64),
 'Code_customer_id':
    array(['CS001113000004', 'CS001114000005', 'CS001115000010', ..., 'CS052212000002', 'CS052514000001', 'ZZ000000000000'], dtype='<U14'),
 'product_cd':
    array([2119, 3235,  861, ...,  457, 1030,  808], dtype=int64),
 'Code_product_cd':
    array(['P040101001', 'P040101002', 'P040101003', ..., 'P091503003', 'P091503004', 'P091503005'], dtype='<U10'),
 'quantity':
    array([1, 1, 1, ..., 1, 1, 1]),
 'amount':
    array([158,  81, 170, ..., 168, 148, 138])}

Let's try with np.argsort () how important this work is.

arr_receipt['sales_ymd'].flags['C_CONTIGUOUS']
# False
dic_receipt['sales_ymd'].flags['C_CONTIGUOUS']
# True

%timeit np.argsort(arr_receipt['sales_ymd'])
# 7.33 ms ± 134 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit np.argsort(dic_receipt['sales_ymd'])
# 5.98 ms ± 58.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

This is good. It seems that somehow the structured array and the convenience are ignored and it is no longer easy to handle, but since this article is about throwing away pandas and studying NumPy. No problem! : hugging:

By the way, the code that creates the structured array, which was messed up last time, is also made into a function.

def make_array(size, **kwargs):
    arr = np.empty(size, dtype=[(colname, subarr.dtype)
                                for colname, subarr in kwargs.items()])
    for colname, subarr in kwargs.items():
        arr[colname] = subarr
    return arr

P_023

P-023: Sum the sales amount (amount) and sales quantity (quantity) for each store code (store_cd) for the receipt detail data frame (df_receipt).

Use np.bincount () to find the sum for each value in the store code column. This is a technique that can be used only because the store code string is converted from characters to numerical data.

In[023]


make_array(
    dic_receipt['Code_store_cd'].size,
    store_cd=dic_receipt['Code_store_cd'],
    amount=np.bincount(dic_receipt['store_cd'], dic_receipt['amount']),
    quantity=np.bincount(dic_receipt['store_cd'], dic_receipt['quantity']))

Out[023]


array([('S12007', 638761., 2099.), ('S12013', 787513., 2425.),
       ('S12014', 725167., 2358.), ('S12029', 794741., 2555.),
       ('S12030', 684402., 2403.), ('S13001', 811936., 2347.),
       ('S13002', 727821., 2340.), ('S13003', 764294., 2197.),
       ('S13004', 779373., 2390.), ('S13005', 629876., 2004.),
       ('S13008', 809288., 2491.), ('S13009', 808870., 2486.),
       ('S13015', 780873., 2248.), ('S13016', 793773., 2432.),
       ('S13017', 748221., 2376.), ('S13018', 790535., 2562.),
       ('S13019', 827833., 2541.), ('S13020', 796383., 2383.),
       ('S13031', 705968., 2336.), ('S13032', 790501., 2491.),
       ('S13035', 715869., 2219.), ('S13037', 693087., 2344.),
       ('S13038', 708884., 2337.), ('S13039', 611888., 1981.),
       ('S13041', 728266., 2233.), ('S13043', 587895., 1881.),
       ('S13044', 520764., 1729.), ('S13051', 107452.,  354.),
       ('S13052', 100314.,  250.), ('S14006', 712839., 2284.),
       ('S14010', 790361., 2290.), ('S14011', 805724., 2434.),
       ('S14012', 720600., 2412.), ('S14021', 699511., 2231.),
       ('S14022', 651328., 2047.), ('S14023', 727630., 2258.),
       ('S14024', 736323., 2417.), ('S14025', 755581., 2394.),
       ('S14026', 824537., 2503.), ('S14027', 714550., 2303.),
       ('S14028', 786145., 2458.), ('S14033', 725318., 2282.),
       ('S14034', 653681., 2024.), ('S14036', 203694.,  635.),
       ('S14040', 701858., 2233.), ('S14042', 534689., 1935.),
       ('S14045', 458484., 1398.), ('S14046', 412646., 1354.),
       ('S14047', 338329., 1041.), ('S14048', 234276.,  769.),
       ('S14049', 230808.,  788.), ('S14050', 167090.,  580.)],
      dtype=[('store_cd', '<U6'), ('amount', '<f8'), ('quantity', '<f8')])

Time[023]


#Model answer
%%timeit
df_receipt.groupby('store_cd').agg({'amount':'sum', 'quantity':'sum'}).reset_index()
# 9.14 ms ± 234 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%%timeit
make_array(
    dic_receipt['Code_store_cd'].size,
    store_cd=dic_receipt['Code_store_cd'],
    amount=np.bincount(dic_receipt['store_cd'], dic_receipt['amount']),
    quantity=np.bincount(dic_receipt['store_cd'], dic_receipt['quantity']))
# 473 µs ± 19.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

P_024

P-024: For the receipt detail data frame (df_receipt), find the newest sales date (sales_ymd) for each customer ID (customer_id) and display 10 items.

Use np.maximum () to find the maximum sales date. First, sort the sales dates based on the customer ID column. Next, get the position of the line where the customer ID changes, and use np.ufunc.reduceat () to perform the same processing for each customer ID (np.maximum ()).

In[024]


sorter_index = np.argsort(dic_receipt['customer_id'])
sorted_id = dic_receipt['customer_id'][sorter_index]
sorted_ymd = dic_receipt['sales_ymd'][sorter_index]

cut_index, = np.concatenate(
    ([True], sorted_id[1:] != sorted_id[:-1])).nonzero()

make_array(
    cut_index.size,
    customer_id=dic_receipt['Code_customer_id'],
    sales_ymd=np.maximum.reduceat(sorted_ymd, cut_index))[:10]

Out[024]


array([('CS001113000004', 20190308), ('CS001114000005', 20190731),
       ('CS001115000010', 20190405), ('CS001205000004', 20190625),
       ('CS001205000006', 20190224), ('CS001211000025', 20190322),
       ('CS001212000027', 20170127), ('CS001212000031', 20180906),
       ('CS001212000046', 20170811), ('CS001212000070', 20191018)],
      dtype=[('customer_id', '<U14'), ('sales_ymd', '<i4')])

P_025

P-025: For the receipt detail data frame (df_receipt), find the oldest sales date (sales_ymd) for each customer ID (customer_id) and display 10 items.

In[025]


sorter_index = np.argsort(dic_receipt['customer_id'])
sorted_id = dic_receipt['customer_id'][sorter_index]
sorted_ymd = dic_receipt['sales_ymd'][sorter_index]

cut_index, = np.concatenate(
    ([True], sorted_id[1:] != sorted_id[:-1])).nonzero()

make_array(
    cut_index.size,
    customer_id=dic_receipt['Code_customer_id'],
    sales_ymd=np.minimum.reduceat(sorted_ymd, cut_index))[:10]

Out[025]


array([('CS001113000004', 20190308), ('CS001114000005', 20180503),
       ('CS001115000010', 20171228), ('CS001205000004', 20170914),
       ('CS001205000006', 20180207), ('CS001211000025', 20190322),
       ('CS001212000027', 20170127), ('CS001212000031', 20180906),
       ('CS001212000046', 20170811), ('CS001212000070', 20191018)],
      dtype=[('customer_id', '<U14'), ('sales_ymd', '<i4')])

P_026

P-026: For the receipt detail data frame (df_receipt), find the newest sales date (sales_ymd) and the oldest sales date for each customer ID (customer_id), and display 10 different data.

In[026]


sorter_index = np.argsort(dic_receipt['customer_id'])
sorted_id = dic_receipt['customer_id'][sorter_index]
sorted_ymd = dic_receipt['sales_ymd'][sorter_index]

cut_index, = np.concatenate(
    ([True], sorted_id[1:] != sorted_id[:-1])).nonzero()
sales_ymd_max = np.maximum.reduceat(sorted_ymd, cut_index)
sales_ymd_min = np.minimum.reduceat(sorted_ymd, cut_index)

new_arr = make_array(cut_index.size,
                     customer_id=dic_receipt['Code_customer_id'],
                     sales_ymd_max=sales_ymd_max, sales_ymd_min=sales_ymd_min)
new_arr[sales_ymd_max != sales_ymd_min][:10]

Out[026]


array([('CS001114000005', 20190731, 20180503),
       ('CS001115000010', 20190405, 20171228),
       ('CS001205000004', 20190625, 20170914),
       ('CS001205000006', 20190224, 20180207),
       ('CS001214000009', 20190902, 20170306),
       ('CS001214000017', 20191006, 20180828),
       ('CS001214000048', 20190929, 20171109),
       ('CS001214000052', 20190617, 20180208),
       ('CS001215000005', 20181021, 20170206),
       ('CS001215000040', 20171022, 20170214)],
      dtype=[('customer_id', '<U14'), ('sales_ymd_max', '<i4'), ('sales_ymd_min', '<i4')])

P_027

P-027: Calculate the average sales amount (amount) for each store code (store_cd) for the receipt detail data frame (df_receipt), and display the TOP5 in descending order.

Use np.bincount () to calculate the total number and amount for each store code, and calculate the average by total ÷ number.

In[027]


mean_amount = (np.bincount(dic_receipt['store_cd'], dic_receipt['amount'])
               / np.bincount(dic_receipt['store_cd']))

new_arr = make_array(dic_receipt['Code_store_cd'].size,
                     store_cd=dic_receipt['Code_store_cd'],
                     amount=mean_amount)
new_arr[np.argsort(mean_amount)[::-1]][:5]

Out[027]


array([('S13052', 402.86746988), ('S13015', 351.11196043),
       ('S13003', 350.91551882), ('S14010', 348.79126214),
       ('S13001', 348.47038627)],
      dtype=[('store_cd', '<U6'), ('amount', '<f8')])

P_028

P-028: Calculate the median sales amount (amount) for each store code (store_cd) for the receipt statement data frame (df_receipt), and display the TOP5 in descending order.

Operate in the same way as when finding the maximum and minimum values, and loop np.median () for each store code.

In[028]


sorter_index = np.argsort(dic_receipt['store_cd'])
sorted_cd = dic_receipt['store_cd'][sorter_index]
sorted_amount = dic_receipt['amount'][sorter_index]

cut_index, = np.concatenate(
    ([True], sorted_cd[1:] != sorted_cd[:-1], [True])).nonzero()
median_amount = np.array([np.median(sorted_amount[s:e])
                          for s, e in zip(cut_index[:-1], cut_index[1:])])

new_arr = make_array(dic_receipt['Code_store_cd'].size,
                     store_cd=dic_receipt['Code_store_cd'],
                     amount=median_amount)
new_arr[np.argsort(median_amount)[::-1]][:5]

As shown below, there is also a method of creating an array arranged in the order of "store code → sales amount" using np.lexsort () and counting the number of each store code to obtain the median index. np.lexsort () was too late.

#Sort array
sorter_index = np.lexsort((dic_receipt['amount'], dic_receipt['store_cd']))
sorted_cd = dic_receipt['store_cd'][sorter_index]
sorted_amount = dic_receipt['amount'][sorter_index]

#Find the median index
counts = np.bincount(sorted_cd)
median_index = counts//2
median_index[1:] += counts.cumsum()[:-1]

#Calculate median
med_a = sorted_amount[median_index]
med_b = sorted_amount[median_index - 1]
median_amount = np.where(counts % 2, med_a, (med_a+med_b)/2)

Out[028]


array([('S13052', 190.), ('S14010', 188.), ('S14050', 185.),
       ('S14040', 180.), ('S13003', 180.)],
      dtype=[('store_cd', '<U6'), ('amount', '<f8')])

P_029

P-029: Find the mode of the product code (product_cd) for each store code (store_cd) for the receipt details data frame (df_receipt).

Create a 2D plane mapping with the store code and product code, and usenp.add.at ()to add 1 to each line. Then, use np.argmax () to get the product code that takes the maximum value for each store code.

In[029]


mapping = np.zeros((dic_receipt['Code_store_cd'].size,
                    dic_receipt['Code_product_cd'].size), dtype=int)
np.add.at(mapping, (dic_receipt['store_cd'], dic_receipt['product_cd']), 1)

make_array(
    dic_receipt['Code_store_cd'].size,
    store_cd=dic_receipt['Code_store_cd'],
    product_cd=dic_receipt['Code_product_cd'][np.argmax(mapping, axis=1)])

Out[029]


array([('S12007', 'P060303001'), ('S12013', 'P060303001'),
       ('S12014', 'P060303001'), ('S12029', 'P060303001'),
       ('S12030', 'P060303001'), ('S13001', 'P060303001'),
       ('S13002', 'P060303001'), ('S13003', 'P071401001'),
       ('S13004', 'P060303001'), ('S13005', 'P040503001'),
       ('S13008', 'P060303001'), ('S13009', 'P060303001'),
       ('S13015', 'P071401001'), ('S13016', 'P071102001'),
       ('S13017', 'P060101002'), ('S13018', 'P071401001'),
       ('S13019', 'P071401001'), ('S13020', 'P071401001'),
       ('S13031', 'P060303001'), ('S13032', 'P060303001'),
       ('S13035', 'P040503001'), ('S13037', 'P060303001'),
       ('S13038', 'P060303001'), ('S13039', 'P071401001'),
       ('S13041', 'P071401001'), ('S13043', 'P060303001'),
       ('S13044', 'P060303001'), ('S13051', 'P050102001'),
       ('S13052', 'P050101001'), ('S14006', 'P060303001'),
       ('S14010', 'P060303001'), ('S14011', 'P060101001'),
       ('S14012', 'P060303001'), ('S14021', 'P060101001'),
       ('S14022', 'P060303001'), ('S14023', 'P071401001'),
       ('S14024', 'P060303001'), ('S14025', 'P060303001'),
       ('S14026', 'P071401001'), ('S14027', 'P060303001'),
       ('S14028', 'P060303001'), ('S14033', 'P071401001'),
       ('S14034', 'P060303001'), ('S14036', 'P040503001'),
       ('S14040', 'P060303001'), ('S14042', 'P050101001'),
       ('S14045', 'P060303001'), ('S14046', 'P060303001'),
       ('S14047', 'P060303001'), ('S14048', 'P050101001'),
       ('S14049', 'P060303001'), ('S14050', 'P060303001')],
      dtype=[('store_cd', '<U6'), ('product_cd', '<U10')])

P_030

P-030: For the receipt detail data frame (df_receipt), calculate the sample variance of the sales amount (amount) for each store code (store_cd), and display the TOP5 in descending order.

First, use np.bincount () to calculate the total number and amount for each store code. Next, calculate the average by total ÷ number. Next, calculate the deviation and use np.bincount () to calculate the variance from the total ÷ number for each store code.

In[030]


counts = np.bincount(dic_receipt['store_cd'])
mean_amount = (np.bincount(dic_receipt['store_cd'], dic_receipt['amount'])
               / counts)
deviation_array = mean_amount[dic_receipt['store_cd']] - dic_receipt['amount']
var_amount = np.bincount(dic_receipt['store_cd'], deviation_array**2) / counts

new_arr = make_array(dic_receipt['Code_store_cd'].size,
                     store_cd=dic_receipt['Code_store_cd'],
                     amount=var_amount)
new_arr[np.argsort(var_amount)[::-1]][:5]

Out[030]


array([('S13052', 440088.70131127), ('S14011', 306314.55816389),
       ('S14034', 296920.08101128), ('S13001', 295431.99332904),
       ('S13015', 295294.36111594)],
      dtype=[('store_cd', '<U6'), ('amount', '<f8')])

P_031

P-031: Calculate the sample standard deviation of the sales amount (amount) for each store code (store_cd) for the receipt detail data frame (df_receipt), and display the TOP5 in descending order.

Since the square root of the variance is the standard deviation, apply np.sqrt () to the previous question.

In[031]


counts = np.bincount(dic_receipt['store_cd'])
mean_amount = (np.bincount(dic_receipt['store_cd'], dic_receipt['amount'])
               / counts)
deviation_array = mean_amount[dic_receipt['store_cd']] - dic_receipt['amount']
var_amount = np.bincount(dic_receipt['store_cd'], deviation_array**2) / counts

new_arr = make_array(dic_receipt['Code_store_cd'].size,
                     store_cd=dic_receipt['Code_store_cd'],
                     amount=np.sqrt(var_amount))
new_arr[np.argsort(var_amount)[::-1]][:5]

Out[031]


array([('S13052', 663.39181583), ('S14011', 553.45691627),
       ('S14034', 544.90373555), ('S13001', 543.53656117),
       ('S13015', 543.40993837)],
      dtype=[('store_cd', '<U6'), ('amount', '<f8')])

P_032

P-032: Find the percentile value of the sales amount (amount) of the receipt detail data frame (df_receipt) in 25% increments.

In[032]


np.percentile(dic_receipt['amount'], np.arange(5)/4)

Out[032]


array([ 10., 102., 170., 288., 10925.])

P_033

P-033: Calculate the average sales amount (amount) for each store code (store_cd) for the receipt detail data frame (df_receipt), and extract 330 or more.

Same as question 27.

In[033]


mean_amount = (np.bincount(dic_receipt['store_cd'], dic_receipt['amount'])
               / np.bincount(dic_receipt['store_cd']))

new_arr = make_array(dic_receipt['Code_store_cd'].size,
                     store_cd=dic_receipt['Code_store_cd'],
                     amount=mean_amount)
new_arr[mean_amount >= 330]

Out[033]


array([('S12013', 330.19412998), ('S13001', 348.47038627),
       ('S13003', 350.91551882), ('S13004', 330.94394904),
       ('S13015', 351.11196043), ('S13019', 330.20861588),
       ('S13020', 337.87993212), ('S13052', 402.86746988),
       ('S14010', 348.79126214), ('S14011', 335.71833333),
       ('S14026', 332.34058847), ('S14045', 330.08207343),
       ('S14047', 330.07707317)],
      dtype=[('store_cd', '<U6'), ('amount', '<f8')])

P_034

P-034: For the receipt detail data frame (df_receipt), add up the sales amount (amount) for each customer ID (customer_id) and calculate the average of all customers. However, if the customer ID starts with "Z", it represents a non-member, so exclude it from the calculation.

First, non-members are judged using .astype () Strategy for the column for "character string ⇔ number" conversion, and . Extract that value with .nonzero () . Then use np.in1d () to create a boolean array ʻis_member` that determines if each row in the customer ID column is a member. Finally, add up by customer ID and average it.

In[034]


startswithZ, = (dic_receipt['Code_customer_id'].astype('<U1') == 'Z').nonzero()
is_member = np.in1d(dic_receipt['customer_id'], startswithZ, invert=True)

sums = np.bincount(dic_receipt['customer_id'][is_member],
                   dic_receipt['amount'][is_member])

np.mean(sums)

Out[034]


2547.742234529256

P_035

P-035: For the receipt detail data frame (df_receipt), add up the sales amount (amount) for each customer ID (customer_id) to find the average of all customers, and extract the customers who shop above the average. .. However, if the customer ID starts with "Z", it represents a non-member, so exclude it from the calculation. Only 10 data items need to be displayed.

Just add the extraction process to the previous question.

In[035]


startswithZ, = (dic_receipt['Code_customer_id'].astype('<U1') == 'Z').nonzero()
is_member = np.in1d(dic_receipt['customer_id'], startswithZ, invert=True)

sums = np.bincount(dic_receipt['customer_id'][is_member],
                   dic_receipt['amount'][is_member])
mean = np.mean(sums)

new_arr = make_array(dic_receipt['Code_customer_id'].size - startswithZ.size,
                     store_cd=dic_receipt['Code_customer_id'][~startswithZ],
                     amount=sums)
new_arr[sums > mean][:10]

Out[035]


array([('CS001113000004', 3044.), ('CS001113000004', 3337.),
       ('CS001113000004', 4685.), ('CS001113000004', 4132.),
       ('CS001113000004', 5639.), ('CS001113000004', 3496.),
       ('CS001113000004', 3726.), ('CS001113000004', 3485.),
       ('CS001113000004', 4370.), ('CS001113000004', 3300.)],
      dtype=[('store_cd', '<U14'), ('amount', '<f8')])

By the way, in the model answer, for some reason, df_receipt.groupby ('customer_id'). Amount.sum () is performed twice.

in conclusion

NumPy doesn't have a group by, so it was a tough fight.

Recommended Posts

That's why I quit pandas [Data Science 100 Knock (Structured Data Processing) # 2]
That's why I quit pandas [Data Science 100 Knock (Structured Data Processing) # 1]
That's why I quit pandas [Data Science 100 Knock (Structured Data Processing) # 3]
That's why I quit pandas [Data Science 100 Knock (Structured Data Processing) # 5]
That's why I quit pandas [Data Science 100 Knock (Structured Data Processing) # 4]
That's why I quit pandas [Data Science 100 Knock (Structured Data Processing) # 6]
"Data Science 100 Knock (Structured Data Processing)" Python-007 Explanation
"Data Science 100 Knock (Structured Data Processing)" Python-006 Explanation
"Data Science 100 Knock (Structured Data Processing)" Python-001 Explanation
"Data Science 100 Knock (Structured Data Processing)" Python-002 Explanation
[Python] Data Science 100 Knock (Structured Data Processing) 021 Explanation
"Data Science 100 Knock (Structured Data Processing)" Python-005 Explanation
"Data Science 100 Knock (Structured Data Processing)" Python-004 Explanation
[Python] Data Science 100 Knock (Structured Data Processing) 020 Explanation
[Python] Data Science 100 Knock (Structured Data Processing) 025 Explanation
"Data Science 100 Knock (Structured Data Processing)" Python-003 Explanation
[Python] Data Science 100 Knock (Structured Data Processing) 019 Explanation
Preparing to try "Data Science 100 Knock (Structured Data Processing)"
Data science 100 knock (structured data processing) environment construction (Windows10)
[Python] Data Science 100 Knock (Structured Data Processing) 001-010 Impressions + Explanation Link Summary
[Python] 100 knocks on data science (structured data processing) 018 Explanation
[Python] 100 knocks on data science (structured data processing) 023 Explanation
[Python] 100 knocks on data science (structured data processing) 030 Explanation
[Python] 100 knocks on data science (structured data processing) 022 Explanation
100 language processing knock-20 (using pandas): reading JSON data
[Python] 100 knocks on data science (structured data processing) 017 Explanation
[Python] 100 knocks on data science (structured data processing) 026 Explanation
[Python] 100 knocks on data science (structured data processing) 016 Explanation
[Python] 100 knocks on data science (structured data processing) 024 Explanation
[Python] 100 knocks on data science (structured data processing) 027 Explanation
[Python] 100 knocks on data science (structured data processing) 029 Explanation
[Python] 100 knocks on data science (structured data processing) 015 Explanation
[Python] 100 knocks on data science (structured data processing) 028 Explanation
Data science 100 knock commentary (P021 ~ 040)
Data science 100 knock commentary (P061 ~ 080)
Data science 100 knock commentary (P041 ~ 060)
Data science 100 knock commentary (P081 ~ 100)
I tried 100 language processing knock 2020
Data processing tips with Pandas
I tried 100 language processing knock 2020: Chapter 3
100 Language Processing Knock-31 (using pandas): Verb
I tried 100 language processing knock 2020: Chapter 1
I tried 100 language processing knock 2020: Chapter 2
I tried 100 language processing knock 2020: Chapter 4
100 Language Processing Knock-38 (using pandas): Histogram
100 Language Processing Knock-91: Preparation of Analogy Data
I took Udemy's "Practical Python Data Science"
100 Language Processing Knock-35 (using pandas): Noun concatenation
100 Language Processing Knock-39 (using pandas): Zipf's Law
Example of efficient data processing with PANDAS
100 Language Processing Knock-34 (using pandas): "A B"