[PYTHON] Data science 100 knock commentary (P021 ~ 040)

1.First of all

Following the previous, I will explain 100 knocks of data science. For introduction, please refer to this article (* Docker is handled on Mac)

Basically, it is an explanation of the answer, but it also describes another solution.

2. Commentary

P-021: Count the number of receipt details data frame (df_receipt).

P-021


#The number is len()Use the method.
len(df_receipt)

P-022: Count the number of unique cases for the customer ID (customer_id) of the receipt detail data frame (df_receipt).

P-022


#The number of unique cases is unique()Use the method.
# df_receipt['customer_id'].unique() >>>Returns a list of unique element values in a NumPy array ndarray
len(df_receipt['customer_id'].unique())

Reference: https://note.nkmk.me/python-pandas-value-counts/

P-023: Sum the sales amount (amount) and sales quantity (quantity) for each store code (store_cd) for the receipt detail data frame (df_receipt).

P-023


#with the groupby method'store_cd'Grouping by
# agg()Aggregate the data with. Dictionary type sum of sales amount (amount) and sales quantity (quantity)
# reset_index()Re-index with
df_receipt.groupby('store_cd').agg({'amount': 'sum', 'quantity': 'sum'}).reset_index()

#(Another solution)
df_receipt[['amount', 'quantity', 'store_cd']].groupby('store_cd', as_index=False).sum()

Reference (groupby): https://note.nkmk.me/python-pandas-groupby-statistics/ Reference (agg): https://note.nkmk.me/python-pandas-agg-aggregate/

P-024: For the receipt detail data frame (df_receipt), find the newest sales date (sales_ymd) for each customer ID (customer_id) and display 10 items.

P-024


#Customer ID (customer) with groupby_Group by id).
#Sales date_ymd) newest(max()Get in)
df_receipt.groupby('customer_id').sales_ymd.max().reset_index().head(10)

# (Another solution)
#Customer ID (customer) with loc_Sales dates (sales) grouped by id)_ymd) newest(idxmax()Get in)
df_receipt[['customer_id', 'sales_ymd']].loc[df_receipt.groupby('customer_id').sales_ymd.idxmax()].head(10)

Reference (groupby): https://note.nkmk.me/python-pandas-groupby-statistics/

P-025: For the receipt detail data frame (df_receipt), find the oldest sales date (sales_ymd) for each customer ID (customer_id) and display 10 items.

P-025


# P-See 023. The oldest sales date is agg({'sales_ymd': min})Can be expressed by
df_receipt.groupby('customer_id').agg({'sales_ymd': min}).reset_index().head(10)

# (Another solution)
#Customer ID (customer) with loc_Sales dates (sales) grouped by id)_ymd) oldest(idxmin()Get in)
df_receipt[['customer_id', 'sales_ymd']].loc[df_receipt.groupby('customer_id').sales_ymd.idxmin()].head(10)

P-026: For the receipt detail data frame (df_receipt), find the newest sales date (sales_ymd) and the oldest sales date for each customer ID (customer_id), and display 10 different data.

P-026


#Customer ID (customer)_Sales date (sales) for each id)_ymd) new sales date(max)And old sales date(min)Create a data frame for
df_tmp = df_receipt.groupby('customer_id').agg({'sales_ymd':['max','min']}).reset_index()

#Change column(Details are explained below)
df_tmp.columns = ["_".join(pair) for pair in df_tmp.columns]

#Both query different data()Search by.
df_tmp.query('sales_ymd_max != sales_ymd_min').head(10)

Since df.columns becomes MultiIndex ([('customer_id',''), ('sales_ymd','max'), ('sales_ymd','min')],), extract them one by one with a for statement. Join the characters in parentheses with "_" .join (). スクリーンショット 2020-08-21 18.04.58.png

P-027: Calculate the average sales amount (amount) for each store code (store_cd) for the receipt statement data frame (df_receipt), and display the TOP5 in descending order.

P-027


#Receipt detail data frame (df_Receipt) to store code (store_Group by cd).
# agg({'amount':'mean'})Calculate the average sales amount (amount) at
# reset_index()Re-index with and sort_values('amount', ascending=False)Sort by sales amount (amount) in descending order
df_receipt.groupby('store_cd').agg({'amount':'mean'}) \
      .reset_index().sort_values('amount', ascending=False).head(5)

Reference (groupby): https://note.nkmk.me/python-pandas-groupby-statistics/ Reference (agg): https://note.nkmk.me/python-pandas-agg-aggregate/

P-028: Calculate the median sales amount (amount) for each store code (store_cd) for the receipt statement data frame (df_receipt), and display the TOP5 in descending order.

P-028


# P-Refer to 027
#Median is median
df_receipt.groupby('store_cd').agg({'amount':'median'}).reset_index().sort_values('amount', ascending=False).head(5)

Reference (median): https://note.nkmk.me/python-statistics-mean-median-mode-var-stdev/

P-029: Find the mode of the product code (product_cd) for each store code (store_cd) for the receipt statement data frame (df_receipt).

P-029


#The first half is P-Refer to 027
#Product code (product_cd) to function(Lambda expression)To apply.
df_receipt.groupby('store_cd').product_cd.apply(lambda x: x.mode()).reset_index()
# df.groupby('grouping_content').What you want to find the mode.apply(lambda x: x.mode())

# (Wrong answer)
df_receipt.groupby('store_cd').agg({'product_cd':'mode'}).reset_index()
# >>>AttributeError: 'SeriesGroupBy' object has no attribute 'mode'

Reference (lambda expression): https://note.nkmk.me/python-lambda-usage/

P-030: For the receipt detail data frame (df_receipt), calculate the sample variance of the sales amount (amount) for each store code (store_cd), and display the TOP5 in descending order.

P-030


#Store code (store_Sample distribution of sales amount (amount) of cd)(var(ddof=0))
# reset_Re-index and sort by index(sort_values),descending order(ascending=False)
df_receipt.groupby('store_cd').amount.var(ddof=0).reset_index().sort_values('amount', ascending=False).head(5)

Reference: https://deepage.net/features/numpy-var.html Reference (sample variance and unbiased variance): https://bellcurve.jp/statistics/course/8614.html

P-031: Calculate the sample standard deviation of the sales amount (amount) for each store code (store_cd) for the receipt detail data frame (df_receipt), and display the TOP5 in descending order.

P-031


#Store code (store_cd) sample standard deviation of sales amount (amount)(std(ddof=0))
# reset_Re-index and sort by index(sort_values),descending order(ascending=False)
df_receipt.groupby('store_cd').amount.std(ddof=0).reset_index().sort_values('amount', ascending=False).head()

Reference: https://deepage.net/features/numpy-var.html Reference (sample variance and unbiased variance): https://bellcurve.jp/statistics/course/8614.html

P-032: Find the percentile value of the sales amount (amount) of the receipt detail data frame (df_receipt) in 25% increments.

P-032


#Percentile value of sales amount (amount)(quantile)
# np.arange(5)/4 >>> [0, 0.25, 0.5, 0.75, 1]
df_receipt.amount.quantile(q=np.arange(5)/4)

# (Another solution)
np.percentile(df_receipt['amount'], q=[25, 50, 75,100])

Reference (quantile): https://note.nkmk.me/python-pandas-quantile/

P-033: Calculate the average sales amount (amount) for each store code (store_cd) for the receipt detail data frame (df_receipt), and extract 330 or more.

P-033


#Store code (store_Group by cd).
#Average sales amount (amount)(mean)Is calculated and reset_index()Re-index with
# query()More than 300 sales amount (amount) by method
df_receipt.groupby('store_cd').amount.mean().reset_index().query('amount >= 330')

# (Another solution)
df_receipt.groupby('store_cd').amount.mean()[df_receipt.groupby('store_cd').amount.mean() >= 300]

P-034: For the receipt detail data frame (df_receipt), add up the sales amount (amount) for each customer ID (customer_id) and calculate the average of all customers. However, if the customer ID starts with "Z", it represents a non-member, so exclude it from the calculation.

P-034


#How to write without using query
# ~Is Not. , Customer ID (customer_id) starts with Z(str.startswith("Z"))
#Customer ID (customer)_The average of all customers by summing the sales amount (amount) for each id)
df_receipt[~df_receipt['customer_id'].str.startswith("Z")].groupby('customer_id').amount.sum().mean()

#How to write using query
df_receipt.query('not customer_id.str.startswith("Z")', engine='python').groupby('customer_id').amount.sum().mean()

P-035: For the receipt statement data frame (df_receipt), add up the sales amount (amount) for each customer ID (customer_id) to find the average of all customers, and extract the customers who shop above the average. .. However, if the customer ID starts with "Z", it represents a non-member, so exclude it from the calculation. Only 10 data items need to be displayed.

P-035


#Customer ID"Z"Exclude.'customer_id'Divide each group. (2547.742234529256) Average for all customers
amount_mean = df_receipt[~df_receipt['customer_id'].str.startswith("Z")].groupby('customer_id').amount.sum().mean()

#Customer ID (customer)_Sum the sales amount (amount) for each id) (data frame type)
df_amount_sum = df_receipt.groupby('customer_id').amount.sum().reset_index()

# amount_mean(Average for all customers)Display the above 10 items
df_amount_sum[df_amount_sum['amount'] >= amount_mean].head(10)

P-036: Internally combine the receipt detail data frame (df_receipt) and the store data frame (df_store), and display all items of the receipt detail data frame and 10 store names (store_name) of the store data frame.

P-036


# merge(A(df), B(df), how='inner'(Inner join), on='Common columns')
pd.merge(df_receipt, df_store[['store_cd','store_name']], how='inner', on='store_cd').head(10)

Reference: https://note.nkmk.me/python-pandas-merge-join/

P-037: Internally combine the product data frame (df_product) and the category data frame (df_category), and display all items of the product data frame and 10 subcategory names (category_small_name) of the category data frame.

P-037


# merge(A(df), B(df), how='inner'(Inner join), on='Common columns')
pd.merge(df_product
         , df_category[['category_small_cd','category_small_name']]
         , how='inner', on='category_small_cd').head(10)

Reference: https://note.nkmk.me/python-pandas-merge-join/

P-038: Find the total sales amount for each customer from the customer data frame (df_customer) and receipt detail data frame (df_receipt). However, for customers who have no shopping record, the sales amount should be displayed as 0. In addition, customers should target those whose gender code (gender_cd) is female (1), and exclude non-members (customer IDs starting with'Z'). Only 10 results need to be displayed.

P-038


#Total sales amount for each customer.
#Group by customer ID and total sales amount (amount)(sum)
df_amount_sum = df_receipt.groupby('customer_id').amount.sum().reset_index()

#Customer is gender code (gender_For those whose cd) is female (1), non-member (customer ID is'Z'Excludes those starting with)
df_tmp = df_customer.query('gender_cd == "1" and not customer_id.str.startswith("Z")', engine='python')

# merge(A(df), B(df), how='inner'(Inner join), on='Common columns')
# fillna(0)For customers who have no shopping record at, the sales amount is 0
pd.merge(df_tmp['customer_id'], df_amount_sum, how='left', on='customer_id').fillna(0).head(10)

P-039: Extract the top 20 customers with the highest number of sales days and the top 20 customers with the highest total sales amount from the receipt detail data frame (df_receipt), and perform a complete outer join. However, non-members (customer IDs starting with'Z') should be excluded.

P-039


#Customer ID('customer_id')Group by and total sales amount (amount)(sum)
df_sum = df_receipt.groupby('customer_id').amount.sum().reset_index()
# customer_id.str.startswith("Z")(Except for customer IDs that start with Z)
df_sum = df_sum.query('not customer_id.str.startswith("Z")', engine='python')
#Sort the total sales amount (amount) and extract the top 20
df_sum = df_sum.sort_values('amount', ascending=False).head(20)

#Number of sales days(sales_ymd)To extract duplicates.
df_cnt = df_receipt[~df_receipt.duplicated(subset=['customer_id', 'sales_ymd'])]
# customer_id.str.startswith("Z")(Except for customer IDs that start with Z)
df_cnt = df_cnt.query('not customer_id.str.startswith("Z")', engine='python')
#Customer ID('customer_id')Group by and the number of sales days(sales_ymd)Number of cases(count)
df_cnt = df_cnt.groupby('customer_id').sales_ymd.count().reset_index()
#Number of sales days(sales_ymd)Sort by the number of cases, in descending order(ascending=False)The top 20 items sorted by are extracted.
df_cnt = df_cnt.sort_values('sales_ymd', ascending=False).head(20)

#Outer merge with merge
pd.merge(df_sum, df_cnt, how='outer', on='customer_id')

Reference (duplicate): https://note.nkmk.me/python-pandas-duplicated-drop-duplicates/ Reference (outer join): https://note.nkmk.me/python-pandas-merge-join/

P-040: I would like to investigate how many data will be obtained by combining all stores and all products. Calculate the number of direct products of stores (df_store) and products (df_product).

P-040


#Store (df_make a copy of store)
df_store_tmp = df_store.copy()
#Goods (df_make a copy of product)
df_product_tmp = df_product.copy()

#Key to combine(Column)が必要なのでそれぞれColumnを追加する。
df_store_tmp['key'] = 0
df_product_tmp['key'] = 0

#Outer join and check the number with len method
len(pd.merge(df_store_tmp, df_product_tmp, on='key', how='outer'))

Reference: https://note.nkmk.me/python-pandas-merge-join/

3. References

100 knocks of data science How to run 100 data science knocks on Mac

Recommended Posts

Data science 100 knock commentary (P021 ~ 040)
Data science 100 knock commentary (P061 ~ 080)
Data science 100 knock commentary (P041 ~ 060)
Data science 100 knock commentary (P081 ~ 100)
"Data Science 100 Knock (Structured Data Processing)" Python-007 Explanation
"Data Science 100 Knock (Structured Data Processing)" Python-006 Explanation
Learn data science
"Data Science 100 Knock (Structured Data Processing)" Python-001 Explanation
"Data Science 100 Knock (Structured Data Processing)" Python-002 Explanation
[Python] Data Science 100 Knock (Structured Data Processing) 021 Explanation
"Data Science 100 Knock (Structured Data Processing)" Python-005 Explanation
"Data Science 100 Knock (Structured Data Processing)" Python-004 Explanation
[Python] Data Science 100 Knock (Structured Data Processing) 020 Explanation
[Python] Data Science 100 Knock (Structured Data Processing) 025 Explanation
"Data Science 100 Knock (Structured Data Processing)" Python-003 Explanation
[Python] Data Science 100 Knock (Structured Data Processing) 019 Explanation
Preparing to try "Data Science 100 Knock (Structured Data Processing)"
Data science 100 knock (structured data processing) environment construction (Windows10)
Challenge 100 data science knocks
OPT data science competition
Data Science 100 Knock ~ Battle for less than beginners part3
Data Science 100 Knock ~ Battle for less than beginners part6
Data Science 100 Knock ~ Battle for less than beginners part2
Data Science 100 Knock ~ Battle for less than beginners part1
Data Science 100 Knock ~ Battle for less than beginners part9
Data Science 100 Knock ~ Battle for less than beginners part7
Data Science 100 Knock ~ Battle for less than beginners part4
Data Science 100 Knock ~ Battle for less than beginners part11
[Python] Data Science 100 Knock (Structured Data Processing) 001-010 Impressions + Explanation Link Summary
That's why I quit pandas [Data Science 100 Knock (Structured Data Processing) # 2]
That's why I quit pandas [Data Science 100 Knock (Structured Data Processing) # 1]
Data Science Cheat Sheet (Python)
That's why I quit pandas [Data Science 100 Knock (Structured Data Processing) # 3]
Try "100 knocks on data science" ①
That's why I quit pandas [Data Science 100 Knock (Structured Data Processing) # 5]
Infra_ Data Science Course Output
That's why I quit pandas [Data Science 100 Knock (Structured Data Processing) # 4]
That's why I quit pandas [Data Science 100 Knock (Structured Data Processing) # 6]
Start data science on the cloud
Introducing books related to data science.
[Data science basics] Data acquisition from API
Data science environment construction with Docker