1.First of all

Following the previous, I will explain 100 knocks of data science. For introduction, please refer to this article (* Docker is handled on Mac)

Basically, it is an explanation of the answer, but it also describes another solution.

2. Commentary

P-021: Count the number of receipt details data frame (df_receipt).

`P-021`


#The number is len()Use the method.
len(df_receipt)

P-022: Count the number of unique cases for the customer ID (customer_id) of the receipt detail data frame (df_receipt).

`P-022`


#The number of unique cases is unique()Use the method.
# df_receipt['customer_id'].unique() >>>Returns a list of unique element values in a NumPy array ndarray
len(df_receipt['customer_id'].unique())

Reference: https://note.nkmk.me/python-pandas-value-counts/

P-023: Sum the sales amount (amount) and sales quantity (quantity) for each store code (store_cd) for the receipt detail data frame (df_receipt).

`P-023`


#with the groupby method'store_cd'Grouping by
# agg()Aggregate the data with. Dictionary type sum of sales amount (amount) and sales quantity (quantity)
# reset_index()Re-index with
df_receipt.groupby('store_cd').agg({'amount': 'sum', 'quantity': 'sum'}).reset_index()

#(Another solution)
df_receipt[['amount', 'quantity', 'store_cd']].groupby('store_cd', as_index=False).sum()

Reference (groupby): https://note.nkmk.me/python-pandas-groupby-statistics/ Reference (agg): https://note.nkmk.me/python-pandas-agg-aggregate/

P-024: For the receipt detail data frame (df_receipt), find the newest sales date (sales_ymd) for each customer ID (customer_id) and display 10 items.

`P-024`


#Customer ID (customer) with groupby_Group by id).
#Sales date_ymd) newest(max()Get in)
df_receipt.groupby('customer_id').sales_ymd.max().reset_index().head(10)

# (Another solution)
#Customer ID (customer) with loc_Sales dates (sales) grouped by id)_ymd) newest(idxmax()Get in)
df_receipt[['customer_id', 'sales_ymd']].loc[df_receipt.groupby('customer_id').sales_ymd.idxmax()].head(10)

Reference (groupby): https://note.nkmk.me/python-pandas-groupby-statistics/

P-025: For the receipt detail data frame (df_receipt), find the oldest sales date (sales_ymd) for each customer ID (customer_id) and display 10 items.

`P-025`


# P-See 023. The oldest sales date is agg({'sales_ymd': min})Can be expressed by
df_receipt.groupby('customer_id').agg({'sales_ymd': min}).reset_index().head(10)

# (Another solution)
#Customer ID (customer) with loc_Sales dates (sales) grouped by id)_ymd) oldest(idxmin()Get in)
df_receipt[['customer_id', 'sales_ymd']].loc[df_receipt.groupby('customer_id').sales_ymd.idxmin()].head(10)

P-026: For the receipt detail data frame (df_receipt), find the newest sales date (sales_ymd) and the oldest sales date for each customer ID (customer_id), and display 10 different data.

`P-026`


#Customer ID (customer)_Sales date (sales) for each id)_ymd) new sales date(max)And old sales date(min)Create a data frame for
df_tmp = df_receipt.groupby('customer_id').agg({'sales_ymd':['max','min']}).reset_index()

#Change column(Details are explained below)
df_tmp.columns = ["_".join(pair) for pair in df_tmp.columns]

#Both query different data()Search by.
df_tmp.query('sales_ymd_max != sales_ymd_min').head(10)

Since df.columns becomes MultiIndex ([('customer_id',''), ('sales_ymd','max'), ('sales_ymd','min')],), extract them one by one with a for statement. Join the characters in parentheses with "_" .join (). スクリーンショット 2020-08-21 18.04.58.png

P-027: Calculate the average sales amount (amount) for each store code (store_cd) for the receipt statement data frame (df_receipt), and display the TOP5 in descending order.

`P-027`


#Receipt detail data frame (df_Receipt) to store code (store_Group by cd).
# agg({'amount':'mean'})Calculate the average sales amount (amount) at
# reset_index()Re-index with and sort_values('amount', ascending=False)Sort by sales amount (amount) in descending order
df_receipt.groupby('store_cd').agg({'amount':'mean'}) \
      .reset_index().sort_values('amount', ascending=False).head(5)

Reference (groupby): https://note.nkmk.me/python-pandas-groupby-statistics/ Reference (agg): https://note.nkmk.me/python-pandas-agg-aggregate/

P-028: Calculate the median sales amount (amount) for each store code (store_cd) for the receipt statement data frame (df_receipt), and display the TOP5 in descending order.

`P-028`


# P-Refer to 027
#Median is median
df_receipt.groupby('store_cd').agg({'amount':'median'}).reset_index().sort_values('amount', ascending=False).head(5)

Reference (median): https://note.nkmk.me/python-statistics-mean-median-mode-var-stdev/

P-029: Find the mode of the product code (product_cd) for each store code (store_cd) for the receipt statement data frame (df_receipt).

`P-029`


#The first half is P-Refer to 027
#Product code (product_cd) to function(Lambda expression)To apply.
df_receipt.groupby('store_cd').product_cd.apply(lambda x: x.mode()).reset_index()
# df.groupby('grouping_content').What you want to find the mode.apply(lambda x: x.mode())

# (Wrong answer)
df_receipt.groupby('store_cd').agg({'product_cd':'mode'}).reset_index()
# >>>AttributeError: 'SeriesGroupBy' object has no attribute 'mode'

Reference (lambda expression): https://note.nkmk.me/python-lambda-usage/

P-030: For the receipt detail data frame (df_receipt), calculate the sample variance of the sales amount (amount) for each store code (store_cd), and display the TOP5 in descending order.

`P-030`


#Store code (store_Sample distribution of sales amount (amount) of cd)(var(ddof=0))
# reset_Re-index and sort by index(sort_values),descending order(ascending=False)
df_receipt.groupby('store_cd').amount.var(ddof=0).reset_index().sort_values('amount', ascending=False).head(5)

Reference: https://deepage.net/features/numpy-var.html Reference (sample variance and unbiased variance): https://bellcurve.jp/statistics/course/8614.html

P-031: Calculate the sample standard deviation of the sales amount (amount) for each store code (store_cd) for the receipt detail data frame (df_receipt), and display the TOP5 in descending order.

`P-031`


#Store code (store_cd) sample standard deviation of sales amount (amount)(std(ddof=0))
# reset_Re-index and sort by index(sort_values),descending order(ascending=False)
df_receipt.groupby('store_cd').amount.std(ddof=0).reset_index().sort_values('amount', ascending=False).head()

Reference: https://deepage.net/features/numpy-var.html Reference (sample variance and unbiased variance): https://bellcurve.jp/statistics/course/8614.html

P-032: Find the percentile value of the sales amount (amount) of the receipt detail data frame (df_receipt) in 25% increments.

`P-032`


#Percentile value of sales amount (amount)(quantile)
# np.arange(5)/4 >>> [0, 0.25, 0.5, 0.75, 1]
df_receipt.amount.quantile(q=np.arange(5)/4)

# (Another solution)
np.percentile(df_receipt['amount'], q=[25, 50, 75,100])

Reference (quantile): https://note.nkmk.me/python-pandas-quantile/

P-033: Calculate the average sales amount (amount) for each store code (store_cd) for the receipt detail data frame (df_receipt), and extract 330 or more.

`P-033`


#Store code (store_Group by cd).
#Average sales amount (amount)(mean)Is calculated and reset_index()Re-index with
# query()More than 300 sales amount (amount) by method
df_receipt.groupby('store_cd').amount.mean().reset_index().query('amount >= 330')

# (Another solution)
df_receipt.groupby('store_cd').amount.mean()[df_receipt.groupby('store_cd').amount.mean() >= 300]

P-034: For the receipt detail data frame (df_receipt), add up the sales amount (amount) for each customer ID (customer_id) and calculate the average of all customers. However, if the customer ID starts with "Z", it represents a non-member, so exclude it from the calculation.

`P-034`


#How to write without using query
# ~Is Not. , Customer ID (customer_id) starts with Z(str.startswith("Z"))
#Customer ID (customer)_The average of all customers by summing the sales amount (amount) for each id)
df_receipt[~df_receipt['customer_id'].str.startswith("Z")].groupby('customer_id').amount.sum().mean()

#How to write using query
df_receipt.query('not customer_id.str.startswith("Z")', engine='python').groupby('customer_id').amount.sum().mean()

P-035: For the receipt statement data frame (df_receipt), add up the sales amount (amount) for each customer ID (customer_id) to find the average of all customers, and extract the customers who shop above the average. .. However, if the customer ID starts with "Z", it represents a non-member, so exclude it from the calculation. Only 10 data items need to be displayed.

`P-035`


#Customer ID"Z"Exclude.'customer_id'Divide each group. (2547.742234529256) Average for all customers
amount_mean = df_receipt[~df_receipt['customer_id'].str.startswith("Z")].groupby('customer_id').amount.sum().mean()

#Customer ID (customer)_Sum the sales amount (amount) for each id) (data frame type)
df_amount_sum = df_receipt.groupby('customer_id').amount.sum().reset_index()

# amount_mean(Average for all customers)Display the above 10 items
df_amount_sum[df_amount_sum['amount'] >= amount_mean].head(10)

P-036: Internally combine the receipt detail data frame (df_receipt) and the store data frame (df_store), and display all items of the receipt detail data frame and 10 store names (store_name) of the store data frame.

`P-036`


# merge(A(df), B(df), how='inner'(Inner join), on='Common columns')
pd.merge(df_receipt, df_store[['store_cd','store_name']], how='inner', on='store_cd').head(10)

Reference: https://note.nkmk.me/python-pandas-merge-join/

P-037: Internally combine the product data frame (df_product) and the category data frame (df_category), and display all items of the product data frame and 10 subcategory names (category_small_name) of the category data frame.

`P-037`


# merge(A(df), B(df), how='inner'(Inner join), on='Common columns')
pd.merge(df_product
         , df_category[['category_small_cd','category_small_name']]
         , how='inner', on='category_small_cd').head(10)

Reference: https://note.nkmk.me/python-pandas-merge-join/

P-038: Find the total sales amount for each customer from the customer data frame (df_customer) and receipt detail data frame (df_receipt). However, for customers who have no shopping record, the sales amount should be displayed as 0. In addition, customers should target those whose gender code (gender_cd) is female (1), and exclude non-members (customer IDs starting with'Z'). Only 10 results need to be displayed.

`P-038`


#Total sales amount for each customer.
#Group by customer ID and total sales amount (amount)(sum)
df_amount_sum = df_receipt.groupby('customer_id').amount.sum().reset_index()

#Customer is gender code (gender_For those whose cd) is female (1), non-member (customer ID is'Z'Excludes those starting with)
df_tmp = df_customer.query('gender_cd == "1" and not customer_id.str.startswith("Z")', engine='python')

# merge(A(df), B(df), how='inner'(Inner join), on='Common columns')
# fillna(0)For customers who have no shopping record at, the sales amount is 0
pd.merge(df_tmp['customer_id'], df_amount_sum, how='left', on='customer_id').fillna(0).head(10)

P-039: Extract the top 20 customers with the highest number of sales days and the top 20 customers with the highest total sales amount from the receipt detail data frame (df_receipt), and perform a complete outer join. However, non-members (customer IDs starting with'Z') should be excluded.

`P-039`


#Customer ID('customer_id')Group by and total sales amount (amount)(sum)
df_sum = df_receipt.groupby('customer_id').amount.sum().reset_index()
# customer_id.str.startswith("Z")(Except for customer IDs that start with Z)
df_sum = df_sum.query('not customer_id.str.startswith("Z")', engine='python')
#Sort the total sales amount (amount) and extract the top 20
df_sum = df_sum.sort_values('amount', ascending=False).head(20)

#Number of sales days(sales_ymd)To extract duplicates.
df_cnt = df_receipt[~df_receipt.duplicated(subset=['customer_id', 'sales_ymd'])]
# customer_id.str.startswith("Z")(Except for customer IDs that start with Z)
df_cnt = df_cnt.query('not customer_id.str.startswith("Z")', engine='python')
#Customer ID('customer_id')Group by and the number of sales days(sales_ymd)Number of cases(count)
df_cnt = df_cnt.groupby('customer_id').sales_ymd.count().reset_index()
#Number of sales days(sales_ymd)Sort by the number of cases, in descending order(ascending=False)The top 20 items sorted by are extracted.
df_cnt = df_cnt.sort_values('sales_ymd', ascending=False).head(20)

#Outer merge with merge
pd.merge(df_sum, df_cnt, how='outer', on='customer_id')

Reference (duplicate): https://note.nkmk.me/python-pandas-duplicated-drop-duplicates/ Reference (outer join): https://note.nkmk.me/python-pandas-merge-join/

P-040: I would like to investigate how many data will be obtained by combining all stores and all products. Calculate the number of direct products of stores (df_store) and products (df_product).

`P-040`


#Store (df_make a copy of store)
df_store_tmp = df_store.copy()
#Goods (df_make a copy of product)
df_product_tmp = df_product.copy()

#Key to combine(Column)が必要なのでそれぞれColumnを追加する。
df_store_tmp['key'] = 0
df_product_tmp['key'] = 0

#Outer join and check the number with len method
len(pd.merge(df_store_tmp, df_product_tmp, on='key', how='outer'))

Reference: https://note.nkmk.me/python-pandas-merge-join/

3. References

100 knocks of data science How to run 100 data science knocks on Mac

[PYTHON] Data science 100 knock commentary (P021 ~ 040)

1.First of all

2. Commentary

P-021

P-022

P-023

P-024

P-025

P-026

P-027

P-028

P-029

P-030

P-031

P-032

P-033

P-034

P-035

P-036

P-037

P-038

P-039

P-040

3. References

`P-021`

`P-022`

`P-023`

`P-024`

`P-025`

`P-026`

`P-027`

`P-028`

`P-029`

`P-030`

`P-031`

`P-032`

`P-033`

`P-034`

`P-035`

`P-036`

`P-037`

`P-038`

`P-039`

`P-040`