[PYTHON] Data Science 100 Knock ~ Battle for less than beginners part6

This is a struggle record of knocking 100 eggs without knowing the data scientist's egg. It is a mystery whether I can finish the race. ~~ Even if it disappears on the way, please think that it is not given to Qiita. ~~

100 knock articles 100 Knock Guide

** Be careful if you are trying to do it as it includes spoilers **

The link of the page that I referred to first has been changed to the first time only.

This is hard to see! This way of writing is dangerous! If you have any questions, please let me know. ~~ I will use it as food while suffering damage to my heart.

This solution is wrong! This interpretation is different! Please comment if you have any.

This time from 33 to 35. [Last time] 29-32 [First time with table of contents]

33rd

P-033: Calculate the average sales amount (amount) for each store code (store_cd) for the receipt detail data frame (df_receipt), and extract 330 or more.

mine33.py


df=df_receipt
df=df.groupby('store_cd').agg({'amount':'mean'}).query("amount>=330")
df

'''Model answer'''
df_receipt.groupby('store_cd').amount.mean().reset_index().query('amount >= 330')

Problems like reviewing so far. Specify the range of calculation data with guroupby and Add a line of operation with agg and Specify the condition with query.

If you think it's just, the model answer is .amount.mean () I see. However, it seems that it may not work if you write the column specifications by connecting them with ., so be careful.

~~ I gave in to query ~~ I think it would be nice to be able to use the convenient part of the model answer by summarizing it in this way.

34th

P-034: For the receipt detail data frame (df_receipt), add up the sales amount (amount) for each customer ID (customer_id) and calculate the average of all customers. However, if the customer ID starts with "Z", it represents a non-member, so exclude it from the calculation.

mine34.py


df=df_receipt
df=df[df['customer_id'].str.contains('^[^Z]')].reset_index()
df.groupby('customer_id').amount.sum().mean()

'''Model answer'''
#How to write without using query
df_receipt[~df_receipt['customer_id'].str.startswith("Z")].groupby('customer_id').amount.sum().mean()

#How to write using query
df_receipt.query('not customer_id.str.startswith("Z")', engine='python').groupby('customer_id').amount.sum().mean()

This time it was a question with two patterns of model answers. I use contains and usestart within the model answer. If you are used to regular expressions to some extent, contains is easier to write, but is startwith looking only ahead, so the response is faster?

Also, although I have specified "other than those that start with Z" in the regular expression, both model answers define "things that start with Z" and are "other than that". If the regular expression is too long, it will take time to read it, so if it is too long, I wonder if this is the case.

If you write this problem in SQL

py34.sql


SELECT SUM(amount)/COUNT(DISTINCT customer_id)
FROM receipt
WHERE NOT customer_id LIKE 'Z%'

I think it will feel like (operation confirmed for the time being)

35th

P-035: For the receipt statement data frame (df_receipt), add up the sales amount (amount) for each customer ID (customer_id) to find the average of all customers, and extract the customers who shop above the average. .. However, if the customer ID starts with "Z", it represents a non-member, so exclude it from the calculation. Only 10 data items need to be displayed.

mine35.py


df=df_receipt
df=df[df['customer_id'].str.contains('^[^Z]')].reset_index()
df=df.groupby('customer_id').amount.sum().reset_index()
df[df['amount'].mean() <= df['amount']].head(10)

'''Model answer'''
amount_mean = df_receipt[~df_receipt['customer_id'].str.startswith("Z")].groupby('customer_id').amount.sum().mean()
df_amount_sum = df_receipt.groupby('customer_id').amount.sum().reset_index()
df_amount_sum[df_amount_sum['amount'] >= amount_mean].head(10)

Finally, the code for the model answer has become longer. Thanks to the two letters ~~ df ~~

When writing this time, I was very conscious of which calculation was done on the data field. The model answer is also divided into ʻamount_mean and df_amount_sum`.

  1. df=df_receipt
  1. df=df[df['customer_id'].str.contains('^[^Z]')].reset_index() --Collect costomer_ids other than the first" Z "
  2. df=df.groupby('customer_id').amount.sum().reset_index() --Calculate ʻamount total for each costomer_id`
  3. df[df['amount'].mean() <= df['amount']].head(10) --Narrow down to data above the average of 3

I didn't refer to the average except for the comparison of 4, and I didn't like to make two dfs, so I wrote this this time. Of course, the problem says to find the average first, so if you want to use the average in other processes, this writing may be wrong.

In any case, we will compare the "average of the total" and the "total", so we will make that comparison in 4.

The frequent occurrence of ~~ .reset_index () also has the meaning of magic ~~

Up to here for this time

The table join will start from the next time, and it will be a little short because it is halfway, so I will separate it here. ~~ For that reason, I earned even a little with question sentences and model answers ~~

Recommended Posts

Data Science 100 Knock ~ Battle for less than beginners part3
Data Science 100 Knock ~ Battle for less than beginners part6
Data Science 100 Knock ~ Battle for less than beginners part2
Data Science 100 Knock ~ Battle for less than beginners part1
Data Science 100 Knock ~ Battle for less than beginners part9
Data Science 100 Knock ~ Battle for less than beginners part7
Data Science 100 Knock ~ Battle for less than beginners part4
Data Science 100 Knock ~ Battle for less than beginners part11
Data science 100 knocks ~ Battle for less than beginners part5
Data science 100 knocks ~ Battle for less than beginners part10
Data science 100 knocks ~ Battle for less than beginners part8
Data science 100 knock commentary (P021 ~ 040)
Data science 100 knock commentary (P061 ~ 080)
Data science 100 knock commentary (P041 ~ 060)
Data science 100 knock commentary (P081 ~ 100)
How to implement 100 data science knocks for data science beginners (for windows10 Home)
"Data Science 100 Knock (Structured Data Processing)" Python-007 Explanation
"Data Science 100 Knock (Structured Data Processing)" Python-006 Explanation
"Data Science 100 Knock (Structured Data Processing)" Python-001 Explanation
Time series data anomaly detection for beginners
"Data Science 100 Knock (Structured Data Processing)" Python-002 Explanation
[Python] Data Science 100 Knock (Structured Data Processing) 021 Explanation
"Data Science 100 Knock (Structured Data Processing)" Python-005 Explanation
"Data Science 100 Knock (Structured Data Processing)" Python-004 Explanation
[Python] Data Science 100 Knock (Structured Data Processing) 020 Explanation
[Python] Data Science 100 Knock (Structured Data Processing) 025 Explanation
"Data Science 100 Knock (Structured Data Processing)" Python-003 Explanation
[Python] Data Science 100 Knock (Structured Data Processing) 019 Explanation
[Hands-on for beginners] Read kaggle's "Forecasting Home Prices" line by line (Part 1: Reading data)
For new students (Recommended efforts for Python beginners Part 1)
How to use data analysis tools for beginners
Preparing to try "Data Science 100 Knock (Structured Data Processing)"
Data science 100 knock (structured data processing) environment construction (Windows10)
Basics of pandas for beginners ② Understanding data overview