This is a struggle record of knocking 100 eggs without knowing the data scientist's egg. It is a mystery whether I can finish the race. ~~ Even if it disappears on the way, please think that it is not given to Qiita. ~~

** Be careful if you are trying to do it as it includes spoilers **

The link of the page that I referred to first has been changed to the first time only.

This is hard to see! This way of writing is dangerous! If you have any questions, please let me know. ~~ I will use it as food while suffering damage to my heart.

This solution is wrong! This interpretation is different! Please comment if you have any.

This time from 33 to 35. [Last time] 29-32 [First time with table of contents]

33rd

P-033: Calculate the average sales amount (amount) for each store code (store_cd) for the receipt detail data frame (df_receipt), and extract 330 or more.

`mine33.py`


df=df_receipt
df=df.groupby('store_cd').agg({'amount':'mean'}).query("amount>=330")
df

'''Model answer'''
df_receipt.groupby('store_cd').amount.mean().reset_index().query('amount >= 330')

Problems like reviewing so far. Specify the range of calculation data with guroupby and Add a line of operation with agg and Specify the condition with query.

If you think it's just, the model answer is .amount.mean () I see. However, it seems that it may not work if you write the column specifications by connecting them with ., so be careful.

~~ I gave in to query ~~ I think it would be nice to be able to use the convenient part of the model answer by summarizing it in this way.

34th

P-034: For the receipt detail data frame (df_receipt), add up the sales amount (amount) for each customer ID (customer_id) and calculate the average of all customers. However, if the customer ID starts with "Z", it represents a non-member, so exclude it from the calculation.

`mine34.py`


df=df_receipt
df=df[df['customer_id'].str.contains('^[^Z]')].reset_index()
df.groupby('customer_id').amount.sum().mean()

'''Model answer'''
#How to write without using query
df_receipt[~df_receipt['customer_id'].str.startswith("Z")].groupby('customer_id').amount.sum().mean()

#How to write using query
df_receipt.query('not customer_id.str.startswith("Z")', engine='python').groupby('customer_id').amount.sum().mean()

This time it was a question with two patterns of model answers. I use contains and usestart within the model answer. If you are used to regular expressions to some extent, contains is easier to write, but is startwith looking only ahead, so the response is faster?

Also, although I have specified "other than those that start with Z" in the regular expression, both model answers define "things that start with Z" and are "other than that". If the regular expression is too long, it will take time to read it, so if it is too long, I wonder if this is the case.

If you write this problem in SQL

`py34.sql`


SELECT SUM(amount)/COUNT(DISTINCT customer_id)
FROM receipt
WHERE NOT customer_id LIKE 'Z%'

I think it will feel like (operation confirmed for the time being)

35th

P-035: For the receipt statement data frame (df_receipt), add up the sales amount (amount) for each customer ID (customer_id) to find the average of all customers, and extract the customers who shop above the average. .. However, if the customer ID starts with "Z", it represents a non-member, so exclude it from the calculation. Only 10 data items need to be displayed.

`mine35.py`


df=df_receipt
df=df[df['customer_id'].str.contains('^[^Z]')].reset_index()
df=df.groupby('customer_id').amount.sum().reset_index()
df[df['amount'].mean() <= df['amount']].head(10)

'''Model answer'''
amount_mean = df_receipt[~df_receipt['customer_id'].str.startswith("Z")].groupby('customer_id').amount.sum().mean()
df_amount_sum = df_receipt.groupby('customer_id').amount.sum().reset_index()
df_amount_sum[df_amount_sum['amount'] >= amount_mean].head(10)

Finally, the code for the model answer has become longer. Thanks to the two letters ~~ df ~~

When writing this time, I was very conscious of which calculation was done on the data field. The model answer is also divided into ʻamount_mean and df_amount_sum`.

df=df_receipt

The usual

df=df[df['customer_id'].str.contains('^[^Z]')].reset_index() --Collect costomer_ids other than the first" Z "
df=df.groupby('customer_id').amount.sum().reset_index() --Calculate ʻamount total for each costomer_id`
df[df['amount'].mean() <= df['amount']].head(10) --Narrow down to data above the average of 3

I didn't refer to the average except for the comparison of 4, and I didn't like to make two dfs, so I wrote this this time. Of course, the problem says to find the average first, so if you want to use the average in other processes, this writing may be wrong.

In any case, we will compare the "average of the total" and the "total", so we will make that comparison in 4.

The frequent occurrence of ~~ .reset_index () also has the meaning of magic ~~

Up to here for this time

The table join will start from the next time, and it will be a little short because it is halfway, so I will separate it here. ~~ For that reason, I earned even a little with question sentences and model answers ~~

[PYTHON] Data Science 100 Knock ~ Battle for less than beginners part6

33rd

mine33.py