[PYTHON] Data science 100 knocks ~ Battle for less than beginners part8

This is a struggle record of knocking 100 eggs without knowing the data scientist's egg. It is a mystery whether I can finish the race. ~~ Even if it disappears on the way, please think that it is not given to Qiita. ~~

100 knock articles 100 Knock Guide

** Be careful if you are trying to do it as it includes spoilers **

I'm scared because the way to write the model answer is too different

This is hard to see! This way of writing is dangerous! If you have any questions, please let me know. ~~ I will use it as food while suffering damage to my heart.

This solution is wrong! This interpretation is different! Please comment if you have any.

This time from 41 to 44. [Last time] 36-40 [First time with table of contents]

41st

P-041: Aggregate the sales amount (amount) of the receipt detail data frame (df_receipt) for each date (sales_ymd), and calculate the increase or decrease in the sales amount from the previous day. It is sufficient to display 10 calculation results.

mine41.py


df_day=df_receipt.groupby('sales_ymd').agg({'amount':'sum'}).reset_index()
df_Yday=pd.concat([df_day['sales_ymd']+1,df_day['amount']],axis=1).rename(columns={'amount': 'Y_amount'})
df=pd.merge(df_day.head(10),df_Yday,on='sales_ymd',how='outer')
df['sa']=df['amount']-df['Y_amount']
df.head(10)

'''Model answer'''
df_sales_amount_by_date = df_receipt[['sales_ymd', 'amount']].groupby('sales_ymd').sum().reset_index()
df_sales_amount_by_date = pd.concat([df_sales_amount_by_date, df_sales_amount_by_date.shift()], axis=1)
df_sales_amount_by_date.columns = ['sales_ymd','amount','lag_ymd','lag_amount']
df_sales_amount_by_date['diff_amount'] = df_sales_amount_by_date['amount'] - df_sales_amount_by_date['lag_amount']
df_sales_amount_by_date.head(10)

It is a total of one day shift As my method

  1. Create a table in the form of df_Y [date (one day later), aggregate]
  2. Merge the dates together, and this will create the data [Date, Aggregate, Aggregate (1 day ago)]
  3. Add a column of aggregate differences

I made it as. I don't think this is wrong, but I thought it would be better to pull from the table date instead of "1 day".

――For example, if there is a shop closed on Saturdays and Sundays, the date will fly

After seeing the model answer, If you find out how to use .shift () I thought it would be very easy, so I will use it from the next problem. It was to be

42nd

P-042: Aggregate the sales amount (amount) of the receipt detail data frame (df_receipt) for each date (sales_ymd), and combine the data of 1 day ago, 2 days ago, and 3 days ago with the data of each date. Only 10 results should be displayed.

mine42.py


df_day=df_receipt.groupby('sales_ymd').agg({'amount':'sum'})
df=df_day
for i in range(1,4):
    df=pd.merge( df,df_day.shift(i).reset_index(),how='outer',on='sales_ymd')

df.columns=['sales_ymd','amount_sum_0','amount_sum_1','amount_sum_2','amount_sum_3']
df.head(10)

'''Model answer'''
#Code example 2:Horizontal holding case
df_sales_amount_by_date = df_receipt[['sales_ymd', 'amount']].groupby('sales_ymd').sum().reset_index()
for i in range(1, 4):
    if i == 1:
        df_lag = pd.concat([df_sales_amount_by_date, df_sales_amount_by_date.shift(i)],axis=1)
    else:
        df_lag = pd.concat([df_lag, df_sales_amount_by_date.shift(i)],axis=1)
df_lag.columns = ['sales_ymd', 'amount', 'lag_ymd_1', 'lag_amount_1', 'lag_ymd_2', 'lag_amount_2', 'lag_ymd_3', 'lag_amount_3']
df_lag.dropna().sort_values(['sales_ymd']).head(10)

At first I tried to merge four tables at once without turning with a for statement, but apparently pd.merge () can only merge two Reference

So I made one source and turned it with a for statement.

Personally, it was the most fun data manipulation these days. By the way, it seems that the data does not change with .shift (0).

df=pd.merge( df_day.shift(0).reset_index() , df_day.shift(1).reset_index() , df_day.shift(2).reset_index() , df_day.shift(3).reset_index() ,how='outer',on='sales_ymd') I wanted to write like this

43rd

P-043: Sales summary data frame (df_sales_summary) that combines the receipt detail data frame (df_receipt) and the customer data frame (df_customer) and totals the sales amount (amount) for each gender (gender) and age (calculated from age). ). Gender is 0 for male, 1 for female, and 9 for unknown. However, the item composition should be four items: age, sales amount for women, sales amount for men, and sales amount for unknown gender (cross tabulation of age vertically and gender horizontally). Also, the age group should be every 10 years old.

The next problem is

  1. Create an "age" column
  2. Specify 2 items with .groupby
  3. Make it in the form of cross tabulation

I took the stage

mine43.py


df=pd.merge(df_receipt,df_customer,how='inner',on='customer_id')
df_bins=pd.cut(df.age,range(0,100,10))
df=pd.concat([df[['gender_cd','amount']],df_bins],axis=1)
df=df.groupby(['age','gender_cd']).agg({'amount':'sum'}).reset_index()

df_cross=pd.merge( df.query("gender_cd=='0'")[['age','amount']]
                  ,df.query("gender_cd=='1'")[['age','amount']]
                  ,how='outer',on='age')
df_cross=pd.merge( df_cross
                  ,df.query("gender_cd=='9'")[['age','amount']]
                  ,how='outer',on='age')
df_cross.columns=['age','male','female','unkown']
df_cross

'''Model answer'''
df_tmp = pd.merge(df_receipt, df_customer, how ='inner', on="customer_id")
df_tmp['era'] = df_tmp['age'].apply(lambda x: math.floor(x / 10) * 10)
df_sales_summary = pd.pivot_table(df_tmp, index='era', columns='gender_cd', values='amount', aggfunc='sum').reset_index()
df_sales_summary.columns = ['era', 'male', 'female', 'unknown']
df_sales_summary

If you think and see the model answer, do not divide it into bins .apply(lambda x: math.floor(x / 10) * 10) Is it divided by 10 (integer type) and multiplied by 10? I don't know how to use lambda. And it's a pivot table ... I see. Is it possible to use it like that ...

I will study the pivot table (

44th

P-044: The sales summary data frame (df_sales_summary) created in the previous question was a horizontal sales of gender. Let's hold the gender vertically from this data frame and convert it into 3 items: age, gender code, and sales amount. However, the gender code is '00' for men, '01' for women, and '99' for unknowns.

Since this is a continuous problem, the df summarized last time is used as it is. Since it means to change it to vertical holding, I divided the df into three and arranged them vertically. Or rather, I changed the vertical to horizontal last time, so the opposite was easy ...

mine44.py


df_cross_M=df_cross[['age','male']].rename(columns={'male':'sum'})
df_cross_M['gender']='00'
df_cross_F=df_cross[['age','female']].rename(columns={'female':'sum'})
df_cross_F['gender']='01'
df_cross_U=df_cross[['age','unkown']].rename(columns={'unkown':'sum'})
df_cross_U['gender']='99'

df=pd.concat([df_cross_M,df_cross_F,df_cross_U])
df
'''Model answer'''
df_sales_summary = df_sales_summary.set_index('era'). \
        stack().reset_index().replace({'female':'01',
                                        'male':'00',
                                        'unknown':'99'}).rename(columns={'level_1':'gender_cd', 0: 'amount'})

The model answer is finished with a very long sentence. Did you use setindex () to save the chronological columns to the index and unify all the columns? I think it's amazing that you can change the name at once with .replace (). I definitely want to master it

Up to here for this time

In any case, there are more and more ways that are different from my own, so it's very educational.

Take this opportunity

Thank you to those who always stack up. If there are people who can see their progress instead of registering individual articles, we will send you an update notification every time you update the table of contents of part1. Therefore, if you stock part1 without stocking all the articles, you can see your progress, so I think that you will not have to bother. (I just think)

I would like to thank everyone who has always seen it, this time for the first time, and everyone. Thank you for browsing.

Recommended Posts

Data science 100 knocks ~ Battle for less than beginners part5
Data science 100 knocks ~ Battle for less than beginners part10
Data science 100 knocks ~ Battle for less than beginners part8
Data Science 100 Knock ~ Battle for less than beginners part3
Data Science 100 Knock ~ Battle for less than beginners part6
Data Science 100 Knock ~ Battle for less than beginners part2
Data Science 100 Knock ~ Battle for less than beginners part1
Data Science 100 Knock ~ Battle for less than beginners part9
Data Science 100 Knock ~ Battle for less than beginners part7
Data Science 100 Knock ~ Battle for less than beginners part4
Data Science 100 Knock ~ Battle for less than beginners part11
How to implement 100 data science knocks for data science beginners (for windows10 Home)
Challenge 100 data science knocks
Try "100 knocks on data science" ①
100 Pandas knocks for Python beginners
Time series data anomaly detection for beginners
[Hands-on for beginners] Read kaggle's "Forecasting Home Prices" line by line (Part 1: Reading data)
[Linux command] less command option list [Must-see for beginners]
[Python] 100 knocks on data science (structured data processing) 018 Explanation
[Python] 100 knocks on data science (structured data processing) 023 Explanation
[Python] 100 knocks on data science (structured data processing) 030 Explanation
[Python] 100 knocks on data science (structured data processing) 022 Explanation
For new students (Recommended efforts for Python beginners Part 1)
How to use data analysis tools for beginners
[Python] 100 knocks on data science (structured data processing) 017 Explanation
[Python] 100 knocks on data science (structured data processing) 026 Explanation
[Python] 100 knocks on data science (structured data processing) 016 Explanation
[Python] 100 knocks on data science (structured data processing) 024 Explanation
[Python] 100 knocks on data science (structured data processing) 027 Explanation
Basics of pandas for beginners ② Understanding data overview
[Python] 100 knocks on data science (structured data processing) 029 Explanation
[Python] 100 knocks on data science (structured data processing) 015 Explanation
[Python] 100 knocks on data science (structured data processing) 028 Explanation