[PYTHON] Data science 100 knocks ~ Battle for less than beginners part10

This is a struggle record of knocking 100 eggs without knowing the data scientist's egg. It is a mystery whether I can finish the race. ~~ Even if it disappears on the way, please think that it is not given to Qiita. ~~

100 knock articles 100 Knock Guide

** Be careful if you are trying to do it as it includes spoilers **

I don't understand the problem of statistics. (59 ~)

This is hard to see! This way of writing is dangerous! If you have any questions, please let me know. ~~ I will use it as food while suffering damage to my heart.

This solution is wrong! This interpretation is different! Please comment if you have any.

This time from 52 to 56. [Last time] 45-51 [First time with table of contents]

52nd

P-052: Total sales amount (amount) of receipt detail data frame (df_receipt) for each customer ID (customer_id), and 0 for 2000 yen or less and 1 for 2000 yen or more for the total sales amount. Display 10 items together with the customer ID and the total sales amount. However, if the customer ID starts with "Z", it represents a non-member, so exclude it from the calculation.

mine52.py


df=df_receipt[df_receipt['customer_id'].str.contains('^[^Z]')].reset_index()
df=df.groupby('customer_id').agg({'amount':'sum'}).reset_index()
df['amount']=df['amount'].apply(lambda x : 0 if x<=2000 else 1)
df.head(10)

'''Model answer'''
df_sales_amount = df_receipt.query('not customer_id.str.startswith("Z")', engine='python')
df_sales_amount = df_sales_amount[['customer_id', 'amount']].groupby('customer_id').sum().reset_index()
df_sales_amount['sales_flg'] = df_sales_amount['amount'].apply(lambda x: 1 if x > 2000 else 0)
df_sales_amount.head(10)

After doing groupby, add up and divide it into 0 and 1 while comparing them.

If anything, it's a question of how to master lambda, so it was a python problem after a long time.

Python isn't the only problem, but it's really annoying to have to solve it with type-aware sentences in sentences that aren't type-aware.

In this case, it took a long time to understand which hierarchy type of data is contained in x oflambda x: ~.

53rd

P-053: Binar Tokyo (the first 3 digits are 100 to 209) to 1 and the other ones to 0 for the postal code (postal_cd) of the customer data frame (df_customer). In addition, combine it with the receipt detail data frame (df_receipt) and count the number of customers who have a shopping record for the entire period for each of the created two values.

mine53.py


df=df_customer.copy()
df['postal_cd']=df.postal_cd.apply(lambda x: 1 if 100<=int(x[:3])<=209 else 0)
df=pd.merge(df,df_receipt,how='inner',on='customer_id')
df.groupby('postal_cd').agg({'customer_id':'nunique'})

'''Model answer'''
df_tmp = df_customer[['customer_id', 'postal_cd']].copy()
df_tmp['postal_flg'] = df_tmp['postal_cd'].apply(lambda x: 1 if 100 <= int(x[0:3]) <= 209 else 0)

pd.merge(df_tmp, df_receipt, how='inner', on='customer_id'). \
    groupby('postal_flg').agg({'customer_id':'nunique'})

It is too convenient to be able to write the form 100 <= x <= 209 in the if statement in lambda. It's a secret that I forgot to write ʻif because I concentrated too much on ʻelse.

54th

P-054: The address of the customer data data frame (df_customer) is one of Saitama prefecture, Chiba prefecture, Tokyo metropolitan area, and Kanagawa prefecture. Create a code value for each prefecture and extract it along with the customer ID and address. The values should be 11 for Saitama prefecture, 12 for Chiba prefecture, 13 for Tokyo, and 14 for Kanagawa prefecture. You can display 10 results.

mine54.py


df=df_customer.copy()
df['to_cd']=df['address'].apply(lambda x:11 if x.startswith('Saitama')\
                                else 12 if x.startswith('Chiba')\
                                else 13 if x.startswith('Tokyo') \
                                else 14 if x.startswith('Kanagawa Prefecture') else 0)
df[['customer_id','address','to_cd']].head(10)

'''Model answer'''
pd.concat([df_customer[['customer_id', 'address']], df_customer['address'].str[0:3].map({'Saitama': '11',
                                                                           'Chiba':'12', 
                                                                           'Tokyo':'13', 
                                                                           'Kanagawa':'14'})], axis=1).head(10)

Well, it's convenient for start with () (I forgot to spell contains) What you have to be careful about is to return with bool type.

55th

P-055: Sum the sales amount (amount) of the receipt detail data frame (df_receipt) for each customer ID (customer_id), and find the quartile of the total amount. Then, create a category value for the total sales amount for each customer based on the following criteria, and display both the customer ID and sales amount. Category values are 1 to 4 in order from the top. You can display 10 results.

-More than the minimum and less than the first quartile --From the first quartile to less than the second quartile --From the second quartile to less than the third quartile --Third quartile and above

mine55.py


df=df_receipt.copy()
df=df.groupby('customer_id')['amount'].sum().reset_index()
si=df.quantile(q=[0.25,0.5,0.75]).T
#float(si[0.25])
df['sibun']=df['amount'].apply(lambda x:1 if x<float(si[0.25]) 
                              else 2 if x<float(si[0.5])
                              else 3 if x<float(si[0.75]) 
                              else 4)
df.head(10)

'''Model answer'''
#Code example 1
df_sales_amount = df_receipt[['customer_id', 'amount']].groupby('customer_id').sum().reset_index()
pct25 = np.quantile(df_sales_amount['amount'], 0.25)
pct50 = np.quantile(df_sales_amount['amount'], 0.5)
pct75 = np.quantile(df_sales_amount['amount'], 0.75)

def pct_group(x):
    if x < pct25:
        return 1
    elif pct25 <= x < pct50:
        return 2
    elif pct50 <= x < pct75:
        return 3
    elif pct75 <= x:
        return 4

df_sales_amount['pct_group'] = df_sales_amount['amount'].apply(lambda x: pct_group(x))
df_sales_amount.head(10)

#Code example 2
df_temp = df_receipt.groupby('customer_id')[['amount']].sum()
df_temp['quantile'], bins = pd.qcut(df_receipt.groupby('customer_id')['amount'].sum(), 4, retbins=True)
display(df_temp.head())
print('quantiles:', bins)

Get a quartile from the total and divide by less than or equal to ... However, in code example 2, the total is put in pd.cut as it is and it is specified as 4 divisions.

The quartile was what I was doing on the 32nd run, but I forgot as expected. Fukushu Shinakan

56th

P-056: Calculate the age in 10-year increments based on the age of the customer data frame (df_customer), and extract it together with the customer ID (customer_id) and date of birth (birth_day). However, all people over the age of 60 should be in their 60s. The category name indicating the age is arbitrary. The first 10 items should be displayed.

mine56.py


df=df_customer.copy()
df_bins=pd.cut(df.age,[10,20,30,40,50,60,150],right=False,labels=[10,20,30,40,50,60])

df=pd.concat([df[['customer_id','birth_day']],df_bins],axis=1)
df.head(10)

'''Model answer'''
#Code example 1
df_customer_era = pd.concat([df_customer[['customer_id', 'birth_day']],
                             df_customer['age'].apply(lambda x: min(math.floor(x / 10) * 10, 60))],
                             axis=1)

df_customer_era.head(10)

#Code example 2
df_customer['age_group'] = pd.cut(df_customer['age'], bins=[0, 10, 20, 30, 40, 50, 60, np.inf], right=False)
df_customer[['customer_id', 'birth_day', 'age_group']].head(10)

This time it's the 43rd review. I didn't know the existence of np, inf, so I put in the number 150, which is obviously impossible. Also, I used df.age.min () to get the minimum age and confirmed that 10 ~ was fine, and then pulled out 0.

There is an item called labels in pd.cut, so if you put it in, you will get a beautiful number. Also, right = False will take the range 10 <= x <20.

Reference

Up to here for this time

Recommended Posts

Data science 100 knocks ~ Battle for less than beginners part5
Data science 100 knocks ~ Battle for less than beginners part10
Data science 100 knocks ~ Battle for less than beginners part8
Data Science 100 Knock ~ Battle for less than beginners part6
Data Science 100 Knock ~ Battle for less than beginners part2
Data Science 100 Knock ~ Battle for less than beginners part1
Data Science 100 Knock ~ Battle for less than beginners part9
Data Science 100 Knock ~ Battle for less than beginners part7
Data Science 100 Knock ~ Battle for less than beginners part4
Data Science 100 Knock ~ Battle for less than beginners part11
How to implement 100 data science knocks for data science beginners (for windows10 Home)
Challenge 100 data science knocks
Try "100 knocks on data science" ①
100 Pandas knocks for Python beginners
Time series data anomaly detection for beginners
[Hands-on for beginners] Read kaggle's "Forecasting Home Prices" line by line (Part 1: Reading data)
[Linux command] less command option list [Must-see for beginners]
[Python] 100 knocks on data science (structured data processing) 018 Explanation
[Python] 100 knocks on data science (structured data processing) 023 Explanation
[Python] 100 knocks on data science (structured data processing) 030 Explanation
[Python] 100 knocks on data science (structured data processing) 022 Explanation
For new students (Recommended efforts for Python beginners Part 1)
How to use data analysis tools for beginners
[Python] 100 knocks on data science (structured data processing) 017 Explanation
[Python] 100 knocks on data science (structured data processing) 026 Explanation
[Python] 100 knocks on data science (structured data processing) 016 Explanation
[Python] 100 knocks on data science (structured data processing) 024 Explanation
[Python] 100 knocks on data science (structured data processing) 027 Explanation
Basics of pandas for beginners ② Understanding data overview
[Python] 100 knocks on data science (structured data processing) 029 Explanation
[Python] 100 knocks on data science (structured data processing) 015 Explanation
[Python] 100 knocks on data science (structured data processing) 028 Explanation