[PYTHON] Data Science 100 Knock ~ Battle for less than beginners part1

This is a struggle record of knocking 100 eggs (freshly laid) of a data scientist without knowing why. It's a mystery even if you can finish the race. ~~ Even if it disappears on the way, please think that it is not given to Qiita. ~~

100 knock articles 100 Knock Guide

Page used as a reference for building the environment last time

** Be careful if you are trying to do it as it includes spoilers **

There were many writing styles I didn't know, and there were many saying "I wrote this, but the answer was this", so I'll put it in place of a memo.

This is hard to see! This way of writing is dangerous! If you have any questions, please let me know. ~~ I will use it as food while suffering damage to my heart.

This solution is wrong! This interpretation is different! Please comment if you have any.

table of contents problem
part1 1~9
part2 10~18
part3 19~22
part4 23~28
part5 29~32
part6 33~35
[part7] Not posted

1st

As expected this can be written. Even if you don't get into your head even if you prepare for it, it will hinder you if you can't write this.

mine01.py


df_receipt.head(10)

2nd

~~ Suddenly get an error ~~ It was. When projecting multiple columns, I wrote [['column A','column B']]. It is a daily occurrence to throw an error.

mine02.py


df_receipt[['sales_ymd','customer_id','product_cd','amount']].head(10)

3rd

Give an answer that is different from the answer as soon as possible. I wrote

mine03.py


df=df_receipt[['sales_ymd','customer_id','product_cd','amount']]
df.columns=['sales_date','customer_id','product_cd','amount']
df.head(10)

It may seem stupid to use three lines for such a simple thing, but it happened because I was solving it while organizing my mind. Or rather, I haven't used {} or python since I started doing it, so I simply didn't know rename.

** Model answer ** df_receipt[['sales_ymd', 'customer_id', 'product_cd', 'amount']].rename(columns={'sales_ymd': 'sales_date'}).head(10)

4th

I gave a different answer here as well.

mine04.py


df=df_receipt[['sales_ymd','customer_id','product_cd','amount']]
df=df[df['customer_id']=='CS018205000001']
df

Model answer df_receipt[['sales_ymd', 'customer_id', 'product_cd', 'amount']].query('customer_id == "CS018205000001"')

Am I the only one who feels resistance to the part where query is written as a string? ~~ Character string …… WAF …… Regular expression matching …… The head is ~~ With character string type input, even if you make a typo internally, you will not know the error. Is query faster?

5th

mine05.py


#df=df_receipt[['sales_ymd','customer_id','product_cd','amount']]
#df=df[df['customer_id']=='CS018205000001']
df[df['amount']>=1000]

Model answer df_receipt[['sales_ymd', 'customer_id', 'product_cd', 'amount']] \ .query('customer_id == "CS018205000001" & amount >= 1000')

Since the preconditions were the same as the 4th one, I used df as it is. Again, the model answer is query. I thought while writing, but is it a deprecated writing style like ʻix`?

Digression

df=df[df['customer_id']=='CS018205000001'] df[df['amount']>=1000] Connect the two lines of df=df[df['customer_id']=='CS018205000001'][df['amount']>=1000]

Will give the same result, but /opt/conda/lib/python3.7/site-packages/ipykernel_launcher.py:2: UserWarning: Boolean Series key will be reindexed to match DataFrame index. I will give a warning. Since df = df [condition] becomes df, I was angry when I thought that df [condition 1] [condition 2] could also be done.

6th

mine06.py


df=df_receipt[['sales_ymd','customer_id','product_cd','quantity','amount']]
df=df[df['customer_id']=='CS018205000001']
df=df[(1000<=df['amount'])|(5<=df['quantity'])]
df

Model answer df_receipt[['sales_ymd', 'customer_id', 'product_cd', 'quantity', 'amount']].query('customer_id == "CS018205000001" & (amount >= 1000 | quantity >=5)')

Since the conditions are getting longer, it is further divided. Especially, I think it is easier to understand if the AND condition is divided. Will the response go down?

7th

mine07.py


df=df_receipt[['sales_ymd','customer_id','product_cd','amount']]
df=df[df['customer_id']=='CS018205000001']
df=df[(1000<=df['amount'])&(df['amount']<=2000)]
df

Model answer df_receipt[['sales_ymd', 'customer_id', 'product_cd', 'amount']] \ .query('customer_id == "CS018205000001" & 1000 <= amount <= 2000')

It is attractive to be able to write the Between condition in one word. By the way, from this point on, I just thought, "Write in SQL ...".

8th

mine08.py


df=df_receipt[['sales_ymd','customer_id','product_cd','amount']]
df=df[df['customer_id']=='CS018205000001']
df=df[df['product_cd'] != 'P071401019']
df

Model answer df_receipt[['sales_ymd', 'customer_id', 'product_cd', 'amount']] \ .query('customer_id == "CS018205000001" & product_cd != "P071401019"')

I'm wondering if I should conclude the conditions inside under such conditions.

9th

mine09.py


df_store.query('prefecture_cd != "13" & not (floor_area > 900)')

Model answer df_store.query('prefecture_cd != "13" & floor_area <= 900')

Finally give in to query ~~ Not because it's a rewrite problem. Not sure if it was necessary to use not. Or rather, no.

Up to here for this time

To be honest, the only thing I couldn't find out was here. From the next time, I will try to write a figure that breaks the jade in an attempt to forcefully write something I do not understand.

Recommended Posts

Data Science 100 Knock ~ Battle for less than beginners part3
Data Science 100 Knock ~ Battle for less than beginners part6
Data Science 100 Knock ~ Battle for less than beginners part2
Data Science 100 Knock ~ Battle for less than beginners part1
Data Science 100 Knock ~ Battle for less than beginners part9
Data Science 100 Knock ~ Battle for less than beginners part7
Data Science 100 Knock ~ Battle for less than beginners part4
Data Science 100 Knock ~ Battle for less than beginners part11
Data science 100 knocks ~ Battle for less than beginners part5
Data science 100 knocks ~ Battle for less than beginners part10
Data science 100 knocks ~ Battle for less than beginners part8
Data science 100 knock commentary (P021 ~ 040)
Data science 100 knock commentary (P041 ~ 060)
Data science 100 knock commentary (P081 ~ 100)
How to implement 100 data science knocks for data science beginners (for windows10 Home)
"Data Science 100 Knock (Structured Data Processing)" Python-007 Explanation
"Data Science 100 Knock (Structured Data Processing)" Python-006 Explanation
"Data Science 100 Knock (Structured Data Processing)" Python-001 Explanation
Time series data anomaly detection for beginners
"Data Science 100 Knock (Structured Data Processing)" Python-002 Explanation
[Python] Data Science 100 Knock (Structured Data Processing) 021 Explanation
"Data Science 100 Knock (Structured Data Processing)" Python-005 Explanation
"Data Science 100 Knock (Structured Data Processing)" Python-004 Explanation
[Python] Data Science 100 Knock (Structured Data Processing) 020 Explanation
[Python] Data Science 100 Knock (Structured Data Processing) 025 Explanation
"Data Science 100 Knock (Structured Data Processing)" Python-003 Explanation
[Python] Data Science 100 Knock (Structured Data Processing) 019 Explanation
[Hands-on for beginners] Read kaggle's "Forecasting Home Prices" line by line (Part 1: Reading data)
[Linux command] less command option list [Must-see for beginners]
For new students (Recommended efforts for Python beginners Part 1)
How to use data analysis tools for beginners
Preparing to try "Data Science 100 Knock (Structured Data Processing)"
Data science 100 knock (structured data processing) environment construction (Windows10)
Basics of pandas for beginners ② Understanding data overview