Continuing from the last time, I will explain 100 knocks of data science. Data science 100 knock commentary (P001 ~ 020) Data science 100 knock commentary (P021 ~ 040) Data science 100 knock commentary (P041 ~ 060)

For introduction, please refer to this article (* Docker is handled on Mac)

Basically, it is an explanation of the answer, but it also describes another solution.

- Difficulty is gradually increasing.

P-061: The sales amount (amount) of the receipt detail data frame (df_receipt) is totaled for each customer ID (customer_id), and the total sales amount is converted to the common logarithm (base = 10) to total the customer ID and sales amount. Display with. However, if the customer ID starts with "Z", it represents a non-member, so exclude it from the calculation. You can display 10 results.

`P-061`

```
#Customer ID"Z"Except for those starting with
#Receipt detail data frame (df_The sales amount (amount) of receipt) is the customer ID (customer)_id) total
df_sales_amount = df_receipt.query('not customer_id.str.startswith("Z")',
engine='python').groupby('customer_id').amount.sum().reset_index()
#Convert sales amount (amount) to common logarithm (bottom)=10)
#math to apply with apply.log10(x+1)Common logarithm+1 is required.
df_sales_amount['amount_log10'] = df_amount_sum['amount'].apply(lambda x: math.log10(x+1))
df_sales_amount.head(10)
# (Another solution) np.using log10
df_sales_amount = df_receipt.query('not customer_id.str.startswith("Z")', engine='python'). \
groupby('customer_id').agg({'amount':'sum'}).reset_index()
df_sales_amount['amount_log10'] = np.log10(df_sales_amount['amount'] + 1)
df_sales_amount.head(10)
```

Reference: Calculate exponential / logarithmic functions with Python (exp, log, log10, log2)

P-062: The sales amount (amount) of the receipt detail data frame (df_receipt) is totaled for each customer ID (customer_id), and the total sales amount is converted to natural logarithm (base = e) to total the customer ID and sales amount. Display with. However, if the customer ID starts with "Z", it represents a non-member, so exclude it from the calculation. You can display 10 results.

`P-062`

```
#Customer ID"Z"Except for those starting with
#Receipt detail data frame (df_The sales amount (amount) of receipt) is the customer ID (customer)_id) total
df_sales_amount = df_receipt.query('not customer_id.str.startswith("Z")', engine='python') \
.groupby('customer_id').amount.sum().reset_index()
#Natural logarithm of sales amount (amount)(bottom=e) (remember +1)
df_sales_amount['amount_log'] = np.log(df_sales_amount['amount'] + 1)
df_sales_amount.head()
# (Another solution)
df_sales_amount = df_receipt.query('not customer_id.str.startswith("Z")', engine='python'). \
groupby('customer_id').agg({'amount':'sum'}).reset_index()
df_sales_amount['amount_loge'] = np.log(df_sales_amount['amount'] + 1)
df_sales_amount.head(10)
```

Reference: Calculate exponential / logarithmic functions with Python (exp, log, log10, log2)

P-063: Calculate the profit amount of each product from the unit price (unit_price) and cost price (unit_cost) of the product data frame (df_product). You can display 10 results.

`P-063`

```
#Product data frame (df_copy product).
df_tmp = df_product.copy()
#Unit price (unit)_price) to cost (unit)_Profit amount minus cost)(unit_profit)To display
df_tmp['unit_profit'] = df_tmp['unit_price'] - df_tmp['unit_cost']
df_tmp.head(10)
```

P-064: Calculate the overall average profit margin of each product from the unit price (unit_price) and cost (unit_cost) of the product data frame (df_product). However, note that there are nulls in unit prices and costs.

`P-64`

```
#Product data frame (df_copy product).
df_tmp = df_product.copy()
#Unit price (unit)_price) to cost (unit)_The unit price (unit) minus the cost)_Profit margin divided by price)(unit_profit_rate)To display
df_tmp['unit_profit_rate'] = (df_tmp['unit_price'] - df_tmp['unit_cost']) / df_tmp['unit_price']
#Calculate the average profit amount(Excluding Null: skipna=True)
df_tmp['unit_profit_rate'].mean(skipna=True)
```

P-065: For each product in the product data frame (df_product), find a new unit price with a profit margin of 30%. However, round down less than 1 yen. Then display 10 results and confirm that the profit margin is around 30%. However, note that there are NULLs in the unit price (unit_price) and cost (unit_cost).

`P-065`

```
#Product data frame (df_copy product).
df_tmp = df_product.copy()
#Cost (unit)_cost) to a profit margin of 30%New unit price(new_price)Seeking(Round down less than 1 yen)
#Cost (unit)_cost） x 10/7 (Profit margin is 30%Becomes)a formula. math.floor is an error
df_tmp['new_price'] = np.floor(df_tmp['unit_cost'] * 10/7)
#Check the profit margin
df_tmp['new_price_rate'] = (df_tmp['new_price'] - df_tmp['unit_cost']) / df_tmp['new_price']
#Calculate the average profit amount(Excluding Null)
df_tmp.head(10)
# (Another solution)
df_tmp = df_product.copy()
df_tmp['new_price'] = df_tmp['unit_cost'].apply(lambda x: np.floor(x / 0.7))
df_tmp['new_profit_rate'] = (df_tmp['new_price'] - df_tmp['unit_cost']) / df_tmp['new_price']
df_tmp.head(10)
```

Reference: [NumPy] Real number → Integer (rounding, rounding down, rounding up)

P-066: For each product in the product data frame (df_product), find a new unit price with a profit margin of 30%. This time, round off less than 1 yen (for 0.5, rounding in even numbers is fine). Then display 10 results and confirm that the profit margin is around 30%. However, note that there are NULLs in the unit price (unit_price) and cost (unit_cost).

`P-066`

```
#Product data frame (df_copy product).
df_tmp = df_product.copy()
#Cost (unit)_cost) to a profit margin of 30%New unit price(new_price)Seeking(Round off less than 1 yen 0.Rounding in even numbers for 5)
#Cost (unit)_cost） x 10/7 (Profit margin is 30%Becomes)a formula.
df_tmp['new_price'] = np.round(df_tmp['unit_cost'] * 10/7)
#Check the profit margin
df_tmp['new_price_rate'] = (df_tmp['new_price'] - df_tmp['unit_cost']) / df_tmp['new_price']
#Calculate the average profit amount(Excluding Null)
df_tmp.head(10)
# (Another solution)
#Built-in round gives an error on NaN, but numpy.round does not result in an error
df_tmp = df_product.copy()
df_tmp['new_price'] = df_tmp['unit_cost'].apply(lambda x: np.round(x / 0.7))
df_tmp['new_profit_rate'] = (df_tmp['new_price'] - df_tmp['unit_cost']) / df_tmp['new_price']
df_tmp.head(10)
```

Reference: Round numbers with pandas (rounding, rounding to even numbers)

P-067: For each product in the product data frame (df_product), find a new unit price with a profit margin of 30%. This time, round up less than 1 yen. Then display 10 results and confirm that the profit margin is around 30%. However, note that there are NULLs in the unit price (unit_price) and cost (unit_cost).

`P-067`

```
#Product data frame (df_copy product).
df_tmp = df_product.copy()
#Cost (unit)_cost) to a profit margin of 30%New unit price(new_price)Seeking(Round up less than 1 yen)
#Cost (unit)_cost） x 10/7 (Profit margin is 30%Becomes)a formula.
df_tmp['new_price'] = np.ceil(df_tmp['unit_cost'] * 10/7)
#Check the profit margin
df_tmp['new_price_rate'] = (df_tmp['new_price'] - df_tmp['unit_cost']) / df_tmp['new_price']
#Calculate the average profit amount(Excluding Null)
df_tmp.head(10)
# (Another solution)
#Code example 1
# math.ceil gives an error on NaN, but numpy.ceil does not result in an error
df_tmp = df_product.copy()
df_tmp['new_price'] = df_tmp['unit_cost'].apply(lambda x: np.ceil(x / 0.7))
df_tmp['new_profit_rate'] = (df_tmp['new_price'] - df_tmp['unit_cost']) / df_tmp['new_price']
df_tmp.head(10)
```

Reference: [NumPy] Real number → Integer (rounding, rounding down, rounding up)

P-068: For each product in the product data frame (df_product), find the amount including tax of 10% consumption tax. Fractions less than 1 yen should be rounded down and 10 results should be displayed. However, note that NULL exists in the unit price (unit_price).

`P-068`

```
#Product data frame (df_copy product).
df_tmp = df_product.copy()
#Consumption tax rate 10%Amount including tax(price_tax)To ask.(Unit price (unit)_price)*1.1)
df_tmp['price_tax'] = np.floor(df_tmp['unit_price'] * 1.1)
df_tmp.head(10)
# (Another solution)
# math.floor gives an error with NaN, but numpy.floor does not cause an error
df_tmp = df_product.copy()
df_tmp['price_tax'] = df_tmp['unit_price'].apply(lambda x: np.floor(x * 1.1))
df_tmp.head(10)
```

Reference: Round numbers with pandas (rounding, rounding to even numbers)

P-069: The receipt detail data frame (df_receipt) and the product data frame (df_product) are combined, and the total sales amount of all products for each customer and the sales of "07" (bottled canned) in the category major category (category_major_cd). Calculate the total amount and find the ratio between the two. The extraction target is limited to customers who have a purchase record of the category major category "07" (bottled canned), and 10 results can be displayed.

`P-069`

```
#Receipt detail data frame (df_Receipt) and product data frame (df_product)(pd.merge,The key is product_cd)
#Calculate the total sales amount of all products for each customer
df_tmp_1 = pd.merge(df_receipt, df_product, on='product_cd',
how='inner').groupby('customer_id').amount.sum().reset_index()
#Category_major_cd)"07"Total sales amount of (bottled canned)
# merge(df1, df2, on='Key', how='inner')Combine with
#Group by customer for each customer. amount.sum()Total sales amount
df_tmp_2 = pd.merge(df_receipt, df_product.query('category_major_cd=="07"'), on='product_cd',
how='inner').groupby('customer_id').amount.sum().reset_index()
# df_tmp1 and df_Combine tmp2.
df_tmp_3 = pd.merge(df_tmp_1, df_tmp_2, on='customer_id', how='inner')
#Find the ratio of the two.
df_tmp_3['rate_07'] = df_tmp_3['amount_y'] / df_tmp_3['amount_x']
df_tmp_3.head(10)
# (Another solution) agg({column:Function to apply})
df_tmp_1 = pd.merge(df_receipt, df_product,
how='inner', on='product_cd').groupby('customer_id').agg({'amount':'sum'}).reset_index()
df_tmp_2 = pd.merge(df_receipt, df_product.query('category_major_cd == "07"'),
how='inner', on='product_cd').groupby('customer_id').agg({'amount':'sum'}).reset_index()
df_tmp_3 = pd.merge(df_tmp_1, df_tmp_2, how='inner', on='customer_id')
df_tmp_3['rate_07'] = df_tmp_3['amount_y'] / df_tmp_3['amount_x']
df_tmp_3.head(10)
```

Reference: merge, join (column / index criteria) to join pandas.DataFrame

P-070: Calculate the number of days elapsed from the member application date (application_date) of the customer data frame (df_customer) to the sales date (sales_ymd) of the receipt detail data frame (df_receipt), and calculate the customer ID (customer_id) and sales date. , Display with the membership application date. You can display 10 results (note that sales_ymd is a numerical value and application_date is a character string).

`P-70`

```
#Receipt detail data frame (df_Receipt) and customer data frame (df_customer) to join
df_tmp = pd.merge(df_receipt[['sales_ymd', 'customer_id']],
df_customer[['application_date', 'customer_id']], how='inner', on='customer_id')
#Remove duplicates
df_tmp = df_tmp.drop_duplicates()
#Member application date (application_Sales from date_ymd) pull
# pandas.to_datetime()Using the function, a string of strings representing the date and time (date / time) pandas.Series datetime64[ns]Can be converted to a type.
# sales_Since ymd is a numeric type, astype('str')Change to string type with
df_tmp['elapsed_date'] = pd.to_datetime(df_tmp['sales_ymd'].astype('str')) - pd.to_datetime(df_tmp['application_date'])
df_tmp[['sales_ymd', 'application_date', 'elapsed_date']].head(10)
```

Reference: Extract / delete duplicate rows of pandas.DataFrame, Series Reference: Process date / time columns with pandas (string conversion, date extraction, etc.)

P-071: For the sales date (sales_ymd) of the receipt detail data frame (df_receipt), calculate the number of months elapsed from the member application date (application_date) of the customer data frame (df_customer), and calculate the customer ID (customer_id) and sales. Display with the date and membership application date. You can display 10 results (note that sales_ymd is a numerical value and application_date is a character string). Truncate less than a month.

`P-071`

```
# pd.Receipt detail data frame with merge (df_Receipt) and customer data frame (df_customer) join(The key is customer_id)
df_tmp = pd.merge(df_receipt[['customer_id', 'sales_ymd']],
df_customer[['customer_id', 'application_date']], how='inner', on='customer_id')
#Remove duplicates(drop_duplicates())
df_tmp = df_tmp.drop_duplicates()
# pandas.to_datetime()Using the function, a string of strings representing the date and time (date / time) pandas.Series datetime64[ns]Can be converted to a type.
df_tmp['sales_ymd'] = pd.to_datetime(df_tmp['sales_ymd'].astype('str'))
df_tmp['application_date'] = pd.to_datetime(df_tmp['application_date'])
#Relativedelta to calculate the number of months(datetime1, datetime2) datetime1 - datetime2
df_tmp['elapsed_date'] = df_tmp[['sales_ymd', 'application_date']].apply(lambda x:
relativedelta(x[0], x[1]).years * 12 + relativedelta(x[0], x[1]).months, axis=1)
#Customer ID (customer)_Sort by id) and display 10 items
df_tmp.sort_values('customer_id').head(10)
```

Reference: Python date / time comparison / calculation / others

P-072: For the sales date (sales_ymd) of the receipt detail data frame (df_receipt), calculate the number of years elapsed from the member application date (application_date) of the customer data frame (df_customer), and calculate the customer ID (customer_id) and sales date. , Display with the member application date. You can display 10 results. (Note that sales_ymd is a numerical value and application_date is a character string to hold the data). Truncate less than a year.

`P-072`

```
# pd.Receipt detail data frame with merge (df_Receipt) and customer data frame (df_customer) join(The key is customer_id)
df_tmp = pd.merge(df_receipt[['customer_id', 'sales_ymd']],
df_customer[['customer_id', 'application_date']], how='inner', on='customer_id')
#Remove duplicates(drop_duplicates())
df_tmp = df_tmp.drop_duplicates()
# pandas.to_datetime()Using the function, a string of strings representing the date and time (date / time) pandas.Series datetime64[ns]Can be converted to a type.
df_tmp['sales_ymd'] = pd.to_datetime(df_tmp['sales_ymd'].astype('str'))
df_tmp['application_date'] = pd.to_datetime(df_tmp['application_date'])
#Calculate the number of years elapsed. relativedelta(datetime1, datetime2).Year difference in years(axis=Processing along the line with 1)
df_tmp['elapsed_date'] = df_tmp[['sales_ymd', 'application_date']].apply(lambda x:
relativedelta(x[0], x[1]).years, axis=1)
df_tmp.sort_values('customer_id').head(10)
```

Reference: Python date / time comparison / calculation / others

P-073: For the sales date (sales_ymd) of the receipt detail data frame (df_receipt), the elapsed time in epoch seconds from the member application date (application_date) of the customer data frame (df_customer) is calculated, and the customer ID (customer_id) , Display with sales date and membership application date. You can display 10 results (note that sales_ymd is a numerical value and application_date is a character string). Since time information is not retained, each date represents 0:00:00.

`P-073`

```
# pd.Receipt detail data frame with merge (df_Receipt) and customer data frame (df_customer) join(The key is customer_id)
df_tmp = pd.merge(df_receipt[['customer_id', 'sales_ymd']],
df_customer[['customer_id', 'application_date']], how='inner', on='customer_id')
#Remove duplicates(drop_duplicates())
df_tmp = df_tmp.drop_duplicates()
# pandas.to_datetime()Using the function, a string of strings representing the date and time (date / time) pandas.Series datetime64[ns]Can be converted to a type.
df_tmp['sales_ymd'] = pd.to_datetime(df_tmp['sales_ymd'].astype('str'))
df_tmp['application_date'] = pd.to_datetime(df_tmp['application_date'])
#Calculate elapsed time in epoch seconds(timestamp()Convert datetime object to UNIX time with method)
#Note: Timestamp is unix time including nanoseconds (10)**Divide by 9)
df_tmp['elapsed_date'] = (df_tmp['sales_ymd'].astype(np.int64) / 10**9) - \
(df_tmp['application_date'].astype(np.int64) / 10 **9)
#Display 10 items
df_tmp.head(10)
```

`reference`

```
#Why divide by 10 to the 9th power?
nano_time = pd.to_datetime('1970-01-02')
nano_time.value
# >>>86400000000000
```

Since the epoch time is the elapsed time from 1970-01-01, 24 hours x 60 minutes x 60 seconds = 86400 seconds It turns out that the unix time is nanoseconds (10 to the 9th power)

Reference: Python date / time comparison / calculation / others

P-074: For the sales date (sales_ymd) of the receipt detail data frame (df_receipt), calculate the number of days elapsed from the Monday of the week, and display it together with the sales date and the date of the Monday of the week. You can display 10 results (note that sales_ymd holds the data numerically).

`P-074`

```
#Receipt detail data frame (df_receipt) customer data(customer_id),Sales date_ymd）
df_tmp = df_receipt[['customer_id', 'sales_ymd']]
#Remove duplicates
df_tmp = df_tmp.drop_duplicates()
# datetime64[ns]Convert to type
df_tmp['sales_ymd'] = pd.to_datetime(df_tmp['sales_ymd'].astype('str'))
#Add Monday data.(relativedelta(days=x.weekday())Monday: 0,Tuesday: 1 ...)
df_tmp['monday'] = df_tmp['sales_ymd'].apply(lambda x: x - relativedelta(days=x.weekday()))
#Sales date_ymd) minus Monday to find the number of days elapsed
df_tmp['elapsed_weekday'] = df_tmp['sales_ymd'] - df_tmp['monday']
df_tmp.head(10)
```

Reference: Handling of pandas datetime type (pd.to_datetime, pd.offsets, timezone)

P-075: Randomly extract 1% data from the customer data frame (df_customer) and extract the first 10 data.

`P-075`

```
#Customer data frame (df_Extract from customer)(sample)
#Randomly 1%Data is the argument frac=0.01
df_customer.sample(frac=0.01).head(10)
```

Reference: sample that randomly samples (extracts) pandas rows and columns

P-076: Randomly stratified 10% of data from the customer data frame (df_customer) based on the ratio of gender (gender_cd), and aggregate the number of cases by gender.

`P-076`

```
# sklearn.model_selection.train_test_Example using split
# _(Training data 90%)、df_tmp(Test data 10%)10% is randomly extracted by(Stratified sampling:Argument stratify)
_, df_tmp = train_test_split(df_customer, test_size=0.1, stratify=df_customer['gender_cd'])
#Gender_Group by cd), customer_Number of ids(count)Aggregate
df_tmp.groupby('gender_cd').agg({'customer_id' : 'count'})
# (Another solution)
#Randomly extracted using sample method(frac=0.Extract 10% data with 1)
#Gender_Group by cd), customer_Number of ids(count)Aggregate
df_customer.sample(frac=0.1).groupby('gender_cd').agg({'customer_id' : 'count'})
```

Reference: sample that randomly samples (extracts) pandas rows and columns

P-077: Sum the sales amount (amount) of the receipt detail data frame (df_receipt) for each customer, and extract the outliers of the total sales amount. However, if the customer ID starts with "Z", it represents a non-member, so exclude it from the calculation. Here, it is assumed that the outliers are separated from the average by 3σ or more. You can display 10 results.

`P-077`

```
#Customer ID"Z"Exclude anything that starts with
#Receipt detail data frame (df_Total sales amount (amount) of receipt) for each customer
df_sales_amount = df_receipt.query("not customer_id.str.startswith('Z')",
engine='python').groupby('customer_id').amount.sum().reset_index()
#Normalization / standardization(Average 0, standard deviation 1)：preprocessing.scale()Use a function.
df_sales_amount['amount_ss'] = preprocessing.scale(df_sales_amount['amount'])
#Extract outliers(average(mean)± 3σ(std)Extract those that are more apart)
#Extract the ones that meet the conditions with the query method. Absolute value is more than 3σ away(abs)Use
df_sales_amount.query('abs(amount_ss)>=3').head(10)
# (Another solution)
#Customer ID"Z"Exclude anything that starts with
#Receipt detail data frame (df_Total sales amount (amount) of receipt) for each customer
df_sales_amount = df_receipt.query("not customer_id.str.startswith('Z')",
engine='python').groupby('customer_id').amount.sum().reset_index()
#Find the average and standard deviation of the total sales of each customer
mean = df_sales_amount.describe().loc['mean']
std = df_sales_amount.describe().loc['std']
#Extract outliers(average(mean)± 3σ(std)Extract those that are more apart)
df_sales_amount[(df_sales_amount['amount'] >= int(mean) + 3*int(std)) | \
(df_sales_amount['amount'] <= int(mean) - 3*int(std))].head(10)
```

Reference: Normalize / Standardize with Python (list, NumPy array, pandas.DataFrame)

P-078: Sum the sales amount (amount) of the receipt detail data frame (df_receipt) for each customer, and extract the outliers of the total sales amount. However, if the customer ID starts with "Z", it represents a non-member, so exclude it from the calculation. Here, the outlier is less than "1st quartile-1.5 x IQR" using IQR, which is the difference between the 1st and 3rd quartiles, or "3rd quartile" It shall exceed "interquartile range + 1.5 x IQR". You can display 10 results.

`P-078`

```
#Customer ID"Z"Exclude anything that starts with
#Receipt detail data frame (df_Total sales amount (amount) of receipt) for each customer
df_sales_amount = df_receipt.query("not customer_id.str.startswith('Z')",
engine='python').groupby('customer_id').amount.sum().reset_index()
#Find the IQR, which is the difference between the first and third quartiles
pct75 = df_sales_amount['amount'].quantile(0.75)
pct25 = df_sales_amount['amount'].quantile(0.25)
iqr = pct75 - pct25
#First quartile-1.5×IQR(amount_low),Third quartile+1.5×IQR(amount_hight)Ask for each
amount_low = pct25 - (iqr * 1.5)
amount_hight = pct75 + (iqr * 1.5)
#Extract outliers(@Variable name in)
df_sales_amount.query('amount > @amount_hight | amount < @amount_low').head(10)
# (Another solution) np.Answer using percentile
df_sales_amount = df_receipt.query('not customer_id.str.startswith("Z")', engine='python'). \
groupby('customer_id').agg({'amount':'sum'}).reset_index()
pct75 = np.percentile(df_sales_amount['amount'], q=75)
pct25 = np.percentile(df_sales_amount['amount'], q=25)
iqr = pct75 - pct25
amount_low = pct25 - (iqr * 1.5)
amount_hight = pct75 + (iqr * 1.5)
df_sales_amount.query('amount < @amount_low or @amount_hight < amount').head(10)
```

Reference: quantile to get quantiles and percentiles with pandas

P-079: Check the number of missing items for each item in the product data frame (df_product).

`P-079`

```
# isnull()Judge whether it is a missing value for each element with
df_product.isnull().sum()
```

Reference: Determine whether pandas contains missing value NaN and count the number

P-080: Create a new df_product_1 by deleting all records that are missing in any item of the product data frame (df_product). In addition, display the number of cases before and after the deletion, and confirm that the number has decreased by the number confirmed in the previous question.

`P-080`

```
#Product data frame (df_product) copy
df_product_1 = df_product.copy()
#Check the number of cases before deleting missing values
print('Before deletion: ', len(df_product_1))
#Dropna to exclude (delete) missing values()Method
df_product_1.dropna(inplace=True)
print('After deletion: ', len(df_product_1))
```

Reference: Exclude (delete) / replace (fill in) / extract missing value NaN with pandas

100 knocks of data science How to run 100 data science knocks on Mac

Difficulty varies depending on the problem. Difficult things are difficult.

Recommended Posts