[PYTHON] Basic machine learning procedure: ② Prepare data

Introduction

Last time, in Basic machine learning procedure: ① Classification model, I organized the procedure for creating a really basic classification model. However, the place to prepare the data was broken. So, this time, I would like to organize the procedure to create the following table using BigQuery and import it into the Python environment.

id result product1 product2 product3 product4 product5
001 1 2500 1200 1890 530 null
002 0 750 3300 null 1250 2000

Analytical environment

Google BigQuery Google Colaboratory

Target data

Purchasing data is stored in the table structure as shown below.

id Store name Purchase date Purchase goods Purchase price Number of purchases
001 〇〇 store 2020-01-10 product1 2500 1
002 ☓☓ store 2020-01-10 product1 750 3
002 ☓☓ store 2020-01-10 product2 1000 2

The table name is as follows on the BQ. (Divided into tables by day) myproject.mydataset.transaction_ date

Unfortunately, the item names cannot be given in Japanese, but for the sake of clarity, they are in Japanese here. Please change the name according to each environment.

1. What to do on BigQuery

I wrote that I will do it on BigQuery, but all the actual operations are done from Python (Google Colaboratory). Notebook is also useful for keeping a record of what you have done.

To operate BigQuery from Colaboratory, you can easily authenticate with the following two lines. When the URL is displayed, access it with your own account, issue a one-time password, and paste it into the cell displayed in Notebook.


from google.colab import auth
auth.authenticate_user()

1-1. Processing of purchasing data

Now, let's proceed with the process of actually aggregating the original purchase data by person and product.

from google.cloud import bigquery

#Here, the target period is January 2020.
from_day = "20200101"
to_day = "20200131"

query=f"""
SELECT
 id,Purchase goods, SUM(Purchase price)AS total amount
 FROM `myproject.mydataset.transaction_*`
 WHERE _table_suffix between `{from_day}` and `{to_day}`
"""

#Project data set table name to output
project = "myproject"
client = bigquery.Client(project=project)
dataset = "mydataset"
ds = client.dataset(dataset )
table = "tmp"

job_config = bigquery.QueryJobConfig()
job_config.destination = ds.table(table)
job_config.write_disposition="WRITE_TRUNCATE" #When overwriting the table. If you want to add"WRITE_APPEND"
job_config.allow_large_results=True #Do you allow large-scale results? Basically True
job_config.use_legacy_sql=False #Do you use legacy SQL? Basically False
job = client.query(query, job_config=job_config)

result = job.result()

Now, when the execution is completed without any output, the execution result is output to myproject.mydataset.tmp. The settings in job_config are a little troublesome, but at first let's write while being aware of what each setting is. (Once you get used to it, you can copy it)

If you run it so far, you will see the following table.

id Purchase goods total fee
001 product1 2500
001 product2 1200
002 product1 750
002 product2 3300

1-2. Processing to a personal table

Next, I would like to create a PivotTable of id x purchase price to load into the machine learning model.

If you use Pandas with Python, it is easy to use pivot_table, but depending on the target period etc., the data size is several GB or several tens of GB, and it takes extra time to import and process data, so even PivotTable conversion I'm doing it with BigQuery.


new_columns=[]

#Create a list of "purchased products" to be held horizontally and create a Dataframe
new_column="Purchase goods"
query = f'SELECT DISTINCT {new_column} FROM `{project}.{dataset}.{tmp}` ORDER BY {new_column}'
df_product = client.query(query).to_dataframe()

##Automatically generate SQL to be held horizontally by looping the Data frame of the purchased product
for index, row in df_product.iterrows():
	item_frag = []
	condition_frag = []

	for i, v in zip(row.index, row.values):
		formula=f"{i}='{v}'"
		condition_frag.append(formula)
		item_frag.append(v)

	item=''.join(item_frag)
	condition=' AND '.join(condition_frag)
	query_frag = f'SUM(IF(({condition}),"Purchase price",0)) AS {new_column}_{item}'
	new_columns.append(query_frag)

# new_Query a List called columns_Change to the string parts
query_parts = ',\n'.join(new_columns)
query = f'SELECT id + query_parts + FROM `{project}.{dataset}.{tmp}` GROUP BY id'

#Output destination table name
table2="pivoted_table"

#BigQuery execution is the same as above
job_config = bigquery.QueryJobConfig()
job_config.destination = ds.table(table2)
job_config.write_disposition="WRITE_TRUNCATE" #When overwriting the table. If you want to add"WRITE_APPEND"
job_config.allow_large_results=True #Do you allow large-scale results? Basically True
job_config.use_legacy_sql=False #Do you use legacy SQL? Basically False
job = client.query(query, job_config=job_config)

result = job.result()

The process of creating SQL for Pivot Table is a bit complicated because it loops twice, but I want to create the following SQL.

SELECT
 id,
SUM(IF((Purchase goods='product1'),Purchase price,0)) AS Purchase goods_product1,
SUM(IF((Purchase goods='product2'),Purchase price,0)) AS Purchase goods_product2,
SUM(IF((Purchase goods='product3'),Purchase price,0)) AS Purchase goods_product3,
...
FROM `myproject.mydataset.tmp`
GROUP BY id

It is a standard SQL that makes a table Pivot, but the SUM (IF (~)) part of this is automatically created according to the number of products. So far, the following table is saved as myproject.mydataset.pivoted_table.

id product1 product2 product3 product4 product5
001 2500 1200 1890 530 null
002 750 3300 null 1250 2000

2. Import Python into the environment

Finally, import the data created by BigQuery into the Python environment with Pandas Dataframe. However, since I have already imported it with Dataframe in the middle of Pivot conversion, it is still more time to write a program. .. ..

query = f'SELECT * FROM `{project}.{dataset}.{table2}`'
df = client.query(query).to_dataframe()

in conclusion

While saying "basic machine learning procedure", this time I started with operations with BigQuery. From Colaboratory, authentication is easy, so I feel that the hurdle to BigQuery operation is low.

BigQuery is still powerful enough, but new features are released every day. I want to make good use of new functions.

Recommended Posts

Basic machine learning procedure: ② Prepare data
Basic machine learning procedure: ④ Classifier learning + ensemble learning
Data set for machine learning
Machine learning in Delemas (data acquisition)
Preprocessing in machine learning 2 Data acquisition
Preprocessing in machine learning 4 Data conversion
How to collect machine learning data
Python: Preprocessing in machine learning: Data acquisition
Python: Preprocessing in machine learning: Data conversion
Preprocessing in machine learning 1 Data analysis process
[Memo] Machine learning
Machine learning classification
Basic machine learning procedure: ③ Compare and examine the selection method of features
Machine Learning sample
Data supply tricks using deques in machine learning
[Python] Data analysis, machine learning practice (Kaggle) -Data preprocessing-
Machine learning Training data division and learning / prediction / verification
[Python3] Let's analyze data using machine learning! (Regression)
I started machine learning with Python Data preprocessing
A story about data analysis by machine learning
Machine learning tutorial summary
About machine learning overfitting
Machine learning ⑤ AdaBoost Summary
Machine Learning: Supervised --AdaBoost
Machine learning logistic regression
About data preprocessing of systems that use machine learning
Tool MALSS (basic) that supports machine learning in Python
Machine learning support vector machine
Studying Machine Learning ~ matplotlib ~
Private Python learning procedure
Machine learning linear regression
Machine learning course memo
Machine learning (TensorFlow) + Lotto 6
Somehow learn machine learning
Time series data prediction by AutoML (automatic machine learning)
xgboost: A valid machine learning model for table data
Machine learning library Shogun
Machine learning rabbit challenge
Introduction to machine learning
Machine Learning: k-Nearest Neighbors
What is machine learning?
Collect machine learning data by scraping from bio-based public databases
Introduction to Machine Learning with scikit-learn-From data acquisition to parameter optimization
Performance verification of data preprocessing for machine learning (numerical data) (Part 2)
Made icrawler easier to use for machine learning data collection
Pre-processing in machine learning 3 Missing values, outliers, and imbalanced data
How to use machine learning for work? 03_Python coding procedure
Performance verification of data preprocessing for machine learning (numerical data) (Part 1)
Machine learning model considering maintainability
Machine learning learned with Pokemon
Japanese preprocessing for machine learning
Reinforcement learning 7 Learning data log output
An introduction to machine learning
Machine learning / classification related techniques
Machine Learning: Supervised --Linear Regression
Basics of Machine Learning (Notes)
[Machine learning] Understanding random forest
Machine learning with Python! Preparation
Machine Learning Study Resource Notepad
Machine learning ② Naive Bayes Summary
Machine learning article summary (self-authored)