[PYTHON] Preprocessing in machine learning 2 Data acquisition

Aidemy 2020/10/29

Introduction

Hello, it is Yope! I am a liberal arts student, but I was interested in the possibilities of AI, so I went to the AI-specialized school "Aidemy" to study. I would like to share the knowledge gained here with you, and I am summarizing it on Qiita. I am very happy that many people have read the previous summary article. Thank you! This is the second post of machine learning pre-processing. Nice to meet you.

What to learn this time ・ Data acquisition from Excel ・ Data acquisition from database

Data acquisition from Excel

Data read

-The first thing to do in data preprocessing is __ "Read data" __. -Data sources that read data include __ "files", "databases", "websites" __, etc., and this time we will deal with __Excel files and data acquisition from databases, which are often read. ..

-Read data from Excel with pandas. -As you learned in "Data Handling 2", use __pd.read_excel () __ to read the Excel file. -"File link" and "Sheet name in Excel" can be specified in this argument. -Also, you can pass the individually set __pd.ExcelFile ("file name") __ as the first argument.

Combine Excel data

-__ When you want to handle the data organized in multiple sheets as one data __, you need to perform data combination as preprocessing. -As an example, it is conceivable to combine the product data sheet A and the sales data sheet B in the horizontal direction. -Use __pd.merge (left, right, on =) __ for this merge. -Arguments "left" and "right" specify the sheet to be joined on the left / right side, and "on" specifies the name of the column used for joining.

・ Code![Screenshot 2020-10-29 14.43.14.png](https://qiita-image-store.s3.ap-northeast-1.amazonaws.com/0/698700/e487d03e-75a2-0ef9- 8df3-f855081f1d55.png)

・ Result![Screenshot 2020-10-29 14.43.44.png](https://qiita-image-store.s3.ap-northeast-1.amazonaws.com/0/698700/963997fe-836e-c819- f87a-563a13a7143a.png)

Filtering Excel data

-If you want to narrow down the data (row) that has a specific character string from the read or combined Excel data (sheet), use __DataFlame.query ('column with specified character string> character string') __ To do. -If you want to specify more than one of this character string, pass it in a list, and change the ">" part of the above code to "==" or "in".

-For example, if you want to extract lines with product_ids 1 and 3 of the data "df" read by pandas, do as follows.

df.query('product_id == [1,3]')

Data aggregation (grouping)

-There is a column called "type" in the loaded product sheet "df", where the data belongs to one of "food" representing food, "drink" representing beverages, and "sweet" representing sweets. And. -Of these, when you want to know the number of __ data belonging to "food" __ etc., it is necessary to aggregate and group the data belonging to "food". -Use __df.groupby ('name of column to group') __ for grouping. -At this time, when grouping with multiple columns, use __ ('column 1','column 2']) __. -The form of the data returned by this groupby method is "GroupBy object", and it is grouped by using methods such as __count (), mean (), sum (), max () __. It is possible to obtain the number of such data. (Already learned in the Pandas course)

・ In this example, do as follows.

gb = df.groupby('type')
gb['food'].count()

Data acquisition from database

Read data from database

-Reading from the database is done by __pd.read_sql () __, but the value passed as an argument is special. -In the first argument, specify the table and column of the database to be read as follows.

''' SELECT Column 1, Column 2 ("," is not required in the last column) FROM table '''

-In the second argument, pass the information of the database to be connected together as follows.

__sqla.create_engine ('connection database + driver name: // connection user name: password @ host name: port number / database name? charset = character code') __

-The flow of data reading is described below. (The information in the database is fictitious)

import sqalchemy as sqla
#Summarize database information
engine = sqla.create_engine('mysql+mysqldb://ngayope:ngayope@mysql-service:3307/database1?charset=utf8')
#From the "products" table of the database "database1", "product"_id "and" product_get "name"
pd.read_sql('''
SELECT
 product_id,
 product_name
FROM products
''',engine)

Join table

-The database tables can be merged with pd.merge (left, right, on =) as in Excel, but in the case of the database, they can be merged more easily. -For joins, use SQL JOIN. -The join method is described below (when joining table B to table A using column 1 as a key).

''' SELECT Table A. Column 1, Table B. Column 1 FROM Table A JOIN table B ON table A. column 1 = table B. column 1 (describe to add) '''

Data refinement

-__ Data narrowing __ is performed using the WHERE clause. When using it, do it in the same line as SELECT and FROM. -Use __ "OR" "AND" __ when you want to narrow down by multiple conditions. "AND" extracts items that satisfy all the conditions.

''' WHERE Table name. Column name = Value you want to extract 1 OR Table name. Column name = Value you want to extract 2 '''

・ Code up to this point (data shows the sales date and list price of the product)![Screenshot 2020-10-29 14.58.27.png](https://qiita-image-store.s3.ap- northeast-1.amazonaws.com/0/698700/f9e63923-93c6-3822-50d1-bb0b09407083.png)

・ Explanation of the above code: __ "SELECT" __ specifies the three data to be output this time. __ "FROM" __ specifies mlprep_sales_products, which is a table of "Which products were sold", and __ "JOIN ON" __ combines this with mlprep_sales and mlprep_products. Finally, in __ "WHERE" __, only those whose column "catalog_price" indicating the list price of the mlprep_products table is 200 or more are extracted.

Data aggregation

-I explained the method using GroupBy of pandas, but SQL also has __GROUP BY clause __ which has the same function, and data can be aggregated using this. -When aggregating data, join the data as needed and then aggregate the data into one table.

''' JOIN Table B ON Table A. Column 1 = Table B. Column 1 GROUP BY columns to aggregate '''

Summary

-Excel data read is _pd.read_excel () ___, data merge is __pd.merge () __, data narrowing is __df.query () __, data aggregation (grouping) is __df.groupby () _ Do with _. -Use __pd.read_sql () __ to read database data. -For the first argument, specify the table and column to be read by __ "SELECT column FROM table" __. -Similarly, use __ "JOIN" __ for data binding, __ "WHERE" __ for narrowing down, and __ "GROUP BY" __ for aggregation. -For the second argument, specify the database information summarized by __sqla.create_engine () __.

This time is over. Thank you for reading until the end.

Recommended Posts

Preprocessing in machine learning 2 Data acquisition
Python: Preprocessing in machine learning: Data acquisition
Machine learning in Delemas (data acquisition)
Preprocessing in machine learning 4 Data conversion
Python: Preprocessing in machine learning: Data conversion
Preprocessing in machine learning 1 Data analysis process
Python: Preprocessing in Machine Learning: Overview
Pre-processing in machine learning 3 Missing values, outliers, and imbalanced data
Data supply tricks using deques in machine learning
[Python] Data analysis, machine learning practice (Kaggle) -Data preprocessing-
I started machine learning with Python Data preprocessing
Python: Preprocessing in machine learning: Handling of missing, outlier, and imbalanced data
Data set for machine learning
Japanese preprocessing for machine learning
Machine learning in Delemas (practice)
Used in machine learning EDA
About data preprocessing of systems that use machine learning
Introduction to Machine Learning with scikit-learn-From data acquisition to parameter optimization
Classification and regression in machine learning
Machine learning
Performance verification of data preprocessing for machine learning (numerical data) (Part 1)
Random seed research in machine learning
Basic machine learning procedure: ② Prepare data
How to collect machine learning data
Machine learning imbalanced data sklearn with k-NN
[python] Frequently used techniques in machine learning
[Python] First data analysis / machine learning (Kaggle)
[Python] Saving learning results (models) in machine learning
[Updated Ver1.3.1] I made a data preprocessing library DataLiner for machine learning.
[Memo] Machine learning
Machine learning classification
Machine Learning sample
Multivariate LSTM and data preprocessing in TensorFlow 2.x
Machine learning Training data division and learning / prediction / verification
Summary of evaluation functions used in machine learning
Get a glimpse of machine learning in Python
Stock price forecast using deep learning [Data acquisition]
A story about data analysis by machine learning
[For beginners] Introduction to vectorization in machine learning
Machine learning tutorial summary
About machine learning overfitting
Build an interactive environment for machine learning in Python
Machine learning ⑤ AdaBoost Summary
Machine Learning: Supervised --AdaBoost
Sampling in imbalanced data
Tool MALSS (application) that supports machine learning in Python
Machine learning logistic regression
Coursera Machine Learning Challenges in Python: ex2 (Logistic Regression)
How to split machine learning training data into objective variables and others in Pandas
Tool MALSS (basic) that supports machine learning in Python
About testing in the implementation of machine learning models
Studying Machine Learning ~ matplotlib ~
Machine learning linear regression
Machine learning course memo
Machine learning library dlib
Coursera Machine Learning Challenges in Python: ex1 (Linear Regression)
Time series data prediction by AutoML (automatic machine learning)
Attempt to include machine learning model in python package
Cross-entropy to review in Coursera Machine Learning week 2 assignments
Preprocessing of prefecture data
xgboost: A valid machine learning model for table data