Python: Preprocessing in machine learning: Data acquisition

Data acquisition from Excel

Reading data from Excel

The first task in preprocessing data is getting the data. There are various types of data sources that retrieve data, such as files, databases, and websites. This time, I often read data,

Learn to get data from. In addition, data acquisition from CSV was introduced in Data Cleansing.

It's easy to use pandas to read data from Excel. If you use pandas, you can get pandas DataFrame format data after reading the data. Subsequent processing will also be easier to write. For pandas, see About data processing with Pandas.

To read an excel file with pandas
Function pd.read_excel(io, sheetname)Is used.

image.png

For this first argument, io

Function pd.ExcelFile('The name of the file you want to read')
It is possible to pass the ExcelFile object created in.

Check the official pandas documentation for more information on the pd.read_excel () function.

import pandas as pd

xlsx = pd.ExcelFile('The name of the file you want to read')
#Creating an ExcelFile object

df = pd.read_excel(io, sheetname)
#Read an excel file with pandas

Combine data read in Excel

Earlier, I loaded one sheet of Excel file In the actual preprocessing, the data stored in multiple sheets is organized and saved. You may want to use it for analysis as a set of data.

As a concrete example, consider the situation where each sheet has the following data.

image.png

In this situation, what kind of data should be created when you want to use both the selling price and the list price of the sold product as analysis data?

To do so, you need to combine the data from the two sheets with product_id to create the following data.

image.png

#This join operation is a pandas function
pd.merge(left, right, on)
#Is used.

image.png

If you describe the arguments of the merge function that should be used in the case of the previous example schematically

pd.merge (sales sheet data frame, product sheet data frame, on ='product_id') It looks like.

See the official documentation for more information on the merge function. The following is an example.

import pandas as pd
xlsx = pd.ExcelFile('file name')
df1 = pd.read_excel(xlsx, 'employees')
df2 = pd.read_excel(xlsx, 'projects')
df3 = pd.read_excel(xlsx, 'project_assigns')

#Please write your answer here
df13 = pd.merge(df1, df3, on='employee_id')
df123 = pd.merge(df13, df2, on='project_id')
df123

Narrowing down the data read by Excel

Of the following data combined Only participating members of "Marketing Project" and "System Development Project" What if I want to know?

image.png

In this case, you need to narrow down the data in the table above.

There are several ways to do this

#pandas functions
DataFrame.query
#This makes the code easier to write and read.

In the function DataFrame.query, various conditions can be specified as strings as follows.

df.query('employee_id > 1')
df.query('(employee_id > 1) & (project_id > 3)')
df.query('project_id in [1, 2]')
df.query('project_id == [1, 2]')

If you want to know only the participating members of "Marketing Project" and "System Development Project" We only need to narrow down the project_id to those with 1 and 3, so write as follows.

df.query('project_id in [1, 3]')
#Or
df.query('project_id == [1, 3]')

See the official documentation for various examples of DataFrame.query.

Aggregation of data read by Excel

In the list of project members You may want to know the number of members in each project. To do this, we will aggregate and group the data for each project. You need to find out the number of data in each group.

#This is from pandas
DataFrame.groupby
#You can use this method.

DataFrame.groupby allows you to specify columns to group in the following format:

df.groupby('Column name you want to group')
#If you want to group by multiple columns, do the following:

df.groupby(['Column 1', 'Column 2'])

The groupby method returns a GroupBy object, not a DataFrame object. The GroupBy object defines convenient methods count (), mean (), sum (), max (), etc. that you want to apply to each group.

See the official documentation for the methods available in the GroupBy object.

Click here for practical examples

import pandas as pd
xlsx = pd.ExcelFile('file name')
df1 = pd.read_excel(xlsx, 'employees')
df2 = pd.read_excel(xlsx, 'projects')
df3 = pd.read_excel(xlsx, 'project_assigns')

df13 = pd.merge(df1, df3, on='employee_id')
df = pd.merge(df13, df2, on='project_id')
df.groupby('project_name').count()['employee_id']

Output result

image.png

Data acquisition from database

Read data from database

When reading data from the database in preprocessing, it is convenient to use the read_sql function of pandas to get the data in DataFrame format. The following information is required to connect to the database.

import sqlalchemy as sqla
engine = sqla.create_engine('mysql+mysqldb://ai:ai@mysql-service:3307/ai_mlprep?charset=utf8')
# create_The format of the connection string of the argument of engine is as follows.
Connection database+Driver name://Connected user name:password@hostname:port number/Database name?charset=Character code

To read data from a database table, use the following SQL SELECT statement.

SELECT
Table column name 1,
Table column name 2
FROM table name

Also, use the pandas read_sql function as follows. Writing the SQL string with line breaks with'''(triple quotes) makes the code easier to read.

pd.read_sql('''
SELECT
  column1,
  column2
FROM table1
''', engine)

In the second argument of read_sql, pass the engine for database connection created by sqlalchemy.

Joining data using JOIN

We have prepared the following tables.

image.png

At this time, what should I do if I want to compare the list price of the product actually sold with the selling price? To do this, use a common column in the product table that includes the selling price and the product table that contains the list price. You will need to join the tables.

After reading each table in DataFrame format, just like when reading data from Excel You can also merge each DataFrame with the pandas merge function, If you have a table in your database, you can use SQL The join process can be written concisely and only the result data of the join can be obtained.

To join table A and table B of the database using each column 1 as a key Write SQL using JOIN ON as shown below.

SELECT
Table A.Column 1,
Table B.Column 1
FROM
Table A
JOIN table B ON table A.Column 1=Table B.Column 1

When joining three or more tables as shown below, JOIN ON is described consecutively.

SELECT
Table A.Column 1,
Table B.Column 1
FROM
Table A
JOIN table B ON table A.Column 1=Table B.Column 1
JOIN table C ON table B.Column 2=Table C.Column 2

Narrowing down data using WHERE

It has the same table structure.

When doing an analysis, for example, you want to look at trends regarding when drinks sell. What if I have a requirement?

To do this, you need to do the following two things.

The sales date information is included in the sales table, and Since the information of the sold products is included in the sales product table, the two tables are joined.

Since I want to narrow down the output results to only drinks, I narrow down to only drink product_id.

In this case, in the WHERE clause of SQL
product_id =4 and product_id =Specify 6 to narrow down the output.

Here's how to write SQL that meets the above requirements and includes a WHERE clause to filter the output:

SELECT
  product_id,
  sales_year,
  sales_month
FROM
  mlprep_sales_products JOIN mlprep_sales ON mlprep_sales_products.sales_id = mlprep_sales.sales_id
WHERE
  mlprep_sales_products.product_id = 4 OR
  mlprep_sales_products.product_id = 6

As mentioned above, you can specify multiple conditions using OR in the WHERE clause. Also, in the above, the condition is described using =, but inequality signs such as can also be used in the condition description. Although not shown in the example, it is possible to point out multiple conditions using AND in addition to OR. If you use AND, only data that meets all of the multiple conditions will be output.

Aggregate data using GROUP BY

It has the same configuration again.

Given this data, the total selling price of all products by year and month What if I want to use it for analysis?

You can use pandas' GroupBy, but you can do the same with SQL.

The calculation of the total selling price for each year and month is It cannot be realized by joining by JOIN or narrowing down by WHERE. The following two points need to be addressed.

Since the sales table has the data for 1, year and month, and the sales product table has the data for the selling price information, the two tables are combined and the data to be aggregated is combined into one table.

Calculate the total value by aggregating the data by 2, year and month

The 1 can be achieved by joining with JOIN, but the 2 uses the SQL GROUP BY clause.

The SQL GROUP BY clause specifies the columns to aggregate in the GROUP BY clause for the final table after joining the tables (if needed), as follows: If the aggregation has two levels, such as year and month, write the two columns separated by commas.

You can also specify a function to aggregate for the aggregated data. Here, I want to calculate the total selling price for each year and month. We use SUM to calculate the sales price XX sales quantity (sales_price * sales_amount).

SELECT
  sales_year,
  sales_month,
  SUM(sales_price * sales_amount)
FROM
  mlprep_sales
  JOIN mlprep_sales_products ON mlprep_sales.sales_id = mlprep_sales_products.sales_id
GROUP BY sales_year, sales_month

Recommended Posts

Python: Preprocessing in machine learning: Data acquisition
Preprocessing in machine learning 2 Data acquisition
Python: Preprocessing in machine learning: Data conversion
Machine learning in Delemas (data acquisition)
Python: Preprocessing in Machine Learning: Overview
Preprocessing in machine learning 4 Data conversion
Preprocessing in machine learning 1 Data analysis process
[Python] Data analysis, machine learning practice (Kaggle) -Data preprocessing-
I started machine learning with Python Data preprocessing
Python: Preprocessing in machine learning: Handling of missing, outlier, and imbalanced data
Pre-processing in machine learning 3 Missing values, outliers, and imbalanced data
[python] Frequently used techniques in machine learning
[Python] First data analysis / machine learning (Kaggle)
[Python] Saving learning results (models) in machine learning
Data supply tricks using deques in machine learning
[Python3] Let's analyze data using machine learning! (Regression)
Get a glimpse of machine learning in Python
Data set for machine learning
Handle Ambient data in Python
Japanese preprocessing for machine learning
Machine learning in Delemas (practice)
Build an interactive environment for machine learning in Python
Tool MALSS (application) that supports machine learning in Python
Display UTM-30LX data in Python
Coursera Machine Learning Challenges in Python: ex2 (Logistic Regression)
Tool MALSS (basic) that supports machine learning in Python
Python data analysis learning notes
Python Machine Learning Programming> Keywords
Coursera Machine Learning Challenges in Python: ex1 (Linear Regression)
Used in machine learning EDA
Attempt to include machine learning model in python package
Beginning with Python machine learning
MALSS, a tool that supports machine learning in Python
Introduction to Machine Learning with scikit-learn-From data acquisition to parameter optimization
The result of Java engineers learning machine learning in Python www
Performance verification of data preprocessing for machine learning (numerical data) (Part 2)
Coursera Machine Learning Challenges in Python: ex7-2 (Principal Component Analysis)
Performance verification of data preprocessing for machine learning (numerical data) (Part 1)
Get Leap Motion data in Python.
Implement stacking learning in Python [Kaggle]
Data acquisition using python googlemap api
Read Protocol Buffers data in Python3
Python: Time Series Analysis: Preprocessing Time Series Data
Get data from Quandl in Python
Machine learning with python (1) Overall classification
Machine learning summary by Python beginners
Automate routine tasks in machine learning
Handle NetCDF format data in Python
Widrow-Hoff learning rules implemented in Python
Classification and regression in machine learning
Python Pandas Data Preprocessing Personal Notes
<For beginners> python library <For machine learning>
Hashing data in R and Python
Preprocessing template for data analysis (Python)
Implemented Perceptron learning rules in Python
Random seed research in machine learning
"Scraping & machine learning with Python" Learning memo
Basic machine learning procedure: ② Prepare data
How to collect machine learning data
How about Anaconda for building a machine learning environment in Python?
Coursera Machine Learning Challenges in Python: ex5 (Adjustment of Regularization Parameters)