Data acquisition from Excel

Reading data from Excel

The first task in preprocessing data is getting the data. There are various types of data sources that retrieve data, such as files, databases, and websites. This time, I often read data,

Excel file
Database

Learn to get data from. In addition, data acquisition from CSV was introduced in Data Cleansing.

It's easy to use pandas to read data from Excel. If you use pandas, you can get pandas DataFrame format data after reading the data. Subsequent processing will also be easier to write. For pandas, see About data processing with Pandas.

To read an excel file with pandas
Function pd.read_excel(io, sheetname)Is used.

For this first argument, io

Function pd.ExcelFile('The name of the file you want to read')
It is possible to pass the ExcelFile object created in.

Check the official pandas documentation for more information on the pd.read_excel () function.

import pandas as pd

xlsx = pd.ExcelFile('The name of the file you want to read')
#Creating an ExcelFile object

df = pd.read_excel(io, sheetname)
#Read an excel file with pandas

Combine data read in Excel

Earlier, I loaded one sheet of Excel file In the actual preprocessing, the data stored in multiple sheets is organized and saved. You may want to use it for analysis as a set of data.

As a concrete example, consider the situation where each sheet has the following data.

In this situation, what kind of data should be created when you want to use both the selling price and the list price of the sold product as analysis data?

To do so, you need to combine the data from the two sheets with product_id to create the following data.

#This join operation is a pandas function
pd.merge(left, right, on)
#Is used.

If you describe the arguments of the merge function that should be used in the case of the previous example schematically

pd.merge (sales sheet data frame, product sheet data frame, on ='product_id') It looks like.

See the official documentation for more information on the merge function. The following is an example.

import pandas as pd
xlsx = pd.ExcelFile('file name')
df1 = pd.read_excel(xlsx, 'employees')
df2 = pd.read_excel(xlsx, 'projects')
df3 = pd.read_excel(xlsx, 'project_assigns')

#Please write your answer here
df13 = pd.merge(df1, df3, on='employee_id')
df123 = pd.merge(df13, df2, on='project_id')
df123

Narrowing down the data read by Excel

Of the following data combined Only participating members of "Marketing Project" and "System Development Project" What if I want to know?

In this case, you need to narrow down the data in the table above.

There are several ways to do this

#pandas functions
DataFrame.query
#This makes the code easier to write and read.

In the function DataFrame.query, various conditions can be specified as strings as follows.

df.query('employee_id > 1')
df.query('(employee_id > 1) & (project_id > 3)')
df.query('project_id in [1, 2]')
df.query('project_id == [1, 2]')

If you want to know only the participating members of "Marketing Project" and "System Development Project" We only need to narrow down the project_id to those with 1 and 3, so write as follows.

df.query('project_id in [1, 3]')
#Or
df.query('project_id == [1, 3]')

See the official documentation for various examples of DataFrame.query.

Aggregation of data read by Excel

In the list of project members You may want to know the number of members in each project. To do this, we will aggregate and group the data for each project. You need to find out the number of data in each group.

#This is from pandas
DataFrame.groupby
#You can use this method.

DataFrame.groupby allows you to specify columns to group in the following format:

df.groupby('Column name you want to group')
#If you want to group by multiple columns, do the following:

df.groupby(['Column 1', 'Column 2'])

The groupby method returns a GroupBy object, not a DataFrame object. The GroupBy object defines convenient methods count (), mean (), sum (), max (), etc. that you want to apply to each group.

See the official documentation for the methods available in the GroupBy object.

Click here for practical examples

import pandas as pd
xlsx = pd.ExcelFile('file name')
df1 = pd.read_excel(xlsx, 'employees')
df2 = pd.read_excel(xlsx, 'projects')
df3 = pd.read_excel(xlsx, 'project_assigns')

df13 = pd.merge(df1, df3, on='employee_id')
df = pd.merge(df13, df2, on='project_id')
df.groupby('project_name').count()['employee_id']

Output result

Data acquisition from database

Read data from database

When reading data from the database in preprocessing, it is convenient to use the read_sql function of pandas to get the data in DataFrame format. The following information is required to connect to the database.

Type of database to connect to (MySQL, PostgreSQL, Oracle, etc.)
Connection information to the connection destination database (database name, host name, port number, connection user authentication information, character code)
Driver information used for connection (mysqlclient, mysqlDB, mysql-connector-python, oursql, etc.)
Mysqlclient is a MySQL driver operated by SQLAlchemy and does not appear in the code.

import sqlalchemy as sqla
engine = sqla.create_engine('mysql+mysqldb://ai:ai@mysql-service:3307/ai_mlprep?charset=utf8')
# create_The format of the connection string of the argument of engine is as follows.

Connection database+Driver name://Connected user name:password@hostname:port number/Database name?charset=Character code

To read data from a database table, use the following SQL SELECT statement.

SELECT
Table column name 1,
Table column name 2
FROM table name

Also, use the pandas read_sql function as follows. Writing the SQL string with line breaks with'''(triple quotes) makes the code easier to read.

pd.read_sql('''
SELECT
  column1,
  column2
FROM table1
''', engine)

In the second argument of read_sql, pass the engine for database connection created by sqlalchemy.

Joining data using JOIN

We have prepared the following tables.

At this time, what should I do if I want to compare the list price of the product actually sold with the selling price? To do this, use a common column in the product table that includes the selling price and the product table that contains the list price. You will need to join the tables.

After reading each table in DataFrame format, just like when reading data from Excel You can also merge each DataFrame with the pandas merge function, If you have a table in your database, you can use SQL The join process can be written concisely and only the result data of the join can be obtained.

To join table A and table B of the database using each column 1 as a key Write SQL using JOIN ON as shown below.

SELECT
Table A.Column 1,
Table B.Column 1
FROM
Table A
JOIN table B ON table A.Column 1=Table B.Column 1

When joining three or more tables as shown below, JOIN ON is described consecutively.

SELECT
Table A.Column 1,
Table B.Column 1
FROM
Table A
JOIN table B ON table A.Column 1=Table B.Column 1
JOIN table C ON table B.Column 2=Table C.Column 2

Narrowing down data using WHERE

It has the same table structure.

When doing an analysis, for example, you want to look at trends regarding when drinks sell. What if I have a requirement?

To do this, you need to do the following two things.

The sales date information is included in the sales table, and Since the information of the sold products is included in the sales product table, the two tables are joined.

Since I want to narrow down the output results to only drinks, I narrow down to only drink product_id.

Of the products included in the product table, the only drinks are tea with product_id 4 and orange juice with product_id 6.

In this case, in the WHERE clause of SQL
product_id =4 and product_id =Specify 6 to narrow down the output.

Here's how to write SQL that meets the above requirements and includes a WHERE clause to filter the output:

SELECT
  product_id,
  sales_year,
  sales_month
FROM
  mlprep_sales_products JOIN mlprep_sales ON mlprep_sales_products.sales_id = mlprep_sales.sales_id
WHERE
  mlprep_sales_products.product_id = 4 OR
  mlprep_sales_products.product_id = 6

As mentioned above, you can specify multiple conditions using OR in the WHERE clause. Also, in the above, the condition is described using =, but inequality signs such as can also be used in the condition description. Although not shown in the example, it is possible to point out multiple conditions using AND in addition to OR. If you use AND, only data that meets all of the multiple conditions will be output.

Aggregate data using GROUP BY

It has the same configuration again.

Given this data, the total selling price of all products by year and month What if I want to use it for analysis?

You can use pandas' GroupBy, but you can do the same with SQL.

The calculation of the total selling price for each year and month is It cannot be realized by joining by JOIN or narrowing down by WHERE. The following two points need to be addressed.

Since the sales table has the data for 1, year and month, and the sales product table has the data for the selling price information, the two tables are combined and the data to be aggregated is combined into one table.

Calculate the total value by aggregating the data by 2, year and month

The 1 can be achieved by joining with JOIN, but the 2 uses the SQL GROUP BY clause.

The SQL GROUP BY clause specifies the columns to aggregate in the GROUP BY clause for the final table after joining the tables (if needed), as follows: If the aggregation has two levels, such as year and month, write the two columns separated by commas.

You can also specify a function to aggregate for the aggregated data. Here, I want to calculate the total selling price for each year and month. We use SUM to calculate the sales price XX sales quantity (sales_price * sales_amount).

SELECT
  sales_year,
  sales_month,
  SUM(sales_price * sales_amount)
FROM
  mlprep_sales
  JOIN mlprep_sales_products ON mlprep_sales.sales_id = mlprep_sales_products.sales_id
GROUP BY sales_year, sales_month

Python: Preprocessing in machine learning: Data acquisition