Python Application: Data Handling Part 2: Parsing Various Data Formats

Various data formats

File input / output using the pandas library

HTML is a web page, JSON is a web API CSV and Excel have different main uses such as data organization. Mutual conversion is possible by using the pandas library.

HTML file structure

An HTML file is a data format that describes the contents of a web page.

The master of HTML analysis method is all web pages written in HTML It can be the target of analysis. Extracting information from HTML files on the web

This is called scraping.
In python, you can scrape with the library.

pandas library:Scraping table elements in HTML files
Separate libraries such as BeautifulSoup and lxml:Scraping other than table elements

JSON file structure

JSON file is an abbreviation for "JavaScript Object Notation" It is a text format originally created by referring to the notation of the programming language "Javascript".

JSON format is a text format independent of Javascript language Because most programming languages support reading and writing It is often used to exchange data between different programming languages.

image.png

The structure of a JSON file is basically the same as the structure of Python dictionary variables. Specify key and value pairs in curly braces {}, separated by commas. Place a colon: between the key and the value.

CSV file structure

CSV file is an abbreviation for "Comma Separated Values" It is a data format called "comma-separated values".

Because CSV files are saved in text format You can open the data independently of any specific software.

The data structure is simple, there is no extra metadata, and it is lightweight. It has been used for communication between spreadsheet software and database software for a long time.

image.png

The structure of the CSV file is very simple, and the values are separated by commas to represent columns. This makes it possible to describe tabular data concisely.

Excel file structure

Excel is a spreadsheet software used all over the world Many companies, public institutions and other organizations use this Information is disclosed in Excel file format.

Therefore, it is possible to handle Excel files when collecting and analyzing data using Python. The range of data analysis is greatly expanded.

image.png

When handling Excel files with spreadsheet software, it can be operated graphically. You don't have to be so conscious of the structure, Use these terms when working with Excel files from programming languages Remember these keywords to specify what you want to do.

the term Details
book Excel file
sheet Sheet in the book
row line
column Column
cell cell

DataFrame and conversion of each data format

Read the file with DataFrame

Use the pandas library to create HTML, JSON, CSV, etc. files Use read_ to read.

read_***() 
#Use this function to load.
# ***Will contain different characters for each file format.

For HTML files, the read_html () function, In the case of an Excel file, specify it like the read_excel () function.

The pandas library also supports formats other than the file formats listed in the table, as well. It can be read by a function called read_*** (). The loaded file is converted to a DataFrame type object in the pandas library It is possible to perform various processing using the function of pandas

file format function
HTML read_html()
JSON read_json()
CSV read_csv()
Excel read_excel()

For example, if you want to parse HTML files using the pandas library Use the read_html () function in the pandas library. By entering the path or URL of the HTML file you want to parse in the argument of the read_html () function, You can generate a DataFrame type object from a table element in an HTML file.

import pandas as pd  
tables = pd.read_html("HTML file you want to parse")

Export to a file from a DataFrame

DataFrame object in pandas library Use to_ as a file such as an HTML file, JSON file, or CSV file.

to_***() 
#Use this function to export.
# read_***()Like a function***Will contain different characters for each file format

For HTML, the to_html () function, for Excel, the to_excel () function, and so on. The pandas library also supports formats other than the file formats listed in the table, as well. It can be read by a function called to _ *** ().

file format function
HTML to_html()
JSON to_json()
CSV to_csv()
Excel to_excel()

For example, if you want to output to an Excel file using the pandas library Use the to_excel () function in the pandas library. By specifying the name of the Excel file you want to export in the argument of the to_excel () function You can generate an Excel file from an object of type DataFrame.

# pandas.DataFrame type object`df`To output to an Excel file
df.to_excel("Excel file name you want to export")

Get CSV file data and plot it on a graph

Read CSV file data

First, read the data.

import pandas as pd

stock_data=pd.read_csv(Where is the specified csv file?)
# ./~Specify the location of the file, etc.

print(stock_data)

Draw a graph using Pandas features

In pandas, you can create a graph using an object of type DataFrame as an index function. Assuming you have an object df of type DataFrame, you can write:

from matplotlib import pyplot as plt
df.plot()
plt.show()

#When only specific data
df = data[price]
df.plot()
plt.show()

#At the time of all data
df = data
df.plot()
plt.show()

#Not specified. You can leave the data as it is

Recommended Posts

Python Application: Data Handling Part 2: Parsing Various Data Formats
Python Application: Data Handling Part 3: Data Format
Python Application: Data Visualization Part 3: Various Graphs
Python Application: Data Cleansing Part 1: Python Notation
Python application: data visualization part 1: basic
Data handling 2 Analysis of various data formats
Python application: Data handling Part 1: Data formatting and file input / output
Python application: Pandas Part 2: Series
Python application: data visualization # 2: matplotlib
Python application: Numpy Part 3: Double array
Python application: Data cleansing # 2: Data cleansing with DataFrame
[Python] Chapter 04-06 Various data structures (creating dictionaries)
[Python] Chapter 04-03 Various data structures (multidimensional list)
[Python] Chapter 04-04 Various data structures (see list)
[Introduction to Udemy Python3 + Application] 65. Exception handling
Python application: Pandas Part 4: DataFrame concatenation / combination
Data handling
[Python] Web application from 0! Hands-on (4) -Data molding-
[Python] Various data processing using Numpy arrays
[Python] Chapter 04-02 Various data structures (list manipulation)
Data acquisition from analytics API with Google API Client for python Part 2 Web application
Create test data like that with Python (Part 1)
A story stuck with handling Python binary data
[Data science memorandum] Handling of missing values ​​[python]
QGIS + Python Part 2
QGIS + Python Part 1
Data analysis python
Python Error Handling
Python exception handling
# 3 [python3] Various operators
Multi-condition data handling
Python: Scraping Part 1
Python timezone handling
Python exception handling
Python3 Beginning Part 1
[python] Read data
Python: Scraping Part 2
"My Graph Generation Application" by Python (PySide + PyQtGraph) Part 2
Web application made with Python3.4 + Django (Part.1 Environment construction)
[Python] Chapter 04-05 Various data structures (tuple creation and features)
"My Graph Generation Application" by Python (PySide + PyQtGraph) Part 1