[PYTHON] Read CSV and analyze with Pandas and Seaborn

Data analysis is popular these days, so I'll analyze it by showing a sample of the code.

the code

The execution environment will be Python3.

In this article we will do the following:

--Read CSV --Simple column conversion --Aggregate and draw from various perspectives

Use Seaborn for drawing.

Seaborn: statistical data visualization

Data to be used

The data to be analyzed is as follows.

target.csv


datetime, id, value
20170606121314, 1,2
20170606121315, 1,3
20170606121316, 1,4
20170608121616, 1,4
20170608121617, 1,1
20170608121618, 1,2
20170606121540, 2,10
20170606121541, 2,8
20170606121542, 2,11
20170608121543, 2,4
20170606134002, 3,21
20170606134003, 3,10
20170606134004, 3,4
20170608134005, 3,50

datetime is a string of year, month, day, hour, minute, and second. Also assume that a certain value occurs every second for a certain period of time for a few seconds for each id.

Analytical work in Python

Read csv file

python


import pandas as pd

#CSV read
df = pd.read_csv("target.csv",sep=",")
df.columns = ["datetime","id","value"]

As a method to check if it was read

df.head()

It will be. Then, the output will be as follows.

datetime id value
0 20170606121314 1 2
1 20170606121315 1 3
2 20170606121316 1 4
3 20170608121616 1 4
4 20170608121617 1 1

The head () method is a method that displays the first 5 lines of data and is often used to check the contents of data.

There is also a method called tail (), which displays 5 lines of data from the end of the data. The display result is as follows.

datetime id value
9 2017-06-08 12:15:43 2 4
10 2017-06-06 13:40:02 3 21
11 2017-06-06 13:40:03 3 10
12 2017-06-06 13:40:04 3 4
13 2017-06-08 13:40:05 3 50

Also, in the following line, the column is set in the dataframe.

python


df.columns = ["datetime","id","value"]

datetime column from string to datetime

python


from datetime import datetime as dt

df.datetime = df.datetime.apply(lambda d: dt.strptime(str(d), "%Y%m%d%H%M%S"))

The purpose of doing this is to make the date column easier to work with. What we're doing is accessing the value in each row of the datetime column with df.datetime and parse the string with the strptime method. This allows values that were originally Strings to be converted to date and time types.

Aggregate by ID and see the number of records

python


df_by_id= df.groupby("id")["value"].count().reset_index()
df_by_id

groupby ("id ") aggregates records by value in the id column. The number of records by id is counted by count ().

The contents of df_byid are as follows.

id value
0 1 6
1 2 4
2 3 4

Draw in a histogram with the number of records on the horizontal axis

python


import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
id_df = pd.DataFrame(df_by_id)
sns.distplot(id_df.value, kde=False, rug=False, axlabel="record_count",bins=10)

We use a library called seaborn that draws beautiful diagrams.

スクリーンショット 2017-06-25 21.31.56.png

Aggregate by ID and see the total of value columns

python


df_value_sum= df.groupby("id")["value"].sum().reset_index()

The part that is count () above is just sum ().

The contents of df_value_sum are as follows.

id value
0 1 16
1 2 33
2 3 85

Aggregate by ID and get the time when the data first occurred

python


start_datetime_by_id = df.groupby(["id"])["datetime"].first().reset_index()
df_date = pd.DataFrame(start_datetime_by_id)

The contents of df_date are as follows.

id datetime
0 1 2017-06-06 12:13:14
1 2 2017-06-06 12:15:40
2 3 2017-06-06 13:40:02

Display how many data occurred on which day of the month with the date on the horizontal axis

python


sns.distplot(date_df.datetime.dt.month, kde=False, rug=False, axlabel="record_generate_date",hist_kws={"range": [1,30]}, bins=30)

With the option hist_kws = {"range ": [1,30]}, the horizontal axis draws in the range 0-30. This is where the data occurred out of the data on June 30, 2017. This is for the sake of clarity.

スクリーンショット 2017-06-25 21.44.55.png

Recommended Posts

Read CSV and analyze with Pandas and Seaborn
Read csv with python pandas
Read and write csv files with numpy
Load csv with pandas and play with Index
Read Python csv data with Pandas ⇒ Graph with Matplotlib
Read JSON with Python and output as CSV
Analyze Apache access logs with Pandas and Matplotlib
Read and format a csv file mixed with comma tabs with Python pandas
Read and write csv file
[Python] Read Japanese csv with pandas without garbled characters (and extract columns written in Japanese)
[Python] Read the csv file and display the figure with matplotlib
Reading and writing CSV with Python
Ignore # line and read in pandas
Load csv with duplicate columns in pandas
Read CSV file with python (Download & parse CSV file)
Import of japandas with pandas 1.0 and above
Read Python csv and export to txt
Grouping csv and getting minimum value (pandas)
How to read CSV files in Pandas
I read the Sudachi synonym dictionary with Pandas and searched for synonyms
Read CSV file with Python and convert it to DataFrame as it is
How to read a CSV file with Python 2/3
Scraping tabelog with python and outputting to CSV
[Introduction to Pandas] Read a csv file without a column name and give it a column name
[Python] How to read excel file with pandas
Read pandas data
Install pip and pandas with Ubuntu or VScode
Interactively visualize data with TreasureData, Pandas and Jupyter.
[Python3] Read and write with datetime isoformat with json
Example of reading and writing CSV with Python
Read the csv file with jupyter notebook and write the graph on top of it
I tried to read and save automatically with VOICEROID2 2
Read the csv file and display it in the browser
I tried to automatically read and save with VOICEROID2
Overview and tips of seaborn with statistical data visualization
Read the linked list in csv format with graph-tool
How to extract null values and non-null values with pandas
How to convert JSON file to CSV file with Python Pandas
[Python] How to deal with pandas read_html read error
Read and write files with Slackbot ~ Bot development with Python ~
Make a CSV formatting tool with Python Pandas PyInstaller
[Python] A memo to write CSV vertically with Pandas
Read json file with Python, format it, and output json
Process csv data with python (count processing using pandas)
Extract the maximum value with pandas and change that value
[Memo] Load csv of s3 into pandas with boto3
Processing datasets with pandas (1)
Convert 202003 to 2020-03 with pandas
Processing datasets with pandas (2)
Merge datasets with pandas
jupyter and pandas installation
Learn Pandas with Cheminformatics
Data visualization with pandas
Data manipulation with Pandas!
pandas resample and rolling
Pandas averaging and listing
Csv tinkering with python
Read Python csv file
With and without WSGI
Create a new csv with pandas based on the local csv
Read the URL list with Robot Framework and surround the screenshots