[PYTHON] Merge datasets with pandas

Latest version of pandas 0.14.0

The latest version of pandas, 0.14.0, was released on 5/30. To upgrade from the previous version 0.13.1, enter the following command.

pip install --upgrade pandas 

Dataset merging

It was mentioned earlier that pandas is a high-level and powerful analysis library that provides objects called data frames. It's just (http://qiita.com/ynakayama/items/4362c439d9ea814cbe60), but you can use it to combine multiple datasets like RDBMS.

Try from the actual dataset

The read_json function of pandas can treat JSON as it is as a data frame. Here, we will use the API of the open source project House API Project. Here, information such as the number of Wifi client connections and temperature is released as open data using sensors. Let's merge this.

import pandas as pd

#Get the latest 100 Wifi connections on the 1st floor
wifi = pd.read_json('http://house-api-project.org/api/shibuhouse/wifi/clients?limit=100house-api-project.org/api/shibuhouse/wifi/1f/clients?limit=100')

#Get the latest 100 temperatures on the 1st floor
temp = pd.read_json('http://house-api-project.org/api/shibuhouse/1f/temperature?limit=100')

print( wifi.head(5) )
#=>
#                                       _id  clients                     time
#  0   {'$oid': '538c3799c3ca7c08531d645e'}       21  2014-06-02 08:36:36 UTC
#  1   {'$oid': '538c3760c3ca7c08531d6454'}       21  2014-06-02 08:35:37 UTC
#  2   {'$oid': '538c371fc3ca7c08531d6445'}       21  2014-06-02 08:34:36 UTC
#  3   {'$oid': '538c36e4c3ca7c08531d643b'}       21  2014-06-02 08:33:37 UTC
#  4   {'$oid': '538c36a7c3ca7c08531d642f'}       21  2014-06-02 08:32:36 UTC

print( temp.head(5) )
#=>
#                                       _id  temperature                     time
#  0   {'$oid': '538c37cac3ca7c08531d6465'}       34.000  2014-06-02 08:37:29 UTC
#  1   {'$oid': '538c378dc3ca7c08531d6459'}       34.000  2014-06-02 08:36:29 UTC
#  2   {'$oid': '538c3752c3ca7c08531d644d'}       34.000  2014-06-02 08:35:29 UTC
#  3   {'$oid': '538c3715c3ca7c08531d6442'}       34.000  2014-06-02 08:34:29 UTC
#  4   {'$oid': '538c36dbc3ca7c08531d6437'}       34.000  2014-06-02 08:33:30 UTC

#Merge two datasets by index
merged = pd.merge(wifi, temp, left_index=True, right_index=True)

Various options for merging

The frequently used arguments are as shown in the table below.

argument Description
left Merge to the left
right Merge to the right
how How to join(inner, outer, left, right)
on When you want to explicitly specify the column name to use for merging
left_on/right_on Alias used as a join key
left_index/right_index When you want to merge by index

Dataset concatenation

The concat function concatenates datasets. It's similar to merging and confusing, but it's a good idea to read the documentation for more details.

Merge, join, and concatenate http://pandas.pydata.org/pandas-docs/stable/merging.html

In the JSON API example above, it looks like this.

concat = pd.concat([wifi, temp], ignore_index=True, axis=1)

concat.head(10)
#=>
#                                        0   1                        2  \
#  0  {'$oid': '538c4081c3ca7c08531d6616'}  20  2014-06-02 09:14:37 UTC   
#  1  {'$oid': '538c4045c3ca7c08531d660a'}  20  2014-06-02 09:13:37 UTC   
#  2  {'$oid': '538c4005c3ca7c08531d65fb'}  20  2014-06-02 09:12:37 UTC   
#  3  {'$oid': '538c3fcac3ca7c08531d65ef'}  20  2014-06-02 09:11:37 UTC   
#  4  {'$oid': '538c3f8ec3ca7c08531d65e4'}  20  2014-06-02 09:10:36 UTC   
#  5  {'$oid': '538c3f51c3ca7c08531d65d8'}  20  2014-06-02 09:09:36 UTC   
#  6  {'$oid': '538c3f17c3ca7c08531d65cf'}  20  2014-06-02 09:08:36 UTC   
#  7  {'$oid': '538c3edbc3ca7c08531d65c2'}  20  2014-06-02 09:07:37 UTC   
#  8  {'$oid': '538c3ea1c3ca7c08531d65b9'}  20  2014-06-02 09:06:36 UTC   
#  9  {'$oid': '538c3e61c3ca7c08531d65a8'}  22  2014-06-02 09:05:37 UTC   

#                                        3       4                        5  
#  0  {'$oid': '538c4077c3ca7c08531d6611'}  34.125  2014-06-02 09:14:30 UTC  
#  1  {'$oid': '538c403ac3ca7c08531d6606'}  34.125  2014-06-02 09:13:30 UTC  
#  2  {'$oid': '538c3fffc3ca7c08531d65fa'}  34.125  2014-06-02 09:12:30 UTC  
#  3  {'$oid': '538c3fc4c3ca7c08531d65ee'}  34.125  2014-06-02 09:11:31 UTC  
#  4  {'$oid': '538c3f86c3ca7c08531d65e2'}  34.125  2014-06-02 09:10:29 UTC  
#  5  {'$oid': '538c3f49c3ca7c08531d65d7'}  34.125  2014-06-02 09:09:29 UTC  
#  6  {'$oid': '538c3f0ec3ca7c08531d65cb'}  34.125  2014-06-02 09:08:29 UTC  
#  7  {'$oid': '538c3ed1c3ca7c08531d65bf'}  34.125  2014-06-02 09:07:29 UTC  
#  8  {'$oid': '538c3e96c3ca7c08531d65b3'}  34.125  2014-06-02 09:06:29 UTC  
#  9  {'$oid': '538c3e5bc3ca7c08531d65a7'}  34.125  2014-06-02 09:05:30 UTC  

Various options for concatenation

The frequently used arguments are as shown in the table below.

argument Description
objs Specify the pandas object to be concatenated with a list etc.
axis Row or column
join inner or outer, the default is outer
keys Values associated between concatenated datasets
ignore_index Generate a new index separately from the axis to be connected

Summary

We've found that pandas makes it easy to merge and concatenate external datasets. For example, by plotting from here, you can see the possibilities of various analyzes such as drawing a scatter plot and examining the correlation.

Recommended Posts

Merge datasets with pandas
Processing datasets with pandas (1)
Processing datasets with pandas (2)
Quickly try to visualize datasets with pandas
Quickly visualize with Pandas
Bootstrap sampling with Pandas
Convert 202003 to 2020-03 with pandas
Learn Pandas with Cheminformatics
Merge array with PyYAML
Data visualization with pandas
Data manipulation with Pandas!
Shuffle data with pandas
Read csv with python pandas
Load nested json with pandas
[Python] Change dtype with pandas
Standardize by group with pandas
Prevent omissions with pandas print
Data processing tips with Pandas
Pandas
Versatile data plotting with pandas + matplotlib
[Python] Join two tables with pandas
Dynamically create new dataframes with pandas
Extract specific multiple columns with pandas
1. Statistics learned with Python 1-1. Basic statistics (Pandas)
Convenient analysis with Pandas + Jupyter notebook
Draw a graph with pandas + XlsxWriter
Manipulating strings with pandas group by
Bulk Insert Pandas DataFrame with psycopg2
I want to do ○○ with Pandas
Create an age group with pandas
Merge JSON format data with Ansible
Classifying SNLI datasets with Word2Vec + LSTM
Excel aggregation with Python pandas Part 1
[Python] Format when to_csv with pandas
Feature generation with pandas group by
Handle various date formats with pandas
I get an error with import pandas.
Load csv with duplicate columns in pandas
Import of japandas with pandas 1.0 and above
Excel aggregation with Python pandas Part 2 Variadic
Tips for plotting multiple lines with pandas
Try converting to tidy data with pandas
Draw hierarchical axis labels with matplotlib + pandas
Replace column names / values with pandas dataframe
[Easy Python] Reading Excel files with pandas
Load csv with pandas and play with Index
Working with 3D data structures in pandas
Read CSV and analyze with Pandas and Seaborn
Example of efficient data processing with PANDAS
Best practices for messing with data with pandas
[Pandas 1.0.1 Memorial] Fierce battle record with cookbook