[PYTHON] I tried Pandas' Sql Upsert

Introduction

This is Miyano from estie CTO (@estie_mynfire).

Since estie constructs real estate data from various resources, pandas is used for data shaping. for that reason ** Data created with pandas-> DataBase ** It is important to carry out the flow smoothly.

Until now, there were only replace and append, so I had to do my best on the pandas side to mold and then update. (This is very annoying)

Meanwhile, the other day, pandas had the long-awaited Pull request for adding Sql Upsert function, so I tried to move it. This should explode your work efficiency ...! !!

What is Upsert in the first place?

It means to do Insert and Update. There are two main functions of Upsert in Sql.

Based on the Primary Key, nothing exists, and Insert (** upsert_keep **) does not exist.
Based on Primary Key, if it exists, update it, and if it does not exist, insert (** upsert_overwrite **)

Why are you happy to be able to do it with pandas

Difficult to handle missing values in Sql
Only NULL string is judged as missing value
Python nan and None-> Enter the table as a string as it is
Recognized as a blank string-> 0
csv spitting is troublesome
Even though I wrote it in Python, it is inefficient to spit it out to csv and then plunge it into another wrapper.

Therefore, Python engineers have a desire to update tables with confidence ** directly from pandas.

Conclusion

It is convenient because Sql operation can be done almost only with pandas. And simple and easy to use
It seems that upsert_keep is not behaving as expected, so investigate the cause

'''
Attributes:
    df (pd.DataFrame): any DataFrame
    tablename (str):table name
'''
import pandas as pd
from sqlalchemy import create_engine

con = create_engine('mysql+mysqlconnector://[user]:[pass]@[host]:[port]/[schema]')

# upsert_keep ->Basically do nothing(?)There is a possibility of a bug, so I will look at it in the future.
df.to_sql(tablename, con, if_exist='upsert_keep', index=False)

# upsert_overwrite ->Update those that exist and Insert if they do not exist. As intended
df.to_sql(tablename, con, if_exist='upsert_overwrite', index=False)

Env From here, I will write a concrete story. The environment is as follows.

macOS Mojave
- 10.14.5
Python -V
- 3.6.8
package version
- SQLAlchemy 1.3.11

Setting

Environment

git clone https://github.com/V0RT3X4/pandas.git
cd pandas
git checkout -b sql-upsert
git pull origin  sql-upsert

git rev-parse HEAD #Current commit hash value
# d0eb251075883902280cba6cd0dd9a1a1c4a69a4

Installation from sources

pip install cython
# Successfully installed cython-0.29.14
python setup.py build_ext --inplace --force #It takes a lot of time. Let's wait patiently.

Mysql server

Set up a mysql server and create a users table for local testing.

users table (** id column is the primary key **)

id	name	age
0	taro	20
1	jiro	17
2	saburo	18


mysql.server start --skip-grant-tables #Start a mysql server that allows you to log in without a password
mysql #Login

mysql> CREATE DATABASE testdb;
mysql> USE testdb;
mysql> CREATE TABLE users (
         id INT NOT NULL,
         name VARCHAR(256),
         age INT
       );
mysql> ALTER TABLE users ADD PRIMARY KEY (id);
mysql> INSERT INTO users (id, name, age) VALUES
       (0, 'taro', 20),
       (1, 'jiro', 19),
       (2, 'saburo', 18);

Main subject

From here, I will try Sql Upsert using pandas and sqlalchemy.

Db connect First of all, from the connection with DB. You will need sqlalchemy, so if you don't have it, please do pip install sqlalchemy

from sqlalchemy import create_engine
import pandas as pd

con = create_engine('mysql+mysqlconnector://@localhost/testdb')
# format: 'mysql+mysqlconnector://[user]:[pass]@[host]:[port]/[schema]'
users = pd.read_sql('SELECT * FROM users;', con)
users
#   id    name  age
#0   0    taro   20
#1   1    jiro   19
#2   2  saburo   18

Upsert

First, rewrite the DataFrame

users.loc[users.id==0, 'name'] = 'syo'
users = pd.concat([users, pd.DataFrame({'id': [3], 'name': ['shiro'], 'age': [28]})])
#   id    name  age
#0   0     syo   20
#1   1    jiro   19
#2   2  saburo   18
#0   3   shiro   28

The basic syntax of the to_sql method is

df.to_sql(tablename, con, if_exist, index=False)
# df: pd.DataFrame
# tablename:table name

# if_exist: {'fail', 'replace', 'append', 'upsert_overwrite', 'upsert_keep'}

# * fail: Raise a ValueError. 
# * replace: Drop the table before inserting new values. 
# * append: Insert new values to the existing table. 
# * upsert_overwrite: Overwrite matches in database with incoming data. 
# * upsert_keep: Keep matches in database instead of incoming data.

(Check pandas / core / generic.py for other arguments!)

upsert_keep

Do nothing (?)-> See below
I thought it would insert 3 shiro 28, but it wasn't.

users.to_sql('users', con, if_exists='upsert_keep', index=False)
pd.read_sql('SELECT * FROM users;', con) #Verification
#   id    name  age
#0   0     taro   20
#1   1    jiro   19
#2   2  saburo   18

upsert_overwrite

primary key (id in this case) Insert if not wearing, update the wearing record
As intended

users.to_sql('users', con, if_exists='upsert_overwrite', index=False)
pd.read_sql('SELECT * FROM users;', con) #Verification
#   id    name  age
#0   0     syo   20
#1   1    jiro   19
#2   2  saburo   18
#3   3   shiro   28

in conclusion

Problems with upsert_keep not behaving as intended

It seems necessary to look around the _upsert_keep_processing method of pandas / io / sql.py. As soon as the cause is known, we will update the pull request and article.

About estie

At estie, we are always looking for engineers who are enthusiastic about new technologies and full-stack engineers! https://www.wantedly.com/companies/company_6314859/projects

estie -> https://www.estie.jp estiepro -> https://pro.estie.jp Company site-> https://www.estie.co.jp