[PYTHON] I tried Pandas' Sql Upsert

Introduction

This is Miyano from estie CTO (@estie_mynfire).

Since estie constructs real estate data from various resources, pandas is used for data shaping. for that reason ** Data created with pandas-> DataBase ** It is important to carry out the flow smoothly.

Until now, there were only replace and append, so I had to do my best on the pandas side to mold and then update. (This is very annoying)

Meanwhile, the other day, pandas had the long-awaited Pull request for adding Sql Upsert function, so I tried to move it. This should explode your work efficiency ...! !!

image.png

What is Upsert in the first place?

It means to do Insert and Update. There are two main functions of Upsert in Sql.

  1. Based on the Primary Key, nothing exists, and Insert (** upsert_keep **) does not exist.
  2. Based on Primary Key, if it exists, update it, and if it does not exist, insert (** upsert_overwrite **)

Why are you happy to be able to do it with pandas

Therefore, Python engineers have a desire to update tables with confidence ** directly from pandas.

Conclusion

'''
Attributes:
    df (pd.DataFrame): any DataFrame
    tablename (str):table name
'''
import pandas as pd
from sqlalchemy import create_engine

con = create_engine('mysql+mysqlconnector://[user]:[pass]@[host]:[port]/[schema]')

# upsert_keep ->Basically do nothing(?)There is a possibility of a bug, so I will look at it in the future.
df.to_sql(tablename, con, if_exist='upsert_keep', index=False)

# upsert_overwrite ->Update those that exist and Insert if they do not exist. As intended
df.to_sql(tablename, con, if_exist='upsert_overwrite', index=False)

Env From here, I will write a concrete story. The environment is as follows.

Setting

Environment

git clone https://github.com/V0RT3X4/pandas.git
cd pandas
git checkout -b sql-upsert
git pull origin  sql-upsert

git rev-parse HEAD #Current commit hash value
# d0eb251075883902280cba6cd0dd9a1a1c4a69a4

Installation from sources

pip install cython
# Successfully installed cython-0.29.14
python setup.py build_ext --inplace --force #It takes a lot of time. Let's wait patiently.

Mysql server

Set up a mysql server and create a users table for local testing.

users table (** id column is the primary key **)

id name age
0 taro 20
1 jiro 17
2 saburo 18

mysql.server start --skip-grant-tables #Start a mysql server that allows you to log in without a password
mysql #Login

mysql> CREATE DATABASE testdb;
mysql> USE testdb;
mysql> CREATE TABLE users (
         id INT NOT NULL,
         name VARCHAR(256),
         age INT
       );
mysql> ALTER TABLE users ADD PRIMARY KEY (id);
mysql> INSERT INTO users (id, name, age) VALUES
       (0, 'taro', 20),
       (1, 'jiro', 19),
       (2, 'saburo', 18);

Main subject

From here, I will try Sql Upsert using pandas and sqlalchemy.

Db connect First of all, from the connection with DB. You will need sqlalchemy, so if you don't have it, please do pip install sqlalchemy

from sqlalchemy import create_engine
import pandas as pd

con = create_engine('mysql+mysqlconnector://@localhost/testdb')
# format: 'mysql+mysqlconnector://[user]:[pass]@[host]:[port]/[schema]'
users = pd.read_sql('SELECT * FROM users;', con)
users
#   id    name  age
#0   0    taro   20
#1   1    jiro   19
#2   2  saburo   18

Upsert

First, rewrite the DataFrame

users.loc[users.id==0, 'name'] = 'syo'
users = pd.concat([users, pd.DataFrame({'id': [3], 'name': ['shiro'], 'age': [28]})])
#   id    name  age
#0   0     syo   20
#1   1    jiro   19
#2   2  saburo   18
#0   3   shiro   28

The basic syntax of the to_sql method is

df.to_sql(tablename, con, if_exist, index=False)
# df: pd.DataFrame
# tablename:table name

# if_exist: {'fail', 'replace', 'append', 'upsert_overwrite', 'upsert_keep'}

# * fail: Raise a ValueError. 
# * replace: Drop the table before inserting new values. 
# * append: Insert new values to the existing table. 
# * upsert_overwrite: Overwrite matches in database with incoming data. 
# * upsert_keep: Keep matches in database instead of incoming data.

(Check pandas / core / generic.py for other arguments!)

upsert_keep

users.to_sql('users', con, if_exists='upsert_keep', index=False)
pd.read_sql('SELECT * FROM users;', con) #Verification
#   id    name  age
#0   0     taro   20
#1   1    jiro   19
#2   2  saburo   18

upsert_overwrite

users.to_sql('users', con, if_exists='upsert_overwrite', index=False)
pd.read_sql('SELECT * FROM users;', con) #Verification
#   id    name  age
#0   0     syo   20
#1   1    jiro   19
#2   2  saburo   18
#3   3   shiro   28

in conclusion

Problems with upsert_keep not behaving as intended

It seems necessary to look around the _upsert_keep_processing method of pandas / io / sql.py. As soon as the cause is known, we will update the pull request and article.

About estie

At estie, we are always looking for engineers who are enthusiastic about new technologies and full-stack engineers! https://www.wantedly.com/companies/company_6314859/projects

estie -> https://www.estie.jp estiepro -> https://pro.estie.jp Company site-> https://www.estie.co.jp

Recommended Posts

I tried Pandas' Sql Upsert
I tried PyQ
I tried AutoKeras
I tried papermill
I tried django-slack
I tried Django
I tried spleeter
I tried cgo
I tried the pivot table function of pandas
I tried using argparse
I tried using anytree
I tried competitive programming
I tried using aiomysql
I tried using Summpy
I tried using coturn
I tried using "Anvil".
I tried using Hubot
I tried using ESPCN
I tried PyCaret2.0 (pycaret-nightly)
I tried deep learning
I tried AWS CDK!
I tried to debug.
I tried using PyCaret
I tried using cron
I tried Kivy's mapview
I tried using ngrok
I tried using face_recognition
I tried to paste
I tried using Jupyter
I tried using PyCaret
I tried moving EfficientDet
I tried shell programming
I tried using Heapq
I tried using doctest
I tried Python> decorator
I tried running TensorFlow
I tried Auto Gluon
I tried using folium
I tried using jinja2
I tried AWS Iot
I tried Bayesian optimization!
I tried using folium
I tried using time-window
I tried to summarize how to use pandas in python
I tried Value Iteration Networks
I tried fp-growth with python
I tried scraping with Python
I tried AutoGluon's Image Classification
Correspondence between pandas and SQL
I tried Learning-to-Rank with Elasticsearch!
[I tried using Pythonista 3] Introduction
I tried using easydict (memo).
I tried to organize SVM.
I tried face recognition using Face ++
I tried using Random Forest
I tried clustering with PyCaret
I tried using BigQuery ML
I tried "K-Fold Target Encoding"
I tried to implement PCANet
I tried using Amazon Glacier
I tried to summarize the code often used in Pandas