[python] Around generating XML string without using to_sql () of pandas and updating data using OPENXML function in SQL Server stored procedure

1. Overview

pandas has methods to write / read data from various data sources such as csv and Excel files, relational databases (RDB) and Google's BigQuery. For example, if you want to write data to RDBMS tables such as SQL Server. , You can easily update the data in the table by using the to_sql () method of pandas. If you try to write the same data to the table again during the second update process of the data already stored in the table, the data will be duplicated. Therefore, data on the table should not be written for such duplicated data. If you want to update the data in a unique way, you can't deny the feeling that to_sql () of pandas doesn't work a little. I can't just run to_sql () in replace mode ... So, this time, I tried to prototype / consider how to update only the difference data to the table on SQL Server by the following flow without using to_sql () of pandas.

1: Generate XML string with elementTree from pandas df data frame with python ... (1) 2: Execute the SQL Server stored procedure with SQLAlchemy using the XML string of (1) as an argument 3: Read the XML character string of (1) with the OPENXML function of SQL Server and expand the data ... (2) 4: Store the data of (2) in the temporary table #table, JOIN with the real table and update only the difference

It should be noted that the data can be updated in this above image on the premise that there is support for the OPENXML function, which is a function to operate XML strings provided as standard in Microsoft's RDBMS product SQL Server, and other RDBMS. I'm not sure if the product has a function to manipulate XML strings like the OPENXML function, but what about? (Note: I'm not familiar with MySQL and postgres, but as a direction, if there is a function to easily operate XML strings on RDB, such as MySQL and postgres, I think that those functions can be substituted. I was wondering if there was any ...) Therefore, please note that this issue is a method that can be implemented only in SQL Server!

2. First half (generate XML string on python)

file1.py


import pandas as pd
import xml.etree.ElementTree as ET

#Sample data
lei = ['353800PIEETYXIDK6K51','5493006W3QUS5LMH6R84','not found']
cname = ['Toyota Motor','Polar Ocean','Sakura Internet']
isin = [' JP3633400001' , 'JP3257200000','JP3317300006']
sic =['7203','1301','3778']

#df generation
Pythondata={'sic': sic, 'isin': isin, 'cname': cname ,'lei': lei }
df = pd.DataFrame(data)
columns = ['sic', 'isin', 'cname', 'lei' ]
df.columns = columns
print(df)

#XML string generation
roots = ET.Element('root') 
for i in range(len(df)):
    f0 = ET.SubElement(roots, 'sb')
    f1 = ET.SubElement(f0, 'hoge')
    f1.set('sic', df.iloc[i,0])
    f1.set('isin', df.iloc[i,1])
    f1.set('cname', df.iloc[i,2])
    f1.set('lei', df.iloc[i,3])
tree = ET.tostring(roots)
tree = tree.decode()
tree = "'" + tree + "'"
print(tree)

First, the above three sample data (Toyota Motor, Kyokuyo, Sakura Internet) are from JPX TSE page I checked the code etc. I also looked up the LEI code on the GLEIF page. When you execute the above python code, the pandas data frame df is generated first, and the XML string is generated from that df. In the above, the python module elementTree for creating and manipulating XML objects is used to generate XML objects, and XML tags (to be exact, XML attribute information) are added sequentially from the root tag in a for loop. ..

The following is the execution result of the above sample python code ↓ スクリーンショット 2020-07-19 20.49.34.png

The XML string generated by the above sample python code is as follows ↓

'<root><sb><hoge cname="&#12488;&#12520;&#12479;&#33258;&#21205;&#36554;" isin=" JP3633400001" lei="353800PIEETYXIDK6K51" sic="7203" /></sb><sb><hoge cname="&#26997;&#27915;" isin="JP3257200000" lei="5493006W3QUS5LMH6R84" sic="1301" /></sb><sb><hoge cname="&#12373;&#12367;&#12425;&#12452;&#12531;&#12479;&#12540;&#12493;&#12483;&#12488;" isin="JP3317300006" lei="not found" sic="3778" /></sb></root>'

If you make the XML string a little more readable and write it with line breaks, the XML structure will be as follows ↓

'<root>
<sb>
<hoge cname="&#12488;&#12520;&#12479;&#33258;&#21205;&#36554;" isin="JP3633400001" lei="353800PIEETYXIDK6K51" sic="7203" />
</sb>
<sb>
<hoge cname="&#26997;&#27915;" isin="JP3257200000" lei="5493006W3QUS5LMH6R84" sic="1301" />
</sb>
<sb>
<hoge cname="&#12373;&#12367;&#12425;&#12452;&#12531;&#12479;&#12540;&#12493;&#12483;&#12488;" isin="JP3317300006" lei="not found" sic="3778" />
</sb>
</root>'

In the above XML, it is not defined by element, but by attribute. This is defined by an attribute based on the specifications when reading XML with the SQL Server OPEN XML function described later. After that, I wonder how to handle NULL (NaN / None) when generating an XML string. I think that missing values will naturally appear in the data handled by pandas, but at that time, NULL (NaN / None) cannot be handled well when defining it as an XML character string, so add some character string. It is necessary to think of a workaround such as giving. (For example, in the sample above, 1301 Kyokuyo and 7203 Toyota have LEI code, while 3778 Sakura Internet has [LEI code]. ](Https://www.gleif.org/ja/) does not have, and forcibly converts NULL data to characters such as "not found" in order to generate an XML string, and SQL Server described later. Is it necessary to deal with them well after reading with the OPENXML function in? There is room for improvement here)

3. Second half (read / composition data from XML string with OPENXML function on SQL Server)

file2.sql


DECLARE @idoc INT
DECLARE @xml AS NVARCHAR(MAX)
SET @xml = '<root><sb><hoge cname="&#12488;&#12520;&#12479;&#33258;&#21205;&#36554;" isin=" JP3633400001" lei="353800PIEETYXIDK6K51" sic="7203" /></sb><sb><hoge cname="&#26997;&#27915;" isin="JP3257200000" lei="5493006W3QUS5LMH6R84" sic="1301" /></sb><sb><hoge cname="&#12373;&#12367;&#12425;&#12452;&#12531;&#12479;&#12540;&#12493;&#12483;&#12488;" isin="JP3317300006" lei="not found" sic="3778" /></sb></root>'

--SELECT @xml


DROP TABLE IF EXISTS #temp;

EXEC sp_xml_preparedocument @idoc OUTPUT, @xml;

SELECT *  
INTO #temp
FROM OPENXML(@idoc, '/root/sb/hoge',1)  
WITH (SIC varchar(50) './@sic'
		,	ISIN varchar(50) './@isin'
		,	CNAME varchar(50) './@cname'
		,	LEI varchar(50) './@lei'
		);

SELECT *  
FROM #temp;

DROP TABLE IF EXISTS #temp;

Execution result of the above SQL query ↓ スクリーンショット 2020-07-20 10.15.39.png SQL Server supports a function to operate XML character strings called OPENXML function as standard equipment, and by using this OPENXML function, the XML character string generated by the above python is converted to data that can be operated by SQL. Then, if you store the data in the temporary table #temp (SELECT * INTO ~ clause), then JOIN the #temp and the storage destination table, and record the #temp that does not yet exist in the storage destination table. If you describe the transaction query to be inserted in the storage destination table as a difference record, it seems that you can update the data in the table while ensuring the uniqueness of the data.

It should be noted that the XML character string (XML structure) generated in the first half and the XML reading description when reading with the SQL OPENXML function in the second half must be exactly the same, and the XML structure to be handled is the first half. It seems that this is a point to note because the XML string ejection part of the above and the XML reading part of the latter half must be the same. (If the XML string cannot be read well in the part read by the OPENXML function of SQL Server, the part generated on python in the first half of the XML description is usually synchronized with the part read by OPENXML on SQL Server. May not be)

4. Execute SQL Server stored procedure in SQLAlchemy

file1.py


import sqlalchemy 

#DB connection setting string
CONNECT_INFO = 'mssql+pyodbc://hoge'      #hoge = ODBC name
engine = sqlalchemy.create_engine(CONNECT_INFO, encoding='utf-8')

#...(Omission)...

##DB update
Session = sqlalchemy.orm.sessionmaker(bind=engine)
session = Session()
    
##Stored execution
query = 'EXEC dbo.spUpdatehoge @prm = {0}'.format(tree)
res = session.execute(query)  #sp execution
print(res.fetchall())
session.commit()      #commit
session.close()

When executing a SQL Server stored procedure with python (SQLAlchemy), I want to get the return value (response) of executing the stored procedure on python and display it at the session.execute (query) part of the above query. In that case, if you return the update result status with the Select statement instead of the Return clause in the stored procedure, you can see the result of the stored execution on python immediately. Apparently, the ResultProxy object of SQLAlchemy cannot get the result returned by the Return clause in the stored procedure well. The update method described this time is divided into the part that generates the XML string with the Python script in the first half, the part of OPENXML processing in the stored procedure in SQL Server in the second half, and the first half / second half of the processing. So, I think it would be reassuring if the processing result of the stored procedure (success or failure) can be acquired and displayed by Python's SQLAlchemy.

Recommended Posts

[python] Around generating XML string without using to_sql () of pandas and updating data using OPENXML function in SQL Server stored procedure
Graph time series data in Python using pandas and matplotlib
I tried to make a function to retrieve data from database column by column using sql with sqlite3 of python [sqlite3, sql, pandas]
[In-Database Python Analysis Tutorial with SQL Server 2017] Step 4: Feature extraction of data using T-SQL
Full-width and half-width processing of CSV data in Python