Convert XML document stored in XML database (BaseX) to CSV format (using Python)

Article content

Suppose you have multiple XML files at hand and want to weigh those data. It's hard for humans to compare multiple XML files, so I'll convert them to a CSV file with one line per XML file. If you open the CSV file with Excel etc., it will be easier to compare and examine the data. I implemented the conversion to CSV in Python, but forgive me that the code isn't very well-behaved.

environment

--OS: Windows10 (I haven't tried it, but it can be executed on other OS) --XML database: BaseX 9.3.2

Reference URL (Thank you)

-[Addition little by little] From installation of XML database "BaseX" to benchmarking

Load an XML document into BaseX

First, load multiple XML documents at hand into BaseX. I wrote and worked the following Python code with reference to Code here.


from pathlib import Path
import os
import pprint
from BaseXClient import BaseXClient

#Create session
session = BaseXClient.Session('localhost', 1984, 'admin', 'admin')

try:
    #XML file to be stored in DB
    xml_directory = Path("C:\\") / "xml_data"
    list_xml_path = sorted(xml_directory.glob("*.xml"), key=os.path.getmtime)
    print("XML file to load:")
    pprint.pprint(list_xml_path)

    #DB open
    session.execute("open testdb")
    print(session.info())

    #Read the xml file and add it to the DB
    for path in list_xml_path:
        with open(path, mode='r', encoding="utf-8") as fi:
            str_xml = fi.read()

        session.add(path.name, str_xml)
        print(session.info())

    #Display the contents of DB
    print("\n" + session.execute("xquery /"))
    print("Completed normally\n")

finally:
    #Close session
    if session:
        session.close()

To supplement the explanation of the code,

--Load the XML document into the existing database "testdb". Please refer to this article for how to create a DB. --The location of the XML file is "C: \ xml_data". --For the XML file, download sample-1.xml to sample-8.xml of this site (thank you), and the contents are "encoding". I rewrote it to "UTF-8" and saved it again in UTF-8 format. --XML files are loaded in chronological order with file timestamps.

When I ran the code, the XML document was loaded with the following display:

XML file to load:
[WindowsPath('C:/xml_data/sample-1.xml'),
 WindowsPath('C:/xml_data/sample-2.xml'),
 WindowsPath('C:/xml_data/sample-3.xml'),
 WindowsPath('C:/xml_data/sample-4.xml'),
 WindowsPath('C:/xml_data/sample-5.xml'),
 WindowsPath('C:/xml_data/sample-7.xml'),
 WindowsPath('C:/xml_data/sample-6.xml'),
 WindowsPath('C:/xml_data/sample-8.xml')]
Database 'testdb' was opened in 1.48 ms.

Resource(s) added in 3.72 ms.

Resource(s) added in 1.93 ms.

Resource(s) added in 8.89 ms.

Resource(s) added in 1.91 ms.

Resource(s) added in 2.05 ms.

Resource(s) added in 2.05 ms.

Resource(s) added in 1.93 ms.

(Omitted)

Completed normally

Issue a query to BaseX and output to a CSV file

Then it queries BaseX to output a CSV file that makes one XML document one line. I wrote the following Python code and made it work.


import pprint
from BaseXClient import BaseXClient


#Own exception class
class MyException(Exception):
    pass


#Read the xpath of the root element from the DB
def read_xpath_root(session):
    set_xpath_root = set()

    query = f'''\
for $root in /*
return fn:path($root)
'''

    query_obj = session.query(query)
    query_obj.execute()

    #Add xpaths for all root elements to a unique set
    for typecode, item in query_obj.iter():
        set_xpath_root.add(item)

    print("root element XPath:")
    pprint.pprint(set_xpath_root)

    query_obj.close()

    #I decided to allow only one type of root element.
    if len(set_xpath_root) == 1:
        pass
    else:
        msg = f"assert len(set_xpath_root)<{len(set_xpath_root)}> == 1"
        raise MyException(msg)

    return set_xpath_root.pop().replace("[1]", "[01]") + '/'


#Collect all types of xpaths of text elements from DB
def read_xpath_text(session):
    set_xpath_text = set()

    query = f'''\
for $text in //text()
return fn:path($text)
'''

    query_obj = session.query(query)
    query_obj.execute()

    #Add xpaths for all text elements to a unique set
    for typecode, item in query_obj.iter():
        set_xpath_text.add(item)

    query_obj.close()

    #I want to sort, so move from set to list
    list_xpath_text = []
    for xpath_text in set_xpath_text:
        list_xpath_text.append(
            #When sorting[1]But[10]For problems that will be later. For the time being
            xpath_text
                .replace("[1]", "[01]")
                .replace("[2]", "[02]")
                .replace("[3]", "[03]")
                .replace("[4]", "[04]")
                .replace("[5]", "[05]")
                .replace("[6]", "[06]")
                .replace("[7]", "[07]")
                .replace("[8]", "[08]")
                .replace("[9]", "[09]")
        )
    list_xpath_text.sort()
    print("XPath of text element (all types):")
    pprint.pprint(list_xpath_text)
    return list_xpath_text


#Create session
session = BaseXClient.Session('localhost', 1984, 'admin', 'admin')

try:
    #DB open
    session.execute("open testdb")
    print(session.info())

    #read the xpath of the root element
    xpath_root = read_xpath_root(session)

    #Collect all types of xpath of text element (use csv column name)
    list_xpath_text = read_xpath_text(session)

    #Assemble and issue a query that returns a row of csv
    csv_header = "input_path"
    query = f'''\
for $root in /*
let $base_uri := fn:base-uri($root)
order by $base_uri
return <ROW>"{{fn:substring($base_uri, fn:string-length(db:name($root)) + 3)}}"\
'''

    #csv column loop
    for xpath_text in list_xpath_text:
        # xpath_Format from text to csv header,+=
        csv_header += ',' \
                      + xpath_text\
                          .replace(xpath_root, "")\
                          .replace("/text()[01]", "")\
                          .replace("Q{}", "")

        # xpath_Format from text for XQuery,+=
        query += ',"{' + xpath_text.replace(xpath_root, "$root/") + '}"'

    query += "</ROW>"

    #DB inquiry
    query_obj = session.query(query)
    query_obj.execute()

    #Output file name
    basename = xpath_root\
                       .replace("Q{}", "")\
                       .replace("[01]", "")\
                       .replace('/', "")

    #Query statement output
    with open(basename + "_xquery.txt", 'w') as fo:
        fo.write(query)
        fo.write('\n')

    #csv output
    with open(basename + ".csv", 'w') as fo:
        #Header output
        fo.write(csv_header)
        fo.write('\n')

        #Line output
        for typecode, item in query_obj.iter():
            fo.write(item.replace("<ROW>", "").replace("</ROW>", '\n'))

    query_obj.close()
    print("Completed normally\n")

finally:
    #Close session
    if session:
        session.close()

To supplement the explanation of the code,

--This code requires that all XML documents in the DB have the same root element name. --The CSV file name is automatically generated based on the root element name. --Extract the text element from the XML document and use it as CSV data. --Set the XPath of the text element as the CSV column name. --As a rough procedure, first collect all types of XPath of text elements from all XML documents (implemented by read_xpath_text function). --Next, use that XPath to assemble an XQuery statement and output the query result to CSV. --Since XML is semi-structured data, there may be XML documents that do not have the XPath included in the query. If XPath does not exist in the XML document, an empty string will be output to CSV as its column data. --Only the first column of CSV (column name = input_path) is output by extracting the XML file name from the result of the db: name function of XQuery, not from the text element of XML.

When I ran the code, the following was displayed and the CSV file "manyosyu.csv" was output.

Database 'testdb' was opened in 1.12 ms.

root element XPath:
{'/Q{}manyosyu[1]'}
XPath of text element (all types):
['/Q{}manyosyu[01]/Q{}volume[01]/Q{}poem[01]/Q{}image[01]/text()[01]',
 '/Q{}manyosyu[01]/Q{}volume[01]/Q{}poem[01]/Q{}mean[01]/text()[01]',
 '/Q{}manyosyu[01]/Q{}volume[01]/Q{}poem[01]/Q{}mkana[01]/text()[01]',
 '/Q{}manyosyu[01]/Q{}volume[01]/Q{}poem[01]/Q{}pno[01]/text()[01]',
 '/Q{}manyosyu[01]/Q{}volume[01]/Q{}poem[01]/Q{}poet[01]/text()[01]',
 '/Q{}manyosyu[01]/Q{}volume[01]/Q{}poem[01]/Q{}yomi[01]/text()[01]',
 '/Q{}manyosyu[01]/Q{}volume[01]/Q{}poem[02]/Q{}image[01]/text()[01]',
 '/Q{}manyosyu[01]/Q{}volume[01]/Q{}poem[02]/Q{}mean[01]/text()[01]',
 '/Q{}manyosyu[01]/Q{}volume[01]/Q{}poem[02]/Q{}mkana[01]/text()[01]',
 '/Q{}manyosyu[01]/Q{}volume[01]/Q{}poem[02]/Q{}pno[01]/text()[01]',
 '/Q{}manyosyu[01]/Q{}volume[01]/Q{}poem[02]/Q{}poet[01]/text()[01]',
 '/Q{}manyosyu[01]/Q{}volume[01]/Q{}poem[02]/Q{}yomi[01]/text()[01]',
 '/Q{}manyosyu[01]/Q{}volume[01]/Q{}poem[03]/Q{}image[01]/text()[01]',
 '/Q{}manyosyu[01]/Q{}volume[01]/Q{}poem[03]/Q{}mean[01]/text()[01]',
 '/Q{}manyosyu[01]/Q{}volume[01]/Q{}poem[03]/Q{}mkana[01]/text()[01]',
 '/Q{}manyosyu[01]/Q{}volume[01]/Q{}poem[03]/Q{}pno[01]/text()[01]',
 '/Q{}manyosyu[01]/Q{}volume[01]/Q{}poem[03]/Q{}poet[01]/text()[01]',
 '/Q{}manyosyu[01]/Q{}volume[01]/Q{}poem[03]/Q{}yomi[01]/text()[01]',
 '/Q{}manyosyu[01]/Q{}volume[01]/Q{}poem[04]/Q{}image[01]/text()[01]',
 '/Q{}manyosyu[01]/Q{}volume[01]/Q{}poem[04]/Q{}mean[01]/text()[01]',
 '/Q{}manyosyu[01]/Q{}volume[01]/Q{}poem[04]/Q{}mean[01]/text()[02]',
 '/Q{}manyosyu[01]/Q{}volume[01]/Q{}poem[04]/Q{}mkana[01]/text()[01]',
 '/Q{}manyosyu[01]/Q{}volume[01]/Q{}poem[04]/Q{}pno[01]/text()[01]',
 '/Q{}manyosyu[01]/Q{}volume[01]/Q{}poem[04]/Q{}poet[01]/text()[01]',
 '/Q{}manyosyu[01]/Q{}volume[01]/Q{}poem[04]/Q{}yomi[01]/text()[01]',
 '/Q{}manyosyu[01]/Q{}volume[01]/Q{}poem[05]/Q{}image[01]/text()[01]',
 '/Q{}manyosyu[01]/Q{}volume[01]/Q{}poem[05]/Q{}mean[01]/text()[01]',
 '/Q{}manyosyu[01]/Q{}volume[01]/Q{}poem[05]/Q{}mkana[01]/text()[01]',
 '/Q{}manyosyu[01]/Q{}volume[01]/Q{}poem[05]/Q{}pno[01]/text()[01]',
 '/Q{}manyosyu[01]/Q{}volume[01]/Q{}poem[05]/Q{}poet[01]/text()[01]',
 '/Q{}manyosyu[01]/Q{}volume[01]/Q{}poem[05]/Q{}yomi[01]/text()[01]',
 '/Q{}manyosyu[01]/Q{}volume[01]/Q{}poem[06]/Q{}image[01]/text()[01]',
 '/Q{}manyosyu[01]/Q{}volume[01]/Q{}poem[06]/Q{}mean[01]/text()[01]',
 '/Q{}manyosyu[01]/Q{}volume[01]/Q{}poem[06]/Q{}mkana[01]/text()[01]',
 '/Q{}manyosyu[01]/Q{}volume[01]/Q{}poem[06]/Q{}pno[01]/text()[01]',
 '/Q{}manyosyu[01]/Q{}volume[01]/Q{}poem[06]/Q{}poet[01]/text()[01]',
 '/Q{}manyosyu[01]/Q{}volume[01]/Q{}poem[06]/Q{}yomi[01]/text()[01]']
Completed normally

The following query statement debug file is also output.

manyosyu_xquery.txt



for $root in /*
let $base_uri := fn:base-uri($root)
order by $base_uri
return <ROW>"{fn:substring($base_uri, fn:string-length(db:name($root)) + 3)}","{$root/Q{}volume[01]/Q{}poem[01]/Q{}image[01]/text()[01]}","{$root/Q{}volume[01]/Q{}poem[01]/Q{}mean[01]/text()[01]}","{$root/Q{}volume[01]/Q{}poem[01]/Q{}mkana[01]/text()[01]}","{$root/Q{}volume[01]/Q{}poem[01]/Q{}pno[01]/text()[01]}","{$root/Q{}volume[01]/Q{}poem[01]/Q{}poet[01]/text()[01]}","{$root/Q{}volume[01]/Q{}poem[01]/Q{}yomi[01]/text()[01]}","{$root/Q{}volume[01]/Q{}poem[02]/Q{}image[01]/text()[01]}","{$root/Q{}volume[01]/Q{}poem[02]/Q{}mean[01]/text()[01]}","{$root/Q{}volume[01]/Q{}poem[02]/Q{}mkana[01]/text()[01]}","{$root/Q{}volume[01]/Q{}poem[02]/Q{}pno[01]/text()[01]}","{$root/Q{}volume[01]/Q{}poem[02]/Q{}poet[01]/text()[01]}","{$root/Q{}volume[01]/Q{}poem[02]/Q{}yomi[01]/text()[01]}","{$root/Q{}volume[01]/Q{}poem[03]/Q{}image[01]/text()[01]}","{$root/Q{}volume[01]/Q{}poem[03]/Q{}mean[01]/text()[01]}","{$root/Q{}volume[01]/Q{}poem[03]/Q{}mkana[01]/text()[01]}","{$root/Q{}volume[01]/Q{}poem[03]/Q{}pno[01]/text()[01]}","{$root/Q{}volume[01]/Q{}poem[03]/Q{}poet[01]/text()[01]}","{$root/Q{}volume[01]/Q{}poem[03]/Q{}yomi[01]/text()[01]}","{$root/Q{}volume[01]/Q{}poem[04]/Q{}image[01]/text()[01]}","{$root/Q{}volume[01]/Q{}poem[04]/Q{}mean[01]/text()[01]}","{$root/Q{}volume[01]/Q{}poem[04]/Q{}mean[01]/text()[02]}","{$root/Q{}volume[01]/Q{}poem[04]/Q{}mkana[01]/text()[01]}","{$root/Q{}volume[01]/Q{}poem[04]/Q{}pno[01]/text()[01]}","{$root/Q{}volume[01]/Q{}poem[04]/Q{}poet[01]/text()[01]}","{$root/Q{}volume[01]/Q{}poem[04]/Q{}yomi[01]/text()[01]}","{$root/Q{}volume[01]/Q{}poem[05]/Q{}image[01]/text()[01]}","{$root/Q{}volume[01]/Q{}poem[05]/Q{}mean[01]/text()[01]}","{$root/Q{}volume[01]/Q{}poem[05]/Q{}mkana[01]/text()[01]}","{$root/Q{}volume[01]/Q{}poem[05]/Q{}pno[01]/text()[01]}","{$root/Q{}volume[01]/Q{}poem[05]/Q{}poet[01]/text()[01]}","{$root/Q{}volume[01]/Q{}poem[05]/Q{}yomi[01]/text()[01]}","{$root/Q{}volume[01]/Q{}poem[06]/Q{}image[01]/text()[01]}","{$root/Q{}volume[01]/Q{}poem[06]/Q{}mean[01]/text()[01]}","{$root/Q{}volume[01]/Q{}poem[06]/Q{}mkana[01]/text()[01]}","{$root/Q{}volume[01]/Q{}poem[06]/Q{}pno[01]/text()[01]}","{$root/Q{}volume[01]/Q{}poem[06]/Q{}poet[01]/text()[01]}","{$root/Q{}volume[01]/Q{}poem[06]/Q{}yomi[01]/text()[01]}"</ROW>

Let's open the generated "manyosyu.csv" in Excel.

manyosyu.png

A CSV line was generated for each XML file loaded into the DB.

Recommended Posts

Convert XML document stored in XML database (BaseX) to CSV format (using Python)
Convert / return class object to JSON format in Python
Convert markdown to PDF in Python
[Python] How to convert db file to csv
Convert xml format data to txt format data (yolov3)
How to convert csv to tsv in CLI
[Python] Convert csv file delimiters to tab delimiters
Convert psd file to png in Python
Log in to Slack using requests in Python
Convert Python date types to RFC822 format
How to convert DateTimeField format in Django
Convert from Markdown to HTML in Python
Download files in any format using Python
Convert absolute URLs to relative URLs in Python
Note: [Python3] Convert datetime to a string in any format you like
I want to convert a table converted to PDF in Python back to CSV
How to plot galaxy visible light data using OpenNGC database in python
I made a script in python to convert .md files to Scrapbox format
Preprocessing with Python. Convert Nico Nico Douga tag search results to CSV format
Bad post for using "animeface-2009" in Python & Implementation of function to output to PASCAL VOC format XML file
Convert FBX files to ASCII <-> BINARY in Python
Convert PDFs to images in bulk with Python
[Python] Convert from DICOM to PNG or CSV
Convert csv, tsv data to matrix with python --using MovieLens as an example
Convert UTF-8 CSV files to read in Excel
Convert STL to Voxel mesh using Python VTK
Csv in python
Convert json format data to txt (using yolo)
format in python
Convert exponential notation float to str in Python
Convert cubic mesh code to WKT in Python
Convert strings to character-by-character list format with python
Pass dataframe containing True / False from Python to R in csv format (pd.DataFrame-> tbl_df)
Convert timezoned date and time to Unixtime in Python2.7
I want to write in Python! (1) Code format check
Convert CIDR notation netmask to dotted decimal notation in Python
[Python] Convert PDF text to CSV page by page (2/24 postscript)
How to convert floating point numbers to binary numbers in Python
Convert the image in .zip to PDF with Python
How to convert JSON file to CSV file with Python Pandas
Convert callback-style asynchronous API to async / await in Python
How to exit when using Python in Terminal (Mac)
[Python] Created a method to convert radix in 1 second
How to retrieve multiple arrays using slice in python.
How to convert Json file to CSV format or EXCEL format
Convert Webpay Entity type to Dict type (recursively in Python)
How to execute a command using subprocess in Python
How to read csv containing only integers in Python
Convert Pascal VOC format xml file to COCO format json file
Parse XML in Python
Image format in Python
Try to make it using GUI and PyQt in Python
Convert from Pandas DataFrame to System.Data.DataTable using Python for .NET
How to output a document in pdf format with Sphinx
Python / datetime> Implementation to convert YYYYMMDD format to YYYY / MM / DD
[R] [Python] Memo to read multiple csv files in multiple zip files
To return char * in a callback function using ctypes in Python
Batch convert all xlsx files in the folder to CSV files
Get Python scripts to run quickly in Cloud Run using responder
Solve the Japanese problem when using the CSV module in Python.
Things to keep in mind when using Python with AtCoder