Process Pubmed .xml data with python

Introduction

This article is a personal memo about how to read the bibliographic data (xml format) caught in the search in Pubmed with python.

I would appreciate it if you could point out any points you noticed.

Data you want to process

One piece of data looks like the following. Actually, I want to process multiple data, but first I will make it possible to process one by one.

001.xml


<PubmedArticle>
    <MedlineCitation Status="Publisher" Owner="NLM">
        <PMID Version="1">12345678</PMID>
        <DateRevised>
            <Year>2020</Year>
            <Month>03</Month>
            <Day>27</Day>
        </DateRevised>
        <Article PubModel="Print-Electronic">
            <Journal>
                <ISSN IssnType="Electronic">1873-3700</ISSN>
                <JournalIssue CitedMedium="Internet">
                    <PubDate>
                        <Year>2020</Year>
                        <Month>Mar</Month>
                    </PubDate>
                </JournalIssue>
                <Title>Journal of XXX</Title>
            </Journal>
            <ArticleTitle>Identification of XXX.</ArticleTitle>
            <AuthorList CompleteYN="Y">
                <Author ValidYN="Y">
                    <LastName>Sendai</LastName>
                    <ForeName>Shiro</ForeName>
                    <Initials>S</Initials>
                    <AffiliationInfo>
                        <Affiliation>Sendai, Japan.</Affiliation>
                    </AffiliationInfo>
                </Author>
                <Author ValidYN="Y">
                    <LastName>Tohoku</LastName>
                    <ForeName>Taro</ForeName>
                    <Initials>T</Initials>
                    <AffiliationInfo>
                        <Affiliation>Miyagi, Japan.</Affiliation>
                    </AffiliationInfo>
                </Author>
            </AuthorList>
            <Language>eng</Language>
            <PublicationTypeList>
                <PublicationType UI="D016428">Journal Article</PublicationType>
            </PublicationTypeList>
            <ArticleDate DateType="Electronic">
                <Year>2020</Year>
                <Month>03</Month>
                <Day>23</Day>
            </ArticleDate>
        </Article>
        <CitationSubset>IM</CitationSubset>
    </MedlineCitation>
    <PubmedData>
        <PublicationStatus>aheadofprint</PublicationStatus>
        <ArticleIdList>
            <ArticleId IdType="pubmed">32213359</ArticleId>
            <ArticleId IdType="pii">S0031-9422(19)30971-9</ArticleId>
            <ArticleId IdType="doi">10.1016/j.phytochem.2020.112349</ArticleId>
        </ArticleIdList>
    </PubmedData>
</PubmedArticle>

Understanding basic usage

Load the library for reading xml.

001.py


import xml.etree.ElementTree as ET

Read the xml data from the file. It seems that multiple data are lined up with two line breaks, so split them with split to make a list.

002.py


test_data = open("./xxxx/pubmed.xml", "r")
contents = test_data.read()
records = contents.split('\n\n')

The first bibliographic data (records [0]) is read by ET.fromstring () and stored in the variable root. If you look at root with type (), you'll see that it's an Element object.

003.py


root = ET.fromstring(records[0])
type(root)
#<class 'xml.etree.ElementTree.Element'>

You can check the tag with root.tag. I will check it.

004.py


root.tag
#'PubmedArticle'

Roughly speaking, one piece of data has the following form. I was able to access the outermost tag with root.tag.

002.xml


<PubmedArticle>
    <MedlineCitation>
    </MedlineCitation>
    <PubmedData>
    </PubmedData>
</PubmedArticle>

Inside \ <PubmedArticle > are two elements (MedlineCitation and PubmedData), which can be accessed using subscripts. Access using a subscript and look up the type further.

005.py


root[0]
#<Element 'MedlineCitation' at 0x10a9d5b38>
type(root[0])
#<class 'xml.etree.ElementTree.Element'>

root[1]
#<Element 'PubmedData' at 0x10aa78868>
type(root[1])
#<class 'xml.etree.ElementTree.Element'>

You can see that both are Element objects.

In short, it seems that all nodes are Element objects. Element objects can be iterated and child nodes can be retrieved and processed one by one.

for i in root:
    print(i.tag)

You can look up an Element's tag with .tag, and you can look up the attributes and attribute values attached to that tag with .attrib.

root[0].tag
#'MedlineCitation'

root[0].attrib
#{'Status': 'Publisher', 'Owner': 'NLM'}
# root[0]The area around the tag is as follows.
#    <MedlineCitation Status="Publisher" Owner="NLM">


type(root[0].attrib)
#<class 'dict'> #Dictionary class

How to access the Element object

There are likely to be three. In each case, you can specify one or more tags. Enclose the entire tag in quotation marks, and separate the tags with slashes when specifying multiple tags.

  1. find('tag1/tag2')
  2. findall('tag1/tag2')
  3. iter('tag1/tag2')

If it is 1, the Element object is returned, if it is 2, the list of Element objects is returned, and if it is 3, it is an object for iteration? Will be returned. I will check it.

root.find('MedlineCitation/DateRevised/Year')
#<Element 'Year' at 0x10a9f8ae8>

root.findall('MedlineCitation')
#[<Element 'MedlineCitation' at 0x10a9d5b38>]

root.iter('Author')
#<_elementtree._element_iterator object at 0x10aa65990>

#Let's iterate with a for statement.
for i in root.iter('Author'):
    print(i)
#<Element 'Author' at 0x10aa6e9f8>
#<Element 'Author' at 0x10aa6ec28>

It seems that findall () looks only at the child nodes of the Element object, and iter () looks at all the child nodes, grandchild nodes, great-grandchild nodes ... of the Element object.

Access to the value of the Element object

The Element object has two values. Attribute values and text data. The attribute value can be obtained by .get ('* property name ') for the Element object. Alternatively, .attrib [' property name *'] seems to be fine.

#.get()Or
root.find('MedlineCitation').get('Status')
#'Publisher'

#.attrib()Or
root.find('MedlineCitation').attrib['Status']
#'Publisher'

You can also get text data by using .text for the Element object.

The text data here is the part surrounded by tags, 2020 in the example below. <Year>2020</Year>

Try to get the value by specifying the path to the Element object with find ().

root.find('MedlineCitation/Article/Journal/JournalIssue/PubDate/Year').text
#'2020'

To get information about multiple Authors, iterate over the list obtained with findall ().

Corrected in consideration of multiple author affiliations (March 31, 2020).

for x in root.findall('MedlineCitation/Article/AuthorList/Author'):
    x.find('LastName').text   #Author's surname
    x.find('ForeName').text    #Author's name
    for y in x.findall('AffiliationInfo'):
        y.find('Affiliation').text

The doi (document identifier) is described in the tag ELocationID, but the tag ELocationID has some attribute values, and it is necessary to obtain the text data in the case of EIdType = "doi".

for x in root.findall('MedlineCitation/Article/ELocationID'):
    if(x.get('EIdType') == 'doi'):
        x.text

It is necessary to distinguish whether the record is a Review or a Journal Article, which is described in the Publication Type. However, there are usually multiple Publication Types, and if any of them has a value of Review, it seems to be Review.

For example, if you look at the Review record, it looks like this:

.xml


<PublicationTypeList>
    <PublicationType UI="D016428">Journal Article</PublicationType>
    <PublicationType UI="D016454">Review</PublicationType>
    <PublicationType UI="D013485">Research Support, Non-U.S. Gov't</PublicationType>
</PublicationTypeList>

So, whether it is a review or not is

isReview = False
for x in root.findall('MedlineCitation/Article/PublicationTypeList'):
    if (x.text == 'Review'):
        isReview = TRUE

I think it's good to do it.

To summarize the above, including other information that you may want to acquire

import xml.etree.ElementTree as ET

test_data = open("./pubmed.xml", "r")
contents = test_data.read()
records = contents.split('\n\n')
root = ET.fromstring(records[0])#For the time being, only the first case.

#Author information
for x in root.findall('MedlineCitation/Article/AuthorList/Author'):
    x.find('LastName').text   #Author's surname
    x.find('ForeName').text    #Author's name
    for y in x.findall('AffiliationInfo'):
        y.find('Affiliation').text#Fixed.

#Judgment of Review
isReview = False
for x in root.findall('MedlineCitation/Article/PublicationTypeList'):
    if (x.text == 'Review'):
        isReview = TRUE

# doi
for x in root.findall('MedlineCitation/Article/ELocationID'):
    if(x.get('EIdType') == 'doi'):
        x.text

#PMID
root.find('MedlineCitation/PMID').text
#Paper title
root.find('MedlineCitation/Article/ArticleTitle').text
#Journal name
root.find('MedlineCitation/Article/Journal/Title').text
#Year of publication
root.find('MedlineCitation/Article/Journal/JournalIssue/PubDate/Year').text
#Publication month
root.find('MedlineCitation/Article/Journal/JournalIssue/PubDate/Month').text
#language
root.find('MedlineCitation/Article/Language').text

I think it should be done. In the above code, there is only one process, but

for record in records:
    root = ET.fromstring(record)
    #Describe the process

You should do it as.

Now, if you have xml data, you can extract the necessary information at once. All you have to do is think about how to shape it.

Now you know how to handle xml data.

Recommended Posts

Process Pubmed .xml data with python
Process Pubmed .xml data with python [Part 2]
Process feedly xml with Python.
Process big data with Dataflow (ApacheBeam) + Python3
Data analysis with python 2
Data analysis with Python
Sample data created with python
Generate XML (RSS) with Python
Get Youtube data with python
Read json data with python
Process csv data with python (count processing using pandas)
[Python] Get economic data with DataReader
Easy data visualization with Python seaborn.
Data analysis starting with python (data visualization 1)
Data analysis starting with python (data visualization 2)
Python application: Data cleansing # 2: Data cleansing with DataFrame
Get additional data in LDAP with python
Data pipeline construction with Python and Luigi
Receive textual data from mysql with python
[Note] Get data from PostgreSQL with Python
Add a Python data source with Redash
Try working with binary data in Python
Generate Japanese test data with Python faker
Convert Excel data to JSON with python
[Python] Use string data with scikit-learn SVM
Download Japanese stock price data with python
Process multiple lists with for in Python
Recommendation of Altair! Data visualization with Python
Data analysis starting with python (data preprocessing-machine learning)
Let's do MySQL data manipulation with Python
Organize data divided by folder with Python
FizzBuzz with Python3
Scraping with Python
Statistics with python
Scraping with Python
Python with Go
Data analysis python
Twilio with Python
Integrate with Python
Play with 2016-Python
AES256 with python
Tested with Python
python starts with ()
with syntax (Python)
Bingo with python
Zundokokiyoshi with python
Excel with Python
[python] Read data
Microcomputer with Python
Cast with python
Create test data like that with Python (Part 1)
Read Python csv data with Pandas ⇒ Graph with Matplotlib
Read table data in PDF file with Python
Get stock price data with Quandl API [Python]
I tried to get CloudWatch data with Python
A story stuck with handling Python binary data
Compare xml parsing speeds with Python and Go
Write CSV data to AWS-S3 with AWS-Lambda + Python
I started machine learning with Python Data preprocessing
Extract data from a web page with Python
Serial communication with Python