[PYTHON] Extract classification information etc. from genbank data in xml format

Script for registration array

From xml about sequence information of Genbank You can retrieve taxon information with the following script


import xml.etree.ElementTree as ET 

tree = ET.parse("./gene_file.xml") 
root = tree.getroot()

for child in root.findall('GBSeq'):
    accession = child.find('GBSeq_accession-version').text
    taxon = child.find('GBSeq_taxonomy').text
    for child in child.findall('GBSeq_feature-table'):
        for child in child.findall('GBFeature'):
            for child in child.findall('GBFeature_quals'):
                for child in child.findall('GBQualifier'):
                    if child.find('GBQualifier_value') is not None:
                        taxon_id = child.find('GBQualifier_value').text
                        if('taxon:' in taxon_id):
                            taxon_id_out = taxon_id
                    else:
                        taxon_id_out = ""
    out +=(accession+"\t"+taxon_id_out+ "\t"+ taxon +"\n")

with open("out10.taxon.txt", mode='w') as f:
    f.write(out)

Why i wrote

Parsing from flat file is troublesome + exceptions are placed, so I tried to read and extract from xml.

Recommended Posts

Extract classification information etc. from genbank data in xml format
Try to extract specific data from JSON format data in object storage Cloudian/S3
Extract data from S3
Write data in HDF format
[Python] Extract text data from XML data of 10GB or more.
Export DB data in json format
Get data from Quandl in Python
Extract specific data from complex JSON
Extract text from images in Python
Extract strings from files in Python
Extract information using File :: Stat in Ruby
Convert xml format data to txt format data (yolov3)
Get Precipitation Probability from XML in Python
Use PIL in Python to extract only the data you want from Exif