[PYTHON] Create a correspondence table between EC number and Uniprot entry from enzyme.dat

What is enzyme.dat

ENZYME is a database file of information on the naming of enzymes. In the file

ID  Identification                         (Begins each entry; 1 per entry)
DE  Description (official name)            (>=1 per entry)
AN  Alternate name(s)                      (>=0 per entry)
CA  Catalytic activity                     (>=1 per entry)
CF  Cofactor(s)                            (>=0 per entry)
CC  Comments                               (>=0 per entry)
PR  Cross-references to PROSITE            (>=0 per entry)
DR  Cross-references to Swiss-Prot         (>=0 per entry)

Information such as is stored.

enzyme.Contents of dat (part)

ID   1.1.1.1
DE   Alcohol dehydrogenase.
AN   Aldehyde reductase.
CA   (1) A primary alcohol + NAD(+) = an aldehyde + NADH.
CA   (2) A secondary alcohol + NAD(+) = a ketone + NADH.
CF   Zn(2+) or Fe cation.
CC   -!- Acts on primary or secondary alcohols or hemi-acetals with very broad
CC       specificity; however the enzyme oxidizes methanol much more poorly
CC       than ethanol.
CC   -!- The animal, but not the yeast, enzyme acts also on cyclic secondary
CC       alcohols.
PR   PROSITE; PDOC00058;
PR   PROSITE; PDOC00059;
PR   PROSITE; PDOC00060;
DR   P07327, ADH1A_HUMAN;  P28469, ADH1A_MACMU;  Q5RBP7, ADH1A_PONAB;
DR   P25405, ADH1A_SAAHA;  P25406, ADH1B_SAAHA;  P00327, ADH1E_HORSE;

With EC number (corresponding to the above ID) classified according to the function of the enzyme as part of the research Since it was necessary to make a correspondence table of Uniprot entry (corresponding to the above DR) of each protein, I decided to extract the explanation of ** ID **, ** DR **, and EC number (corresponding to ** DE ** above) from enzyme.dat and create the associated table.

Things necessary

-enzyme.dat (obtained from ftp://ftp.expasy.org/databases/enzyme)

Python module to use

--pandas (used to create DataFrame)

things to do

Create a list by extracting the lines that start with ID, DE, and DR. Create a table with DataFrame and export it as a csv file.

What i did

Open file

path = "enzyme.dat"
with open(path) as f:
    s = f.readlines() #Separated by line and read as a list
    s = s[24:] #Exclude the explanation part of the head

Creating an id list

id_list = []
for i in s:
    if i.startswith("ID  "): #Find a string that starts with an ID
        x = i[5:-1] # "ID   "Get the following strings
        id_list.append(x) #Add to list
id_list[:10]

['1.1.1.1',
 '1.1.1.2',
 '1.1.1.3',
 '1.1.1.4',
 '1.1.1.5',
 '1.1.1.6',
 '1.1.1.7',
 '1.1.1.8',
 '1.1.1.9',
 '1.1.1.10']

Create description list

Since DE and DR may have two or more lines, add elements while referring to the contents after one line. Continue adding strings until the beginning of the line is no longer "DE", and when you reach the last line of DE, add it to the list.

description_list = []
name = ""
for i in range(len(s)):
    if s[i].startswith("DE   "):
        x = s[i][5:-1]
        name += x
        if not s[i + 1].startswith("DE   "):
            description_list.append(name)
            name = ""
description_list[:10]

['Alcohol dehydrogenase.',
 'Alcohol dehydrogenase (NADP(+)).',
 'Homoserine dehydrogenase.',
 '(R,R)-butanediol dehydrogenase.',
 'Transferred entry: 1.1.1.303 and 1.1.1.304.',
 'Glycerol dehydrogenase.',
 'Propanediol-phosphate dehydrogenase.',
 'Glycerol-3-phosphate dehydrogenase (NAD(+)).',
 'D-xylulose reductase.',
 'L-xylulose reductase.']

Creating an accession column

accession_list = []
name = ""
for i in range(len(s)):
    if s[i].startswith("DR   "):
        x = s[i][5:-1]
        name += x
        if not s[i + 1].startswith("DR   "):
            accession_list.append(name)
            name = ""

accession_list[1]

'Q6AZW2, A1A1A_DANRE;  Q568L5, A1A1B_DANRE;  Q24857, ADH3_ENTHI ;Q04894, ADH6_YEAST ;  P25377, ADH7_YEAST ;  O57380, ADH8_PELPE ;Q9F282, ADHA_THEET ;  P0CH36, ADHC1_MYCS2;  P0CH37, ADHC2_MYCS2;P0A4X1, ADHC_MYCBO ;  P9WQC4, ADHC_MYCTO ;  P9WQC5, ADHC_MYCTU ;P27250, AHR_ECOLI  ;  Q3ZCJ2, AK1A1_BOVIN;  Q5ZK84, AK1A1_CHICK;O70473, AK1A1_CRIGR;  P14550, AK1A1_HUMAN;  Q9JII6, AK1A1_MOUSE;P50578, AK1A1_PIG  ;  Q5R5D5, AK1A1_PONAB;  P51635, AK1A1_RAT  ;Q6GMC7, AK1A1_XENLA;  Q28FD1, AK1A1_XENTR;  Q9UUN9, ALD2_SPOSA ;P27800, ALDX_SPOSA ;  P75691, YAHK_ECOLI ;'

After that, I should be able to create a DataFrame using these three lists, Comparing the number of elements in the created list

len(id_list), len(description_list), len(accession_list)

(7876, 7876, 5001)

Only accession_list does not match

Why doesn't the number match for accession_list?

If you check the dat file carefully

//
ID   1.14.13.42
DE   Deleted entry.
//
ID   1.14.13.43
DE   Questin monooxygenase.
AN   Questin oxygenase.
CA   Questin + NADPH + O(2) = demethylsulochrin + NADP(+).
CC   -!- The enzyme cleaves the anthraquinone ring of questin to form a
CC       benzophenone.
CC   -!- Involved in the biosynthesis of the seco-anthraquinone (+)-geodin.
//

There are quite a few IDs that do not have DR. Therefore

# PR, CC, DE, CA,Use CF to find enzymes without DR
for name in ("PR", "CC", "DE", "CA", "CF"):
    print("start", name)
    no_dr_enzyme = []
    for i in range(len(s)):
        if s[i].startswith(f"{name}   "):
            if s[i + 1].startswith("//"):
                no_dr_enzyme.append(i)
    x = 1
    for i in no_dr_enzyme:
        s.insert(i + x, "DR   none ;\n")
        x += 1

Add the line "DR none" to the ID that does not have DR.

If you create accession_list again and compare the number of elements

len(id_list), len(description_list), len(accession_list)
(7876, 7876, 7876)

Now that we have all the numbers, we can create a DataFrame.

Create a DataFrame and export it to a csv file

import pandas as pd

df = pd.DataFrame(
    {"ID": id_list, "Description": description_list, "Accession": accession_list}
)

#Export as a csv file
df.to_csv("enzyme.csv", index=False)

Completed script (make_enzyme_table.py)

#!/usr/bin/env python
# -*- coding: utf-8 -*-
#
# make_enzyme_table.py
#

import pandas as pd

def main():
    #Read file
    path = "enzyme.dat"

    with open(path) as f:
        s = f.readlines()
        s = s[24:]
        print(s[:10])

    #Creation of id column
    id_list = []
    for i in s:
        if i.startswith("ID  "):
            x = i[5:-1]
            id_list.append(x)

    #Create description column
    description_list = []
    name = ""
    for i in range(len(s)):
        if s[i].startswith("DE   "):
            x = s[i][5:-1]
            name += x
            if not s[i + 1].startswith("DE   "):
                description_list.append(name)
                name = ""

    # PR, CC, DE, CA,Use CF to find and complement enzymes without DR
    for name in ("PR", "CC", "DE", "CA", "CF"):
        print("start", name)
        no_dr_enzyme = []
        for i in range(len(s)):
            if s[i].startswith(f"{name}   "):
                if s[i + 1].startswith("//"):
                    no_dr_enzyme.append(i)
        x = 1
        for i in no_dr_enzyme:
            s.insert(i + x, "DR   none ;\n")
            x += 1

    #Creating an accession column
    accession_list = []
    name = ""
    for i in range(len(s)):
        if s[i].startswith("DR   "):
            x = s[i][5:-1]
            name += x
            if not s[i + 1].startswith("DR   "):
                accession_list.append(name)
                name = ""

    #Creating a DataFrame
    df = pd.DataFrame(
        {"ID": id_list, "Description": description_list, "Accession": accession_list}
    )

    #csv write
    df.to_csv("enzyme.csv", index=False)

if '__main__' == __name__:
    main()

Complete

enzyme.Contents of csv (part)

ID,Description,Accession
1.1.1.1,Alcohol dehydrogenase.,"P07327, ADH1A_HUMAN;  P28469, ADH1A_MACMU;  Q5RBP7, ADH1A_PONAB;P25405, ADH1A_SAAHA;  P25406, ADH1B_SAAHA;  P00327, ADH1E_HORSE;P00326, ADH1G_HUMAN;  O97959, ADH1G_PAPHA;  P00328, ADH1S_HORSE;P80222, ADH1_ALLMI ;  P30350, ADH1_ANAPL ;  P49645, ADH1_APTAU ;P06525, ADH1_ARATH ;  P41747, ADH1_ASPFN ;  Q17334, ADH1_CAEEL ;P43067, ADH1_CANAX ;  P85440, ADH1_CATRO ;  P14219, ADH1_CENAM ;P48814, ADH1_CERCA ;  Q70UN9, ADH1_CERCO ;  P23991, ADH1_CHICK ;P86883, ADH1_COLLI ;  P19631, ADH1_COTJA ;  P23236, ADH1_DROHY ;P48586, ADH1_DROMN ;  P09370, ADH1_DROMO ;  P22246, ADH1_DROMT ;P07161, ADH1_DROMU ;  P12854, ADH1_DRONA ;  P08843, ADH1_EMENI ;P26325, ADH1_GADMC ;  Q9Z2M2, ADH1_GEOAT ;  Q64413, ADH1_GEOBU ;Q64415, ADH1_GEOKN ;  P12311, ADH1_GEOSE ;  P05336, ADH1_HORVU ;P20369, ADH1_KLULA ;  Q07288, ADH1_KLUMA ;  P00333, ADH1_MAIZE ;P86885, ADH1_MESAU ;  P00329, ADH1_MOUSE ;  P80512, ADH1_NAJNA ;Q9P6C8, ADH1_NEUCR ;  Q75ZX4, ADH1_ORYSI ;  Q2R8Z5, ADH1_ORYSJ ;P12886, ADH1_PEA   ;  P22797, ADH1_PELPE ;  P41680, ADH1_PERMA ;P25141, ADH1_PETHY ;  O00097, ADH1_PICST ;  Q03505, ADH1_RABIT ;P06757, ADH1_RAT   ;  P14673, ADH1_SOLTU ;  P80338, ADH1_STRCA ;P13603, ADH1_TRIRP ;  P00330, ADH1_YEAST ;  Q07264, ADH1_ZEALU ;P20368, ADH1_ZYMMO ;  O45687, ADH2_CAEEL ;  O94038, ADH2_CANAL ;P48815, ADH2_CERCA ;  Q70UP5, ADH2_CERCO ;  Q70UP6, ADH2_CERRO ;P27581, ADH2_DROAR ;  P25720, ADH2_DROBU ;  P23237, ADH2_DROHY ;P48587, ADH2_DROMN ;  P09369, ADH2_DROMO ;  P07160, ADH2_DROMU ;P24267, ADH2_DROWH ;  P37686, ADH2_ECOLI ;  P54202, ADH2_EMENI ;Q24803, ADH2_ENTHI ;  P42327, ADH2_GEOSE ;  P10847, ADH2_HORVU ;P49383, ADH2_KLULA ;  Q9P4C2, ADH2_KLUMA ;  P04707, ADH2_MAIZE ;Q4R1E8, ADH2_ORYSI ;  Q0ITW7, ADH2_ORYSJ ;  O13309, ADH2_PICST ;P28032, ADH2_SOLLC ;  P14674, ADH2_SOLTU ;  F2Z678, ADH2_YARLI ;P00331, ADH2_YEAST ;  F8DVL8, ADH2_ZYMMA ;  P0DJA2, ADH2_ZYMMO ;P07754, ADH3_EMENI ;  P42328, ADH3_GEOSE ;  P10848, ADH3_HORVU ;P49384, ADH3_KLULA ;  P14675, ADH3_SOLTU ;  P07246, ADH3_YEAST ;P49385, ADH4_KLULA ;  Q09669, ADH4_SCHPO ;  A6ZTT5, ADH4_YEAS7 ;P10127, ADH4_YEAST ;  Q6XQ67, ADH5_SACPS ;  P38113, ADH5_YEAST ;P28332, ADH6_HUMAN ;  P41681, ADH6_PERMA ;  Q5R7Z8, ADH6_PONAB ;Q5XI95, ADH6_RAT   ;  P40394, ADH7_HUMAN ;  Q64437, ADH7_MOUSE ;P41682, ADH7_RAT   ;  P9WQC0, ADHA_MYCTO ;  P9WQC1, ADHA_MYCTU ;O31186, ADHA_RHIME ;  Q7U1B9, ADHB_MYCBO ;  P9WQC6, ADHB_MYCTO ;P9WQC7, ADHB_MYCTU ;  P9WQB8, ADHD_MYCTO ;  P9WQB9, ADHD_MYCTU ;P33744, ADHE_CLOAB ;  P0A9Q8, ADHE_ECO57 ;  P0A9Q7, ADHE_ECOLI ;P81600, ADHH_GADMO ;  P72324, ADHI_RHOS4 ;  Q9SK86, ADHL1_ARATH;Q9SK87, ADHL2_ARATH;  A1L4Y2, ADHL3_ARATH;  Q8VZ49, ADHL4_ARATH;Q0V7W6, ADHL5_ARATH;  Q8LEB2, ADHL6_ARATH;  Q9FH04, ADHL7_ARATH;P81601, ADHL_GADMO ;  P39451, ADHP_ECOLI ;  O46649, ADHP_RABIT ;O46650, ADHQ_RABIT ;  Q96533, ADHX_ARATH ;  Q3ZC42, ADHX_BOVIN ;Q17335, ADHX_CAEEL ;  Q54TC2, ADHX_DICDI ;  P46415, ADHX_DROME ;P19854, ADHX_HORSE ;  P11766, ADHX_HUMAN ;  P93629, ADHX_MAIZE ;P28474, ADHX_MOUSE ;  P80360, ADHX_MYXGL ;  P81431, ADHX_OCTVU ;A2XAZ3, ADHX_ORYSI ;  Q0DWH1, ADHX_ORYSJ ;  P80572, ADHX_PEA   ;O19053, ADHX_RABIT ;  P12711, ADHX_RAT   ;  P80467, ADHX_SAAHA ;P86884, ADHX_SCYCA ;  P79896, ADHX_SPAAU ;  Q9NAR7, ADH_BACOL  ;P14940, ADH_CUPNE  ;  Q0KDL6, ADH_CUPNH  ;  Q00669, ADH_DROAD  ;P21518, ADH_DROAF  ;  P25139, ADH_DROAM  ;  Q50L96, ADH_DROAN  ;P48584, ADH_DROBO  ;  P22245, ADH_DRODI  ;  Q9NG42, ADH_DROEQ  ;P28483, ADH_DROER  ;  P48585, ADH_DROFL  ;  P51551, ADH_DROGR  ;Q09009, ADH_DROGU  ;  P51549, ADH_DROHA  ;  P21898, ADH_DROHE  ;Q07588, ADH_DROIM  ;  Q9NG40, ADH_DROIN  ;  Q27404, ADH_DROLA  ;P10807, ADH_DROLE  ;  P07162, ADH_DROMA  ;  Q09010, ADH_DROMD  ;P00334, ADH_DROME  ;  Q00671, ADH_DROMM  ;  P25721, ADH_DROMY  ;Q00672, ADH_DRONI  ;  P07159, ADH_DROOR  ;  P84328, ADH_DROPB  ;P37473, ADH_DROPE  ;  P23361, ADH_DROPI  ;  P23277, ADH_DROPL  ;Q6LCE4, ADH_DROPS  ;  Q9U8S9, ADH_DROPU  ;  Q9GN94, ADH_DROSE  ;Q24641, ADH_DROSI  ;  P23278, ADH_DROSL  ;  Q03384, ADH_DROSU  ;P28484, ADH_DROTE  ;  P51550, ADH_DROTS  ;  B4M8Y0, ADH_DROVI  ;Q05114, ADH_DROWI  ;  P26719, ADH_DROYA  ;  P17648, ADH_FRAAN  ;P48977, ADH_MALDO  ;  P81786, ADH_MORSE  ;  P9WQC2, ADH_MYCTO  ;P9WQC3, ADH_MYCTU  ;  P39462, ADH_SACS2  ;  P25988, ADH_SCAAL  ;Q00670, ADH_SCACA  ;  P00332, ADH_SCHPO  ;  Q2FJ31, ADH_STAA3  ;Q2G0G1, ADH_STAA8  ;  Q2YSX0, ADH_STAAB  ;  Q5HI63, ADH_STAAC  ;Q99W07, ADH_STAAM  ;  Q7A742, ADH_STAAN  ;  Q6GJ63, ADH_STAAR  ;Q6GBM4, ADH_STAAS  ;  Q8NXU1, ADH_STAAW  ;  Q5HRD6, ADH_STAEQ  ;Q8CQ56, ADH_STAES  ;  Q4J781, ADH_SULAC  ;  P50381, ADH_SULSR  ;Q96XE0, ADH_SULTO  ;  P51552, ADH_ZAPTU  ;  Q5AR48, ASQE_EMENI ;A5JYX5, DHS3_CAEEL ;  P32771, FADH_YEAST ;  A7ZIA4, FRMA_ECO24 ;Q8X5J4, FRMA_ECO57 ;  A7ZX04, FRMA_ECOHS ;  A1A835, FRMA_ECOK1 ;Q0TKS7, FRMA_ECOL5 ;  Q8FKG1, FRMA_ECOL6 ;  B1J085, FRMA_ECOLC ;P25437, FRMA_ECOLI ;  B1LIP1, FRMA_ECOSM ;  Q1RFI7, FRMA_ECOUT ;P44557, FRMA_HAEIN ;  P39450, FRMA_PHODP ;  Q3Z550, FRMA_SHISS ;P73138, FRMA_SYNY3 ;  E1ACQ9, NOTN_ASPSM ;  N4WE73, OXI1_COCH4 ;N4WE43, RED2_COCH4 ;  N4WW42, RED3_COCH4 ;  P33010, TERPD_PSESP;O07737, Y1895_MYCTU;"
1.1.1.2,Alcohol dehydrogenase (NADP(+)).,"Q6AZW2, A1A1A_DANRE;  Q568L5, A1A1B_DANRE;  Q24857, ADH3_ENTHI ;Q04894, ADH6_YEAST ;  P25377, ADH7_YEAST ;  O57380, ADH8_PELPE ;Q9F282, ADHA_THEET ;  P0CH36, ADHC1_MYCS2;  P0CH37, ADHC2_MYCS2;P0A4X1, ADHC_MYCBO ;  P9WQC4, ADHC_MYCTO ;  P9WQC5, ADHC_MYCTU ;P27250, AHR_ECOLI  ;  Q3ZCJ2, AK1A1_BOVIN;  Q5ZK84, AK1A1_CHICK;O70473, AK1A1_CRIGR;  P14550, AK1A1_HUMAN;  Q9JII6, AK1A1_MOUSE;P50578, AK1A1_PIG  ;  Q5R5D5, AK1A1_PONAB;  P51635, AK1A1_RAT  ;Q6GMC7, AK1A1_XENLA;  Q28FD1, AK1A1_XENTR;  Q9UUN9, ALD2_SPOSA ;P27800, ALDX_SPOSA ;  P75691, YAHK_ECOLI ;"
1.1.1.3,Homoserine dehydrogenase.,"P00561, AK1H_ECOLI ;  P27725, AK1H_SERMA ;  P00562, AK2H_ECOLI ;Q9SA18, AKH1_ARATH ;  P49079, AKH1_MAIZE ;  O81852, AKH2_ARATH ;P49080, AKH2_MAIZE ;  P57290, AKH_BUCAI  ;  Q8K9U9, AKH_BUCAP  ;Q89AR4, AKH_BUCBP  ;  P37142, AKH_DAUCA  ;  P44505, AKH_HAEIN  ;P19582, DHOM_BACSU ;  P08499, DHOM_CORGL ;  Q5B998, DHOM_EMENI ;Q9ZL20, DHOM_HELPJ ;  P56429, DHOM_HELPY ;  Q9CGD8, DHOM_LACLA ;P52985, DHOM_LACLC ;  P37143, DHOM_METGL ;  Q58997, DHOM_METJA ;P63630, DHOM_MYCBO ;  P46806, DHOM_MYCLE ;  P9WPX0, DHOM_MYCTO ;P9WPX1, DHOM_MYCTU ;  P29365, DHOM_PSEAE ;  O94671, DHOM_SCHPO ;P52986, DHOM_SYNY3 ;  P31116, DHOM_YEAST ;  P37144, DHON_METGL ;"
1.1.1.4,"(R,R)-butanediol dehydrogenase.","P14940, ADH_CUPNE  ;  Q0KDL6, ADH_CUPNH  ;  P39714, BDH1_YEAST ;O34788, BDHA_BACSU ;  Q00796, DHSO_HUMAN ;"
1.1.1.5,Transferred entry: 1.1.1.303 and 1.1.1.304.,none ;
1.1.1.6,Glycerol dehydrogenase.,"A4IP64, ADH1_GEOTN ;  O13702, GLD1_SCHPO ;  P45511, GLDA_CITFR ;P0A9S6, GLDA_ECOL6 ;  P0A9S5, GLDA_ECOLI ;  P32816, GLDA_GEOSE ;P50173, GLDA_PSEPU ;  Q9WYQ4, GLDA_THEMA ;  Q92EU6, GOLD_LISIN ;"
1.1.1.7,Propanediol-phosphate dehydrogenase.,none ;

After that, by collating the result of blast with the created table, you can get the EC number list of the enzyme contained in the sample used for blast (= you can grasp what kind of role the enzyme exists).

Recommended Posts

Create a correspondence table between EC number and Uniprot entry from enzyme.dat
Create a record table from JFL match results
Python: Create a dictionary from a list of keys and values
Create a summary table by product and time by processing the data extracted from a certain POS system
Correspondence between pandas and SQL
A1 notation and 26-ary number
Difference between ps a and ps -a