Biopython Tutorial und Kochbuch Japanische Übersetzung (4.7)

4.7 Slicing a SeqRecord Bis 4.6

You can slice a SeqRecord, to give you a new SeqRecord covering just part of the sequence. What is important here is that any per-letter annotations are also sliced, and any features which fall completely within the new sequence are preserved (with their locations adjusted). ** Ein Teil der Sequenz kann durch Schneiden des SeqRecord als neuer SeqRecord generiert werden. Es ist zu beachten, dass die Annotation pro Buchstabe ebenfalls in Scheiben geschnitten wird, die Funktionen in der neuen Sequenz jedoch mit denen des Originals übereinstimmen (Positionen werden angepasst) **

For example, taking the same GenBank file used earlier: *** Am Beispiel der zuvor verwendeten GenBank-Datei ***

>>> from Bio import SeqIO
>>> record = SeqIO.read("NC_005816.gb", "genbank")

>>> record
SeqRecord(seq=Seq('TGTAACGAACGGTGCAATAGTGATCCACACCCAACGCCTGAAATCAGATCCAGG...CTG',
IUPACAmbiguousDNA()), id='NC_005816.1', name='NC_005816',
description='Yersinia pestis biovar Microtus str. 91001 plasmid pPCP1, complete sequence',
dbxrefs=['Project:58037'])

>>> len(record)
9609
>>> len(record.features)
41

For this example we’re going to focus in on the pim gene, YP_pPCP05. If you have a look at the GenBank file directly you’ll find this gene/CDS has location string 4343..4780, or in Python counting 4342:4780. From looking at the file you can work out that these are the twelfth and thirteenth entries in the file, so in Python zero-based counting they are entries 11 and 12 in the features list: ** In diesem Beispiel konzentrieren wir uns auf das pim-Gen. OLN: YP_pPCP05. Wenn Sie in die GenBank-Datei schauen, ist der Speicherort dieses Gens / CDS 4343..4780 oder unter der Python-Anzahl 4342: 4780. Die Standortinformationen sind die Einträge 12 und 13 in der GenBank-Datei. Da Python von 0 zählt, sind es 11 und 12 Einträge in der Funktionsliste. ** ** **

>>> print(record.features[20])
type: gene
location: [4342:4780](+)
qualifiers:
    Key: db_xref, Value: ['GeneID:2767712']
    Key: gene, Value: ['pim']
    Key: locus_tag, Value: ['YP_pPCP05']
<BLANKLINE>

>>> print(record.features[21])
type: CDS
location: [4342:4780](+)
qualifiers:
    Key: codon_start, Value: ['1']
    Key: db_xref, Value: ['GI:45478716', 'GeneID:2767712']
    Key: gene, Value: ['pim']
    Key: locus_tag, Value: ['YP_pPCP05']
    Key: note, Value: ['similar to many previously sequenced pesticin immunity ...']
    Key: product, Value: ['pesticin immunity protein']
    Key: protein_id, Value: ['NP_995571.1']
    Key: transl_table, Value: ['11']
    Key: translation, Value: ['MGGGMISKLFCLALIFLSSSGLAEKNTYTAKDILQNLELNTFGNSLSH...']

Let’s slice this parent record from 4300 to 4800 (enough to include the pim gene/CDS), and see how many features we get:
Schneiden Sie 4300 bis 4800 aus der Elternsequenz (pim-Gen)/Mal sehen, welche Funktionen wir haben (die Länge, die das CDS enthält):

>>> sub_record = record[4300:4800]

>>> sub_record
SeqRecord(seq=Seq('ATAAATAGATTATTCCAAATAATTTATTTATGTAAGAACAGGATGGGAGGGGGA...TTA',
IUPACAmbiguousDNA()), id='NC_005816.1', name='NC_005816',
description='Yersinia pestis biovar Microtus str. 91001 plasmid pPCP1, complete sequence.',
dbxrefs=[])

>>> len(sub_record)
500
>>> len(sub_record.features)
2

Our sub-record just has two features, the gene and CDS entries for YP_pPCP05: Das YP_pPCP05-Gen und CD-Sentries, deren Unterdatensatz zwei Merkmale enthält: Referenz: https://www.ddbj.nig.ac.jp/ddbj/cds.html

>>> print(sub_record.features[0])
type: gene
location: [42:480](+)
qualifiers:
    Key: db_xref, Value: ['GeneID:2767712']
    Key: gene, Value: ['pim']
    Key: locus_tag, Value: ['YP_pPCP05']
<BLANKLINE>

>>> print(sub_record.features[1])
type: CDS
location: [42:480](+)
qualifiers:
    Key: codon_start, Value: ['1']
    Key: db_xref, Value: ['GI:45478716', 'GeneID:2767712']
    Key: gene, Value: ['pim']
    Key: locus_tag, Value: ['YP_pPCP05']
    Key: note, Value: ['similar to many previously sequenced pesticin immunity ...']
    Key: product, Value: ['pesticin immunity protein']
    Key: protein_id, Value: ['NP_995571.1']
    Key: transl_table, Value: ['11']
    Key: translation, Value: ['MGGGMISKLFCLALIFLSSSGLAEKNTYTAKDILQNLELNTFGNSLSH...']

Notice that their locations have been adjusted to reflect the new parent sequence! *** Hinweis: Die Positionen werden angepasst, um der generierten übergeordneten Sequenz zu entsprechen! *** ***

While Biopython has done something sensible and hopefully intuitive with the features (and any per-letter annotation), for the other annotation it is impossible to know if this still applies to the sub-sequence or not. To avoid guessing, the annotations and dbxrefs are omitted from the sub-record, and it is up to you to transfer any relevant information as appropriate. ** Biopython war in der Lage, Feature-Elemente mit Bedacht und intuitiv zu erfassen (sowie andere Annotationen pro Buchstabe), aber es gibt keinen Raum zu wissen, ob sich andere Annotationen an untergeordnete Sequenzen anpassen. Um Missverständnisse zu vermeiden, haben wir die Anmerkungen und dbxrefs in den untergeordneten Datensätzen weggelassen. ** ** **

>>> sub_record.annotations
{}
>>> sub_record.dbxrefs
[]

The same point could be made about the record id, name and description, but for practicality these are preserved: *** Ich habe ID, Name und Beschreibung im Kinderdatensatz aus praktischen Gründen reserviert. *** ***

>>> sub_record.id
'NC_005816.1'
>>> sub_record.name
'NC_005816'
>>> sub_record.description
'Yersinia pestis biovar Microtus str. 91001 plasmid pPCP1, complete sequence'

This illustrates the problem nicely though, our new sub-record is not the complete sequence of the plasmid, so the description is wrong! Let’s fix this and then view the sub-record as a reduced GenBank file using the format method described above in Section 4.6: ** In diesem Beispiel wurde das Problem aufgedeckt. Der untergeordnete Datensatz ist keine vollständige Plasmidsequenz. Daher ist die Beschreibung falsch und kann mit der in Abschnitt 4.6 beschriebenen Formatmethode korrigiert werden: **

>>> sub_record.description = "Yersinia pestis biovar Microtus str. 91001 plasmid pPCP1, partial."
>>> print(sub_record.format("genbank"))
...

See Sections 20.1.7 and 20.1.8 for some FASTQ examples where the per-letter annotations (the read quality scores) are also sliced. ** FASTQ-Beispiele finden Sie unter 20.1.7 und 20.1.8. In diesem Beispiel wurden die Anmerkungen pro Buchstabe (Qualitätsfaktor) in Scheiben geschnitten. ** ** **

Bis 4.8