Biopython Tutorial and Cookbook Japanese translation (4.7)

4.7 Slicing a SeqRecord To 4.6

You can slice a SeqRecord, to give you a new SeqRecord covering just part of the sequence. What is important here is that any per-letter annotations are also sliced, and any features which fall completely within the new sequence are preserved (with their locations adjusted). ** A part of the array can be generated as a new SeqRecord by slicing the SeqRecord. It should be noted that the per-letter annotation is sliced as well, but the features in the new array are the same as the original (locations are adjusted) **

For example, taking the same GenBank file used earlier: *** Taking the GenBank file used before *** as an example

>>> from Bio import SeqIO
>>> record = SeqIO.read("NC_005816.gb", "genbank")

>>> record
SeqRecord(seq=Seq('TGTAACGAACGGTGCAATAGTGATCCACACCCAACGCCTGAAATCAGATCCAGG...CTG',
IUPACAmbiguousDNA()), id='NC_005816.1', name='NC_005816',
description='Yersinia pestis biovar Microtus str. 91001 plasmid pPCP1, complete sequence',
dbxrefs=['Project:58037'])

>>> len(record)
9609
>>> len(record.features)
41

For this example we’re going to focus in on the pim gene, YP_pPCP05. If you have a look at the GenBank file directly you’ll find this gene/CDS has location string 4343..4780, or in Python counting 4342:4780. From looking at the file you can work out that these are the twelfth and thirteenth entries in the file, so in Python zero-based counting they are entries 11 and 12 in the features list: ** In this example, we will focus on the pim gene. OLN: YP_pPCP05. If you look in the GenBank file, the location of this gene / CDS is 4343..4780, or under the python count is 4342: 4780. The location information will be the entries 12 and 13 in the GenBank file. Since python counts from 0, it will be 11 and 12 entries in the features list. ** **

>>> print(record.features[20])
type: gene
location: [4342:4780](+)
qualifiers:
    Key: db_xref, Value: ['GeneID:2767712']
    Key: gene, Value: ['pim']
    Key: locus_tag, Value: ['YP_pPCP05']
<BLANKLINE>

>>> print(record.features[21])
type: CDS
location: [4342:4780](+)
qualifiers:
    Key: codon_start, Value: ['1']
    Key: db_xref, Value: ['GI:45478716', 'GeneID:2767712']
    Key: gene, Value: ['pim']
    Key: locus_tag, Value: ['YP_pPCP05']
    Key: note, Value: ['similar to many previously sequenced pesticin immunity ...']
    Key: product, Value: ['pesticin immunity protein']
    Key: protein_id, Value: ['NP_995571.1']
    Key: transl_table, Value: ['11']
    Key: translation, Value: ['MGGGMISKLFCLALIFLSSSGLAEKNTYTAKDILQNLELNTFGNSLSH...']

Let’s slice this parent record from 4300 to 4800 (enough to include the pim gene/CDS), and see how many features we get:
Slice 4300 to 4800 from the parent sequence (pim gene)/Let's see what features we got (the length that includes the CDS):

>>> sub_record = record[4300:4800]

>>> sub_record
SeqRecord(seq=Seq('ATAAATAGATTATTCCAAATAATTTATTTATGTAAGAACAGGATGGGAGGGGGA...TTA',
IUPACAmbiguousDNA()), id='NC_005816.1', name='NC_005816',
description='Yersinia pestis biovar Microtus str. 91001 plasmid pPCP1, complete sequence.',
dbxrefs=[])

>>> len(sub_record)
500
>>> len(sub_record.features)
2

Our sub-record just has two features, the gene and CDS entries for YP_pPCP05: The YP_pPCP05 gene and CD Sentries, whose sub-record contains two features: Reference: https://www.ddbj.nig.ac.jp/ddbj/cds.html

>>> print(sub_record.features[0])
type: gene
location: [42:480](+)
qualifiers:
    Key: db_xref, Value: ['GeneID:2767712']
    Key: gene, Value: ['pim']
    Key: locus_tag, Value: ['YP_pPCP05']
<BLANKLINE>

>>> print(sub_record.features[1])
type: CDS
location: [42:480](+)
qualifiers:
    Key: codon_start, Value: ['1']
    Key: db_xref, Value: ['GI:45478716', 'GeneID:2767712']
    Key: gene, Value: ['pim']
    Key: locus_tag, Value: ['YP_pPCP05']
    Key: note, Value: ['similar to many previously sequenced pesticin immunity ...']
    Key: product, Value: ['pesticin immunity protein']
    Key: protein_id, Value: ['NP_995571.1']
    Key: transl_table, Value: ['11']
    Key: translation, Value: ['MGGGMISKLFCLALIFLSSSGLAEKNTYTAKDILQNLELNTFGNSLSH...']

Notice that their locations have been adjusted to reflect the new parent sequence! *** Notice: locations will be adjusted to correspond to the generated parent array! *** ***

While Biopython has done something sensible and hopefully intuitive with the features (and any per-letter annotation), for the other annotation it is impossible to know if this still applies to the sub-sequence or not. To avoid guessing, the annotations and dbxrefs are omitted from the sub-record, and it is up to you to transfer any relevant information as appropriate. ** Biopython was able to get the features element wisely and intuitively (as well as other per-letter annotations), but there is no room to know if other annotations adapt to child sequences. To avoid misunderstanding, we have omitted annotations and dbxrefs for child records. ** **

>>> sub_record.annotations
{}
>>> sub_record.dbxrefs
[]

The same point could be made about the record id, name and description, but for practicality these are preserved: *** I have reserved id, name and description in the child record for practicality. *** ***

>>> sub_record.id
'NC_005816.1'
>>> sub_record.name
'NC_005816'
>>> sub_record.description
'Yersinia pestis biovar Microtus str. 91001 plasmid pPCP1, complete sequence'

This illustrates the problem nicely though, our new sub-record is not the complete sequence of the plasmid, so the description is wrong! Let’s fix this and then view the sub-record as a reduced GenBank file using the format method described above in Section 4.6: ** This example exposed the problem, the child record is not the complete plasmid sequence, so the description is incorrect and can be corrected with the format method described in Section 4.6: **

>>> sub_record.description = "Yersinia pestis biovar Microtus str. 91001 plasmid pPCP1, partial."
>>> print(sub_record.format("genbank"))
...

See Sections 20.1.7 and 20.1.8 for some FASTQ examples where the per-letter annotations (the read quality scores) are also sliced. ** See 20.1.7 and 20.1.8 for FASTQ examples. In this example, the per-letter annotations (quality score) were sliced. ** **

To 4.8