Biopython Tutorial and Cookbook Japanese translation (4.2)

4.2 Creating a SeqRecord To 4.1

Using a SeqRecord object is not very complicated, since all of the information is presented as attributes of the class. ** Using the SeqRecord object isn't that complicated, because all the information is explicit as an attribute of the class. ** **

Usually you won’t create a SeqRecord “by hand”, but instead use Bio.SeqIO to read in a sequence file for you (see Chapter 5 and the examples below). However, creating SeqRecord can be quite simple. ** Normally you don't create the SeqRecord object yourself, instead use Bio.SeqIO to read it from the sequence file. (See chap5 and the sample below) However, generating a SeqRecord is very easy. ** **

4.2.1 SeqRecord objects from scratch

To create a SeqRecord at a minimum you just need a Seq object: ** You can create a minimal SeqRecord object just by using the Seq class. ** **


>>> from Bio.Seq import Seq
>>> simple_seq = Seq("GATC")
>>> from Bio.SeqRecord import SeqRecord
>>> simple_seq_r = SeqRecord(simple_seq)

Additionally, you can also pass the id, name and description to the initialization function, but if not they will be set as strings indicating they are unknown, and can be modified subsequently: ** In addition, id, name and description can be assigned at initialization. If you don't, those values will be initialized with unknown, but you can modify them later. ** **

>>> simple_seq_r.id
'<unknown id>'
>>> simple_seq_r.id = "AC12345"
>>> simple_seq_r.description = "Made up sequence I wish I could write a paper about"
>>> print(simple_seq_r.description)
Made up sequence I wish I could write a paper about
>>> simple_seq_r.seq
Seq('GATC')

Including an identifier is very important if you want to output your SeqRecord to a file. You would normally include this when creating the object: ** If you want to output SeqRecord as a file, it is very important to give it an identifier. Usually added when creating an object. ** **

>>> from Bio.Seq import Seq
>>> simple_seq = Seq("GATC")
>>> from Bio.SeqRecord import SeqRecord
>>> simple_seq_r = SeqRecord(simple_seq, id="AC12345")

As mentioned above, the SeqRecord has an dictionary attribute annotations. This is used for any miscellaneous annotations that doesn’t fit under one of the other more specific attributes. ** As mentioned above, SeqRecord has a dictionary-type attribute called annotation. Use this for miscellaneous annotations that don't fit into other explicit attributes. ** **

Adding annotations is easy, and just involves dealing directly with the annotation dictionary: ** Adding annotations is easy, just add them directly into the dictionary. ** **

>>> simple_seq_r.annotations["evidence"] = "None. I just made it up."
>>> print(simple_seq_r.annotations)
{'evidence': 'None. I just made it up.'}
>>> print(simple_seq_r.annotations["evidence"])
None. I just made it up.

Working with per-letter-annotations is similar, letter_annotations is a dictionary like attribute which will let you assign any Python sequence (i.e. a string, list or tuple) which has the same length as the sequence: ** The usage of per-letter-annotations is similar, letter_annotations can assign sequence data in dictionary type. (i.e. string, list or tuple) **

>>> simple_seq_r.letter_annotations["phred_quality"] = [40, 40, 38, 30]
>>> print(simple_seq_r.letter_annotations)
{'phred_quality': [40, 40, 38, 30]}
>>> print(simple_seq_r.letter_annotations["phred_quality"])
[40, 40, 38, 30]

The dbxrefs and features attributes are just Python lists, and should be used to store strings and SeqFeature objects (discussed later in this chapter) respectively. ** Cross-references to the database (.dbxrefs) and feature information are used to store list types, strings and SeqFeature objects, respectively. (See later in this chapter) **

4.2.2 SeqRecord objects from FASTA files

This example uses a fairly large FASTA file containing the whole sequence for Yersinia pestis biovar Microtus str. 91001 plasmid pPCP1, originally downloaded from the NCBI. ** This sample uses a fairly large FASTA file containing the entire sequence of Yersinia pestis. I downloaded it from NCBI. ** **

This file is included with the Biopython unit tests under the GenBank folder, or online NC_005816.fna from our website. ** The file containing the Biopython unit tests can be found in the GenBank folder, or you can also get the file from our site. ** **

The file starts like this - and you can check there is only one record present (i.e. only one line starting with a greater than symbol): ** This file starts like this--I have only one line right now (i.e. one line starting with the greater-than sign) **

>gi|45478711|ref|NC_005816.1| Yersinia pestis biovar Microtus ... pPCP1, complete sequence
TGTAACGAACGGTGCAATAGTGATCCACACCCAACGCCTGAAATCAGATCCAGGGGGTAATCTGCTCTCC
...

Back in Chapter 2 you will have seen the function Bio.SeqIO.parse(...) used to loop over all the records in a file as SeqRecord objects. ** As we saw in Chapter 2, the function Bio.SeqIO.parse (...) loops through all the records in the SeqRecord object file. ** **

The Bio.SeqIO module has a sister function for use on files which contain just one record which we’ll use here (see Chapter 5 for details): ** Here we use a similar sister function in the Bio.SeqIO module for reading a file that has only one record. (Details will be given in Chapter 5.) **

Now, let’s have a look at the key attributes of this SeqRecord individually – starting with the seq attribute which gives you a Seq object: ** Now let's take a look at the main attributes of SeqRecord-it will return a Seq object when you access the seq attribute. ** **

>>> record.seq
Seq('TGTAACGAACGGTGCAATAGTGATCCACACCCAACGCCTGAAATCAGATCCAGG...CTG', SingleLetterAlphabet())

Here Bio.SeqIO has defaulted to a generic alphabet, rather than guessing that this is DNA. ** Bio.SeqIO returned the genetic alphabet, needless to say, it's DNA. ** **

If you know in advance what kind of sequence your FASTA file contains, you can tell Bio.SeqIO which alphabet to use (see Chapter 5). ** If you know in advance what sequence data will be included in the FASTA file, you can specify which alphabet to use for the Bio.SeqIO module (see Chapter 5). ** **

Next, the identifiers and description: ** Next is "Identifier" and "Description": **

>>> record.id
'gi|45478711|ref|NC_005816.1|'
>>> record.name
'gi|45478711|ref|NC_005816.1|'
>>> record.description
'gi|45478711|ref|NC_005816.1| Yersinia pestis biovar Microtus ... pPCP1, complete sequence'

As you can see above, the first word of the FASTA record’s title line (after removing the greater than symbol) is used for both the id and name attributes. ** As we saw above, the first character of the record line title in the FASTA file (after removing the greater-than sign>) is used for both the id and name attributes. ** **

The whole title line (after removing the greater than symbol) is used for the record description. This is deliberate, partly for backwards compatibility reasons, but it also makes sense if you have a FASTA file like this: ** The entire title line (after removing the greater-than sign>) is used to describe the record. This is for reasons such as compatibility with the backend. And even more so with FASTA files like the ones below. ** **

>Yersinia pestis biovar Microtus str. 91001 plasmid pPCP1
TGTAACGAACGGTGCAATAGTGATCCACACCCAACGCCTGAAATCAGATCCAGGGGGTAATCTGCTCTCC
...

Note that none of the other annotation attributes get populated when reading a FASTA file: ** Please note that the other attributes will be empty at the initial stage of submitting the FASTA file. ** **

In this case our example FASTA file was from the NCBI, and they have a fairly well defined set of conventions for formatting their FASTA lines. ** In this course our sample FASTA files are taken from NCBI and the FASTA record format is fairly well maintained. ** **

This means it would be possible to parse this information and extract the GI number and accession for example. However, FASTA files from other sources vary, so this isn’t possible in general. ** In other words, it is possible to analyze information and extract GI numbers and additional information. But with other sources it will be difficult. ** **

4.2.3 SeqRecord objects from GenBank files

As in the previous example, we’re going to look at the whole sequence for Yersinia pestis biovar Microtus str. 91001 plasmid pPCP1, originally downloaded from the NCBI, but this time as a GenBank file. ** In the previous example, we saw the entire sequence of Yersinia pestis downloaded from NCBI. Let's take a look at the GenBank version this time. ** **

Again, this file is included with the Biopython unit tests under the GenBank folder, or online NC_005816.gbfrom our website. ** Similarly, use a file with Biopython unit tests from the GenBank folder, or download it directly from the site. ** **

This file contains a single record (i.e. only one LOCUS line) and starts: ** This file contains a single record (only one i.e. LOCUS line) and starts **

LOCUS       NC_005816               9609 bp    DNA     circular BCT 21-JUL-2008
DEFINITION  Yersinia pestis biovar Microtus str. 91001 plasmid pPCP1, complete
            sequence.
ACCESSION   NC_005816
VERSION     NC_005816.1  GI:45478711
PROJECT     GenomeProject:10638
...

Again, we’ll use Bio.SeqIO to read this file in, and the code is almost identical to that for used above for the FASTA file (see Chapter 5 for details): ** Similarly, read the file using Bio.SeqIO and do it in much the same way as for FASTA files. (See Chapter 5 for details): **

>>> from Bio import SeqIO
>>> record = SeqIO.read("NC_005816.gb", "genbank")
>>> record
SeqRecord(seq=Seq('TGTAACGAACGGTGCAATAGTGATCCACACCCAACGCCTGAAATCAGATCCAGG...CTG',
IUPACAmbiguousDNA()), id='NC_005816.1', name='NC_005816',
description='Yersinia pestis biovar Microtus str. 91001 plasmid pPCP1, complete sequence.',
dbxrefs=['Project:10638'])

You should be able to spot some differences already! But taking the attributes individually, the sequence string is the same as before, but this time Bio.SeqIO has been able to automatically assign a more specific alphabet (see Chapter 5 for details): ** You should have already discovered some differences, but let's look at each one, the strings in the sequence are the same as before, but this time Bio.SeqIO was able to determine the alphabet type more accurately. (See Chapter 5 for details) **

The name comes from the LOCUS line, while the id includes the version suffix. The description comes from the DEFINITION line: ** The name is the LOCUS line and the id contains the version information. The explanation is taken from the DEFINITION line. ** **

GenBank files don’t have any per-letter annotations: ** There is no per-letter annotation in the GenBank file. ** **

>>> record.letter_annotations
{}

Most of the annotations information gets recorded in the annotations dictionary, for example: ** A lot of annotation information is stored in the annotation dictionary. For example: **

>>> len(record.annotations)
11
>>> record.annotations["source"]
'Yersinia pestis biovar Microtus str. 91001'

The dbxrefs list gets populated from any PROJECT or DBLINK lines: ** The list dbxrefs reflects the contents of the PROJECT or DBLINK line. ** **

>>> record.dbxrefs
['Project:10638']

Finally, and perhaps most interestingly, all the entries in the features table (e.g. the genes or CDS features) get recorded as SeqFeature objects in the features list. ** Finally, most interestingly, all entries in the feature table (e.g. genes or CDS features) are stored in the feature list as SeqFeature objects. ** **

Recommended Posts

Biopython Tutorial and Cookbook Japanese translation (4.3)
Biopython Tutorial and Cookbook Japanese translation (4.1)
Biopython Tutorial and Cookbook Japanese translation (4.5)
Biopython Tutorial and Cookbook Japanese translation (4.8)
Biopython Tutorial and Cookbook Japanese translation (4.7)
Biopython Tutorial and Cookbook Japanese translation (4.9)
Biopython Tutorial and Cookbook Japanese translation (4.6)
Biopython Tutorial and Cookbook Japanese translation (4.2)
Biopython Tutorial and Cookbook Japanese translation (4.4)
Biopython Tutorial and Cookbook Japanese translation (Chapter 1, 2)
streamlit tutorial Japanese translation
sosreport Japanese translation
[Translation] hyperopt tutorial
man systemd Japanese translation
streamlit explanation Japanese translation
Dockerfile reference Japanese translation
docker help Japanese translation
SymPy tutorial Japanese notebook
[PyTorch] Tutorial (Japanese version) ② ~ AUTOGRAD ~
docker build --help Japanese translation
Japanese translation of sysstat manual
[PyTorch] Tutorial (Japanese version) ① ~ Tensor ~
Japanese translation of Linux manual
docker run --help Japanese translation
Pandas User Guide "Table Formatting and PivotTables" (Official Document Japanese Translation)