Biopython Tutorial and Cookbook Japanese translation (Chapter 1, 2)

I wanted to be able to use Biopython as a bioinformatician. That's all. I didn't have to translate that much. .. I'm getting tired on the way and I'm skipping.

Japanese translation of Biopython Tutorial and Cookbook

reference Biopython Tutorial and Cookbook Biopython web site Chapter 1 Introduction

1.1 What is Biopython? Biopython is a free and available Python tool for computational molecular biology. Python is an object-oriented, interpreted language that is becoming more common in computational science. A language written in C, C ++, and FORTRAN that is easy to learn and has very clear syntax and modular extensibility. The purpose of BioPython is to make Python as easy to use in the bioinformatics domain by creating high quality, reusable modules and classes. Biopython has various bioinformatics file formats (BLAST and Clustalw. / clustalw), FASTA, Genbank, online services Access to (NCBI, ExPASy, etc.), general and uncommon programs ( Clustalw, [DSSP](https://ja.wikipedia.org/wiki/DSSP_%E6%B0%B4%E7%B4%A0 % E7% B5% 90% E5% 90% 88% E6% 8E% A8% E5% AE% 9A% E3% 82% A2% E3% 83% AB% E3% 82% B4% E3% 83% AA% E3 % 82% BA% E3% 83% A0), interface to MS / MS, etc.), standard sequence classes, KD Tree % A8) Contains modules for clustering such as data structure and document format.

1.2 What can I find in the Biopython package The main features of BioPython are as follows.

--You can parse bioinformatics-related files including the following formats into data structures that are easy to handle in Python. - Blast output – both from standalone and WWW - Blast - Clustalw - FASTA - GenBank - PubMed and Medline - ExPASy files, like Enzyme and Prosite - SCOP, including ‘dom’ and ‘lin’ files - UniGene - SwissProt

--Files in supported formats can be accessed via dictionary types by iterating or indexing. ――We can handle various online services. - NCBI – Blast, Entrez and PubMed services - ExPASy – Swiss-Prot and Prosite entries, as well as Prosite searches --Has an interface to common bioinformatics tools. - Standalone Blast from NCBI - Clustalw alignment program - EMBOSS command line tools --Standard classes for working with arrays, array IDs, and array metadata are available. --You can perform general tasks on sequences such as transcription, translation, and molecular weight calculation. --You can perform clustering using the k-nearest neighbor method, naive Bayes, and support vector machines. --You can create a substitution array and work with an array that contains standard methods for working with it. --You can easily parallelize tasks that can be parallelized. --Includes GUI-based programs for array manipulation, translation, BLAST execution, etc. --In addition to this tutorial, you can find a wide range of documentation and help pages for using modules such as wiki, websites, mailing lists, etc. .. --Integration with the Sequence Database Schema (BioSQL), which is also supported by BioPerl and BioJava projects.

1.3 Installing Biopython

download: http://biopython.org/wiki/Download

Supported OS: Windows, Mac, Linux

$ python setup.py build
$ python setup.py test
$ sudo python setup.py install

Detailed installation instructions, including the installation of Python and Biopython dependencies, are provided below.

PDF
HTML

1.4 Frequently Asked Questions (FAQ) ** 1. What is the reference? ** **

application note [1, Cock et al., 2009] For the official project announcement: [13, Chapman and Chang, 2000]; For Bio.PDB: [18, Hamelryck and Manderick, 2003]; For Bio.Cluster: [14, De Hoon et al., 2004]; For Bio.Graphics.GenomeDiagram: [2, Pritchard et al., 2006]; For Bio.Phylo and Bio.Phylo.PAML: [9, Talevich et al., 2012]; For the FASTQ file format as supported in Biopython, BioPerl, BioRuby, BioJava, and EMBOSS: [7, Cock et al., 2010]

2. “Biopython”？ “BioPython”？ The correct name is “Biopython”. Not “BioPython”

Omitted below.

Chapter 2 Quick Start – What can you do with Biopython? This section is designed to give you an overview of what you can do and how to use it so that you can get started quickly with BioPython. All examples in this section assume that you have a basic knowledge of Python and that you have Biopython installed. If you need to brush up your knowledge of Python, we've provided you with a wealth of free resources to get you started with the official Python documentation (http://www.python.org/doc/).

Some tasks require access to the database and may require an internet environment to perform.

2.1 General overview of what Biopython provides As mentioned in the introduction, BioPython is a set of libraries that allow biologists working in front of a computer to work with "objects" of interest. The user should have some programming experience (Python as well) or be interested in learning the program. Biopython does not parse a specific file format, but by providing a reusable library that allows you to focus on the issue of interest (of course, by writing a non-existent parser and contributing to Biopython). If you want to help, please!), The aim is to make your work as a programmer easier.

One thing to keep in mind about Biopython is that it often offers multiple ways to "do the same thing". For me, this can be frustrating. However, this can also be beneficial in practice, as it provides a lot of flexibility and usability beyond the library. This tutorial will show you the general or easy way. If you want to learn other ways, Cookbook (Chapter 20, this chapter contains some cool tricks and tips. ), The Advanced section (Chapter 22), the built-in docstrigs (via Python help commands, API documentation), and ultimately the code itself.

2.2 Working with sequences

Although arguable, the central object in bioinformatics is the array. That is, a quick introduction to the Biopython mechanism begins with working with arrays, or Sec objects. This will be discussed in more detail in Chapter 3. When we think of arrays, we spend a lot of time thinking of strings like ʻAGTACACTGGT`. Such a Seq object can be created as follows. Below, “>>>” indicates that it is a Python prompt.


>>> from Bio.Seq import Seq
>>> my_seq = Seq("AGTACACTGGT")
>>> my_seq
Seq('AGTACACTGGT', Alphabet())
>>> print(my_seq)
AGTACACTGGT
>>> my_seq.alphabet
Alphabet()

What I got here is a Seq object consisting of the genetic alphabet-I didn't specify whether it was DNA or protein (yes, protein has a lot of alanine, glycine, cysteine, threonine!) It reflects. The alphabet will be explained further in Chapter 3.

In addition to having an alphabet, Seq objects differ from Python strings in the methods they support. This cannot be done with just a string.

>>> my_seq
Seq('AGTACACTGGT', Alphabet())
>>> my_seq.complement()
Seq('TCATGTGACCA', Alphabet())
>>> my_seq.reverse_complement()
Seq('ACCAGTGTACT', Alphabet())

The next most important class is Seq Record, or Sequence Record. It has an annotated array (Seq object) containing an identifier, name and description. The Bio.SeqIO module used to read and write the array file format works with the SeqRecord object. This will be introduced below and will be explained in detail in Chapter 5.

This section deals with the basic features of Biopython array classes and how to use them. Once you know what it's all about working with the Biopython library, let's delve into the fun and interesting world of dealing with biological file formats!

2.3 A usage example Before you dive into the parser and everything you can do with Biopython, prepare an example that will inspire everything you do. If this tutorial didn't have any biology, why would you want to read it?

I like plants, so I need to have a plant-based case (sorry, don't think of fans of other creatures!). Go to our greenhouse, I was amazed at Lady Slipper Orchids (if you wonder why, take a look at the Lady Slipper Orchids photos on Flickr or Google Image Search).

Of course, orchids are beautiful to look at, but they are also very interesting for those who study evolution and phylogenetics. So, he said he was thinking of writing a constructive proposal to do a molecular study of Lady Slipper evolution, and he knew what kind of research had already been done and what could be added to it. Let's try. After a short reading of the paper, it was found that Lady Slipper Orchids belongs to the Orchidaceae family Cypripedioideae subfamily, which is composed of five genera: Cypripedium, Paphiopedilum, Phragmipedium, Selenipedium and Mexipedium.

This is enough for us to start digging into more. So let's see how Biooython's tools can help. Start by parsing the array in Section 2.4. But orchids will come back later as well. -For example, search PubMed for orchids-related articles. In Chapter 9, sequence data is extracted from GenBank. Extract the protein data of orchids, which is Chapter 10, from SwissProt. Section 6.4.1. Performs Clustal W multiple alignment of orchid protein.

2.4 Parsing sequence file formats Much work in bioinformatics deals with many file formats designed to have biological information. These files are full of interesting biological data, and attempts have been made to parse them into a format that is easy to handle in some programming languages. However, the task of parsing these files can be stressful, as their file formats change fairly regularly, and some of the formats can contain fragile parts, even the best-designed parsers. It's work.

From now on, I will introduce the Bio.SeqIO module. -Chapter 5 reveals much more. Start by searching online for our friend lady slipper orchids. NCBI is used manually to simplify this implementation. Use Entrez online search to look at the nucleic acid database at NCBI that describes Cypripedioideae. let's watch,

At the time this tutorial was written, this search had only 94 hits. We have these in FASTA format, GenBank text files (ls_orchid.fasta, [ls_orchid.gbk]( Saved as https://raw.githubusercontent.com/biopython/biopython/master/Doc/examples/ls_orchid.gbk) (docs / tutorial / examples / below with the Biopython source code).

If you do this search now, you will get hundreds of results! If you follow the tutorial and want to see a similar list of genes, you can either download the above two files or copy them from docs / examples / in the Biopython source code. In Section 2.5, we'll see how to do these searches in Python.

2.4.1 Simple FASTA parsing example If you open the lady slipper orchids FASTA file (ls_orchid.fasta) in your favorite text editor, you will see that the file starts like this:

>gi|2765658|emb|Z78533.1|CIZ78533 C.irapeanum 5.8S rRNA gene and ITS1 and ITS2 DNA
CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGATGAGACCGTGGAATAAACGATCGAGTG
AATCCGGAGGACCGGTGTACTCAGCTCACCGGGGGCATTGCTCCCGTGGTGACCCTGATTTGTTGTTGGG
...

This file contains 94 characters. Each row begins with a ">", followed by an array of one or several rows. Let's try this in Python:

from Bio import SeqIO
for seq_record in SeqIO.parse("ls_orchid.fasta", "fasta"):
    print(seq_record.id)
    print(repr(seq_record.seq))
    print(len(seq_record))

You should see something like this:

gi|2765658|emb|Z78533.1|CIZ78533
Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGATGAGACCGTGG...CGC', SingleLetterAlphabet())
740
...
gi|2765564|emb|Z78439.1|PBZ78439
Seq('CATTGTTGAGATCACATAATAATTGATCGAGTTAATCTGGAGGATCTGTTTACT...GCC', SingleLetterAlphabet())
592

Considering that the FASTA format does not specify an alphabet, Bio.SeqIO is initially in the more general-purpose SingleLetterAlphabet () than the DNA-specific one.

2.4.2 Simple GenBank parsing example Now let's load the GenBank file ls_orchid.gbk instead. The script to do this is almost the same as the snippet used for the FASTA file above. The only difference we have changed is the file name and format strings.

from Bio import SeqIO
for seq_record in SeqIO.parse("ls_orchid.gbk", "genbank"):
    print(seq_record.id)
    print(repr(seq_record.seq))
    print(len(seq_record))
This should give:

Z78533.1
Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGATGAGACCGTGG...CGC', IUPACAmbiguousDNA())
740
...
Z78439.1
Seq('CATTGTTGAGATCACATAATAATTGATCGAGTTAATCTGGAGGATCTGTTTACT...GCC', IUPACAmbiguousDNA())
592

At this time, Bio.SeqIO can choose IUPAC Ambiguous DNA, which is a nifty alphabet. You will also notice that the shorter string is used as seq_record.id.

2.4.3 I love parsing – please don’t stop talking about it!

Biopython has many parsers, each with features based on the format of its array. Chapter 5 deals with Bio.SeqIO in more detail, and Chapter 6 introduces Bio.AlignIO.

The most commonly used file formats have parsers in Bio.SeqIO and Bio.AlignIO, but there are no parsers for lesser-used or older file formats yet. Check SeqIO and AlignIO for the latest information, or ask by mailing list. The wiki has a list of the latest supported file formats and some examples. A specific parser and other documentation for doing cool things with it is Cookbook (for this tutorial). Chapter 20) is useful. If you can't find the information you're looking for, consider a cookbook suggestion to help busy authors.

2.5 Connecting with biological databases

One of the common tasks in bioinformatics is to extract information from biological databases. Connecting to the database manually can be a tedious task, especially if you need to repeat it. Biopython is trying to save time and energy by making some online databases available from Python scripts. Currently, you can use Biopython to extract data from the following databases.

Entrez (and PubMed) from the NCBI – See Chapter 9.
ExPASy – See Chapter 10.
SCOP – See the Bio.SCOP.search() function.

Corresponding modules allow you to interact with the CGI scripts on these pages and get results in a manageable format. In some cases, the output data can be integrated with the Biopython parser for easier extraction of information.

2.6 What to do next

Now that you've done this, and hopefully you've got a good understanding of the basics of Biopython, you're ready to start working efficiently. It's a good idea to finish reading this tutorial first. Then if you're interested, read the source code and the auto-generated documentation.

Once you have an image of what you want to do and which library in Biopython you can use to achieve it, Cookbook (# htoc278) ( It is good to read Chapter 20). It may contain code to do something similar to what you want to do. Enjoy coding!