[PYTHON] Orthologous analysis using OrthoFinder

(2017/2/22, CentOS x86_64)

Introduction

OrthoFinder was used to perform Orthologous analysis based on the genomic information of multiple species. OrthoFinder uses MCL (markov cluster algorithm) to estimate orthologs. According to the paper, OrthoFinder is faster than other methods (such as OrthoMCL) in benchmarking tests using OrthoBench, and it is also an excellent method that has been refined by its own standardization for classification of orthologs. I will.

reference

http://www.stevekellylab.com/software/orthofinder https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4531804/

The idea of OrthoFinder

Orthologs are nowadays understood by people in various definitions, but in OrthoFinder,

What you can do with OrthoFinder

  1. OrthoGroup (OG) estimation
  2. Estimating the orthologous gene set of 1 species x 1 species
  3. Creating a phylogenetic tree
  4. Selection of single copy genes

It will do the above four things automatically. Regarding 3, it will create a phylogenetic tree for each species and a phylogenetic tree for each OG. If you want to create a phylogenetic tree of a species using only single-copy genes, you will have to do it yourself.

Installation

OrthoFinder depends on Python2.7, so if you are using Python3.x, please build a virtual environment with pyenv, anaconda, etc. (Reference items / 5b62d31cb7e6ed50f02c)). To install, you need to install * BLAST + *, * MCL *, * FastMe *, * DLCpar * in addition to OrthoFinder itself.

  1. OrthoFinder
  1. git clone to download the package and unzip it.
$git clone https://github.com/davidemms/OrthoFinder.git
$tar xzj OrthoFinder-1.1.2.tar.gz
  1. Put your PATH in the orthofinder directory.
  1. MCL, FastMe There are no particular points to note. Those who have root privileges can easily build with sudo etc., and those who do not have root privileges can easily build by going to their respective websites and downloading. Please install by referring to the OrthoFinder Manual.

  2. DLCper You need to be a little careful. You can install it in the same way as 2., but when building with setup.py, you need to do it in the directory where * bin * contains python (you can check with which python). Simply cp to the directory and run setup.py, or use the --prefix option to specify the directory to build. If you don't do this, the Python module dlcpar will not be in Python and OrthoFinder will not work.

How to use

Preparation

  1. Prepare multiple Fasta files (.fa, .faa) you want to analyze
  2. Combine all Fasta files into one directory

Specify the directory containing the Fasta files you want to parse. If you unzip the OrthoFinder package, you will find the ʻExampleData` directory containing the Fasta file directly underneath, so it is better to do a test run with it.

$python orthofinder.py -f your_fasta_dir -t 5 # -Specifying a file with the f option, -Specify the number of threads that can be used with the t option.

At this time, you can also specify a parallel job with the OrthoFinder algorithm with the -a option. It is necessary to consider the memory and set it so that it does not crash as follows.

  • 0.02 GB per species for small genomes (e.g. bacteria)

When the analysis is finished, the Results_Date directory will be created directly under your_fasta_dir.

Check the result

The following files are generated in this directory:

  1. Orthogroups.csv
  1. Orthogroups.txt
  2. Orthogroups_SpeciesOverlaps.csv
  3. Orthogroups_UnassignedGenes.csv
  4. Orthologues_Date (directory) → Directly under the Tree directory, ʻOrthologue directory`
  5. Statistics_Overall.csv
  6. Statistics_PerSpecies.csv

Orthogroups.csv file

The estimated Orthogroup is included in 1. as follows. Species are separated by Tabs and genes are separated by commas. 2. is the format version of OrthoMCL.

OG Specie1 Specie2 Specie3
OG000001 gene_s1_1, gene_s1_3 gene_s2_1, gene_s2_2 gene_s3_2
OG000002 gene_s1_2, gene_s1_4 gene_s2_3 gene_s3_1, gene_s3_3

Statistics file

6.Statistics_Overall.csv contains 1) total number of genes used 2) estimated total number of OGs 3) percentage of genes classified as OG Contains information such as. 7.Statistics_PerSpecies.csv has the above data for each species.

Tree directory, Orthologue directory

A tree file of the phylogenetic tree for each OG is created in the Tree directory, and the phylogenetic tree of the species is contained in the directory directly above. In the Orthologue directory, a table of ortholog genes of 1 species x 1 species is created for each species used.

useful function

1. After the analysis is completed, add a new species and re-analyze.

Thankfully, OrthoFinder has additional features. As for how to use

  1. Create a new directory and put the Fasta file you want to add
  2. Analyze the Working Directory directly under the Result_Date directory of the original data you want to add by specifying it as follows. For this WorkingDirectory, specify the one that contains SpecieID.txt.
$python orthofinder -b previous_working_dir -f new_fasta_dir

2. After the analysis is completed, exclude the species and re-analyze.

You can kindly exclude it.

  1. Open SpecieID.txt in Working Directory directly under Result of the original data with an editor.
  2. Add # to the species you want to exclude and comment them out.
  3. Analysis as follows
$python orthofinder -b previous_working_dir

3. Add and exclude at the same time

Of course, you can add and exclude at the same time. Prepare the Fasta you want to add, edit SpecieID.txt, and run it with the same command as when adding a new Fasta above.

4. Other

It is also possible to move only steps such as BLAST independently. You can also create a phylogenetic tree using MAFFT and FastTree. See the OrthoFinder Manual (https://github.com/davidemms/OrthoFinder/blob/master/OrthoFinder-manual.pdf) for more information.

Recommended Posts

Orthologous analysis using OrthoFinder
Data analysis using xarray
Data analysis using Python 0
Japanese morphological analysis using Janome
Data analysis using python pandas
Precautions when using TextBlob trait analysis
Face recognition using principal component analysis
Japanese analysis processing using Janome part1
Recommendation of data analysis using MessagePack
Image binarization using linear discriminant analysis
Recommendation tutorial using association analysis (concept)
Recommendation tutorial using association analysis (python implementation)
Try cluster analysis using the K-means method
[Machine learning] Regression analysis using scikit learn