Hello. This article is for Bioinformatics Advent Calendar 2020.
To tell the truth, I wanted to proceed with the development of Ruby-htslib and make it public, but I haven't made much progress, so I'll give up and write the background as fluttering.
This article is incomplete, but I will give up and publish it. I will update it as the development of ruby-htslib progresses.
R, Python, Perl, Julia, C ++ .. There are as many great languages out there as there are stars. But I like the Ruby language. It happened that the first time I could write was the Ruby language. So, Ruby became the mother tongue, as if the spot-billed duck chicks thought they were the first programming language they saw. (It's a lie. I tried Objective-C language before Ruby language, but gave up. Ruby looked simple and beautiful compared to Objective-C language.)
The feature of such Ruby language is that almost everything is composed of objects and uses methods instead of functions, compared to R and Julia. In this world, "data" and "procedure" exist as a group of "objects". I think this is generally a disadvantage in data analysis. On the other hand, it is a powerful support for those who want to make something. When I notice it, I try to make a tool in Ruby.
Well, I said that the Ruby language may not be very suitable for data analysis, but there are also deep-rooted fans, and there is also a library for life information called BioRuby.
However, the data of today's amazing research often uses the data output by the next-generation sequencer. We handle file formats of such next-generation sequencer-related data, that is, formats such as SAM
, BAM
, VCF
, and BCF
. I've come to realize that it seems difficult to handle this with BioRuby.
Does anyone have a nice library? I thought and searched for it. There was one. Bioruby-samtools.
However, this project is a tool that calls Samtools from the command line, and it feels a little different from the tool that can manipulate files in detail. (This project was previously configured as a binding for Samtools using FFI, but it seems that it started using Open3 when HTSlib was separated from Samtools. Open3 is to run a program and its Connect a pipe to the standard input, standard output, and standard error output of the process. In other words, it is not a binding. It seems that when the API became stable, it was planned to create an HTSlib binding using FFI, but the developer can afford it. It seems that it was difficult to create without
On the other hand, in languages such as Python and R, there are libraries that can use the files of the next-generation sequencer. As a result, many tools and pipelines for analysis have been developed.
Normally, you should flexibly switch languages by selecting the tool according to your purpose. I know it in my head, but it's hard to take such a rational method. Like Mogura, I started wondering if there was any way to handle these files in the Ruby language.
Formats such as the SAM
file and the BAM
file are files that represent lead alignment. The next-generation sequencer will generate a read file called FASTQ. The SAM/BAM file is the result of mapping this file to the reference genome with an aligner such as bwa
or STAR
.
I think this is a slightly strange file format. There are more common databases around us, such as SQLite. There are also file formats such as tab-delimited text and MS Excel. However, in the world of life information, it seems that SAM/BAM files are preferred instead of those files.
@HD VN:1.6 SO:coordinate
@SQ SN:ref LN:45
r001 99 ref 7 30 8M2I4M1D3M = 37 39 TTAGATAAAGGATACTG *
r002 0 ref 9 30 3S6M1P1I4M * 0 0 AAAAGATAAGGATA *
r003 0 ref 9 30 5S6M * 0 0 GCCTAAGCTAA * SA:Z:ref,29,-,6H5M,17,0;
r004 0 ref 16 30 6M14N5M * 0 0 ATAGCTTCAGC *
r003 2064 ref 29 17 6H5M * 0 0 TAGGC * SA:Z:ref,9,+,5S6M,30,1;
r001 147 ref 37 30 9M = 7 -39 CAGCGGCAT * NM:i:1
↑ An example of a SAM file listed in the specifications (hts-spec). It has unfamiliar headers, flags and tags. The flag contains very important information such as whether the lead is mapped or not.
Flags are written in binary, and even if you look at Pat, you can't tell what it means. You can use the samtools flag
command to display the meaning of the flags.
samtools flag 2064
# 0x810 2064 REVERSE,SUPPLEMENTARY
↑ When the meaning of the flag is displayed with samtools
Integer | Binary | Description |
---|---|---|
1 | 000000000001 | There is a pair lead |
2 | 000000000010 | Both are mapped |
4 | 000000000100 | read1 is not mapped |
8 | 000000001000 | read2 (mate) is not mapped |
16 | 000000010000 | read1 is mapped to the inverse complementary strand |
32 | 000000100000 | read2 is mapped to the inverse complementary strand |
64 | 000001000000 | read1 |
128 | 000010000000 | read2 |
256 | 000100000000 | Mapped to multiple locations |
512 | 001000000000 | Mapping QV is low |
1024 | 010000000000 | PCR or optical duplicate |
2048 | 100000000000 | supplementary alignment (e.g. aligner specific, could be a portion of a split read or a tied region) |
↑ Flag table
BAM file is a compressed format of SAM file, and it seems to be a format compatible with gzip. The BAM file seems to be devised to be easy to search even if it is compressed. There is also a similar file format called CRAM
. This seems to be a file format with a higher compression rate than BAM. You can manipulate these special files using a command line tool called Samtools.
In fact, I think the reason why this seemingly harassing file format for beginners is used is largely due to historical background and the technical limitations of humankind. RNA-Seq BAM files can be as small as a few GB, and whole-genome sequences can be as large as tens of GB. This alone is huge enough, but SQLite and TSV will be several times larger than that. Searching for such a huge file in Excel is naturally difficult. It will be difficult to analyze, and it will be difficult to secure space for storage. Perhaps it's a size issue, and a special format has been developed to search as fast as possible. (If you are interested, please read the entry below)
However, computer performance continues to improve. The same situation does not always last. In 10 or 20 years, the time may come when smartphones will be sufficient for processing current sequence information. For example, there is a project called biowasm. This seems to be like running samtools`` bedtools
bowtie`` seqkit
etc. on the browser by using WebAssembly.
↑ An ambitious project biowasm that runs genomics tools on a browser
At present, such tools may not be useful. Get a huge budget in the world, operate a large number of PCs such as cloud environments and supercomputers using workflow languages, efficiently process large quantities of samples by violence of numbers and do not discover new facts It may not be the target of evaluation. But even if you say such a brave thing, in reality, only a few people on the planet can do it. So, if you catch things like that, you may feel darker and darker. (At least I feel dark)
Therefore, I feel that the time to touch and look at tools that are far from the world is an important time to warm your heart.
The story jumped, but the story jumps further.
I'm curious that I can't find the SAM or BAM file icons. I searched quite a bit for an icon set for bioinfo. People who are good at command line operations may not care about that, but I personally think that the existence of file icons is very important for working comfortably.
If you know a great set of icons for bioinfo, please let us know in the comments. Maybe somewhere in the world there are people making such things, but I think they just haven't been found. Well, there is also a last resort to design it yourself, cut the money and have the designer finish it, and publish it copyright-free.
Is published as a Github repository. There is not much information in Japanese, so if you want to know more details, you need to refer to the original text. However, because it is a PDF file, Google Translate does not work very well. So I transcribed it from PDF and posted it to Qiita as an article.
By Google Translate the site above, you can understand the specifications of SAM/BAM files in Japanese to some extent.
VCF/BCF is a form of mutation. I can't say that I understand this on the surface yet, so I need to study it in the future. First of all, I'm thinking of machine-translating the specifications like SAM/BAM format to get an overview of the whole.
By the way, it turned out that this Samtools is written in C language, and inside Samtools, a library called htslib works like an engine. HTSLib seems to have separated what was originally a part of Samtools as HTSLib. And I found that bindings to htslib were created in various languages such as Python and Java. I learned that even in languages like Python and R, tools that directly handle SAM/BAM/VCF/BCF use htslib indirectly.
language | URL |
---|---|
Python | https://github.com/pysam-developers/pysam |
JAVA | https://github.com/samtools/htsjdk |
Rust | https://github.com/rust-bio/rust-htslib |
Nim | https://github.com/brentp/hts-nim |
Julia | https://github.com/bicycle1885/HTSLib.jl |
R | https://github.com/Bioconductor/Rhtslib |
Perl | https://github.com/Ensembl/Bio-DB-HTS |
↑ htslib binding in various languages
What's interesting is that bindings are being enthusiastically developed in new statically typed languages such as Rust and Nim. I think this is a big demand for speed. Since the amount of genome data is huge, it may be required to process it faster using languages such as Rust, Nim, and Go rather than scripting languages such as Python.
The fifth Julia language binding from the top seems to have been developed by the Japanese. Looking inside, it seems that the C language header file is parsed and the binding is automatically generated to some extent.
Figure: HTSLib.jl comes with its own binding [auto-generated script](https://github.com/bicycle1885/HTSLib.jl/blob/master/scripts/translate_exported_functions.jl)So I thought. I think that the Ruby language will be able to handle next-generation sequencer files to some extent if htslib bindings are created.
At such times, if you are in a trendy language such as Python, R, Julia, you may be wondering which binding to use by looking for a binding made by a wonderful person in the world, but in Ruby there is only one solution. But you don't have to worry about it.
You can make it yourself or not.
By chance, the HTSLib treatise was uploaded on December 16th. In response to this, the creation of htslib bindings may be activated in various languages.
↑ Figure in the treatise
↑ Twitter that the author of Nim language binding hts-nim commented "It seems convenient Danner"
So how do you create a Ruby binding? There are two main methods. One is to use native extension and the other is to use FFI. To use the native extension, you need to write C language. I can't use this method because I can't handle C. The other is that FFI is easy to use, and you can create Ruby bindings even if you are not very familiar with C language.
I have been making a tool for drawing graphs in Ruby called GR.rb for a year.
↑ GR.rb that can draw beautiful graphs and figures even in Ruby
It is a binding of a graph drawing library called GR, which is the de facto de facto standard in the Julia language. FFI uses a Ruby standard library called Fiddle. Creating a Ruby binding is like connecting wires to create a plastic model, which can be done without such advanced technology.
If you observe the code for the great bindings that precede it, learn the patterns, and make them the same, you'll almost always have something that works. This plotting tool was also started by myself because there is not enough library to draw graphs in Ruby.
In the Ruby community, Andrew Kane has been energetically creating bindings for C language machine learning tools for about two years, and uses FFI for C language bindings in Ruby. For those who want to make it, I think it will be very helpful to observe Ankane's repository.
Next, what is the significance of creating a Ruby binding for htslib?
As a personal merit of developing, while making ruby-htslib, it will be necessary to read the file specifications of SAM and VCF, so I feel that I will naturally acquire various knowledge about the format. I will. In the future, I think the time will surely come when various new modality will emerge beyond the next-generation sequencer. Even in that case, human activities such as the emergence of various formats, the creation of tools and libraries that form the basis for processing them, and the creation of bindings from various languages do not change significantly. I feel that it is of great significance to learn a place where various people are working hard on common issues.
Next, while there are various languages that are more suitable for data analysis than Ruby, let us consider the social meaning of creating bindings in the Ruby language. I don't know until I try this. Ruby has a reputation for ease of use in the field of converting data into objects and handling it, as represented by ActiveRecord. So, if something like the goodness of HTSLib's Ruby binding appears as you continue to make it, I think there is one such direction. Of course, it would be better if we could create an analysis tool using the Ruby language or something like an analysis pipeline, but I think this is quite difficult to achieve. In any case, I think diversity is important in the world of programming languages as well.
Now, from here on, how can a person without technical skills make ruby-htslib? It will be a realistic story.
The first point is choosing the FFI library. In the Ruby language, there are two types of libraries for creating FFI bindings, Ruby-FFI and Fiddle. Both are historic libraries. Ruby-FFI is a more sophisticated third party library. Fiddle is a standard library of Ruby, but its functions are simple and not very popular.
I decided to use Ruby-FFI this time. Ruby-FFI has been adopted by many projects and is good at dealing with mutual conversion between Ruby arrays and C pointers, nested structs, etc. The bad points of Ruby-FFI are that it is not a Ruby standard library, so it is necessary to install an additional library, and there is a risk that it will not be able to keep up with changes when the Ruby configuration changes significantly in the future. Since there is no developer, it is difficult to ask questions in Japanese.
↑ First of all, in order to get the whole picture of Ruby FFI, I made a complete translation of the FFI wiki using Google Translate.
Well, I can think of two points that it is difficult to make even if you use Ruby-FFI with low level binding.
One is that Ruby-FFI does not support the C language bit field. There is also a Gem that makes FFI compatible with bit field, but I think that it can not be used as it is because the release date is old.
Another difficulty is memory management. Ruby automatically deletes objects whose garbage collection is no longer referenced. It seems that FFI also releases the memory at that timing. However, for some reason, it may not work well, and the memory to be released may not be released, or conversely, the GC may release the memory without permission. When that happens, I think it will be very difficult for me to find and debug the problem. When you hit such a thing, I think you have to get help from others.
When creating an FFI binding, I think that the function of the header file usually written in C language is rewritten to the FFI module of Ruby by repeating the replacement using regular expressions. However, it is not bad to rewrite each function into a method by yourself. Anyone can do it by handwriting each method, if it takes time and effort. It may seem like a monotonous task, but it actually has a big advantage. By copying the API of the library, you will be able to gain a deeper understanding of the overall picture of the tool.
However, when humans manually create bindings, typographical errors and typographical errors are inevitable. HTSLib is still under development and functions will continue to be added and modified. As the version of HTSLib goes up, the Ruby bindings will have to be tweaked as well. I like manual work and think it's important, but I also want to think about some kind of automation.
One policy is to create your own converter, as in the case of the Julia language. In this case, you can use regular expression substitution or use a C language parser implemented in the Ruby language to create a script that automatically generates bindings. Another possibility is to use an existing converter made by someone, such as Rust's HTSlib binding.
Ruby-FFI had a project to automatically generate bindings, but development is virtually interrupted. Since the last update date is old, you will not be able to use it as it is.
Also, there seems to be the following projects. This seems to work fine, so I'm looking into whether it's available.
A related method is to look directly at the binary files in the shared library.
nm htslib.so
Then you can see the name of the function directly from the shared library binary file. For the purpose of covering as much as possible the functions that can be implemented, a way to look directly at such binaries may be useful somewhere.
The advantage of including the shared library in the package is that you can easily adjust the version. However, the size of the gem will be large, and some users will not like the binary files included in the gem and will want to use their own htslib. Also, when considering cross-platform, if you do not divide the gem for each platform, you need to pack three types of binaries: Windows, Mac, and Linux. (I'm not familiar with it, so there may be a better way)
There are two ways to find/specify a shared library, such as specifying the path with an environment variable, or using Pkg-config to find the file. It seems that htslib also supports pkg-config. However, it seems necessary to check the operation when htslib is installed in Anaconda etc.
Bindings using ffi create two types of APIs.
The low-level API creates an FFI module and prepares it as a singular method. When you call the singular method of the FFI module, the htslib function is called.
High-level APIs provide a more Ruby-like API. What is Ruby-like is a difficult question. I think the goal is to be able to handle NGS-related files reasonably in an object-oriented manner. I don't understand it yet, so I'm thinking of porting hts-python and improving it.
The ActiveRecords API may be helpful for uses such as searching for bam files. I'm not familiar with how to use ActiveRecord, so I have to study a lot.
In order to think about high-level API, it seems that you need to understand htslib as a whole by using tools of other languages. I still haven't got the ability, and I feel that I need to study and experience a lot from here, so I think I should realize it on a yearly basis.
At this stage, I've just finished creating a low-level API, mostly by hand.
You can load a bam file with a sample like the following.
require 'htslib'
bam_path = File.expand_path('../assets/poo.sort.bam', __dir__)
htf = HTS::FFI.hts_open(bam_path, 'r')
idx = HTS::FFI.sam_index_load(htf, bam_path)
hdr = HTS::FFI.sam_hdr_read(htf)
b = HTS::FFI.bam_init1
nuc = { 1 => 'A', 2 => 'C', 4 => 'G', 8 => 'T', 15 => 'N' }
cig = {
0 => :BAM_CMATCH,
1 => :BAM_CINS,
2 => :BAM_CDEL,
3 => :BAM_CREF_SKIP,
4 => :BAM_CSOFT_CLIP,
5 => :BAM_CHARD_CLIP,
6 => :BAM_CPAD,
7 => :BAM_CEQUAL,
8 => :BAM_CDIFF,
9 => :BAM_CBACK
}
10.times do
HTS::FFI.sam_read1(htf, hdr, b)
p b[:core].members.zip(b[:core].values)
p name: b[:data].read_string,
flag: b[:core][:flag],
pos: b[:core][:pos] + 1,
mpos: b[:core][:mpos] + 1,
mqual: b[:core][:qual],
seq: HTS::FFI.bam_get_seq(b).read_bytes(b[:core][:l_qseq] / 2).unpack1('B*')
.each_char.each_slice(4).map { |i| nuc[i.join.to_i(2)] }.join,
cigar: HTS::FFI.bam_get_cigar(b).read_array_of_uint32(b[:core][:n_cigar])
.map { |i| s = format('%32d', i.to_s(2)); [s[0..27].to_i(2), cig[s[28..-1].to_i(2)]] },
qu
As you can see, it's confusing because it's forced to display using only low-level APIs. There is still more ... The destination is long. When you do this, you should see something like this:
[[:pos, 3289], [:tid, 0], [:bin, 4681], [:qual, 0], [:l_extranul, 2], [:flag, 133], [:l_qname, 32], [:n_cigar, 0], [:l_qseq, 70], [:mtid, 0], [:mpos, 3289], [:isize, 0]]
{:name=>"poo_3290_3833_2:0:0_2:0:0_119", :flag=>133, :pos=>3290, :mpos=>3290, :mqual=>0, :seq=>"GGGGCAGCTTGTTCGAAGCGTGACCCCCAAGACGTCGTCCTGACGAGCACAAACTCCCATTGAGAGTGGC", :cigar=>[], :qual=>"2222222222222222222222222222222222222222222222222222222222222222222222"}
[[:pos, 3292], [:tid, 0], [:bin, 4681], [:qual, 0], [:l_extranul, 3], [:flag, 133], [:l_qname, 32], [:n_cigar, 0], [:l_qseq, 70], [:mtid, 0], [:mpos, 3292], [:isize, 0]]
{:name=>"poo_3293_3822_1:0:0_0:0:0_76", :flag=>133, :pos=>3293, :mpos=>3293, :mqual=>0, :seq=>"TTCGATGCGGGACCCCCAAGACGTCGTCCTGACGAGCACAAACTCCCATTGAGAGTGGCACATGATTTCC", :cigar=>[], :qual=>"2222222222222222222222222222222222222222222222222222222222222222222222"}
[[:pos, 3293], [:tid, 0], [:bin, 4681], [:qual, 0], [:l_extranul, 2], [:flag, 69], [:l_qname, 32], [:n_cigar, 0], [:l_qseq, 70], [:mtid, 0], [:mpos, 3293], [:isize, 0]]
{:name=>"poo_3294_3861_2:0:0_2:0:0_2d7", :flag=>69, :pos=>3294, :mpos=>3294, :mqual=>0, :seq=>"TGGGGACCGTGTGACTATCAGAAGGGTGGGGTCAGCTTGTTCGATGCGGGACCCCCAAGACGTCGTCCTG", :cigar=>[], :qual=>"2222222222222222222222222222222222222222222222222222222222222222222222"}
[[:pos, 3298], [:tid, 0], [:bin, 4681], [:qual, 0], [:l_extranul, 2], [:flag, 69], [:l_qname, 32], [:n_cigar, 0], [:l_qseq, 70], [:mtid, 0], [:mpos, 3298], [:isize, 0]]
{:name=>"poo_3299_3808_0:0:0_4:0:0_2e0", :flag=>69, :pos=>3299, :mpos=>3299, :mqual=>0, :seq=>"CCCAAGACGTCGACCTGAGGAGCACAAACTCCCAATGAGAGTGGCACATGATTTGCCCAACCATACCATT", :cigar=>[], :qual=>"2222222222222222222222222222222222222222222222222222222222222222222222"}
[[:pos, 3303], [:tid, 0], [:bin, 4681], [:qual, 0], [:l_extranul, 2], [:flag, 133], [:l_qname, 32], [:n_cigar, 0], [:l_qseq, 70], [:mtid, 0], [:mpos, 3303], [:isize, 0]]
{:name=>"poo_3304_3813_0:0:0_2:0:0_2f5", :flag=>133, :pos=>3304, :mpos=>3304, :mqual=>0, :seq=>"GGACCCCCAAGACGTCGTCCTGACGCGCACAAACTCCCATTGAGAGTGGCACATTATTTCCCCAACCATA", :cigar=>[], :qual=>"2222222222222222222222222222222222222222222222222222222222222222222222"}
[[:pos, 3311], [:tid, 0], [:bin, 4681], [:qual, 0], [:l_extranul, 3], [:flag, 133], [:l_qname, 32], [:n_cigar, 0], [:l_qseq, 70], [:mtid, 0], [:mpos, 3311], [:isize, 0]]
{:name=>"poo_3312_3825_0:0:0_1:0:0_6f", :flag=>133, :pos=>3312, :mpos=>3312, :mqual=>0, :seq=>"TTGTTCGATGCGGGACCCCCAATACGTCGTCCTGACGAGCACAAACTCCCATTGAGAGTGGCACATGATT", :cigar=>[], :qual=>"2222222222222222222222222222222222222222222222222222222222222222222222"}
[[:pos, 3317], [:tid, 0], [:bin, 4681], [:qual, 0], [:l_extranul, 2], [:flag, 133], [:l_qname, 32], [:n_cigar, 0], [:l_qseq, 70], [:mtid, 0], [:mpos, 3317], [:isize, 0]]
{:name=>"poo_3318_3847_3:0:0_1:0:0_142", :flag=>133, :pos=>3318, :mpos=>3318, :mqual=>0, :seq=>"CTATCAGAAGGGTGGGGGCAGCTTGTTCGATGCGGGACCCCCAAGACGACGTCCTGACGAGCACAAACTC", :cigar=>[], :qual=>"2222222222222222222222222222222222222222222222222222222222222222222222"}
[[:pos, 3319], [:tid, 0], [:bin, 4681], [:qual, 0], [:l_extranul, 2], [:flag, 69], [:l_qname, 32], [:n_cigar, 0], [:l_qseq, 70], [:mtid, 0], [:mpos, 3319], [:isize, 0]]
{:name=>"poo_3320_3822_3:0:0_6:0:0_333", :flag=>69, :pos=>3320, :mpos=>3320, :mqual=>0, :seq=>"TTCGATGCGGGACCCCCAAGACGTCGTGCTGACGAGCACAACCTCGCAATGAGAGTGGCACAAGATTTGC", :cigar=>[], :qual=>"2222222222222222222222222222222222222222222222222222222222222222222222"}
[[:pos, 3321], [:tid, 0], [:bin, 4681], [:qual, 0], [:l_extranul, 2], [:flag, 69], [:l_qname, 32], [:n_cigar, 0], [:l_qseq, 70], [:mtid, 0], [:mpos, 3321], [:isize, 0]]
{:name=>"poo_3322_3847_1:0:0_0:0:0_2cb", :flag=>69, :pos=>3322, :mpos=>3322, :mqual=>0, :seq=>"CTATCAGAAGGGTGGGGGCAGCTTGTTCGATGCGGGACCCCCAAGACGTCGTCCTGACGAGCACAAACTC", :cigar=>[], :qual=>"2222222222222222222222222222222222222222222222222222222222222222222222"}
[[:pos, 3327], [:tid, 0], [:bin, 4681], [:qual, 0], [:l_extranul, 2], [:flag, 69], [:l_qname, 32], [:n_cigar, 0], [:l_qseq, 70], [:mtid, 0], [:mpos, 3327], [:isize, 0]]
{:name=>"poo_3328_3840_1:0:0_2:0:0_211", :flag=>69, :pos=>3328, :mpos=>3328, :mqual=>0, :seq=>"AAGGGTGGGGGCAGCTTGTTCGATGCGGGACCCCCAAGACGTCGTCCTGACGAGCACCAACTCCGATTGA", :cigar=>[], :qual=>"2222222222222222222222222222222222222222222222222222222222222222222222"}
poo is a suitable bam file created using wgsim.
This project was selected for the 2020 Ruby Association Development Grant.
This article is over (unfinished)
Recommended Posts