[PYTHON] A memo about the behavior of bowtie2 during multiple hits

A story about how multi-maps are handled by bowtie2.

background

In the default behavior of bowtie2, even if the array is hit multiple times, only one of them is reported. Then, how to choose one of them is usually randomly selected from the ones with the highest alignment score. The document also says so [^ 1]. … But in some cases it seems that it will not be random and will solidify [^ 2], so I would like to clarify the conditions.

Most of the conclusions are written here [^ 2], but this problem is due to the random number generation of bowtie2. The --non-deterministic option section of the Bowtie2 instruction manual [^ 3] has the following explanation.

Normally, Bowtie 2 re-initializes its pseudo-random generator for each read. It seeds the generator with a number derived from (a) the read name, (b) the nucleotide sequence, (c) the quality sequence, (d) the value of the --seed option. This means that if two reads are identical (same name, same nucleotides, same qualities) Bowtie 2 will find and report the same alignment(s) for both, even if there was ambiguity. When --non-deterministic is specified, Bowtie 2 re-initializes its pseudo-random generator for each read using the current time. This means that Bowtie 2 will not necessarily report the same alignment for two identical reads. This is counter-intuitive for some users, but might be more appropriate in situations where the input consists of many identical reads.

In summary, bowtie2 initializes a random number generator for each read, and its seed value seems to depend on the following four.

--Lead name --Base sequence --Quality array --- The value of the --seed option

The point is that leads with the same information, including the __name, are mapped to the same location __. Below, let's confirm that the mapping result will be different if the read name, quality array, and --seed option are different.

Preparation

Prepare the following FASTQ based on the multi-mapped base sequence. (Arrangements etc. are created based on the verification example here [^ 2])

--FASTQ with the same read name, base sequence, and quality sequence --FASTQ has the same base sequence, but the name and quality sequence {either or both} are different for each read.

Since it is troublesome to prepare each time, the next generator will be used instead.

gen_testseq.py


#!/usr/bin/env python
import argparse
import random

parser = argparse.ArgumentParser()
parser.add_argument("-r", "--reads", type=int, default=10000)
parser.add_argument("-n", "--name", action="store_true")
parser.add_argument("-q", "--quality", action="store_true")
args = parser.parse_args()

for i in range(args.reads):
    print "@test_sequence{}".format(random.random() if args.name else '')
    print "CAAAGTGTTTACTTTTGGATTTTTGCCAGTCTAACAGGTGAAGCCCTGGAGATTCTTATTAGTGATTTGGGCTGGGGCCTGGCCATGTGTATTTTTTTAAA"
    print '+'
    print ''.join([chr(random.randrange(33, 127)) for _ in range(101)]) if args.quality else 'I' * 101

When the array on the second line of print is mapped to hg19, 9 places are multi-hit, and the following 3 places have the highest alignment score and are map candidates.

chrom position strand
chr1 11635 +
chr2 114359380 -
chr15 102519535 -

bowtie2 used version 2.2.4.

Verification

When all lead information is the same FASTQ

Like this, only the position is taken out from the map result and the number is counted.

$ bowtie2 --no-hd -x hg19 -U <(./gen_testseq.py) | cut -f 3-4 | sort | uniq -c

result

10000 reads; of these:
  10000 (100.00%) were unpaired; of these:
    0 (0.00%) aligned 0 times
    0 (0.00%) aligned exactly 1 time
    10000 (100.00%) aligned >1 times
100.00% overall alignment rate
  10000 chr15	102519435

All were mapped in one place.

Try specifying the --seed option

$ bowtie2 --no-hd --seed 17320508 -x hg19 -U <(./gen_testseq.py) | cut -f 3-4 | sort | uniq -c
10000 reads; of these:
  10000 (100.00%) were unpaired; of these:
    0 (0.00%) aligned 0 times
    0 (0.00%) aligned exactly 1 time
    10000 (100.00%) aligned >1 times
100.00% overall alignment rate
  10000 chr1	11635

The location changes depending on the value, but only one location is mapped.

Try specifying the --non-deterministic option

$ bowtie2 --no-hd --non-deterministic -x hg19 -U <(./gen_testseq.py) | cut -f 3-4 | sort | uniq -c
10000 reads; of these:
  10000 (100.00%) were unpaired; of these:
    0 (0.00%) aligned 0 times
    0 (0.00%) aligned exactly 1 time
    10000 (100.00%) aligned >1 times
100.00% overall alignment rate
   3302 chr1	11635
   3379 chr15	102519435
   3319 chr2	114359280

This time it was randomly mapped. Of course, the result will be different each time you run it.

If the name or quality sequence is different for each read

$ bowtie2 --no-hd -x hg19 -U <(./gen_testseq.py -n) | cut -f 3-4 | sort | uniq -c
10000 reads; of these:
  10000 (100.00%) were unpaired; of these:
    0 (0.00%) aligned 0 times
    0 (0.00%) aligned exactly 1 time
    10000 (100.00%) aligned >1 times
100.00% overall alignment rate
   3379 chr1	11635
   3294 chr15	102519435
   3327 chr2	114359280

This time it was randomly mapped. here,

$ ./gen_testseq.py -n > test.fastq
$ bowtie2 --no-hd -x hg19 -U test.fastq | cut -f 3-4 | sort | uniq -c

If you drop it to a file as described above, you will get the same result at each runtime. Of course, this does not apply if the value of the --seed option is different or if the --non-deterministic option is specified.

Randomly mapped even if the quality array is different

$ bowtie2 --no-hd -x hg19 -U <(./gen_testseq.py -q) | cut -f 3-4 | sort | uniq -c
10000 reads; of these:
  10000 (100.00%) were unpaired; of these:
    0 (0.00%) aligned 0 times
    0 (0.00%) aligned exactly 1 time
    10000 (100.00%) aligned >1 times
100.00% overall alignment rate
   3351 chr1	11635
   3312 chr15	102519435
   3337 chr2	114359280

Conclusion

--Usually, you can think that the map destination is randomly selected (it is unlikely that the lead name will be duplicated in addition to the array). --If the same lead is included, it will be mapped randomly by using the --non-deterministic option. --However, reproducibility is lost. --Use the --seed option if you need results with different multimap destinations and want to ensure reproducibility.

reference

It seems that there was a case where it did not vary well [^ 2], but was it a bug?

[^ 1]: See the -k option section [^ 3]

Recommended Posts

A memo about the behavior of bowtie2 during multiple hits
About the behavior of Queue during parallel processing
About the behavior of yield_per of SqlAlchemy
About the behavior of enable_backprop of Chainer v2
A memo explaining the axis specification of axis
About the behavior of copy, deepcopy and numpy.copy
A note about the python version of python virtualenv
About the behavior of Model.get_or_create () of peewee in Python
A memorandum about the warning of the pylint output result
A memo to visually understand the axis of pandas.Panel
A story about changing the master name of BlueZ
A reminder about the implementation of recommendations in Python
A note on the default behavior of collate_fn in PyTorch
Reuse the behavior of the @property method by using a descriptor [16/100]
Tank game made with python About the behavior of tanks
About the ease of Python
I tried a little bit of the behavior of the zip function
About the components of Luigi
About the features of Python
[python] A note that started to understand the behavior of matplotlib.pyplot
About the return value of pthread_mutex_init ()
About the return value of the histogram.
About the basic type of Go
About the upper limit of threads-max
About the size of matplotlib points
About the basics list of Python basics
The story of writing a program
I wanted to be careful about the behavior of Python's default arguments
A note about the functions of the Linux standard library that handles time
Check the behavior of destructor in Python
A class that hits the DMM API
A rough understanding of python-fire and a memo
About the virtual environment of python version 3.7
Measure the relevance strength of a crosstab
A quick overview of the Linux kernel
A memorandum of understanding about django's QueryDict
About the arguments of the setup function of PyCaret
[python] [meta] Is the type of python a type?
About the Normal Equation of Linear Regression
Get the filename of a directory (glob)
The story of blackjack A processing (python)
Memo of troubles about coexistence of Python 2/3 system
Memo about Sphinx Part 1 (Creating a project)
Notice the completion of a time-consuming command
A note about doing the Pyramid tutorial