A story about how multi-maps are handled by bowtie2.

background

In the default behavior of bowtie2, even if the array is hit multiple times, only one of them is reported. Then, how to choose one of them is usually randomly selected from the ones with the highest alignment score. The document also says so [^ 1]. … But in some cases it seems that it will not be random and will solidify [^ 2], so I would like to clarify the conditions.

Most of the conclusions are written here [^ 2], but this problem is due to the random number generation of bowtie2. The --non-deterministic option section of the Bowtie2 instruction manual [^ 3] has the following explanation.

Normally, Bowtie 2 re-initializes its pseudo-random generator for each read. It seeds the generator with a number derived from (a) the read name, (b) the nucleotide sequence, (c) the quality sequence, (d) the value of the --seed option. This means that if two reads are identical (same name, same nucleotides, same qualities) Bowtie 2 will find and report the same alignment(s) for both, even if there was ambiguity. When --non-deterministic is specified, Bowtie 2 re-initializes its pseudo-random generator for each read using the current time. This means that Bowtie 2 will not necessarily report the same alignment for two identical reads. This is counter-intuitive for some users, but might be more appropriate in situations where the input consists of many identical reads.

In summary, bowtie2 initializes a random number generator for each read, and its seed value seems to depend on the following four.

--Lead name --Base sequence --Quality array --- The value of the --seed option

The point is that leads with the same information, including the __name, are mapped to the same location __. Below, let's confirm that the mapping result will be different if the read name, quality array, and --seed option are different.

Preparation

Prepare the following FASTQ based on the multi-mapped base sequence. (Arrangements etc. are created based on the verification example here [^ 2])

--FASTQ with the same read name, base sequence, and quality sequence --FASTQ has the same base sequence, but the name and quality sequence {either or both} are different for each read.

Since it is troublesome to prepare each time, the next generator will be used instead.

`gen_testseq.py`


#!/usr/bin/env python
import argparse
import random

parser = argparse.ArgumentParser()
parser.add_argument("-r", "--reads", type=int, default=10000)
parser.add_argument("-n", "--name", action="store_true")
parser.add_argument("-q", "--quality", action="store_true")
args = parser.parse_args()

for i in range(args.reads):
    print "@test_sequence{}".format(random.random() if args.name else '')
    print "CAAAGTGTTTACTTTTGGATTTTTGCCAGTCTAACAGGTGAAGCCCTGGAGATTCTTATTAGTGATTTGGGCTGGGGCCTGGCCATGTGTATTTTTTTAAA"
    print '+'
    print ''.join([chr(random.randrange(33, 127)) for _ in range(101)]) if args.quality else 'I' * 101

When the array on the second line of print is mapped to hg19, 9 places are multi-hit, and the following 3 places have the highest alignment score and are map candidates.

chrom	position	strand
chr1	11635	+
chr2	114359380	-
chr15	102519535	-

bowtie2 used version 2.2.4.

Verification

When all lead information is the same FASTQ

Like this, only the position is taken out from the map result and the number is counted.

$ bowtie2 --no-hd -x hg19 -U <(./gen_testseq.py) | cut -f 3-4 | sort | uniq -c

result

10000 reads; of these:
  10000 (100.00%) were unpaired; of these:
    0 (0.00%) aligned 0 times
    0 (0.00%) aligned exactly 1 time
    10000 (100.00%) aligned >1 times
100.00% overall alignment rate
  10000 chr15	102519435

All were mapped in one place.

Try specifying the `--seed` option

$ bowtie2 --no-hd --seed 17320508 -x hg19 -U <(./gen_testseq.py) | cut -f 3-4 | sort | uniq -c

10000 reads; of these:
  10000 (100.00%) were unpaired; of these:
    0 (0.00%) aligned 0 times
    0 (0.00%) aligned exactly 1 time
    10000 (100.00%) aligned >1 times
100.00% overall alignment rate
  10000 chr1	11635

The location changes depending on the value, but only one location is mapped.

Try specifying the `--non-deterministic` option

$ bowtie2 --no-hd --non-deterministic -x hg19 -U <(./gen_testseq.py) | cut -f 3-4 | sort | uniq -c

10000 reads; of these:
  10000 (100.00%) were unpaired; of these:
    0 (0.00%) aligned 0 times
    0 (0.00%) aligned exactly 1 time
    10000 (100.00%) aligned >1 times
100.00% overall alignment rate
   3302 chr1	11635
   3379 chr15	102519435
   3319 chr2	114359280

This time it was randomly mapped. Of course, the result will be different each time you run it.

If the name or quality sequence is different for each read

$ bowtie2 --no-hd -x hg19 -U <(./gen_testseq.py -n) | cut -f 3-4 | sort | uniq -c

10000 reads; of these:
  10000 (100.00%) were unpaired; of these:
    0 (0.00%) aligned 0 times
    0 (0.00%) aligned exactly 1 time
    10000 (100.00%) aligned >1 times
100.00% overall alignment rate
   3379 chr1	11635
   3294 chr15	102519435
   3327 chr2	114359280

This time it was randomly mapped. here,

$ ./gen_testseq.py -n > test.fastq
$ bowtie2 --no-hd -x hg19 -U test.fastq | cut -f 3-4 | sort | uniq -c

If you drop it to a file as described above, you will get the same result at each runtime. Of course, this does not apply if the value of the --seed option is different or if the --non-deterministic option is specified.

Randomly mapped even if the quality array is different

$ bowtie2 --no-hd -x hg19 -U <(./gen_testseq.py -q) | cut -f 3-4 | sort | uniq -c

10000 reads; of these:
  10000 (100.00%) were unpaired; of these:
    0 (0.00%) aligned 0 times
    0 (0.00%) aligned exactly 1 time
    10000 (100.00%) aligned >1 times
100.00% overall alignment rate
   3351 chr1	11635
   3312 chr15	102519435
   3337 chr2	114359280

Conclusion

--Usually, you can think that the map destination is randomly selected (it is unlikely that the lead name will be duplicated in addition to the array). --If the same lead is included, it will be mapped randomly by using the --non-deterministic option. --However, reproducibility is lost. --Use the --seed option if you need results with different multimap destinations and want to ensure reproducibility.

reference

It seems that there was a case where it did not vary well [^ 2], but was it a bug?

[^ 1]: See the -k option section [^ 3]

[PYTHON] A memo about the behavior of bowtie2 during multiple hits