Mysterious creature ** Pooh ** It is a ** mysterious creature that may have existed since ancient times **.
Pooh's imagination
wgsim is an NGS simulator created by the bioinfo giant Heng Li. It's 2011 software, but it works fine.
https://github.com/lh3/wgsim
If you have Bioconda installed
conda install wgsim
The mysterious creature Pooh was created from a total whim. God creates living things. (Evolution is outrageous.) Since you are a god, let's create an organism with 5000 bases, which is as sequenced as a little virus. I named this creature "Poo".
Pooh's reference array is randomly generated. (I'm an endangered ** Ruby sect **, but you can do the same with other programming languages.)
genome = Array.new(5000){ %w[A C T G].sample }
puts '>poo chr1'
genome.each_slice(80) do |l|
puts l.join('')
end
If you execute this script, the array will be output like this.
** This is the reference genome of the mysterious creature ** Pooh **! !! !! ** **
>poo chr1
AAAGCGCTGGATCATGCTTTAAGGCTTAGTTTTATGGCCCTAAACACGGAGACTGAAGCCGCTTGAGTCCGGTTTCAGAG
AGATTTGGTTTCACCCACAACTGCCAGCATATAATGGACCGTCGGCCGTACACGTTATGAGCCTGGCAGGGGTTACTACG
CGCCACGGGCTAGAGACCTAGTGGCGAAACGTGAAGGGCAATTATGGAATCCCGTACGCGCAATAAATGTCTACGCATTC
CCGGGGGGTGTTGTAGTCCCACGGCCCCGGACCTGCATTGGTCCAAATTCTCTTCGGTGCATATTAGCTTCATCGGAACC
GGCCTTTCTAATTCAGAACTGTTCGGTCATACACTATCTTCCAACACGCCGTGCACGAGTTGCACGGAGATCTAGTTGTT
ACTCGACGCGCGTACGACCGAGGCGGGGCTTAACACGCGAGTAGAAGCCCTGGGTTATCCACCACTTAATTTTGTAAAAA
GCGCTTTCAGTACTACACGGCTTGACCGTAGCTCTGTCGGACTTGTGACCCCTTCCGGCGAACATTAATTGTACCCCTTC
TTCTGACGGATTGGTTAGTAGTCCAAGCAGTCTTCACCCCAGGAGCACTTCTTCTTGCCGCGAGGCCATTCAGTCCAGTT
CGTGATGGCGGGGGTACCGATGCAAAAGCTCGTCTTTCTTGAGGCATCCACAGCCAGAGGCGGGGTATTTACGAGATTAA
GTTCTATCATTGCACTAGACAAAGCAAGTACAGCTCGGCAAAAAGAAACTCGTGGGCGCGTCTTGACCTGCAGCTAATAC
TGACTAGCATCCGTAGATATTACCGATCGTGCCGCACAAGGGAAATCTCGAGGCTGACAACGTCTTTGCCAAGGCAGTGG
CCGTGTCTTCTCGCCAGGGGTAAAGGTGCCGCTGGACTAGCGCTAACTTAGTGTCGTAATATCTCCATATGGACGATTTC
CCCGATCTTACCGGCTAGCCGCTGGTTGTATGCCGCGCCCAGGCCACCGAGTCTGACATGTTCCAAGGGCCTCATGTGAA
AGCGATGTGCTGATCCAGCATGGCCTTCAGCCTCACAAGACTGAGAGTTGGTGTAGAAACCCGGTTTGTTGGTAGTATGA
TGCTTAAAAATCGGTTTAATTTAGCTTGGTTAGATTTGAACGTGAAGAGCCTATGTCCAAAGACCAACTAACCTACTCAG
ACTAAGATTTACCGATTTCTCGTTTTGACTGAGATACGTAGTTAGCTATACGGGGCGTCGCCATCCTCTACCCCTTGAAT
ATGATTCGTAAGTGATGCCATTCTCGTGCTCGTGCGCTATGAATAGCAATGTCACCACAATAGTGCCCGGTGGCCGACAA
TCCGAGACCATGTCCGTCCTCATTCATGCGTAGTTAGTAGTGTATCGACTCACCTGGAGATGACCTCTATGTCCTGCTGC
GGTACGGCACGGCAGGATCTTTCTAGAAACATCGGCGATGCCCCAGAGGTTCTCTAAACTGTGTCATCTTTTTCTGCGGT
GCGTGGCTGGTCAAACCATTCAAATATATGAGAGTCATGCGCTAGCTCCCGGTAAAACCCCACAATAACGGGATTTGGGG
TTGAAAATGCGAGGCAGCGCACATGGATCAGGCCATTTAGGGTGTGATTGGTATGTTCCGAAAAAGCTGATGAGAACTAT
GCCCCGTTTACTTTGCCAGTAGAAAGAGTAGGAAGTAGTGTAACGCGGGCACCGGAAAACGACCGACACGATACTTTCCC
TTGTGCTGGAATTGTTTGATATATTAGGACGAGAATAGTTCTCTCCATTCCGCCGGCGTATAGACCAGGCCATGCCGTTC
TATTTCCAAGTTCCTTTGAAGGATCGGTGTTATTTCGGATGTCGTCGAGTAGTCGTGGGGCGTTAGGGTCTAAGCGACCA
GTTGTAGCTAATCGGGTGTCCTCCACTAACTGCTATTAGCATAGTACACCATGCTGCTTCAACTCATTTGGGATTTGCAC
TACTCTAGGCCAGAGATACAGATGCAAGCCGACTTAGCTGCACTAGCCAAGATTGAGCCCTAGGGCGTAGCGACCCTGAG
AATCCTGGCTAGCCCCCCTAGCCCCGGAAACTGCCCGTGTTGCTCTCGAAGCAGTTAGTAATCATCGTAATTTTGTCAAG
ATGGCCGTAGAGTGCAACCTGCAAGGCTGCAGGGTGTGTTCGGCATATTAATGCGGACAGTGCAGTGGACAAAGAGTCTC
GGAAGCGACGTGTATACAGTAGTGAGTAAATTGTTAAGCTTACTATCGCAGGAACTTCTGATCTGCCCGACCGTCGAGAT
CGAGTCTAGTAACAAGATTGCAGTCCTCAAGACTAGAAGTACCTAACTCCCTCAAATGCGAGTCCCAACAACTCATCCTC
ACCTCTGGAATCTCACCTCTAAGCATGACAGGCTAGGACCATTTAACATAGTTACAGTGACCTGATACGTAACGCCTCGA
AAGGAAAGGCGCGCCTAGGCAGCAATTGCAACGAACTCGGACGTCAGTACTCAACCAACCGAGTTATTTCAGCGTCGCGA
ATCTAATGCATCTTCCTGCTTAAGACGTCGTACTCCGGTGTCACCCTTCGACTAACTTAAAATCGCGCATTGGCAAGCAT
ACCTCTAATACCGGATCGCTCCCCTCAGGTTTACTTGATGAGCGCTGAACCCAACCATAATTTCGGCAAGGTAATCGCTT
ATCGTTCCACCCAGGCGTGAATAATTCATGCCCAATGCAGGTGACCGTATCTCAGGGAGATATAAGTGAAAGTTATGTGG
AAGGTATTGTAGTACATTTCAGGGTTCATCTAAGCTTCAGGACCACAGATGCAGAAGACGGGTTAAAGGACATGTCGGCA
CGAAGACATATCTTGTGGATCCCTATTCAGTGCCCGTGACTCAACAGATAGACTCATTTTCTCCTTTGGCGTGTCCGCAG
GTAAGATAACTTCAGTGTCCGAGCGGCGGAGTGGGTGTTCACGTCGTCGAGTGTCCATTTCGATTGGGGATGTTGGCTCG
CATACCACACGACCCTTTAGTGGACGGTCTAGTTTGTATACTCCCCGATTGGACTTTACCCGCGAAGCTAGCACTGTAAA
TATTGCCCCGGAGTGGCGGGATACGTTGGATATGCCTATACTGGACCGCAGCCTATTACATCAGATCGGCTTACCGCCAA
TCTTAGTGCTCAGCTACCCAGAGCTGTATCCATGGACATAGGCGCTATGACAAGCAGTACTGTAAAGCAGGGACAGCTTT
ACTATGTCGACAGCCGGCGTACGTCGTTGGGCTACCGCCGGCCTAGTTCCTAGTCAGTTCCGAGGCATAATCTACAAGTC
TAAGTAGGGCGCCATGATGTCACGAGACTTATAGGAGTCCACCGGCAAGTGCGTTGACGTAAATACATATGTGTATGCAG
CACTTTTGACAGGCCCGTAGATTTTGCACAAACAGCCGGGTGTGGTTCAAAAAGGGTGTTGCGGTCAGCCGCAGCAGCGC
AATTGTAACCAATCTACCTCCCGTTTGCCGACCGCTCGCGCAGATTCATGTAGGAACCGCTAGTTCCCCGTCCGCTACAT
TTCAATGGGCTTTTGCATGTCCTATCGCGATACCATCTCTGGTGGCGATGATATTTTATCAGGTTGCATTGGCCCAGTCC
ATCTTCGCGGCCGATTGGGCACGCCATTAGTTGCTCTCTGCGCAATCGCAGCTATACTAATGGTATGGTTGGGGAAATCA
TTGGAGTGCATTATTCCGCCCTCTTAACCTGTTGTGAACGGAACCGCAGTACGGTAGATTGCAAAGTCCAAGGTTTCTGG
ATTGTCTCTGAAAATGAAAGTTCTGCTCTGTCAAGTGATAAACGGAACGTCCTAGGAGGCGCCTACCGCGTTTAGAGAAG
TAACCCACGATATCTCGAACAAAAAATCAAAAGCGGTGTAGGTACAAAACGTATCGCTTACATCCCGTGAAGCTTAGTAG
GACGCACCCTTCATCTCTCAGGGGGTCGCGTACACGATGTCCGTCTTTTATCGTGCCACGTGTACAAGTAGGGGTGCAGA
AATAACTGTACTAATGCAACGTTTCAGTTCTTCTGATGAACATCTCACACGGTTGTCCCCTCTAATATTAGGATTATCGG
CCTAGACGTGTTATGCACGCCGACCCCTTTCCCTAAGTGCTTCCGCAATGAAGTCGCGTATTTGTAACACACATTGGGTA
TGAATATAACTGTCCCTCCTGGAATAAGGAGTGCGCTTATCCTTTCGGACGTCAGAGCCCAGGGAGGAGACGCGTCAGAT
GCCATAGCCTATCGGGATCCGTCGTGCTAGTAGAACCAAAGTAACGGTGCTGTTAAATCGTTCCCTCAGCAGTCCGGGGT
TGACAGGTACAGTGTTTCCAACTACTAGAGGATCTCGCTGCTATAGAACGATACCACTGCGGGGGACACCGCCACTTTTC
GAGCTGTGGGGCCAGAAGTGCTGTCTGGATAATTCGGTAGGCATCAGCGGACAGCGGAGCTCGTAAGTGCCATGCACTTG
TAGCCCATTACGATGAGATCGGTTCCGCCAACTCCGTCAGCACAGGTATCCACGAGGAGTCACCGTCCGGCATGACGCAG
GACGGGGGCGGTGGAAATAATCGAGGCGGATACTCTTCATTTGTAGGCGCCAGCTACAGCGGTCCGGCGGGTCAATACCG
CATCCGCAGACATAATTGACCTCACTCAGCTCGCACGAACCTCCCTACTCTTTTCTACTCGAGGGTGCCTGATCAGAGAA
CGGAAGACGCTATAGGCAAGGCGGGCACGCCTGGAATACAGTCCCTTTCCGAATATTCACACCCGCTACCTATGTCGTAG
CGTTGCCGGGCCCGTTATCCAGACCGAACCCGTTGTCTGTTATAACATATTGAGTAATGGGCAGACAAACCCGTGTAGAG
TCCTCTCGGTTCCACATATCGAAGAGGGCGATCCGTACGA
Long.
Save it as poo.fa
.
By the way, the mysterious creature ** Pooh ** was discovered in toothpaste for sheep. By the way, do sheep brush their teeth? Well, I don't know the details. However, Pooh is reported to have been found in the toothpaste secretly used by young sheep, so let's believe it. The scientific name is said to be pull-pull-pull because it looks plump when viewed under a microscope. (I put it on properly now)
Now, the person who discovered Pooh lost interest in Pooh, who had mountains and valleys in his life, and left Pooh for the next 50 years. However, due to various accidental events such as the heavens and the earth, the rise in the temperature of the equator, the trembling of the chairman of the neighborhood, the faint spring tale, and the reconciliation between the chairman and the president, Pooh was sequenced throughout the genome. It's time to go. Now, since you are a god, use wgsim to find the result of the genome sequence.
Before running wgsim, let's take a look at the options.
wgsim --help
Options: -e FLOAT base error rate [0.020]
-d INT outer distance between the two ends [500]
-s INT standard deviation [50]
-N INT number of read pairs [1000000]
-1 INT length of the first read [70]
-2 INT length of the second read [70]
-r FLOAT rate of mutations [0.0010]
-R FLOAT fraction of indels [0.15]
-X FLOAT probability an indel is extended [0.30]
-S INT seed for random generator [0, use the current time]
-A FLOAT discard if the fraction of ambiguous bases higher than FLOAT [0.05]
-h haplotype mode
There are some options I'm not sure about (what standard deviation is -s, how -A and -h work), but I'll study them later. (do not do)
Here, the number of leads is set to 1000.
wgsim -N 1000 poo.fa poo_1.fq poo_2.fq
When executed, the mutation will be displayed. I think that poo_1.fq
and poo_2.fq
have been generated. It's easy.
poo 977 A W +
poo 1210 T W +
poo 1655 G K +
poo 1874 T K +
poo 2043 C S +
poo 2378 G T -
Let's take a look at the contents of the generated fasta file.
head -n 12 poo_1.fq
@poo_418_960_2:0:0_1:0:0_0/1
TCGTCGAGTAGTCGTGGGGCGTTAGGGTCTAAGCGACCAGTTGTAGCTAATCGGGTGTCCTCCACTAACT
+
2222222222222222222222222222222222222222222222222222222222222222222222
@poo_624_1182_3:0:0_1:0:0_1/1
CTTAGGACATAGGCTCTTCACGTTCAAATCTAACCAAGCTAAATTAAACCGATTTTTAAGCATCATACTA
+
2222222222222222222222222222222222222222222222222222222222222222222222
@poo_2608_3166_3:0:0_0:0:0_2/1
GTCCAGTATAGGCATATCCAACGTATCCCGCCACTCCGGGGCAATATTTACAGTGCTAGCTTCGCGGGTA
+
2222222222222222222222222222222222222222222222222222222222222222222222
Well, I think that the quality score usually drops in the latter half of the lead, but that is not the case. Let's check the sequence quality with Fastqc.
You can see that the quality is constant (and low).
In general, the quality should decrease as you move behind the lead as shown below. (The image is from the official website of Fastqc https://www.bioinformatics.babraham.ac.uk/projects/fastqc/)
Well, there is no help for it because it is old software.
Now, since it is a whole genome sequence, we will map it with bwa. What what. What about minimap? This is the world view of 10 years ago, it's an article to play with wgsim, there is no long lead! !! Oops, I'm upset. God must be careful not to be upset.
First of all, create an index.
bwa index poo.fa
Pooh's genome size is small, so I think execution will be completed in an instant. Let's apply bwa. (Bwa is also a tool made by Heng Li. It can be called a god.)
bwa mem poo.fa poo_1.fq poo_2.fq > poo_wgs.sam
Now, let's add SAM → BAM, BAM → sorted BAM, and even index.
samtools view -bS poo_wgs.sam > poo_wgs.bam
samtools sort poo_wgs.bam poo_wgs.sorted
samtools index poo_wgs.sorted.bam
In IGV, set poo.fa
as the reference genome and open poo_was.sorted.bam
.
Uooo! It's done! I'm done, brother! (Noisy)
Well, IGV actually has a convenient split view function. Let's use this to observe the location of the mutation.
Uooo! Brother (omitted)
Mutations have been detected properly.
Well, I wrote an article about the mysterious creature ** Pooh ** on a whim, but it turned out that it was fun to write and it was also a learning experience. Pooh series, maybe I will write it again if I feel like it.