This article is really awesome, so don't really take it.
It is said that the Human Genome Project has revealed the entire sequence of human DNA, but there must be a lot of unknown parts. Draw a graph of what you don't know.
You can download the sequence of the human genome from Gencode. https://www.gencodegenes.org/human/
Here, download ALL luxuriously.
wget ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_35/GRCh38.p13.genome.fa.gz
You can unzip it, but you can see it to some extent by using the zcat command.
First, let's display head.
zcat GRCh38.p13.genome.fa.gz | head
>chr1 1
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
Yes, it came out immediately. It is a mountain of N. In other words, the part where the sequence cannot be specified comes from the beginning. Next, let's check the chromosomes for the time being.
zcat GRCh38.p13.genome.fa.gz | grep "^>"

Well, it comes out in large quantities like this.
Next, let's check if the array really consists of 5 types of strings that are only TCGAN. I'm sure there is a way to use command line tools, but I'm going to write a small program using the Crystal language. The reason for the Crystal language is that it is explosive.
count.cr
counter = Hash(Char, Int32).new(0)
while l = gets
l.chomp.each_char do |c|
counter[c] += 1
end
end
p counter
Yes, it's just plain code for anyone who understands Ruby. The only point is that you specified the type when you generated the Hash. Everything else is the same as Ruby. This is Crystal. Build.
crystal build counter.cr --release
If you include this build time, the execution speed may not be much different from Julia etc., but I will not worry about it. Let's count the number of TCGAN
zcat GRCh38.p13.genome.fa.gz | grep -v "^>" | ./count
The line that defines the chromosome at the beginning of the line with grep -v" ^> " is omitted. Result is
{'N' => 161331703, 'T' => 916993511, 'A' => 914265135, 'C' => 635937481, 'G' => 638590158}
have become. Certainly, it was confirmed that there are no strings other than TCGA and N. Next, let's count the continuous distribution of N. In other words, let's find out how many characters N are consecutive.
nseq.cr
temp = 0
while l = gets
l.chomp.each_char do |c|
if c == 'N'
temp += 1
elsif temp != 0
puts temp
temp = 0
end
end
end
puts temp if temp != 0
Build.
crystal build nseq.cr --release
First, let's find out how many consecutive N points there are.
zcat GRCh38.p13.genome.fa.gz | grep -v "^>" | ./nseq | wc -l
1234
You can see that there are not so many. If you display them side by side in long order, it will look like this.

I think it's strange that 1 is, and I get the impression that there are many 100s and 50,000s.
Now, write a program that calculates the ratio of N, AT, and CG for every 10000 characters.
n2
tcgan = Hash(Char, Int32).new(0)
target = ARGV[0]
chr = ""
loc = 1
flag = false
while l = gets
if l.starts_with?(">")
exit if flag
if l == target
puts "loc\tN\tAT\tCG"
flag = true
end
next
end
if flag
l.chomp.each_char do |c|
tcgan[c] += 1
loc += 1
if loc % 10000 == 0
total = tcgan.values.sum.to_f
ta = (tcgan['A'] + tcgan['T']) / total
cg = (tcgan['G'] + tcgan['C']) / total
n = tcgan['N'] / total
puts "#{loc}\t#{n}\t#{ta}\t#{cg}"
tcgan = Hash(Char, Int32).new(0)
end
end
end
end
Run it as a trial.
zcat GRCh38.p13.genome.fa.gz | ./n2 ">chr1 1" | head
loc N AT CG
10000 1.0 0.0 0.0
20000 0.0001 0.4076 0.5923
30000 0.0 0.4826 0.5174
40000 0.0 0.5288 0.4712
50000 0.0 0.6439 0.3561
60000 0.0 0.6346 0.3654
70000 0.0 0.6669 0.3331
80000 0.0 0.6199 0.3801
90000 0.0 0.6294 0.3706
It seems that it is working well. Creating a program with the same behavior in Ruby takes a lot of execution time, but Crystal is super fast. Throw this to uplot on the command line.
uplot is a Ruby tool that I personally make that allows me to display graphs on the terminal using UnicodePlots.rb.
From here I'm doing something pretty confusing to draw a graph on the terminal, take a screenshot and paste it into Qiita.
chr1

chr2

chr3

chr4

chr5

chr6

chr7

chr8

chr9

chr10

chr11

chr12

chr13

chr14

chr 15

chr 16

chr 17

chr 18

chr 19

chr 20

chr 21

Added scatter

chr 22

Added scatter

chr X

chr Y

According to Wikipedia It seems that the heterochromatin region is mainly undeciphered and that region is inactive. I think it means that it is rarely transcribed. In other words, it may be biologically meaningless.
However, when I try this, it seems that there are many parts where the arrangement is unknown.
If you are familiar with it, please feel free to drop in.
That's all for this article.