In the article I posted earlier How to get a sample report from a hash value using VirusTotal's API, I got the malware name from VirusTotal, but I got the malware name from there. This article introduces one method for unique judgment.
If you want to determine the complete malware name, you can use the previous sort and uniq method, but if you try to acquire more information from the Web etc. by this method, the character string of the acquisition result will be decomposed and roughly. I hope that the malware name can be determined.
In this article, we have provided information that may be useful in such cases.
The target data this time are the following malware names acquired in the above article.
Malware name obtained from VirusTotal
Trojan.Linux.Mirai.1
RDN/Generic BackDoor
Backdoor.Mirai.Linux.91998
Trojan.Gen.NPE
a variant of Linux/Mirai.OX
Other:Malware-gen [Trj]
Unix.Dropper.Mirai-7135870-0
HEUR:Backdoor.Linux.Mirai.b
Trojan.Linux.Mirai.1
Trojan.Mirai.hrbzkk
Backdoor.Linux.Mirai.wao
.UnclassifiedMalware@0
Malware.LINUX/Mirai.lpnjw
Linux.Mirai.671
Backdoor.Linux.MIRAI.USELVH120
Linux/DDoS-CIA
LINUX/Mirai.lpnjw
ELF/DDoS.CIA!tr
Trojan.Linux.Mirai.1
Trojan.Linux.Mirai.K!c
HEUR:Backdoor.Linux.Mirai.b
Trojan:Win32/Skeeyah.A!rfn
Malicious (score: 85)
malware (ai score=89)
Backdoor.Mirai/Linux!1.BAF6 (CLASSIC)
Trojan.Linux.Mirai
Trojan.Linux.Mirai.1
Other:Malware-gen [Trj]
Linux/Backdoor.6f4
From this information, the malware name is roughly determined.
In this case, the expected output is "Mirai". Based on the results obtained in this article, you will be able to collect more detailed information from the hash values of malware by conducting further Web searches.
It decomposes into words by delimiters, calculates the frequency of appearance of words, and determines the one with the highest frequency of occurrence as the malware name. Even if you search for a character string such as "Trojan.Linux.Mirai.1", there may be little or no hit information, so we would like a rough judgment result here.
In order to make a rough judgment of the result, the above acquisition result is separated by a delimiter and decomposed into words.
The following 5 characters are specified as the delimiter. -\ s (space) --. (Period) -/ (Slash) -: (Colon) --- (hyphen) -[,](Key brackets)
After breaking it down into words, calculate the frequency of occurrence of the words. In this case, the frequency of appearance is as follows.
Results of frequency of occurrence(Limited to Top 10)
[('Mirai', 17),
('Trojan', 9),
('Backdoor', 7),
('', 6),
('1', 4),
('Malware', 3),
('Other', 2),
('gen', 2),
('Trj', 2),
('HEUR', 2)]
It turned out that "Mirai" appears most often. If you use this logic, the malware name will be determined to be Mirai.
This is a program that describes this logic. The acquisition result of this time is taken as the first argument and imported line by line. After that, after decomposing into words with the above delimiter, the words are stored in word_list, and the number of elements in word_list is counted by using collections.Counter () from word_list.
In pprint.pprint (list (count.most_common (10))), the words with the top 10 appearance frequencies introduced earlier are output.
The last line, print (list (count.most_common (1)) [0] [0]), prints the word with the highest frequency of occurrence. In this case, it is "Mirai".
Also, remove frequently occurring words from the list in advance, although they are clearly not malware names.
Malware name determination program
import sys
import json
import time
import requests
import numpy as np
import pandas as pd
import re
import collections
import pprint
file = sys.argv[1]
word_list = []
malware = []
with open(file) as f:
malware = f.read().splitlines()
for i in range(len(malware)):
word = re.split('[ \[\]./:-]', malware[i])
#print(word)
word_list += word
#print(word_list)
for i in range(len(word_list)):
try:
word_list.remove('Linux') #Remove frequent words that are not malware names
except ValueError:
pass
candidate = collections.Counter(word_list)
#pprint.pprint(candidate)
#pprint.pprint(list(candidate.most_common(10)))
print(candidate.most_common(1)[0][0]) #Output only malware name
I thought about the judgment logic in reverse order from the result and implemented it. It would be nice if an actual security analyst could perform dynamic analysis or static analysis, but when you can't afford to do that, or when you don't have the skills, how do you determine the malware name? Is worrisome.
If the judgment is made by the method introduced in this article, if you can get the hash value of the malware, the malware name will be known, and if you search the Web based on the result, more detailed information will be automatically obtained. I wonder if it can be obtained in. I felt that. (Actually, it is clear that it takes time and physical strength to make a judgment by searching by hand, getting information by eye, and using the head. It is easy to do.)
In the future, when a hash value can be obtained using this program, we will create a system that automatically searches for and obtains various information and combines them.