[PYTHON] I tried to uniquely determine the malware name from the report information obtained from Virus Total

Introduction

In the article I posted earlier How to get a sample report from a hash value using VirusTotal's API, I got the malware name from VirusTotal, but I got the malware name from there. This article introduces one method for unique judgment.

If you want to determine the complete malware name, you can use the previous sort and uniq method, but if you try to acquire more information from the Web etc. by this method, the character string of the acquisition result will be decomposed and roughly. I hope that the malware name can be determined.

In this article, we have provided information that may be useful in such cases.

Target data

The target data this time are the following malware names acquired in the above article.

Malware name obtained from VirusTotal


Trojan.Linux.Mirai.1
RDN/Generic BackDoor
Backdoor.Mirai.Linux.91998
Trojan.Gen.NPE
a variant of Linux/Mirai.OX
Other:Malware-gen [Trj]
Unix.Dropper.Mirai-7135870-0
HEUR:Backdoor.Linux.Mirai.b
Trojan.Linux.Mirai.1
Trojan.Mirai.hrbzkk
Backdoor.Linux.Mirai.wao
.UnclassifiedMalware@0
Malware.LINUX/Mirai.lpnjw
Linux.Mirai.671
Backdoor.Linux.MIRAI.USELVH120
Linux/DDoS-CIA
LINUX/Mirai.lpnjw
ELF/DDoS.CIA!tr
Trojan.Linux.Mirai.1
Trojan.Linux.Mirai.K!c
HEUR:Backdoor.Linux.Mirai.b
Trojan:Win32/Skeeyah.A!rfn
Malicious (score: 85)
malware (ai score=89)
Backdoor.Mirai/Linux!1.BAF6 (CLASSIC)
Trojan.Linux.Mirai
Trojan.Linux.Mirai.1
Other:Malware-gen [Trj]
Linux/Backdoor.6f4

From this information, the malware name is roughly determined.

Expected output

In this case, the expected output is "Mirai". Based on the results obtained in this article, you will be able to collect more detailed information from the hash values ​​of malware by conducting further Web searches.

It decomposes into words by delimiters, calculates the frequency of appearance of words, and determines the one with the highest frequency of occurrence as the malware name. Even if you search for a character string such as "Trojan.Linux.Mirai.1", there may be little or no hit information, so we would like a rough judgment result here.

To make a rough judgment

In order to make a rough judgment of the result, the above acquisition result is separated by a delimiter and decomposed into words.

The following 5 characters are specified as the delimiter. -\ s (space) --. (Period) -/ (Slash) -: (Colon) --- (hyphen) -[,](Key brackets)

Judgment logic

After breaking it down into words, calculate the frequency of occurrence of the words. In this case, the frequency of appearance is as follows.

Results of frequency of occurrence(Limited to Top 10)


[('Mirai', 17),
 ('Trojan', 9),
 ('Backdoor', 7),
 ('', 6),
 ('1', 4),
 ('Malware', 3),
 ('Other', 2),
 ('gen', 2),
 ('Trj', 2),
 ('HEUR', 2)]

It turned out that "Mirai" appears most often. If you use this logic, the malware name will be determined to be Mirai.

program

This is a program that describes this logic. The acquisition result of this time is taken as the first argument and imported line by line. After that, after decomposing into words with the above delimiter, the words are stored in word_list, and the number of elements in word_list is counted by using collections.Counter () from word_list.

In pprint.pprint (list (count.most_common (10))), the words with the top 10 appearance frequencies introduced earlier are output.

The last line, print (list (count.most_common (1)) [0] [0]), prints the word with the highest frequency of occurrence. In this case, it is "Mirai".

Also, remove frequently occurring words from the list in advance, although they are clearly not malware names.

Malware name determination program


import sys
import json
import time
import requests
import numpy as np
import pandas as pd
import re
import collections
import pprint

file = sys.argv[1]
word_list = []
malware = []

with open(file) as f:
    malware = f.read().splitlines()

for i in range(len(malware)):
    word = re.split('[ \[\]./:-]', malware[i])
    #print(word)
    word_list += word
    #print(word_list)

for i in range(len(word_list)):
    try:
        word_list.remove('Linux') #Remove frequent words that are not malware names
    except ValueError:
        pass

candidate = collections.Counter(word_list)
#pprint.pprint(candidate)
#pprint.pprint(list(candidate.most_common(10)))
print(candidate.most_common(1)[0][0]) #Output only malware name

in conclusion

I thought about the judgment logic in reverse order from the result and implemented it. It would be nice if an actual security analyst could perform dynamic analysis or static analysis, but when you can't afford to do that, or when you don't have the skills, how do you determine the malware name? Is worrisome.

If the judgment is made by the method introduced in this article, if you can get the hash value of the malware, the malware name will be known, and if you search the Web based on the result, more detailed information will be automatically obtained. I wonder if it can be obtained in. I felt that. (Actually, it is clear that it takes time and physical strength to make a judgment by searching by hand, getting information by eye, and using the head. It is easy to do.)

In the future, when a hash value can be obtained using this program, we will create a system that automatically searches for and obtains various information and combines them.

References

Recommended Posts

I tried to uniquely determine the malware name from the report information obtained from Virus Total
I tried to get various information from the codeforces API
[Python] I tried to get the type name as a string from the type function
I want to see the file name from DataLoader
I tried to detect the iris from the camera image
I tried to visualize the spacha information of VTuber
I tried to notify the honeypot report on LINE
I tried to get the location information of Odakyu Bus
I tried to notify the train delay information with LINE Notify
I tried changing the python script from 2.7.11 to 3.6.0 on windows10
I tried to cut out a still image from the video
I tried to move the ball
I tried to estimate the interval.
Python programming: I tried to get company information (crawling) from Yahoo Finance in the US using BeautifulSoup4
I tried to execute SQL from the local environment using Looker SDK
I tried to learn the angle from sin and cos with chainer
Get the song name from the title of the video you tried to sing
I tried to get the movie information of TMDb API with Python
I tried to summarize the umask command
I tried to recognize the wake word
I tried to summarize the graphical modeling.
I tried to estimate the pi stochastically
I tried to touch the COTOHA API
I tried to summarize the relationship between probability distributions starting from the Bernoulli distribution
[IBM Cloud] I tried to access the Db2 on Cloud table from Cloud Funtions (python)
I tried to pass the G test and E qualification by training from 50
I tried web scraping to analyze the lyrics.
[Python] I tried substituting the function name for the function name
I tried hitting the Qiita API from go
I tried to touch the API of ebay
I tried to correct the keystone of the image
Qiita Job I tried to analyze the job offer
LeetCode I tried to summarize the simple ones
I tried to implement the traveling salesman problem
I tried to predict the price of ETF
I tried to vectorize the lyrics of Hinatazaka46!
I tried to extract various information of remote PC from Python by WMI Library
I tried to visualize the characteristics of new coronavirus infected person information with wordcloud
I tried to learn the sin function with chainer
I tried to graph the packages installed in Python
I tried to summarize the basic form of GPLVM
I tried to predict the J-League match (data analysis)
I tried to solve the soma cube with python
I wanted to use the Python library from MATLAB
I tried to approximate the sin function using chainer
I tried to put pytest into the actual battle
[Python] I tried to graph the top 10 eyeshadow rankings
I tried to erase the negative part of Meros
I tried to solve the problem with Python Vol.1
I tried to simulate the dollar cost averaging method
I tried to redo the non-negative matrix factorization (NMF)
[Deep Learning from scratch] I tried to explain Dropout
I read the Chainer reference (updated from time to time)
I tried to identify the language using CNN + Melspectogram
I tried to complement the knowledge graph using OpenKE
I tried to classify the voices of voice actors
I tried to compress the image using machine learning
I tried to summarize the string operations of Python
Sentiment analysis with natural language processing! I tried to predict the evaluation from the review text
I tried to summarize the languages that beginners should learn from now on by purpose
I tried to predict the genre of music from the song title on the Recurrent Neural Network