Get PowerShell commands from malware dynamic analysis site with BeautifulSoup + Python

Introduction

I would like to web scrape Joe Sandbox's malware analysis report with BeautifulSoup + Python to get ** PowerShell command line **.

About Joe Sandbox

It is a site that analyzes malware and outputs a report. https://www.joesandbox.com

There are various versions of JoeSandbox, but the version called Cloud Basic allows you to analyze malware for free. In addition, reports analyzed by Cloud Basic will be published, so you can see other people's analysis result reports. By the way, Web API can be used with versions other than Cloud Basic, but it seems that it cannot be used with Cloud Basic.

If you want to know more details, please refer to the following.

What I want to do this time

Get the PowerShell command line from the JoeSandbox Cloud Basic analysis report.

If you just get it normally, the PowerShell command line executed by the malware and the PowerShell command line executed by the legitimate file will be mixed. Therefore, the score indicating the degree of malware judged by Joe Sandbox is also acquired.

The score is obtained from the following.

image.png

The PowerShell command line is extracted from: In the following cases, C: \\ WINDOWS \\ System32 \\ WindowsPowerShell \\ v1.0 \\ powershell.exe -noP -sta -w 1 -enc abbreviation is extracted.

image.png

The output will write the information to a text file as follows. The order is "report number, score, PowerShell command" separated by commas.

ReportNumber,DetectionScore,PowerShellCommandLine
236546,56,C:\\WINDOWS\\System32\\WindowsPowerShell\\v1.0\\powershell.exe -noP -sta -w 1 -enc abbreviation
236547,99,C:\\WINDOWS\\System32\\WindowsPowerShell\\v1.0\\powershell.exe -NoP -NonI -W Hiden -Exec Bypass Abbreviation
236548,10,C:\\WINDOWS\\System32\\WindowsPowerShell\\v1.0\\powershell.exe abbreviation

code

I am trying to make it work if I connect the contents described below.

Import the required libraries

In addition to BeautifulSoup, import requests, os, re as well.

import requests
from bs4 import BeautifulSoup
import os
import re

Specify the report number to extract

Prepare to enter from and to to extract information from multiple reports. Only one is specified here for testing.

report_num_from = 236543
report_num_to = 236543

Crawl the specified page

The URL of the report is as follows. This "236543" part seems to be the report number, so I would like to loop this number. https://www.joesandbox.com/analysis/236543/0/html

def extract_powershell_command(report_num_from, report_num_to):
    for reoprt_num in range(report_num_from, report_num_to + 1):
        ps_cmdline = []
        try:
            target_url = 'https://www.joesandbox.com/analysis/' + str(reoprt_num) + '/0/html'
            response = requests.get(target_url)
            soup = BeautifulSoup(response.text, 'lxml')

Get score

Score is written at the top of the screen. In this case, the score is 56, but we will get this number.

image.png

The corresponding code was below. image.png

It's a bit rough, but you can get a score by doing the following.

detection_score = 0
table = soup.findAll("table", {"id":"detection-details-overview-table"})[0]
rows = table.findAll("tr")
detection_score = re.sub(r'.+>(.+)</td></tr>', r'\1', str(rows[0]))

Get PowerShell command line

Below is an example screen that includes a PowerShell command line.

image.png

I'm going to get the back part from the cmdline after powershell.exe. The corresponding code was below.

image.png

I will check the contents.

image.png

Apparently it is stored in the table. I'm going to get the one that contains powershell.exe separated by li, format it and get the information behind it from cmdline. There may be a smarter way, but I did the following:

startup = soup.find('div', id='startup1')

for line in startup.findAll('li'):
    if 'powershell.exe' in str(line):
        tmp = str(line).replace('<wbr>', '').replace('</wbr>', '')
        cmdline = re.sub(r'.+ cmdline: (.*) MD5: <span.+', r'\1', tmp)
        ps_cmdline.append(str(reoprt_num) + ',' + detection_score + ',' + cmdline)

Exception handling

Exception handling is also included for the time being. At the end of the process, call the save_file function (described later).

except IndexError as e:
    ps_cmdline.append('{},ERROR:{}'.format(reoprt_num,e))

except Exception as e:
    ps_cmdline.append('{},ERROR:{}'.format(reoprt_num,e))

finally:
    save_file(ps_cmdline)

File writing

This is the process of writing to a file. It is not necessary to make it a separate function, but I will change it to a process other than writing the file in the future, so I made it a function so that it can be easily rewritten. Create ʻoutput.txt` in the same folder and write it.

def save_file(ps_cmdline):
    with open('./output.txt', 'a') as f:
        if os.stat('./output.txt').st_size == 0:
            f.write('ReportNumber,DetectionScore,PowerShellCommandLine\n')

        for x in ps_cmdline:
            f.write(str(x) + "\n")

Completed code

I added a comment by connecting the code so far. Obviously, the range specified by from and to should be moderate. Since it was created and executed with Jupyter Notebook, it has the following form.

import requests
from bs4 import BeautifulSoup
import os
import re

report_num_from = 236547
report_num_to = 236547

def extract_powershell_command(report_num_from, report_num_to):
    """
    Extract PowerShell Command from JoeSandbox analysis result.

    Parameters
    ----------
    report_num_from : int
        First report number to analyze
    report_num_to : int
        Last report number to analyze
    """

    for reoprt_num in range(report_num_from, report_num_to + 1):
        ps_cmdline = []
        try:
            target_url = 'https://www.joesandbox.com/analysis/' + str(reoprt_num) + '/0/html'
            response = requests.get(target_url)
            soup = BeautifulSoup(response.text, 'lxml')

            # Check JoeSandbox Detection Score (Maybe score above 40 is malicious)
            detection_score = 0
            table = soup.findAll("table", {"id":"detection-details-overview-table"})[0]
            rows = table.findAll("tr")
            detection_score = re.sub(r'.+>(.+)</td></tr>', r'\1', str(rows[0]))
    
            startup = soup.find('div', id='startup1')  # 'startup1' is a table with ProcessName & CommandLine

            for line in startup.findAll('li'):
                if 'powershell.exe' in str(line):
                    tmp = str(line).replace('<wbr>', '').replace('</wbr>', '')
                    cmdline = re.sub(r'.+ cmdline: (.*) MD5: <span.+', r'\1', tmp)
                    ps_cmdline.append(str(reoprt_num) + ',' + detection_score + ',' + cmdline)

        # Report number does not exist
        except IndexError as e:
            ps_cmdline.append('{},ERROR:{}'.format(reoprt_num,e))

        except Exception as e:
            ps_cmdline.append('{},ERROR:{}'.format(reoprt_num,e))

        finally:
            save_file(ps_cmdline)

def save_file(ps_cmdline):
    """
    Save the extraction results to a file.
    File I/O is a function because it may change.

    Parameters
    ----------
    ps_cmdline : list of str
        List containing process names.
    """

    with open('./output.txt', 'a') as f:
        if os.stat('./output.txt').st_size == 0:
            f.write('ReportNumber,DetectionScore,PowerShellCommandLine\n')

        for x in ps_cmdline:
            f.write(str(x) + "\n")

extract_powershell_command(report_num_from, report_num_to)

Execution result

Running the above code will give you the following results:

ReportNumber,DetectionScore,PowerShellCommandLine
236547,56,C:\\WINDOWS\\System32\\WindowsPowerShell\\v1.0\\powershell.exe -noP -sta -w 1 -enc SQBmACgAJABQAFMAVgBFAFIAcwBJAG8AbgBUAGEAYgBMAGUALgBQA Abbreviation

Recommended Posts

Get PowerShell commands from malware dynamic analysis site with BeautifulSoup + Python
Get html from element with Python selenium
[Note] Get data from PostgreSQL with Python
Scraping from an authenticated site with python
[Various image analysis with plotly] Dynamic visualization with plotly [python, image]
Data analysis with python 2
Bulk download images from specific site URLs with python
Collecting information from Twitter with Python (morphological analysis with MeCab)
Voice analysis with python
Get schedule from Garoon SOAP API with Python + Zeep
Get mail from Gmail and label it with Python3
Get date with python
Principal component analysis using python from nim with nimpy
Voice analysis with python
Dynamic analysis with Valgrind
Data analysis with Python
Get data from database via ODBC with Python (Access)
Get data from analytics API with Google API Client for python
Get message from first offset with kafka consumer in python
Get country code with python
[Python] Morphological analysis with MeCab
[Co-occurrence analysis] Easy co-occurrence analysis with Python! [Python]
Get Twitter timeline with python
Get Youtube data with python
Sentiment analysis with Python (word2vec)
Planar skeleton analysis with Python
Japanese morphological analysis with Python
Get thread ID with python
Get started with Python! ~ ② Grammar ~
Get stock price with Python
Get home directory with python
Get keyboard events with python
With skype, notify with skype from python!
Muscle jerk analysis with Python
[PowerShell] Morphological analysis with SudachiPy
Get Alembic information with Python
Get past performance of runners from Python scraping horse racing site
Introduction to Data Analysis with Python P17-P26 [ch02 1.usa.gov data from bit.ly]
Get data from MySQL on a VPS with Python 3 and SQLAlchemy
Get US stock price from Python with Web API with Raspberry Pi
From the introduction of JUMAN ++ to morphological analysis of Japanese with Python
[Python] Get elements by specifying attributes with prefix search in BeautifulSoup
Get started with Python! ~ ① Environment construction ~
Call C from Python with DragonFFI
Link to get started with python
3D skeleton structure analysis with Python
Using Rstan from Python with PypeR
Get data from Quandl in Python
Get reviews with python googlemap api
Impedance analysis (EIS) with python [impedance.py]
Install Python from source with Ansible
Create folders from '01' to '12' with python
Dynamic proxy with python, ruby, PHP
Get the weather with Python requests
Get web screen capture with python
Get the weather with Python requests 2
Get one column from DataFrame with DataFrame
[Python] Get economic data with DataReader
Text mining with Python ① Morphological analysis
How to get started with Python
Run Aprili from Python with Orange