[PYTHON] Get Splunk download link by scraping

1. I want to get the download link of the installer

I use AWS EC2 when I want to quickly create a Splunk verification environment for business. Every time I use CloudFormation to automate building and destroying, I have to go to the site for a download link about the Splunk installer. Since the link acquisition work is quite troublesome, I decided to implement scraping in Python.

2. Functions and technologies used

The code implementation in API Gateway and Lambda is omitted in this article because the ancestors have lost a lot of knowledge to Qiita. The following is the article that I referred to.

[Python] Using an external module with AWS Lambda I made a simple REST API with AWS Lambda

3. Library

Pull HTML from Splunk's official website with requests. Read the acquired HTML with Beautiful Soup and use re to extract the necessary information with a regular expression. I think there are many other ways to scrape, but I think it's easier to use requests when screen transitions aren't needed. I haven't used Selenium this time because it is troublesome to manage the driver of the browser and I have little experience using it.

4. Code

Immediately, the following is the implemented code.

lambda_function.py


import requests
import re
from bs4 import BeautifulSoup

def lambda_handler(event, context):
    #Acquisition of OS information
    os_type = event.get("os")

    #For Linux, also get the extension
    filename_extension = ""
    if os_type == "Linux":
        filename_extension = event.get("filename_extension")

    #Get Installer Type
    installer_type = event.get("installer")

    #Get the target version
    target_version = event.get("version")

    #Get the HTML tag of the target version
    html_tag = get_old_installer_link(os_type, installer_type, filename_extension, target_version)
    #If the target version does not exist in Older Releases, get the latest version
    if len(html_tag) == 0:
        html_tag = get_new_installer_link(os_type, installer_type, filename_extension)

    #Extract the download link from the acquired tag
    dl_link = dl_link_extraction(html_tag)

    #Return of execution result
    return {
        'statusCode': 200,
        'body': dl_link
    }


def get_old_installer_link(os, installer, extension, version):
    #Branch the execution contents for each installer
    if installer == "EP":
        # EnterPrise
        #Get old version
        old_r = requests.get('https://www.splunk.com/page/previous_releases')
        old_soup = BeautifulSoup(old_r.content, "html.parser")

        #Branch execution contents for each os
        if os == "Windows":
            html_list = old_soup.find_all("a", attrs={"data-version": version, "data-arch": "x86_64", "data-platform": "Windows"})
        elif os == "Linux":
            html_list = old_soup.find_all("a", attrs={"data-version": version, "data-arch": "x86_64", "data-platform": "Linux", "data-link": re.compile(r'\.' + extension)})

    elif installer == "UF":
        # UniversalForwarder
        #Get old version
        old_r = requests.get('https://www.splunk.com/page/previous_releases/universalforwarder')
        old_soup = BeautifulSoup(old_r.content, "html.parser")

        #Branch execution contents for each os
        if os == "Windows":
            html_list = old_soup.find_all("a", attrs={"data-version": version, "data-arch": "x86_64", "data-platform": "Windows"})
        elif os == "Linux":
            html_list = old_soup.find_all("a", attrs={"data-version": version, "data-arch": "x86_64", "data-platform": "Linux", "data-link": re.compile(r'\.' + extension)})

    return html_list


def get_new_installer_link(os, installer, extension):
    #Branch the execution contents for each installer
    if installer == "EP":
        # EnterPrise
        #Get new version
        new_r = requests.get('https://www.splunk.com/ja_jp/download/splunk-enterprise.html')
        new_soup = BeautifulSoup(new_r.content, "html.parser")

        #Branch execution contents for each os
        if os == "Windows":
            html_list = new_soup.find_all("a", attrs={"data-arch": "x86_64", "data-platform": "Windows"})
        elif os == "Linux":
            html_list = new_soup.find_all("a", attrs={"data-arch": "x86_64", "data-platform": "Linux", "data-link": re.compile(r'\.' + extension)})

    elif installer == "UF":
        # UniversalForwarder
        new_r = requests.get('https://www.splunk.com/ja_jp/download/universal-forwarder.html')
        new_soup = BeautifulSoup(new_r.content, "html.parser")

        #Branch execution contents for each os
        if os == "Windows":
            html_list = new_soup.find_all("a", attrs={"data-arch": "x86_64", "data-platform": "Windows"})
        elif os == "Linux":
            html_list = new_soup.find_all("a", attrs={"data-arch": "x86_64", "data-platform": "Linux", "data-link": re.compile(r'\.' + extension)})

    return html_list


def dl_link_extraction(tag):
    #Extract download links with regular expressions
    link = re.search(r'data-link=\"([^\"]+)\"', str(tag[0])).group(1)
    return link

Basically, there is no error control. No matter what happens, the status will be returned as 200. Actually, it's better to have a solid area, but this time, please forgive me.

5. Execution example

Give parameters to API Gateway and hit the endpoint with curl. There are four types of parameters to be added: the installation destination OS, the target installer, the extension of the installer, and the target version. The setting values corresponding to each parameter are as follows. (If you enter anything else, the download link will not be returned ...)

For Windows, you don't need to select an extension, so it doesn't matter if filename_extension is empty. The following is a case of getting the 7.2.3 Universal Forwarder installer on Linux with tgz.

curl -X POST "https://xxxxxxxxxx.execute-api.ap-xxxxxxxxx-x.amazonaws.com/xxx/xxxxx" -d "{\"os\": \"Linux\",\"installer\": \"UF\",\"version\": \"7.2.3\",\"filename_extension\": \"tgz\"}"

The following is the execution result. A status code and download link will be returned. (The status code always comes back with 200 ...) WS_000000.JPG

6. Summary

Even though I wrote it briefly, I realized that the code is quite rough. I will fix it if I have a chance.

Anyway, you can now scrape the download link. You can now add a curl statement to your CloudFormation userdata to get a download link while creating an EC2 instance. I'm glad that I was freed from the work of fetching and pasting download links one by one ... Next I have to write CloudFormation ...

Recommended Posts

Get Splunk download link by scraping
Nogizaka46 Get blog images by scraping
Get property information by scraping with python
Get iPad maintenance by scraping and notify Slack
I tried to get an image by scraping
Get a list of Qiita likes by scraping
Get boat race match information by web scraping
Automatically download images with scraping
Image collection by web scraping
Get the link destination URL by specifying a text sentence with Python scraping (Beautiful Soup) + XPath
One-liner web scraping by tse