[Python scraping] Output the URL and title of the site containing a specific keyword to a text file

Overview

If you do a Google search and any of the title ʻog: description`` h1~h4` in the hit page contains a specific keyword, the title and URL of the target page will be separated into text files. Output to. If the URL is already in the text file, skip processing.

Python version

3.8.2

Execution procedure

After building a virtual environment with venv etc.

pip install -r requirements.txt
python main.py

code

requirements.txt

beautifulsoup4 == 4.9.1
requests == 2.24.0

settings.py

settings = {
    #Keywords used for Google search
    'google_search_keywords': ['Medical', 'corona'],

    #Number of searches
    'google_search_num': 10,

    #Keywords to search from within the hit page
    'search_keywords_in_page': ['Medical']
}

main.py

import urllib.parse
import re

import requests
import bs4

from settings import settings
from output import OutputText


def get_ogdesc_from_soup(soup: bs4.BeautifulSoup) -> str:
    """
From a BeautifulSoup instance
    <meta property="og:description" content="...">
And returns the contents of content.
If not found, returns an empty string.
    """
    og_desc = soup.find('meta', attrs={'property': 'og:description', 'content': True})
    if og_desc:
        return og_desc['content']
    return ''


def get_href_from_soup(soup: bs4.BeautifulSoup):
    href = soup.get('href')
    href = re.search('(http)(.+)(&sa)', href).group()[0:-3]  #Remove unnecessary strings
    href = urllib.parse.unquote(href)  #Decode
    return href


def do_google_search(keywords: [str], search_num: int) -> [str]:
    """
Perform a Google search with keywords
Returns a list of hit URLs
    """
    #Perform a Google search
    url = 'https://www.google.co.jp/search'
    params = {
        'hl': 'ja',
        'num': search_num,
        'q': ' '.join(keywords)
    }
    response = requests.get(url, params=params)

    #Returns a list of hit URLs
    # `.kCrYT`May need to be fixed due to changes in Google specifications
    soup = bs4.BeautifulSoup(response.content, 'html.parser')
    soups = soup.select('.kCrYT > a')
    return [get_href_from_soup(soup) for soup in soups]


def main():
    output_text = OutputText('output.txt')
    urls = do_google_search(settings['google_search_keywords'], settings['google_search_num'])

    for url in urls:
        #Skip processing if the text file already contains the URL
        if url in output_text.get_urls():
            continue

        try:
            response = requests.get(url)
            response.encoding = 'utf-8'
            response.raise_for_status()
        except:
            #Skip processing if connection error occurs
            continue

        soup = bs4.BeautifulSoup(response.content, 'html.parser')

        titles = [a.text for a in soup.select('title')]
        desc = get_ogdesc_from_soup(soup)
        h1s = [a.text for a in soup.select('h1')]
        h2s = [a.text for a in soup.select('h2')]
        h3s = [a.text for a in soup.select('h3')]
        h4s = [a.text for a in soup.select('h4')]

        #Skip processing if keywords are not included in the page
        no_keyword = True
        for keyword in settings['search_keywords_in_page']:
            for text in titles + [desc] + h1s + h2s + h3s + h4s:
                if keyword in text:
                    no_keyword = False
        if no_keyword:
            continue

        #Write to text file
        title = '**No title**' if len(titles) <= 0 else titles[0].strip().replace('\n', '')
        output_text.write(title, url)

    #Output a text file in an easy-to-read format
    output_text.output_readable_file()


if __name__ == '__main__':
    main()

output.py

import myutil as u
import os


class OutputText:
    file_path = None

    def __init__(self, file_path):
        self.file_path = file_path

        if not os.path.isfile(file_path):
            file = open(self.file_path, 'w', encoding='utf-8')
            file.close()

    def write(self, title, url):
        with open(self.file_path, mode='a', encoding='utf-8') as f:
            u.write_with_tab(f, title, url)
            f.write('\n')

    def get_urls(self):
        lines = self.get_lines()
        return [self.get_url(line) for line in lines]

    def output_readable_file(self):
        file = self.file_path.replace('.txt', '_readable.txt')
        with open(file, mode='w', encoding='utf-8') as f:
            lines = self.get_lines()
            for line in lines:
                f.write(self.get_title(line) + '\n' + self.get_url(line) + '\n')
                f.write('\n------------------------------\n\n')

    def get_lines(self):
        with open(self.file_path, mode='r', encoding='utf-8') as f:
            text = f.read()
            lines = text.strip().split('\n')
            return lines

    def get_title(self, line):
        texts_in_line = line.split('\t')
        return texts_in_line[0] if len(texts_in_line) >= 1 else ''

    def get_url(self, line):
        texts_in_line = line.split('\t')
        return texts_in_line[1] if len(texts_in_line) >= 2 else ''

myutil.py

def write_with_tab(file, *strings):
    """
Write a string to the file separated by tabs
    """
    for i, string in enumerate(strings):
        file.write(string)
        if i != len(strings) - 1:  #If not the last loop
            file.write('\t')
    return file

Recommended Posts

[Python scraping] Output the URL and title of the site containing a specific keyword to a text file
[Python] Change the text color and background color of a specific keyword in print output
Attempt to launch another .exe and save the console output to a text file
[Python] Concatenate a List containing numbers and write it to an output file.
[Python] I made a web scraping code that automatically acquires the news title and URL of Nikkei Inc.
I want to extract the tag information (title and artist) of a music file (flac, wav).
I made a program to check the size of a file in Python
Get the list in the S3 bucket with Python and search with a specific Key. Output the Key name, last update date, and count number to a file.
Output python log to both console and file
Output in the form of a python array
A discussion of the strengths and weaknesses of Python
Various ways to read the last line of a csv file in Python
How to count the number of elements in Django and output to a template
Build a python environment to learn the theory and implementation of deep learning
Python Memorandum: Refer to the text and edit the file name while copying the target file
[Python selenium] After scraping Google search results, output title and URL in csv
<Python> A quiz to batch convert file names separated by a specific character string as part of the file name
Template of python script to read the contents of the file
Automatically determine and process the encoding of the text file
Created a module to monitor file and URL updates
[python] option to turn off the output of click.progressbar
Output the specified table of Oracle database in Python to Excel for each file
Output the output result of sklearn.metrics.classification_report as a CSV file
Output a binary dump in binary and revert to a binary file
Completely translated the site of "The Hitchhiker's Guide to Python"
Python Note: The mystery of assigning a variable to a variable
[Python] Let's change the URL of the Django administrator site
Python --Read data from a numeric data file to find the covariance matrix, eigenvalues, and eigenvectors
Get the value of a specific key up to the specified index in the dictionary list in Python
Recursively get the Excel list in a specific folder with python and write it to Excel.
Extract only the sound of a specific instrument from a MIDI file and make it a separate file
How to insert a specific process at the start and end of spider with scrapy
[Python] A program to find the number of apples and oranges that can be harvested
[Sublime Text 2] Always execute a specific file in the project
[python] Change the image file name to a serial number
Get the number of specific elements in a python list
Detect objects of a specific color and size with Python
[Note] Import of a file in the parent directory in Python
Build a Python environment and transfer data to the server
I want to know the features of Python and pip
[Python] How to output a pandas table to an excel file
I wrote AWS Lambda, and I was a little addicted to the default value of Python arguments
[Python] How to scrape a local html file and output it as CSV using Beautiful Soup
[Python] The role of the asterisk in front of the variable. Divide the input value and assign it to a variable
I want to clear up the question of the "__init__" method and the "self" argument of a Python class.
Output the key list included in S3 Bucket to a file
How to determine the existence of a selenium element in Python
[Python] How to force a method of a subclass to do something specific
[Introduction to Python] I compared the naming conventions of C # and Python.
Create a shell script to run the python file multiple times
[Python] How to get the first and last days of the month
I want to output the beginning of the next month with Python
Output the contents of ~ .xlsx in the folder to HTML with Python
Read the standard output of a subprocess line by line in Python
Outputs a line containing the specified character string from a text file
How to check the memory size of a dictionary in Python
[Python3] Define a decorator to measure the execution time of a function
Output search results of posts to a file using Mattermost API
The idea of feeding the config file with a python file instead of yaml
Convert the result of python optparse to dict and utilize it
[python] A note that started to understand the behavior of matplotlib.pyplot