[Python scraping] I tried google search top10 using Beautifulsoup & selenium

Introduction

When I was working on seo writing before, I was working on manually collecting the top10 urls and titles of search words. At that time, I was able to save a lot of work by using scraping, so I will describe how to do it.

For those who want to write their own blog and earn money, what kind of title will make it easier to access, and since it will be possible to jump to the url from excel, the writing work can be greatly reduced.

table of contents

flow

The whole picture of the code

Commentary

** 1. Launch google **

** 2. Enter the word you want to search from the search box and enter **

** 3. Get url from search results **

** 4. Access each url and get the title and description attributes **

** 5. Export the summarized data as an excel file **

Source code

flow

  1. Launch google
  2. Enter the word you want to search from the search box and enter
  3. Get the url from the search results
  4. Access each url and get the title and description attributes
  5. Export the summarized data as an excel file

The whole code

import time                                 #Required to use sleep
from selenium import webdriver              #Automatically operate the web browser (python-m pip install selenium)
from selenium.webdriver.common.keys import Keys
import chromedriver_binary
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np                               
#Change here if you like
from selenium.webdriver.chrome.options import Options

class Chrome_search:
    def __init__(self):
        self.url = "https://www.google.co.jp/search"
        self.search_word = input("Please enter a search word:")
        self.search_num = int(input("How many cases do you get:"))#How many acquisitions
        
        self.options = Options()
        #self.options.add_argument('--headless') ##Stop launching browser
        #self.options.add_argument('--no-sandbox')  ##Access restrictions disappear, but it is dangerous because any program will be downloaded
        self.options.add_argument('--disable-dev-shm-usage')#Because chrome does not flash back because it can use full memory
        
    def search(self):
        driver = webdriver.Chrome(options=self.options) #mac users comment out this line
        driver.get(self.url)
        search = driver.find_element_by_name('q')  #Search box in HTML(name='q')To specify
        search.send_keys(self.search_word)        #Send a search word
        search.submit()                         #Perform a search
        time.sleep(1)    
        #Create a storage box
        title_list = []     #Store title
        url_list = []     #Store url
        description_list = []   #meta-Store description
        ##Get html
        html = driver.page_source.encode('utf-8') 
        soup = BeautifulSoup(html, "html.parser")        
        #Get search result titles and links
        link_elem01 = soup.select('.yuRUbf > a')
        #Get only the link and remove the extra part
        if self.search_num<=len(link_elem01):       #If the number of urls is less than the number of searches, analyze only the number of urls
            for i in range(self.search_num):
                url_text = link_elem01[i].get('href').replace('/url?q=', '')
                url_list.append(url_text)  
        elif self.search_num > len(link_elem01):
            for i in range(len(link_elem01)):
                url_text = link_elem01[i].get('href').replace('/url?q=','')
                url_list.append(url_text)
        
        time.sleep(1)
        
        #At this stage the url creation is complete
        #url_Get titles one after another from list
        for i in range(len(url_list)):
            driver.get(url_list[i])
            ##Get html
            html2 = driver.page_source.encode('utf-8') 
            ##Perth for Beautiful Soup
            soup2 = BeautifulSoup(html2, "html.parser")
            #Get title
            title_list.append(driver.title)
            #Get Description
            try:
                description = driver.find_element_by_xpath(("//meta[@name='description']")).get_attribute("content")
                description_list.append(description)
            except:
                description_list.append("")
            #Return the browser once
            driver.back()
            time.sleep(0.3)
        #Can you save it once here?
        print(url_list)
        print(title_list)
        print(description_list)

        search_ranking = np.arange(1,len(url_list)+1)
        
        my_list = {"url": url_list,"ranking":search_ranking, "title": title_list,"description":description_list}
        my_file = pd.DataFrame(my_list)
        driver.quit()
        my_file.to_excel(self.search_word+".xlsx",self.search_word,startcol=2,startrow=1)
        df = pd.read_excel(self.search_word+".xlsx")
        return df
    
    
if __name__ == '__main__':
    se = Chrome_search()
    df=se.search()
    df.head()

Commentary

I will explain the code.

Loading the library

import time                                 #Required to use sleep
from selenium import webdriver              #Automatically operate the web browser (python-m pip install selenium)
from selenium.webdriver.common.keys import Keys
import chromedriver_binary
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np  
#Change here if you like
from selenium.webdriver.chrome.options import Options

The description of the key library is as follows.

time: Time control library selenium: A library that controls the browser beautifulsoup: Scraping library chromedriver_binary: Make selenium browser google chrome

1. Launch Google

The following is the place to launch the browser on google.

driver = webdriver.Chrome(options=self.options) #mac users comment out this line
driver.get(self.url)

driver = webdriver.Chrome(options=self.options) Is a declaration to launch selenium.

driver.get ("url you want to search for ") You can access the url you want to search in chrome with.

This time, it is self.url = google browser.

2. Enter the word you want to search from the search box and enter

The code to search from the google search form using selenium is as follows.

search = driver.find_element_by_name('q')  #Search box in HTML(name='q')To specify
search.send_keys(self.search_word)        #Send a search word
search.submit()                         #Perform a search
time.sleep(1)                          #Stop for 1 second

You can extract the element that has the name attribute with driver.find_element_by_name (). After find_element_by_, you can also specify class, id, etc. You can enter the word you want to put in the element specified by driver.fine ... with .send_keys ("word you want to put "). submit () acts as enter. You can stop execution for the specified number of seconds with time.sleep (seconds). By using it when loading the browser, you can wait until the screen is displayed, and lag errors due to communication are less likely to occur.

3. Get the url from the search results

In step 2, the screen actually changes and the search results are displayed.

The code to get the url from the search results is as follows.

#Create a storage box
title_list = []     #Store title
url_list = []     #Store url
description_list = []   #meta-Store description
##Get html
html = driver.page_source.encode('utf-8') 
soup = BeautifulSoup(html, "html.parser")        
#Get search result titles and links
link_elem01 = soup.select('.yuRUbf > a')
#Get only the link and remove the extra part
if self.search_num<=len(link_elem01):       #If the number of urls is less than the number of searches, analyze only the number of urls
    for i in range(self.search_num):
        url_text = link_elem01[i].get('href').replace('/url?q=', '')
        url_list.append(url_text)  
elif self.search_num > len(link_elem01):
    for i in range(len(link_elem01)):
        url_text = link_elem01[i].get('href').replace('/url?q=','')
        url_list.append(url_text)

time.sleep(1)

The most important parts of this code are:

##Get html
html = driver.page_source.encode('utf-8') 
soup = BeautifulSoup(html, "html.parser")        
#Get search result titles and links
link_elem01 = soup.select('.yuRUbf > a')

driver.page_source.encode ('utf-8') forcibly sets the character code to utf-8. Remember that BeautifulSoup (html," html.parser ") is a declaration, so it's like a spell. You can extract the specified element from the css selector with soup.select (). The code after that says link_elem01 [i] .get ('href'), but it reads the href attribute of the data obtained by soup.select.

4. Get title and description

The code to get the title and description is below.

for i in range(len(url_list)):
    driver.get(url_list[i])
    ##Get html
    html2 = driver.page_source.encode('utf-8') 
    ##Perth for Beautiful Soup
    soup2 = BeautifulSoup(html2, "html.parser")
    #Get title
    title_list.append(driver.title)
    #Get Description
    try:
        description = driver.find_element_by_xpath(("//meta[@name='description']")).get_attribute("content")
        description_list.append(description)
    except:
        description_list.append("")
    #Return the browser once
    driver.back()
    time.sleep(0.3)

We will search with selenium based on the list of urls obtained in 3. The code is completed with the knowledge of beautiful soup and selenium that have been released so far. driver.back () is a command to back the browser.

5. Export the summarized data as an excel file

I made a list with url, title, description up to 4. Finally, we will use pandas to format the data and write it to an excel file. The corresponding code is below.

my_list = {"url": url_list,"ranking":search_ranking, "title": title_list,"description":description_list}
my_file = pd.DataFrame(my_list)
driver.quit()
my_file.to_excel(self.search_word+".xlsx",self.search_word,startcol=2,startrow=1)
df = pd.read_excel(self.search_word+".xlsx")

No special explanation is needed for the operation of pandas. Finally, shut down the browser with driver.quit ().

Source code

The source code can be obtained from the following github. Please use it for writing. https://github.com/marumaru1019/python_scraping/tree/master

Recommended Posts

[Python scraping] I tried google search top10 using Beautifulsoup & selenium
I tried web scraping using python and selenium
I tried scraping with Python
Web scraping using Selenium (Python)
[Python] I tried using OpenPose
I tried scraping with python
I tried to access Google Spread Sheets using Python
Python programming: I tried to get (crawling) news articles using Selenium and BeautifulSoup4.
I tried web scraping with python.
I tried using Thonny (Python / IDE)
[Python] I tried using YOLO v3
[Beginner] Python web scraping using Google Colaboratory
I tried scraping Yahoo News with Python
I tried using Selenium with Headless chrome
I tried using Bayesian Optimization in Python
I tried using Selective search as R-CNN
I tried using UnityCloudBuild API from Python
I tried scraping Yahoo weather (Python edition)
I tried using Headless Chrome from Selenium
I tried scraping
Scraping using Python
I tried updating Google Calendar with CSV appointments using Python and Google APIs
I tried object detection using Python and OpenCV
I tried using the Google Cloud Vision API
I tried using mecab with python2.7, ruby2.3, php7
I tried reading a CSV file using Python
I tried using the Datetime module by Python
I tried using Google Translate from Python and it was just too easy
[Python selenium] After scraping Google search results, output title and URL in csv
I tried using parameterized
I tried using argparse
I tried using mimesis
I tried using anytree
I tried using aiomysql
I tried using Summpy
I tried Python> autopep8
Scraping google search (image)
I tried using coturn
I tried using Pipenv
I tried using matplotlib
I tried using "Anvil".
I tried using Hubot
I tried using openpyxl
I tried using Ipython
I tried using PyCaret
I tried using cron
I tried using ngrok
I tried using face_recognition
I tried using Jupyter
Search Twitter using Python
I tried using PyCaret
Scraping with Selenium [Python]
I tried using Heapq
Python web scraping selenium
I tried using doctest
I tried Python> decorator
I tried using folium
I tried using jinja2
I tried using folium
I tried using time-window
python selenium chromedriver beautifulsoup