What you can learn from this article

-You can scrape the data of web pages with simple page nations. -Data can be scraped using Beautiful Soup. -Basic usage of requests. ・ Basic usage of jupyter notebook

Prerequisite environment

・ Mac OS -Using python3.x series ・ The following modules are available (bold)

beautifulsoup4 4.9.3 certifi 2020.11.8 chardet 3.0.4 chromedriver-binary 87.0.4280.20.0 click 7.1.2 cssselect 1.1.0 idna 2.10 isodate 0.6.0 lxml 4.6.2 numpy 1.19.5 pandas 1.2.0 parsel 1.6.0 pip 20.3.3 ppprint 0.1.0 pyparsing 2.4.7 python-dateutil 2.8.1 pytz 2020.5 PyYAML 5.3.1 rdflib 5.0.0 requests 2.25.1 selectorlib 0.16.0 selenium 3.141.0 setuptools 49.2.1 six 1.15.0 soupsieve 2.1 SPARQLWrapper 1.8.5 urllib3 1.26.2 w3lib 1.22.0

If not, copy and paste the following


pip install beautifulsoup4 && pip install requests && pip install pandas

We will scrape by referring to the following web page. opencoddez

All chords

from bs4 import BeautifulSoup
import csv
import pandas as pd
from pandas import DataFrame
import requests
import logging
import pdb

#requests.Pass the header information of the client terminal to the server when getting.
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36'}
article_link = []
article_title = []
article_para = []
article_author = []
article_date = []



def main():
   opencodezscraping('https://www.opencodez.com/page', 0)
   data = {'Article_link': article_link, 'Article_Title': article_title,  'Article_para': article_para, 'Article_Author':article_author, 'Ariticle_Date': article_date}
   df = DataFrame(data, columns=['Article_link', 'Article_Title', 'Article_para', 'Article_Author', 'Article_Date'])
   df.to_csv('./Opencodez_Articles.csv')



with open('Opencodez_Articles.csv', 'w') as csv_file:
    fieldnames = ['Link', 'Title', 'Para', 'Author', 'Data']
    writer = csv.DictWriter(csv_file, fieldnames=fieldnames)
    writer.writeheader()
    


def opencodezscraping(webpage, page_number):
    next_page = webpage + str(page_number)
    response = requests.get(next_page, headers=headers)
    logging.info(f'scraping {page_number}page ・ ・ ・ ・')
    soup = BeautifulSoup(response.content, 'html.parser')
    soup_title = soup.findAll('h2', {'class': 'title'})
    soup_para = soup.findAll('div', {'class': 'post-content image-caption-format-1'})
    soup_date = soup.findAll('span', {'class': 'thetime'})
    
    for x in range(len(soup_title)):
        article_title.append(soup_title[x].a['title'])
        article_link.append(soup_title[x].a['href'])
        article_author.append(soup_para[x].a.text.strip())
        article_date.append(soup_date[x].text)
        article_para.append(soup_para[x].text.strip())
        
 　　　#Execute the function until there are no more pages.
    if status_code != 404:
       page_number = page_number + 1
       opencodezscraping(webpage, page_number)

#If the file in which this is described is the main file, main()Execute the function.
if __name__ == '__main__':
   logging.basicConfig(level=logging.INFO)
   main()


[Reference article]

I will explain each function unit.

main Its main use is to call functions and define objects.


def main():
   #Calling the opencodezscraping function
   opencodezscraping('https://www.opencodez.com/page', 0)
   #Data group to be passed to DataFrame
   data = {'Article_link': article_link, 'Article_Title': article_title,  'Article_para': article_para, 'Article_Author':article_author, 'Ariticle_Date': article_date}　
   df = DataFrame(data, columns=['Article_link', 'Article_Title', 'Article_para', 'Article_Author', 'Article_Date'])
　 #Output the passed data to Csv
   df.to_csv('./Opencodez_Articles.csv')

opencodezscraping


 
def opencodezscraping(webpage, page_number):
    #Concatenate url and page number
    next_page = webpage + str(page_number)
　　 #Get web page
    response = requests.get(next_page, headers=headers)
    logging.info(f'scraping {page_number}page ・ ・ ・ ・')
    #Define Beautiful Soup
    soup = BeautifulSoup(response.content, 'html.parser')  

    #Get article title
    soup_title = soup.findAll('h2', {'class': 'title'})
　　 #Get article description
    soup_para = soup.findAll('div', {'class': 'post-content image-caption-format-1'})
    #Get article posting date
    soup_date = soup.findAll('span', {'class': 'thetime'})
    #Loop by the number of article titles
    for x in range(len(soup_title)):
        #Add each data to the environment variables defined at the top
        article_title.append(soup_title[x].a['title'])
        article_link.append(soup_title[x].a['href'])
        article_author.append(soup_para[x].a.text.strip())
        article_date.append(soup_date[x].text)
        article_para.append(soup_para[x].text.strip())

    #Call the function until there are no more pages.
     if status_code != 404:
       page_number = page_number + 1
       opencodezscraping(webpage, page_number)

Check the behavior with jupter notebook

Check the findall () method of requests.get and getsoup in jupyter notebook (* python interactive console is also OK). If you don't have jupyter notebook, copy and paste the following.

pip instal jupyter notebook

To start, copy and paste the following command. The web browser starts.

jupyter notebook

Confirm requests.get

Send a get request to the web page and confirm that the respond is returned normally. In the following cases, the status code is 200, so you can confirm that the page has been acquired normally.

スクリーンショット 2021-01-10 13.12.30.png

Check findAll ()


#Get article title
soup_title = soup.findAll('h2', {'class': 'title'})
#Get article description
soup_para = soup.findAll('div', {'class': 'post-content image-caption-format-1'})
#Get article posting date
soup_date = soup.findAll('span', {'class': 'thetime'})

Let's check the return values of each of the three methods.

Article title

In the following, the html tag including the title information of the article is acquired. スクリーンショット 2021-01-10 13.15.42.png

From the result, the title information is included in the value of the title attribute of the a tag, so to extract the individual information, do as follows. スクリーンショット 2021-01-10 13.23.04.png

Article description

In the following, the html tag including the article explanation information is acquired. スクリーンショット 2021-01-10 13.25.20.png

From the result, it can be seen that the article description is in the p tag. If you want to get the explanation individually, follow the steps below. スクリーンショット 2021-01-10 13.27.50.png strip () erases the special characters contained in the string.

Article posting date

In the following, the html tag including the posting date information of the article is acquired. スクリーンショット 2021-01-10 13.32.01.png

From the result, you can see that the article posting date is in the span tag with class ='the time'. To get the posting date individually, follow the steps below. スクリーンショット 2021-01-10 13.35.57.png

Summary

This time the url parameter is simple, if you specify the page_number value by incrementing it one by one I was able to get the page information by requesting it. When it comes to scraping a site with complex url parameters, it becomes difficult at once. I'm trying to scrape an e-commerce site, but I'm crying because of a complicated mystery. I will write an article again when it is completed! If there are any mistakes, I would appreciate it if you could point them out!

Reference article Web Scraping a Site with Pagination using BeautifulSoup

[PYTHON] Scraping pages with pagination with Beautiful Soup