[PYTHON] Scraping pages with pagination with Beautiful Soup

What you can learn from this article

-You can scrape the data of web pages with simple page nations. -Data can be scraped using Beautiful Soup. -Basic usage of requests. ・ Basic usage of jupyter notebook

Prerequisite environment

・ Mac OS -Using python3.x series ・ The following modules are available (bold)

beautifulsoup4 4.9.3 certifi 2020.11.8 chardet 3.0.4 chromedriver-binary 87.0.4280.20.0 click 7.1.2 cssselect 1.1.0 idna 2.10 isodate 0.6.0 lxml 4.6.2 numpy 1.19.5 pandas 1.2.0 parsel 1.6.0 pip 20.3.3 ppprint 0.1.0 pyparsing 2.4.7 python-dateutil 2.8.1 pytz 2020.5 PyYAML 5.3.1 rdflib 5.0.0 requests 2.25.1 selectorlib 0.16.0 selenium 3.141.0 setuptools 49.2.1 six 1.15.0 soupsieve 2.1 SPARQLWrapper 1.8.5 urllib3 1.26.2 w3lib 1.22.0

If not, copy and paste the following


pip install beautifulsoup4 && pip install requests && pip install pandas

We will scrape by referring to the following web page. opencoddez

All chords

from bs4 import BeautifulSoup
import csv
import pandas as pd
from pandas import DataFrame
import requests
import logging
import pdb

#requests.Pass the header information of the client terminal to the server when getting.
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36'}
article_link = []
article_title = []
article_para = []
article_author = []
article_date = []



def main():
   opencodezscraping('https://www.opencodez.com/page', 0)
   data = {'Article_link': article_link, 'Article_Title': article_title,  'Article_para': article_para, 'Article_Author':article_author, 'Ariticle_Date': article_date}
   df = DataFrame(data, columns=['Article_link', 'Article_Title', 'Article_para', 'Article_Author', 'Article_Date'])
   df.to_csv('./Opencodez_Articles.csv')



with open('Opencodez_Articles.csv', 'w') as csv_file:
    fieldnames = ['Link', 'Title', 'Para', 'Author', 'Data']
    writer = csv.DictWriter(csv_file, fieldnames=fieldnames)
    writer.writeheader()
    


def opencodezscraping(webpage, page_number):
    next_page = webpage + str(page_number)
    response = requests.get(next_page, headers=headers)
    logging.info(f'scraping {page_number}page ・ ・ ・ ・')
    soup = BeautifulSoup(response.content, 'html.parser')
    soup_title = soup.findAll('h2', {'class': 'title'})
    soup_para = soup.findAll('div', {'class': 'post-content image-caption-format-1'})
    soup_date = soup.findAll('span', {'class': 'thetime'})
    
    for x in range(len(soup_title)):
        article_title.append(soup_title[x].a['title'])
        article_link.append(soup_title[x].a['href'])
        article_author.append(soup_para[x].a.text.strip())
        article_date.append(soup_date[x].text)
        article_para.append(soup_para[x].text.strip())
        
    #Execute the function until there are no more pages.
    if status_code != 404:
       page_number = page_number + 1
       opencodezscraping(webpage, page_number)

#If the file in which this is described is the main file, main()Execute the function.
if __name__ == '__main__':
   logging.basicConfig(level=logging.INFO)
   main()


[Reference article]

I will explain each function unit.

main Its main use is to call functions and define objects.


def main():
   #Calling the opencodezscraping function
   opencodezscraping('https://www.opencodez.com/page', 0)
   #Data group to be passed to DataFrame
   data = {'Article_link': article_link, 'Article_Title': article_title,  'Article_para': article_para, 'Article_Author':article_author, 'Ariticle_Date': article_date} 
   df = DataFrame(data, columns=['Article_link', 'Article_Title', 'Article_para', 'Article_Author', 'Article_Date'])
  #Output the passed data to Csv
   df.to_csv('./Opencodez_Articles.csv')

opencodezscraping


 
def opencodezscraping(webpage, page_number):
    #Concatenate url and page number
    next_page = webpage + str(page_number)
   #Get web page
    response = requests.get(next_page, headers=headers)
    logging.info(f'scraping {page_number}page ・ ・ ・ ・')
    #Define Beautiful Soup
    soup = BeautifulSoup(response.content, 'html.parser')  

    #Get article title
    soup_title = soup.findAll('h2', {'class': 'title'})
   #Get article description
    soup_para = soup.findAll('div', {'class': 'post-content image-caption-format-1'})
    #Get article posting date
    soup_date = soup.findAll('span', {'class': 'thetime'})
    #Loop by the number of article titles
    for x in range(len(soup_title)):
        #Add each data to the environment variables defined at the top
        article_title.append(soup_title[x].a['title'])
        article_link.append(soup_title[x].a['href'])
        article_author.append(soup_para[x].a.text.strip())
        article_date.append(soup_date[x].text)
        article_para.append(soup_para[x].text.strip())

    #Call the function until there are no more pages.
     if status_code != 404:
       page_number = page_number + 1
       opencodezscraping(webpage, page_number)
      

Check the behavior with jupter notebook

Check the findall () method of requests.get and getsoup in jupyter notebook (* python interactive console is also OK). If you don't have jupyter notebook, copy and paste the following.

pip instal jupyter notebook

To start, copy and paste the following command. The web browser starts.

jupyter notebook

Confirm requests.get

Send a get request to the web page and confirm that the respond is returned normally. In the following cases, the status code is 200, so you can confirm that the page has been acquired normally.

スクリーンショット 2021-01-10 13.12.30.png

Check findAll ()


#Get article title
soup_title = soup.findAll('h2', {'class': 'title'})
#Get article description
soup_para = soup.findAll('div', {'class': 'post-content image-caption-format-1'})
#Get article posting date
soup_date = soup.findAll('span', {'class': 'thetime'}) 

Let's check the return values ​​of each of the three methods.

Article title

In the following, the html tag including the title information of the article is acquired. スクリーンショット 2021-01-10 13.15.42.png

From the result, the title information is included in the value of the title attribute of the a tag, so to extract the individual information, do as follows. スクリーンショット 2021-01-10 13.23.04.png

Article description

In the following, the html tag including the article explanation information is acquired. スクリーンショット 2021-01-10 13.25.20.png

From the result, it can be seen that the article description is in the p tag. If you want to get the explanation individually, follow the steps below. スクリーンショット 2021-01-10 13.27.50.png strip () erases the special characters contained in the string.

Article posting date

In the following, the html tag including the posting date information of the article is acquired. スクリーンショット 2021-01-10 13.32.01.png

From the result, you can see that the article posting date is in the span tag with class ='the time'. To get the posting date individually, follow the steps below. スクリーンショット 2021-01-10 13.35.57.png

Summary

This time the url parameter is simple, if you specify the page_number value by incrementing it one by one I was able to get the page information by requesting it. When it comes to scraping a site with complex url parameters, it becomes difficult at once. I'm trying to scrape an e-commerce site, but I'm crying because of a complicated mystery. I will write an article again when it is completed! If there are any mistakes, I would appreciate it if you could point them out!

Reference article Web Scraping a Site with Pagination using BeautifulSoup

Recommended Posts

Scraping pages with pagination with Beautiful Soup
Scraping multiple pages with Beautiful Soup
Scraping with Beautiful Soup
Table scraping with Beautiful Soup
Try scraping with Python + Beautiful Soup
Scraping with Python and Beautiful Soup
Scraping with Beautiful Soup in 10 minutes
Website scraping with Python's Beautiful Soup
Sort anime faces by scraping anime character pages with Beautiful Soup and Selenium
Crawl practice with Beautiful Soup
Beautiful Soup
[Raspberry Pi] Scraping of web pages that cannot be obtained with python requests + Beautiful Soup
[Python] Scraping a table using Beautiful Soup
Remove unwanted HTML tags with Beautiful Soup
Scraping with selenium
Scraping with selenium ~ 2 ~
Scraping with Python
Scraping with Python
Beautiful Soup memo
Beautiful soup spills
Scraping with Selenium
Write a basic headless web scraping "bot" in Python with Beautiful Soup 4
Successful scraping with Selenium
Scraping with Python (preparation)
Try scraping with Python.
Scraping with Python + PhantomJS
My Beautiful Soup (Python)
Scraping with scrapy shell
I tried various things with Python: scraping (Beautiful Soup + Selenium + PhantomJS) and morphological analysis.
Scraping with Selenium [Python]
Note that I dealt with HTML in Beautiful Soup
Scraping with Python + PyQuery
[Python] Delete by specifying a tag with Beautiful Soup
Scraping RSS with Python
Scraping Google News search results in Python (2) Use Beautiful Soup
I tried scraping with Python
Automatically download images with scraping
Web scraping with python + JupyterLab
Scraping with selenium in Python
Scraping with Selenium + Python Part 1
Scraping with chromedriver in python
Festive scraping with Python, scrapy
Save images with web scraping
Scraping with Selenium in Python
Easy web scraping with Scrapy
Scraping with Tor in Python
Scraping weather forecast with python
scraping the Nikkei 225 with playwright-python
Scraping with Selenium + Python Part 2
Get the link destination URL by specifying a text sentence with Python scraping (Beautiful Soup) + XPath
I tried scraping with python
Web scraping beginner with python
I-town page scraping with selenium