-You can scrape the data of web pages with simple page nations. -Data can be scraped using Beautiful Soup. -Basic usage of requests. ・ Basic usage of jupyter notebook
・ Mac OS -Using python3.x series ・ The following modules are available (bold)
beautifulsoup4 4.9.3 certifi 2020.11.8 chardet 3.0.4 chromedriver-binary 87.0.4280.20.0 click 7.1.2 cssselect 1.1.0 idna 2.10 isodate 0.6.0 lxml 4.6.2 numpy 1.19.5 pandas 1.2.0 parsel 1.6.0 pip 20.3.3 ppprint 0.1.0 pyparsing 2.4.7 python-dateutil 2.8.1 pytz 2020.5 PyYAML 5.3.1 rdflib 5.0.0 requests 2.25.1 selectorlib 0.16.0 selenium 3.141.0 setuptools 49.2.1 six 1.15.0 soupsieve 2.1 SPARQLWrapper 1.8.5 urllib3 1.26.2 w3lib 1.22.0
If not, copy and paste the following
pip install beautifulsoup4 && pip install requests && pip install pandas
We will scrape by referring to the following web page. opencoddez
from bs4 import BeautifulSoup
import csv
import pandas as pd
from pandas import DataFrame
import requests
import logging
import pdb
#requests.Pass the header information of the client terminal to the server when getting.
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36'}
article_link = []
article_title = []
article_para = []
article_author = []
article_date = []
def main():
opencodezscraping('https://www.opencodez.com/page', 0)
data = {'Article_link': article_link, 'Article_Title': article_title, 'Article_para': article_para, 'Article_Author':article_author, 'Ariticle_Date': article_date}
df = DataFrame(data, columns=['Article_link', 'Article_Title', 'Article_para', 'Article_Author', 'Article_Date'])
df.to_csv('./Opencodez_Articles.csv')
with open('Opencodez_Articles.csv', 'w') as csv_file:
fieldnames = ['Link', 'Title', 'Para', 'Author', 'Data']
writer = csv.DictWriter(csv_file, fieldnames=fieldnames)
writer.writeheader()
def opencodezscraping(webpage, page_number):
next_page = webpage + str(page_number)
response = requests.get(next_page, headers=headers)
logging.info(f'scraping {page_number}page ・ ・ ・ ・')
soup = BeautifulSoup(response.content, 'html.parser')
soup_title = soup.findAll('h2', {'class': 'title'})
soup_para = soup.findAll('div', {'class': 'post-content image-caption-format-1'})
soup_date = soup.findAll('span', {'class': 'thetime'})
for x in range(len(soup_title)):
article_title.append(soup_title[x].a['title'])
article_link.append(soup_title[x].a['href'])
article_author.append(soup_para[x].a.text.strip())
article_date.append(soup_date[x].text)
article_para.append(soup_para[x].text.strip())
#Execute the function until there are no more pages.
if status_code != 404:
page_number = page_number + 1
opencodezscraping(webpage, page_number)
#If the file in which this is described is the main file, main()Execute the function.
if __name__ == '__main__':
logging.basicConfig(level=logging.INFO)
main()
[Reference article]
I will explain each function unit.
main Its main use is to call functions and define objects.
def main():
#Calling the opencodezscraping function
opencodezscraping('https://www.opencodez.com/page', 0)
#Data group to be passed to DataFrame
data = {'Article_link': article_link, 'Article_Title': article_title, 'Article_para': article_para, 'Article_Author':article_author, 'Ariticle_Date': article_date}
df = DataFrame(data, columns=['Article_link', 'Article_Title', 'Article_para', 'Article_Author', 'Article_Date'])
#Output the passed data to Csv
df.to_csv('./Opencodez_Articles.csv')
opencodezscraping
def opencodezscraping(webpage, page_number):
#Concatenate url and page number
next_page = webpage + str(page_number)
#Get web page
response = requests.get(next_page, headers=headers)
logging.info(f'scraping {page_number}page ・ ・ ・ ・')
#Define Beautiful Soup
soup = BeautifulSoup(response.content, 'html.parser')
#Get article title
soup_title = soup.findAll('h2', {'class': 'title'})
#Get article description
soup_para = soup.findAll('div', {'class': 'post-content image-caption-format-1'})
#Get article posting date
soup_date = soup.findAll('span', {'class': 'thetime'})
#Loop by the number of article titles
for x in range(len(soup_title)):
#Add each data to the environment variables defined at the top
article_title.append(soup_title[x].a['title'])
article_link.append(soup_title[x].a['href'])
article_author.append(soup_para[x].a.text.strip())
article_date.append(soup_date[x].text)
article_para.append(soup_para[x].text.strip())
#Call the function until there are no more pages.
if status_code != 404:
page_number = page_number + 1
opencodezscraping(webpage, page_number)
Check the findall () method of requests.get and getsoup in jupyter notebook (* python interactive console is also OK). If you don't have jupyter notebook, copy and paste the following.
pip instal jupyter notebook
To start, copy and paste the following command. The web browser starts.
jupyter notebook
Send a get request to the web page and confirm that the respond is returned normally. In the following cases, the status code is 200, so you can confirm that the page has been acquired normally.
#Get article title
soup_title = soup.findAll('h2', {'class': 'title'})
#Get article description
soup_para = soup.findAll('div', {'class': 'post-content image-caption-format-1'})
#Get article posting date
soup_date = soup.findAll('span', {'class': 'thetime'})
Let's check the return values of each of the three methods.
In the following, the html tag including the title information of the article is acquired.
From the result, the title information is included in the value of the title attribute of the a tag, so to extract the individual information, do as follows.
In the following, the html tag including the article explanation information is acquired.
From the result, it can be seen that the article description is in the p tag. If you want to get the explanation individually, follow the steps below. strip () erases the special characters contained in the string.
In the following, the html tag including the posting date information of the article is acquired.
From the result, you can see that the article posting date is in the span tag with class ='the time'. To get the posting date individually, follow the steps below.
This time the url parameter is simple, if you specify the page_number value by incrementing it one by one I was able to get the page information by requesting it. When it comes to scraping a site with complex url parameters, it becomes difficult at once. I'm trying to scrape an e-commerce site, but I'm crying because of a complicated mystery. I will write an article again when it is completed! If there are any mistakes, I would appreciate it if you could point them out!
Reference article Web Scraping a Site with Pagination using BeautifulSoup
Recommended Posts