I would like to show you how to scrape from a site with ** Digest authentication ** in python. (Although there is no difference from Basic authentication ...) For scraping itself and other certifications, the following will be helpful. Introduction to Python Web Scraping Practice [[Python] Scraping to pages with Basic authentication] (https://aga-note.com/python-scraping-basic-auth/)
Notes When scraping, it is necessary to consider various rules and manners. List of precautions for web scraping
Language: python 3.7.4 Library: requests, requests.auth, bs4, urllib.request
Install the following two with the pip command.
pip install requests
pip install beautifulsoup4
It is a practice when the installation is completed.
This time, I used the sample of the Web page with Digest authentication created by the administrator of the following site as an example. [Let's make an HTTP client (6) --Digest authentication-] (http://x68000.q-e-d.net/~68user/net/http-auth-2.html)
import requests
from requests.auth import HTTPDigestAuth
from bs4 import BeautifulSoup
#1.Website URL and digest authentication user and pass
url = 'http://X68000.q-e-d.net/~68user/net/sample/http-auth-digest/secret.html'
username = 'hoge'
password = 'fuga'
#2.Get information about URL with Digest authentication
res = requests.get(url,auth=HTTPDigestAuth(username,password))
content = res.content
#3.html data acquisition
#All data
data = BeautifulSoup(content, 'html.parser')
#Title acquisition
title = data.title.string
#Get text
body = data.body.string
print(title, body)
I will also introduce the case of downloading images and files such as Excel directly from the URL with Digest authentication. I couldn't actually find the file URL with Digest authentication, so I'll just list the method.
import urllib.request
from requests.auth import HTTPDigestAuth
from bs4 import BeautifulSoup
#1.Website URL and digest authentication user and pass
url = ******************
username = ******************
password = ******************
#2.Read file of URL with Digest authentication
#Explanation 1
password_manager = urllib.request.HTTPPasswordMgrWithDefaultRealm()
password_manager.add_password(None, url, username, password)
#Explanation 2
authhandler = urllib.request.HTTPDigestAuthHandler(password_manager)
opener = urllib.request.build_opener(authhandler)
#Read file contents
file_content = opener.open(url).read()
#3.Save the file in a local directory (extension xlsx because Excel is assumed)
path = os.path.dirname(os.path.abspath(__file__)) + '/file.xlsx'
with open(excel_path, mode="wb") as f:
f.write(file_content)
print("Saved")
Explanation 1 Register the information required for Digest authentication in the variable of the password management object.
Explanation 2 Open URL with Digest authentication
** Manners in scraping ** List of precautions for web scraping
** Scraping practice ** Introduction to Python Web Scraping Practice [[Python] Scraping to pages with Basic authentication] (https://aga-note.com/python-scraping-basic-auth/)
** Official document ** Authentication — Requests 2.23.0 documentation Official documentation for urllib.request
Recommended Posts