Scraping from an authenticated site with python

Introduction

I would like to show you how to scrape from a site with ** Digest authentication ** in python. (Although there is no difference from Basic authentication ...) For scraping itself and other certifications, the following will be helpful. Introduction to Python Web Scraping Practice [[Python] Scraping to pages with Basic authentication] (https://aga-note.com/python-scraping-basic-auth/)

Notes When scraping, it is necessary to consider various rules and manners. List of precautions for web scraping

What do you need?

Language: python 3.7.4 Library: requests, requests.auth, bs4, urllib.request

Library installation

Install the following two with the pip command.

pip install requests
pip install beautifulsoup4

It is a practice when the installation is completed.

Practice

This time, I used the sample of the Web page with Digest authentication created by the administrator of the following site as an example. [Let's make an HTTP client (6) --Digest authentication-] (http://x68000.q-e-d.net/~68user/net/http-auth-2.html)

import requests
from requests.auth import HTTPDigestAuth
from bs4 import BeautifulSoup

#1.Website URL and digest authentication user and pass
url = 'http://X68000.q-e-d.net/~68user/net/sample/http-auth-digest/secret.html'
username = 'hoge'
password = 'fuga'

#2.Get information about URL with Digest authentication
res = requests.get(url,auth=HTTPDigestAuth(username,password))
content = res.content

#3.html data acquisition
#All data
data = BeautifulSoup(content, 'html.parser')
#Title acquisition
title = data.title.string
#Get text
body = data.body.string
print(title, body)

A little applied

I will also introduce the case of downloading images and files such as Excel directly from the URL with Digest authentication. I couldn't actually find the file URL with Digest authentication, so I'll just list the method.

import urllib.request
from requests.auth import HTTPDigestAuth
from bs4 import BeautifulSoup

#1.Website URL and digest authentication user and pass
url = ******************
username = ******************
password = ******************

#2.Read file of URL with Digest authentication
#Explanation 1
password_manager = urllib.request.HTTPPasswordMgrWithDefaultRealm()
password_manager.add_password(None, url, username, password)
#Explanation 2
authhandler = urllib.request.HTTPDigestAuthHandler(password_manager)
opener = urllib.request.build_opener(authhandler)
#Read file contents
file_content = opener.open(url).read()

#3.Save the file in a local directory (extension xlsx because Excel is assumed)
path = os.path.dirname(os.path.abspath(__file__)) + '/file.xlsx' 
with open(excel_path, mode="wb") as f:
   f.write(file_content)
   print("Saved")

Explanation 1 Register the information required for Digest authentication in the variable of the password management object.

HTTPPasswordMgrWithDefaultRealm (): Password management object
add_password: Method for registering in variable

Explanation 2 Open URL with Digest authentication

HTTPDigestAuthHandler: Create an instance through Digest authentication
build_opener: Create an instance to open an authenticated URL If you want to know more details, please see the reference site.

reference

** Manners in scraping ** List of precautions for web scraping

** Scraping practice ** Introduction to Python Web Scraping Practice [[Python] Scraping to pages with Basic authentication] (https://aga-note.com/python-scraping-basic-auth/)

** Official document ** Authentication — Requests 2.23.0 documentation Official documentation for urllib.request