Scraping from an authenticated site with python

Introduction

I would like to show you how to scrape from a site with ** Digest authentication ** in python. (Although there is no difference from Basic authentication ...) For scraping itself and other certifications, the following will be helpful. Introduction to Python Web Scraping Practice [[Python] Scraping to pages with Basic authentication] (https://aga-note.com/python-scraping-basic-auth/)

Notes When scraping, it is necessary to consider various rules and manners. List of precautions for web scraping

What do you need?

Language: python 3.7.4 Library: requests, requests.auth, bs4, urllib.request

Library installation

Install the following two with the pip command.

pip install requests
pip install beautifulsoup4

It is a practice when the installation is completed.

Practice

This time, I used the sample of the Web page with Digest authentication created by the administrator of the following site as an example. [Let's make an HTTP client (6) --Digest authentication-] (http://x68000.q-e-d.net/~68user/net/http-auth-2.html)

import requests
from requests.auth import HTTPDigestAuth
from bs4 import BeautifulSoup

#1.Website URL and digest authentication user and pass
url = 'http://X68000.q-e-d.net/~68user/net/sample/http-auth-digest/secret.html'
username = 'hoge'
password = 'fuga'

#2.Get information about URL with Digest authentication
res = requests.get(url,auth=HTTPDigestAuth(username,password))
content = res.content

#3.html data acquisition
#All data
data = BeautifulSoup(content, 'html.parser')
#Title acquisition
title = data.title.string
#Get text
body = data.body.string
print(title, body)

A little applied

I will also introduce the case of downloading images and files such as Excel directly from the URL with Digest authentication. I couldn't actually find the file URL with Digest authentication, so I'll just list the method.

import urllib.request
from requests.auth import HTTPDigestAuth
from bs4 import BeautifulSoup

#1.Website URL and digest authentication user and pass
url = ******************
username = ******************
password = ******************

#2.Read file of URL with Digest authentication
#Explanation 1
password_manager = urllib.request.HTTPPasswordMgrWithDefaultRealm()
password_manager.add_password(None, url, username, password)
#Explanation 2
authhandler = urllib.request.HTTPDigestAuthHandler(password_manager)
opener = urllib.request.build_opener(authhandler)
#Read file contents
file_content = opener.open(url).read()

#3.Save the file in a local directory (extension xlsx because Excel is assumed)
path = os.path.dirname(os.path.abspath(__file__)) + '/file.xlsx' 
with open(excel_path, mode="wb") as f:
   f.write(file_content)
   print("Saved")

Explanation 1 Register the information required for Digest authentication in the variable of the password management object.

Explanation 2 Open URL with Digest authentication

reference

** Manners in scraping ** List of precautions for web scraping

** Scraping practice ** Introduction to Python Web Scraping Practice [[Python] Scraping to pages with Basic authentication] (https://aga-note.com/python-scraping-basic-auth/)

** Official document ** Authentication — Requests 2.23.0 documentation Official documentation for urllib.request

Recommended Posts

Scraping from an authenticated site with python
Scraping with Python
Scraping with Python
Horse Racing Site Web Scraping with Python
Scraping with Python (preparation)
Try scraping with Python.
Scraping with Python + PhantomJS
Scraping with Selenium [Python]
Scraping with Python + PyQuery
Scraping RSS with Python
Python scraping Extract racing environment from horse racing site
Generate an insert statement from CSV with Python.
I tried scraping with Python
Web scraping with python + JupyterLab
Scraping with selenium in Python
Scraping with Selenium + Python Part 1
Scraping with chromedriver in python
Creating an egg with python
Scraping with Selenium in Python
Scraping weather forecast with python
Scraping with Selenium + Python Part 2
I tried scraping with python
Web scraping beginner with python
With skype, notify with skype from python!
[Python] Send an email from gmail with two-step verification set
I tried sending an email from Amazon SES with Python
Try scraping with Python + Beautiful Soup
Cut out an image with python
Scraping with Node, Ruby and Python
Using Rstan from Python with PypeR
Web scraping with Python ① (Scraping prior knowledge)
Scraping with Selenium in Python (Basic)
Install Python from source with Ansible
Create folders from '01' to '12' with python
Scraping with Python, Selenium and Chromedriver
Operate an I2C-connected display from Python
Create an Excel file with Python3
I sent an SMS with Python
Run Aprili from Python with Orange
Get Qiita trends with Python scraping
Call python from nim with Nimpy
Draw an illustration with Python + OpenCV
Read fbx from python with cinema4d
[Python] Send an email with outlook
"Scraping & machine learning with Python" Learning memo
Get weather information with Python & scraping
[Python] Scraping lens information from Kakaku.com
Get PowerShell commands from malware dynamic analysis site with BeautifulSoup + Python
Get past performance of runners from Python scraping horse racing site
[Scraping] Python scraping
Collecting information from Twitter with Python (Twitter API)
Get property information by scraping with python
[Python] Building an environment with Anaconda [Mac]
Receive textual data from mysql with python
Get html from element with Python selenium
[Note] Get data from PostgreSQL with Python
WEB scraping with Python (for personal notes)
Play audio files from Python with interrupts
Create wordcloud from your tweet with python3
Automate simple tasks with Python Part1 Scraping
Getting Started with Python Web Scraping Practice