[PYTHON] How to scrape pages that are “Access Denied” in Selenium + Headless Chrome

Introduction

When scraping with Selenium + Headless Chrome, I came across a site that gives a NoSuchElementException error as soon as I set it to headless, even though I can get information in head mode. There were few articles in Japanese about workarounds, so I will post them.

Status

-Scraping is possible in head mode. -A NoSuchElementException occurred as soon as the headless option was added.

debug

Cause investigation

It seems that the element has not been obtained, so I checked the source of the site with driver.page_source.

scraping.py


driver.page_source

The returned HTML has the words "Access Denied", and it seems that access from headless is denied.

<html><head>
webapp_1        | <title>Access Denied</title>
webapp_1        | </head><body>
webapp_1        | <h1>Access Denied</h1>
webapp_1        |  
webapp_1        | You don't have permission to access "http://www.xxxxxxx/" on this server.<p>

Countermeasures

Upon examination, the chrome driver had a user_agent option that could be pretended to be accessed from a browser. By adding this to the option of chromedrivere, you can get the element safely.

scraping.py


options = webdriver.ChromeOptions()
            options.binary_location = '/usr/bin/google-chrome'
            options.add_argument('--no-sandbox')
            options.add_argument('--headless')
            options.add_argument('--disable-gpu')
            options.add_argument('--lang=ja-JP')
            options.add_argument(f'user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.79 Safari/537.36') #add to

           

that's all

Recommended Posts

How to scrape pages that are “Access Denied” in Selenium + Headless Chrome
How to set browser location in Headless Chrome
How to download files from Selenium in Python in Chrome
How to access environment variables in Python
How to crawl pages that scroll infinitely
How to access with cache when reading_json in pandas
How to switch to smartphone mode with Python + Selenium + Chrome
[Linux] How to monitor logs that are constantly added
Scraping 2 How to scrape
How to debug selenium
How to write type hints for variables that are assigned multiple times in one line
How to test that Exception is raised in python unittest
How to manipulate the DOM in an iframe with Selenium
Try to extract the keywords that are popular in COTOHA
How to scrape at speed per second with Python Selenium
How to develop in Python
How to make AWS Lambda Layers when running selenium × chrome on AWS Lambda
Regular expressions that are easy and solid to learn in Python
How to judge that the cross key is input in Python3
Upload and manage packages that are not in conda to anaconda.org
How to automatically install Chrome Driver for Chrome version with Python + Selenium + Chrome
Tips for using Selenium and Headless Chrome in a CUI environment