Selenium + WebDriver (Chrome) + Python | Building environment for scraping

Background

What I wanted to do and the first situation

I want to scrape a web page (html after js expansion).

Survey started

I was thinking of scraping with curl or php, I was in trouble to understand that curl did not pick up the source after js.

After investigating there, the following two are candidates.

phantomjs

There was a lot of information and I felt that it was effective as it was, but I found that development ended in June 2018 and support ended.

Selenium + WebDriver

When I looked it up, there was a lot of information and many new articles, so I decided to try it with Selenium for the time being.

Environmental preparation

Things necessary

python
pip
chromedriver
selenium

Since I am using a Mac and Python is included as standard, I will omit the installation of Python.

install pip

$ curl -kL https://bootstrap.pypa.io/get-pip.py | python

Execution result

$ curl -kL https://bootstrap.pypa.io/get-pip.py | python
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 1841k  100 1841k    0     0  4630k      0 --:--:-- --:--:-- --:--:-- 4649k
DEPRECATION: Python 2.7 reached the end of its life on January 1st, 2020. Please upgrade your Python as Python 2.7 is no longer maintained. pip 21.0 will drop support for Python 2.7 in January 2021. More details about Python 2 support in pip can be found at https://pip.pypa.io/en/latest/development/release-process/#python-2-support pip 21.0 will remove support for this functionality.
Defaulting to user installation because normal site-packages is not writeable
Collecting pip
  Downloading pip-20.2.3-py2.py3-none-any.whl (1.5 MB)
     |████████████████████████████████| 1.5 MB 4.0 MB/s 
Collecting wheel
  Downloading wheel-0.35.1-py2.py3-none-any.whl (33 kB)
Installing collected packages: pip, wheel
  WARNING: The scripts pip, pip2 and pip2.7 are installed in '/Users/xxx/Library/Python/2.7/bin' which is not on PATH.
  Consider adding this directory to PATH or, if you prefer to suppress this warning, use --no-warn-script-location.
  WARNING: The script wheel is installed in '/Users/xxx/Library/Python/2.7/bin' which is not on PATH.
  Consider adding this directory to PATH or, if you prefer to suppress this warning, use --no-warn-script-location.
Successfully installed pip-20.2.3 wheel-0.35.1

There is a message to pass the path, so pass the path

$ export PATH="$HOME/Library/Python/2.7/bin:$PATH"
$ echo 'export PATH="$HOME/Library/Python/2.7/bin:$PATH"' >> ~/.bash_profile

Check if the pass is passed

$ echo $PATH
/Users/xxx/Library/Python/2.7/bin:/usr/local/bin:/usr/bin:/bin:/usr/sbin:/sbin

$ cat ~/.bash_profile
export PATH="$HOME/Library/Python/2.7/bin:$PATH"

Now that you can use the pip command, check

$ pip -V
pip 20.2.3 from /Users/xxx/Library/Python/2.7/lib/python/site-packages/pip (python 2.7)

Install Chrome driver

First, check the version of Chrome you are currently using on your computer. Version: 85.0.4181.101

So, use the following command

pip install chromedriver-binary==85.*

Execution result

$ pip install chromedriver-binary==85.*
DEPRECATION: Python 2.7 reached the end of its life on January 1st, 2020. Please upgrade your Python as Python 2.7 is no longer maintained. pip 21.0 will drop support for Python 2.7 in January 2021. More details about Python 2 support in pip can be found at https://pip.pypa.io/en/latest/development/release-process/#python-2-support pip 21.0 will remove support for this functionality.
Defaulting to user installation because normal site-packages is not writeable
Collecting chromedriver-binary==85.*
  Downloading chromedriver-binary-85.0.4183.87.0.tar.gz (3.6 kB)
Building wheels for collected packages: chromedriver-binary
  Building wheel for chromedriver-binary (setup.py) ... done
  Created wheel for chromedriver-binary: filename=chromedriver_binary-85.0.4183.87.0-py2-none-any.whl size=7722067 sha256=901454e21156aef8f8bf4b0e302098747ea378a435c801330ea46d03ed
  Stored in directory: /Users/xxx/Library/Caches/pip/wheels/12/27/b7/69d38bfd65642b45a64e7e97e3160aba20f20be91cd5a
Successfully built chromedriver-binary
Installing collected packages: chromedriver-binary
Successfully installed chromedriver-binary-85.0.4183.87.0
$ 

Install Selenium

Command used

pip install selenium

Execution result

$ pip install selenium
DEPRECATION: Python 2.7 reached the end of its life on January 1st, 2020. Please upgrade your Python as Python 2.7 is no longer maintained. pip 21.0 will drop support for Python 2.7 in January 2021. More details about Python 2 support in pip can be found at https://pip.pypa.io/en/latest/development/release-process/#python-2-support pip 21.0 will remove support for this functionality.
Defaulting to user installation because normal site-packages is not writeable
Collecting selenium
  Downloading selenium-3.141.0-py2.py3-none-any.whl (904 kB)
     |████████████████████████████████| 904 kB 5.2 MB/s 
Collecting urllib3
  Downloading urllib3-1.25.10-py2.py3-none-any.whl (127 kB)
     |████████████████████████████████| 127 kB 10.7 MB/s 
Installing collected packages: urllib3, selenium
Successfully installed selenium-3.141.0 urllib3-1.25.10
$ 

Now you are ready to go.

Try running with Python

Create a python file

test.py


import chromedriver_binary 
from selenium import webdriver

options = webdriver.ChromeOptions()
# options.add_argument('--incognito')
# options.add_argument('--headless')

print('connect...try...connect...try...')
driver = webdriver.Chrome(options=options)

driver.get('https://qiita.com')
print(driver.current_url)

# driver.quit()

Run

$ python test.py 

This will bring up the Chrome browser. I'm happy.

To launch in the secret window, uncomment the following.

options.add_argument('--incognito')

If you use a headless browser, please uncomment the following.

options.add_argument('--headless')

After that, I think that anyone can scrape by checking Selenium and xpath.

Supplement

The version of python that was included in the Mac this time was 2.7, so it is a little old and support will end in January 2020. I don't usually use Python, so I leave it as it is, but in the execution result of each command, a message (DEPRECATION) for 2.7 appears. Please forgive me m (_ _) m

Reference article

pip installation https://qiita.com/suzuki_y/items/3261ffa9b67410803443 https://qiita.com/tom-u/items/134e2b8d4e11feea8e12

Selenium setup https://qiita.com/Chanmoro/items/9a3c86bb465c1cce738a

Summary of how to select elements in Selenium https://qiita.com/VA_nakatsu/items/0095755dc48ad7e86e2f

Scraping Xpath https://qiita.com/rllllho/items/cb1187cec0fb17fc650a

Recommended Posts

Selenium + WebDriver (Chrome) + Python | Building environment for scraping
[Python + Selenium] Tips for scraping
[Mac] Building a virtual environment for Python
Building a Python development environment for AI development
Overwrite download file for python selenium Chrome
Building an environment for executing Python scripts (for mac)
Building an Anaconda environment for Python with pyenv
[Python / Chrome] Basic settings and operations for scraping
Scraping with Selenium [Python]
Python web scraping selenium
Python environment for projects
Building a Python environment for pyenv, pyenv-virtualenv, Anaconda (Miniconda)
Write about building a Python environment for writing Qiita Qiita
Building a Docker working environment for R and Python
Building an environment for natural language processing with Python
Procedure for building a CDK environment on Windows (Python)
Building a Python environment for programming beginners (Mac OS)
Memo for building a machine learning environment using Python
Scraping with selenium in Python
Scraping with Selenium + Python Part 1
[Python] Eliminate Chrome Webdriver errors
Scraping with Selenium in Python
Python environment construction For Mac
Selenium WebDriver + Firefox49 (provisional) (Python)
Python3 environment construction (for beginners)
Building a Python virtual environment
Web scraping using Selenium (Python)
Scraping with Selenium + Python Part 2
[For organizing] Python development environment
Building a Python virtual environment
Challenge Python3 and Selenium Webdriver
Building a python environment for artificial intelligence (Chainer / TensorFlow / CSLAIER)
[Note] List of basic commands for building python / conda environment
[Python] Introduction to scraping | Program to open web pages (selenium webdriver)
Building a development environment for Android apps-creating Android apps in Python
Building a Hy environment for Lisper who hasn't touched Python
[Python] Building a virtual python environment for the pyramid tutorial (summary)
Building a Python environment on Mac
Error when building mac python environment
Scraping with Selenium in Python (Basic)
Building a Python environment on Ubuntu
Scraping with Python, Selenium and Chromedriver
ElasticSearch + Kibana + Selenium + Python for SEO
Building a virtual environment with Python 3
Python development environment options for May 2020
Emacs settings for Python development environment
Python3 TensorFlow for Mac environment construction
Beginners use Python for web scraping (1)
Beginners use Python for web scraping (4) ―― 1
Building a Docker working environment for R and Python 2: Japanese support
How about Anaconda for building a machine learning environment in Python?
Building a Windows 7 environment for getting started with machine learning with Python
From building a Python environment for inexperienced people to Hello world
How to automatically install Chrome Driver for Chrome version with Python + Selenium + Chrome
Tips for using Selenium and Headless Chrome in a CUI environment
Building a virtual environment for Mayavi dedicated to Python 3.6, Anaconda, Spyder users
[Python] Building an environment with Anaconda [Mac]
[Definitive Edition] Building an environment for learning "machine learning" using Python on Windows
Building a Python3 environment with Amazon Linux2
Build an environment for Blender built-in Python
WEB scraping with Python (for personal notes)