Scraping with Tor in Python

Caution

: warning: This article does not recommend scraping with Tor.

Scraping is basically fine, but you may be guilty if it is prohibited by the terms of use of the target site or if you overload the server of the target site.

What is Tor

It is a technology to anonymize the connection route. In theory, when accessed using Tor, it is difficult to determine who accessed it.

Execution environment

Homebrew 2.2.4
pip 20.0.2
Python 3.7.3

1. Get an IP address

First, let's check the global IP address without Tor. The global IP address is here, and if you are using Tor, you can get the HTML from here. You can check it.

It uses Beautiful Soup, so please install it.

#Install beautifulsoup4 with pip
$ pip install beautifulsoup4
#Verification
$ pip list | grep beautifulsoup4
beautifulsoup4 4.7.1
import urllib.request, urllib.error
from bs4 import BeautifulSoup

#Returns HTML from URL
def fetch_html(url):
  res = urllib.request.urlopen(url)
  return BeautifulSoup(res, 'html.parser')

#Returns the current global IP address
def get_ip_addr():
  html = fetch_html('http://checkip.dyndns.com/')
  return html.body.text.split(': ')[1]

#Returns if you are using Tor
def check_use_tor():
  html = fetch_html('https://check.torproject.org/')
  return html.find('h1')['class'][0] != 'off'

print('You are using tor.' if check_use_tor() else 'You are not using tor.')
print('Current IP address is ' + get_ip_addr())

Execution result

You are not using tor.
Current IP address is XXX.XXX.XX.XXX

2. Install Tor

If you're using MacOS, you can install it with Homebrew. I'm also using brew services start to start it as a daemon.

$ brew install tor
$ brew services start tor
#Verification
$ tor --version
Tor version 0.4.2.6.
$ brew services list | grep tor
tor started your_name /Users/your_name/Library/LaunchAgents/homebrew.mxcl.tor.plist

To stop Tor or to restart it, execute the following command.

$ brew services stop tor
$ brew services reload tor

Also, although not mentioned in this article, the config file is / usr / local / etc / tor / torrc.

3. Scraping through Tor

It uses PySocks, so please install it.

$ pip install PySocks
#Verification
$ pip list | grep PySocks
PySocks 1.7.1

Tor uses socks 5: // localhost: 9050 as a proxy, so add the following to your ** 1. ** code:

import socks, socket

socks.set_default_proxy(socks.PROXY_TYPE_SOCKS5, '127.0.0.1', 9050)
socket.socket = socks.socksocket

Execution result

You are using tor.
Current IP address is YY.YYY.YYY.YY

Make sure that the global IP address displayed is different than when you ran it with ** 1. **. The IP address when using Tor changes at regular intervals.

Recommended Posts

Scraping with Tor in Python
Scraping with selenium in Python
Scraping with chromedriver in python
Scraping with Selenium in Python
Scraping with Python
Scraping with Python
Scraping with Selenium in Python (Basic)
Scraping with Python (preparation)
Try scraping with Python.
Scraping with Python + PhantomJS
Scraping with Selenium [Python]
Scraping with Python + PyQuery
Scraping RSS with Python
Achieve scraping with Python & CSS selector in 1 minute
I tried scraping with Python
Web scraping with python + JupyterLab
Scraping with Selenium + Python Part 1
[Python] Scraping in AWS Lambda
Working with LibreOffice in Python
Web scraping notes in python3
Debugging with pdb in Python
Working with sounds in Python
Tweet with image in Python
Combined with permutations in Python
Scraping weather forecast with python
Scraping with Selenium + Python Part 2
I tried scraping with python
Web scraping beginner with python
I was addicted to scraping with Selenium (+ Python) in 2020
[Scraping] Python scraping
Number recognition in images with Python
Try scraping with Python + Beautiful Soup
Testing with random numbers in Python
Scraping with Node, Ruby and Python
GOTO in Python with Sublime Text 3
Web scraping with Python ① (Scraping prior knowledge)
CSS parsing with cssutils in Python
Scraping with Python, Selenium and Chromedriver
Open UTF-8 with BOM in Python
Scraping with Beautiful Soup in 10 minutes
Let's do image scraping with Python
Use Python in pyenv with NeoVim
Heatmap with Dendrogram in Python + matplotlib
Get Qiita trends with Python scraping
Read files in parallel with Python
Password generation in texto with python
Use OpenCV with Python 3 in Window
Until dealing with python in Atom
"Scraping & machine learning with Python" Learning memo
Get started with Python in Blender
Get weather information with Python & scraping
Working with DICOM images in Python
Try scraping the data of COVID-19 in Tokyo with Python
Write documentation in Sphinx with Python Livereload
Get additional data in LDAP with python
Quadtree in Python --2
Get property information by scraping with python
CURL in python
Python scraping notes
Scraping with selenium
FizzBuzz with Python3