[PYTHON] How to crawl pages that scroll infinitely

What is infinite scrolling?

You can see it on Facebook and Twitter timelines, scroll to the bottom of the page and it will load new information.

Motivation

The reason I decided to crawl the infinite scroll page was because I had to pull in past tweets on Twitter because of school issues. Well, you say that Twitter has an official API. The official Twitter API isn't very kind, and it's designed so that you can't get tweets older than a week **. In other words, if you want to get older tweets, you have to crawl yourself. And since Twitter search results are displayed with ** infinite scroll **, you have to crawl the page that scrolls infinitely.

Why it's difficult to crawl infinite scrolling

The crawler basically works as follows:

  1. Get the HTML response from the given url and process it
  2. Find the url to crawl further in the response
  3. Do 1-2 again with the new url

In this way, a large amount of data is fetched from the net. The problem with crawling infinite scrolling pages is paging (how to navigate through search results etc. with links such as "1 page", "2 pages", "next page" below) Unlike the page I use, ** there is no link to the next search result in the HTML of the page **. This means that existing crawler frameworks (such as Scrapy for Python) can't compete. This time, I will introduce how to crawl such a troublesome infinite scrolling page, also as my own memo.

Illustration

Even if I introduce only the theory, I will explain using the crawler that I actually wrote, which pulls past tweets from Twitter as an example. Please refer to the Github repository for the source. https://github.com/keitakurita/twitter_past_crawler

By the way,

$ pip install twitterpastcrawler

But you can install it.

Method

Infinite scrolling mechanism

So how does infinite scrolling work in the first place? Even the infinite scroll, that you load the infinite of the result in advance somewhere is the amount of data to impossible. In other words, infinite scrolling is ** dynamically ** adding additional data each time the user scrolls down. Therefore, in order for infinite scrolling to work,

  1. Know the currently displayed range
  2. Based on that, know the data to be fetched next You must be able to. In most cases, infinite scrolling has some ** key parameter ** that represents the ** currently displayed range ** and uses that parameter to get the following results:

For Twitter

You can analyze how this is actually achieved by looking at what kind of request Twitter is sending behind the scenes. As a test, search for the word qiita. I'm using Chrome, but any browser can see the status of the network running behind the page. In case of Chrome, you can see it from "View"-> "Development / Management"-> "Developer Tools"-> Network. When you open it, you should see a screen like the one below: Network状況を表示

If you scroll down a few times, you'll see a suspicious URL that appears several times in the list of requests:

https://twitter.com/i/search/timeline?vertical=default&q=qiita&src=typd&composed_count=0&include_available_features=1&include_entities=1&include_new_items_bar=true&interval=30000&lang=en&latent_count=0&min_position=TWEET-829694142603145216-833144090631942144-BD1UO2FFu9QAAAAAAAAETAAAAAcAAAASAAAAAAAAQAAEIIAAAAYAAAAAAACAAAAAAAgQAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAgAAAAAAAQAAAAAEAAAAAAAAAAAABAAAAAAAAAAAAAIAAAAAAAAAAAAAaAAAAAAAAAAAAAAAAAAAAAAAAEAIACAIQIAAAgAAAAAAAASAAAAAAAAAAAAAAAAAAAAAA

This last parameter, min_position, is obviously suspicious. If you download the result of this response and see it, you can see that it is a json format response. Looking at the contents,

focused_refresh_interval: 240000
has_more_items: false
items_html: ...
max_position: "TWEET-829694142603145216-833155909996077056-BD1UO2FFu9QAAAAAAAAETAAAAAcAAAASAAAAAAAAQAAEIIAAAAYAAAAAAACAAAAAAAgQAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAgAAAAAAAQAAAAAEAAAAAAAAAAAABAAAAAAAAAAAAAIAAAAAAAAAAAAAaAAAAAAAAAAAAAAAAAAAAAAAAEAIACAIQIAAAgAAAAAAAASAAAAAAAAAAAAAAAAAAAAAA"

ʻItems_htmlcontains the raw html of the tweet. This is the content of the tweet you are looking for. Of note is the parametermax_position. It should have the same format as the previous parameter called min_position. If you try replacing this with the min_positionin the url and send the request again, you will get a response in the same format. In other words, thismin_position` is the key parameter to be sought.

How to crawl

At this point, the rest is easy. In principle, you can crawl by repeating the following process:

  1. Send the request by adjusting the parameters (for example, q: query) of the url in the previous format.
  2. Get ʻitems_html and max_position` from the obtained json format response.
  3. Process the contents of ʻitems_html` appropriately
  4. Substitute max_position instead of min_position and send the request
  5. Repeat steps 2-4

How to use twitterpastcrawler

In the package I created, just by giving a query, the previous process is automatically performed and the tweet information is spit out to the csv file as shown below.

sample.py


import twitterpastcrawler

crawler = twitterpastcrawler.TwitterCrawler(
                            query="qiita", #Search for tweets that contain the keyword qiita
                            output_file="qiita.csv" # qiita.Output tweet information to a file called csv
                        )

crawler.crawl() #Start crawling

Finally

If you can get past tweets from Twitter, you can find out what kind of tweets were made during a certain event (for example, election or game release date), which is interesting. think. Since the number of other pages with infinite scrolling is increasing, I think that the use of crawling pages with infinite scrolling will expand in the future.

Recommended Posts

How to crawl pages that scroll infinitely
How to end Python's infinite scroll scraping
How to implement Scroll View in pythonista 1
How to scrape pages that are “Access Denied” in Selenium + Headless Chrome
How to solve the recursive function that solved abc115-D
[Python] How to write a docstring that conforms to PEP8
[Linux] How to monitor logs that are constantly added
How to use xml.etree.ElementTree
How to use Python-shell
How to use tf.data
How to use virtualenv
Scraping 2 How to scrape
How to use image-match
How to use shogun
How to install Python
How to use Pandas 2
How to read PyPI
How to install pip
How to use Virtualenv
How to use numpy.vectorize
How to update easy_install
How to install archlinux
How to use pytest_report_header
How to restart gunicorn
How to install python
How to virtual host
How to debug selenium
How to use partial
How to use Bio.Phylo
How to read JSON
How to use SymPy
How to use x-means
How to use WikiExtractor.py
How to update Spyder
How to use IPython
How to install BayesOpt
How to use virtualenv
How to use Matplotlib
How to use iptables
How to use numpy
How to use TokyoTechFes2015
How to use venv
How to use dictionary {}
How to use Pyenv
How to grow dotfiles
How to use list []
How to use python-kabusapi
"How to count Fukashigi"
How to install Nbextensions
How to use OptParse
How to use return
How to install Prover9
How to use dotenv
How to operate NumPy
How to use pyenv-virtualenv
How to use Go.mod
How to use imutils
How to use import
How to test that Exception is raised in python unittest
How to write a test for processing that uses BigQuery
How to write a metaclass that supports both python2 and python3