[PYTHON] [Selenium] Use a while statement to repeatedly move to the "next page"

Tips

1. Iterative processing of page transition by While statement (it works)

selenium_success.py


#Get the URL of the next page (second page) of the first page
next_page_url = driver.find_element_by_class_name("js-next-page-link").get_attribute("href")

#Loop processing from the second page to the last page
while len(next_page_url) > 0:
	driver.get(next_page_url)
	#Set the wait time for an element to load to 10 seconds
	driver.implicitly_wait(10)
	next_page_html = driver.page_source.encode('utf-8')

#Write the implementation code for arbitrary processing here

	next_page_url = driver.find_element_by_class_name("js-next-page-link").get_attribute("href")
else:
	print("\n\n The processing of the last page is finished.\n\n")

2. Iterative processing of page transition by for statement (stops in the middle)

selenium_failure.py


#Get the URL of the next page (second page) of the first page
next_page_url = driver.find_element_by_class_name("js-next-page-link").get_attribute("href")

#Loop processing from the second page to the last page
if len(next_page_url) != 0:
	driver.get(next_page_url)
	#Set the wait time for an element to load to 10 seconds
	driver.implicitly_wait(10)
	next_page_html = driver.page_source.encode('utf-8')

#Write the implementation code for arbitrary processing here

	next_page_url = driver.find_element_by_class_name("js-next-page-link").get_attribute("href")
else:
	print("\n\n The processing of the last page of the article search result is finished.\n\n")

Code actually created

selenium_python_multi_pages_while.py


# coding: utf-8
import time, argparse, datetime
from selenium import webdriver
from bs4 import BeautifulSoup
from pprint import pprint
import pandas as pd
import numpy as np
import time

###constant###
url = "http://qiita.com"

###Variable initialization declaration###
pagenum = 0
#Receive command line arguments
parser = argparse.ArgumentParser()

#Receive only one command line argument
parser.add_argument('-word', '--search_word', default='Python', help='Please specify the search word to enter on the article search page of Qiita.')
parser.add_argument('-max', '--max_page_num', default='50', help='If there are multiple applicable pages and you want to set an upper limit on the number of pages for which data can be acquired, specify the upper limit.')
args = parser.parse_args()
search_word = args.search_word
num_of_search_pages = int(args.max_page_num)

print("\n\n Entered search string:", args.search_word, "\n")
print("When the article list page extends over multiple pages{}Processing will be terminated until the page.".format(num_of_search_pages))

#output file name
output_file_name = str(datetime.datetime.now()) + "_Search: " + search_word

###Method definition###
def proceed_each_page(page_num, this_page_html, driver):
	from bs4 import BeautifulSoup
	this_soup = BeautifulSoup(this_page_html, 'lxml')
	print("""
	================================
		   {}Processing page...	
	===============================
	""".format(page_num))
	results = this_soup.find_all("h1", class_="searchResult_itemTitle")
	#Store results in article title list
	this_page_title_list = []
	for result in results:
		title_texts = result.findAll("a")
		title_texts = str(title_texts[0]).replace("<em>", "").replace("</em>", "").split(">")[1:]
		title_texts = title_texts[0]
		pos = title_texts.find('</a')
		title_text = title_texts[:pos]
		this_page_title_list.append(title_text)

	console_message = "On the search result screen{}Number of articles on the page:  ".format(pagenum) + str(len(this_page_title_list)) + "\n\n"
	pprint(this_page_title_list)
	
	#Store the result in the URL list
	this_page_url_list = []
	for result in results:
		href = result.findAll("a")[0].get("href")
		this_page_url_list.append(str(url + href))

	#Store contributors in contributor list
	# <div class="searchResult_header"><a href="/niiku-y">niiku-y</a>Is 2019/08/Posted in 07</div>
	this_page_author_list = []
	
	results = this_soup.findAll(class_="searchResult_header")
	for result in results:
		author = result.findAll("a")[0].get("href")
		author = author.replace("/", "")
		this_page_author_list.append(author)

	#Save data that the article was obtained from the nth page
	this_page_num_list = [page_num]*len(this_page_title_list)

	##Get a screen screen capture of the nth page of the search result screen
	#Get screen vertical and horizontal size data
	w = driver.execute_script("return document.body.scrollWidth;")
	h = driver.execute_script("return document.body.scrollHeight;")
	driver.set_window_size(w,h)
	#Specify the save location and file name of the screen screen capture file (image file)
	FILENAME = "./{search_word}_page{number}_screen_{datetime}.png ".format(search_word=search_word, number=page_num, datetime=str(datetime.datetime.now()))  
	#Save image
	driver.save_screenshot(FILENAME)
	
	#Returns each list containing information about the processed web page
	return [this_page_num_list, this_page_author_list, this_page_title_list, this_page_url_list, driver]

###Method definition end

###main processing
driver = webdriver.Chrome()

#Access the top page of Qiita
driver.get(url)

#Enter a keyword in the article search box field
search = driver.find_element_by_class_name("st-Header_searchInput")
search.send_keys(search_word)
search.submit()
time.sleep(5)

#Get the HTML of the first page of the article list page of the search results
first_page_html = driver.page_source.encode('utf-8')

#Process the first page
page_num = 1
all_page_num_list = []
all_page_author_list = []
all_page_title_list = []
all_page_url_list = []

this_page_num_list, this_page_author_list, this_page_title_list, this_page_url_list, driver = proceed_each_page(page_num, first_page_html, driver)

all_page_num_list = all_page_num_list + this_page_num_list
all_page_author_list = all_page_author_list + this_page_author_list
all_page_title_list = all_page_title_list + this_page_title_list
all_page_url_list = all_page_url_list + this_page_url_list

#If the web page to be instructed by the driver has the next page, move to the next page.
# next_page_The return value of url is list type. If the above tag with the next page is not present, an empty list is returned
next_page_url = driver.find_element_by_class_name("js-next-page-link").get_attribute("href")
print("=======")
print(type(next_page_url))
print("=======")

#From the second page(Last page or{num_of_search_pages}The smaller page number up to the page)Loop processing up to
# (num_of_search_pages)To make the page the last page,{num_of_search_pages -1)Turn the next page
while len(next_page_url) > 0 and page_num <= (num_of_search_pages - 1):
	driver.get(next_page_url)
	#Set the wait time for an element to load to 10 seconds
	driver.implicitly_wait(10)
	#time.sleep(5)
	next_page_html = driver.page_source.encode('utf-8')
	page_num += 1
	this_page_num_list, this_page_author_list, this_page_title_list, this_page_url_list, driver = proceed_each_page(page_num, next_page_html, driver)
	all_page_num_list = all_page_num_list + this_page_num_list
	all_page_author_list = all_page_author_list + this_page_author_list
	all_page_title_list = all_page_title_list + this_page_title_list
	all_page_url_list = all_page_url_list + this_page_url_list
	next_page_url = driver.find_element_by_class_name("js-next-page-link").get_attribute("href")
	print("=======")
	print(next_page_url)
	print(len(next_page_url))
	print("=======")
else:
	print("\n\n The processing of the last page of the article search result is finished.\n\n")

#Excel file output
print("\n\n Outputs the acquired search result data to an Excel file.\n\m")
data = np.array([all_page_num_list, all_page_author_list, all_page_title_list, all_page_url_list]).T

index_list = list(range(len(all_page_num_list)))
column_list = ['Posting page number in the search result screen', 'Contributor', 'Article title', 'Article URL']
output_df = pd.DataFrame(data, columns=column_list, index=index_list)
pprint(output_df)

#Output the result to an Excel file
output_df.to_excel('./'+output_file_name + '.xlsx', sheet_name='Qiita_Articles_list')

#Close and erase (free memory) the driver instance created for automatic access to the web page
time.sleep(5)
driver.close()
driver.quit()

(How to use)

Console


$ python  selenium_python_multi_pages_while.py --help
usage: selenium_python_multi_pages_while.py [-h] [-word SEARCH_WORD]
                                            [-max MAX_PAGE_NUM]

optional arguments:
  -h, --help            show this help message and exit
  -word SEARCH_WORD, --search_word SEARCH_WORD
Please specify the search word to enter on the article search page of Qiita.
  -max MAX_PAGE_NUM, --max_page_num MAX_PAGE_NUM
If there are multiple applicable pages and you want to set an upper limit on the number of pages for which data can be acquired, specify the upper limit.
$ 

(Execution example and its result)

Console


$ python  selenium_python_multi_pages_while.py -word Haskell -max 5


Search string entered: Haskell

If the article list page extends to multiple pages, the process will be terminated after 5 pages.

	================================
Processing the first page...
	===============================

['MD5 in Haskell#Haskell',
 'stylish-Make haskell compatible with HexFloatLiterals and Numeric Underscores',
 'Haskell tutorial(Haskell Day 2016)',
 'Build the fastest development environment with VS Code Haskell extensions',
 'stylish-Make haskell correspond to BlockArguments',
 'Using Functor with Haskell',
 'Docker on Windows 10 Home+Haskell environment construction with VS Code',
 'Docker +Haskell Hello World build',
 'Play in Haskell typeclass',
 'About Haskell Either']
=======
<class 'str'>
=======

	================================
Processing the second page...
	===============================

['Haskell function types and currying#Haskell',
 'Haskell and SQLite',
 'Getting Started with Haskell',
 'Haskell/References for searching the meaning of GHC symbols',
 'Approaching Haskell',
 'Haskell Primer Articles Memorandum',
 'Read Typing Haskell in Haskell',
 '[Haskell] Some memo about learning haskell',
 'Getting Started with Haskell-Stack installation and configuration',
 'I made a Gem called Haskell that embeds Haskell code in Ruby!']
=======
https://qiita.com/search?page=3&q=Haskell
41
=======

	================================
Processing the third page...
	===============================

['Read Typing Haskell in Haskell',
 'Set up a Haskell development environment with Visual Studio Code',
 'Why Learn Haskell',
 '[Haskell] Some memo about learning haskell',
 'Haskell installation notes',
 "Couldn with VS Code't start client Haskell IDE comes out (Windows 10)",
 'Stack haskell-Let's use it in mode',
 'Haskell Study Part 1-Haskell environment construction',
 'Haskeller got started with Rust',
 'Guidelines for getting started with Haskell to becoming an intermediate']
=======
https://qiita.com/search?page=4&q=Haskell
41
=======

	================================
Processing the 4th page...
	===============================

['Haskell in Atom Editor',
 'Make a reverse Polish notation calculator in Haskell',
 'Ide in Atom editor-Steps to use haskell',
 'Touch Haskell on Mac Note 0.1',
 'I started Haskell',
 'ATOM ide-haskell installation procedure (MacOS X)',
 '[Translation] Difference between PureScript and Haskell\u3000+α',
 'Haskell notes',
 'Haskell development environment construction on Windows 10',
 'You can eat it in Haskell!!']
=======
https://qiita.com/search?page=5&q=Haskell
41
=======

	================================
Processing page 5...
	===============================

['Haskell Weekly News Japanese Edition(Trial) (5/8)',
 'Full message for Haskell experts around the world',
 'Implemented quicktype Haskell output to generate code for each language from JSON',
 'haskell-ide-engine introduction',
 'The point that fits in the Haskell environment construction of VS Code on macOS',
 'Implement Go Tools in Haskell',
 'Haskell($)When(.)The difference of',
 'Reply from Samuel Gélineau Part 1(translation)',
 'Haskell environment construction memo',
 'Haskeller's Weekly Rust Introductory Challenge Day 1#Rust']
=======
https://qiita.com/search?page=6&q=Haskell
41
=======


The processing of the last page of the article search results is finished.




The acquired search result data is output to an Excel file.
\m
Posting page number in the search result screen Posted by\
0                1          Tatsuki-I
1                1          mod_poppo
2                1           hiratara
3                1             sgmryk
4                1      sparklingbaby
5                1         oskats1987
6                1     atsuyoshi-muta
7                1             dd0125
8                1         oskats1987
9                1             Izawa_
10               2          Tatsuki-I
11               2        satosystems
12               2            a163236
13               2        takenobu-hs
14               2         pumbaacave
15               2               F_cy
16               2              nka0i
17               2          zhupeijun
18               2      sparklingbaby
19               2         gogotanaka
20               3              nka0i
21               3          legokichi
22               3              arowM
23               3          zhupeijun
24               3             tnoda_
25               3            yutasth
26               3        t-mochizuki
27               3       CPyRbJvCHlCs
28               3            kanimum
29               3           Lugendre
30               4              eielh
31               4       inatatsu_csg
32               4       busyoumono99
33               4       hiroyuki_hon
34               4              Cj-bc
35               4  nakamurau1@github
36               4         hiruberuto
37               4             sahara
38               4       kitsukitsuki
39               4         reotasosan
40               5          imokurity
41               5         reotasosan
42               5              algas
43               5         dyoshikawa
44               5                dsm
45               5        kwhrstr1206
46               5         TTsurutani
47               5         reotasosan
48               5              1ain2
49               5          Tatsuki-I

Article title\
0 MD5 in Haskell#Haskell
1   stylish-haskell to Hex Float Liters and Numeric Unders...
2 Haskell tutorial(Haskell Day 2016)
3 Build the fastest development environment with VS Code Haskell extensions
4                stylish-Make haskell correspond to BlockArguments
5 Using Functor with Haskell
6 Docker on Windows10 Home+Haskell environment construction with VS Code
7                   Docker +Haskell Hello World build
8 Play in Haskell typeclass
9 About Haskell Either
10 Haskell function types and currying#Haskell
11 Haskell and SQLite
12 Getting Started with Haskell
13                   Haskell/References for searching the meaning of GHC symbols
14                                Approaching Haskell
15 Haskell Introductory Articles Memorandum
16 Read Typing Haskell in Haskell
17         [Haskell] Some memo about learning haskell
18 Getting Started with Haskell-Stack installation and configuration
19 Embed Haskell code in Ruby I made a Gem called Haskell!
Read 20 Typing Haskell in Haskell
21 Prepare Haskell development environment with Visual Studio Code
22 Why Learn Haskell
23         [Haskell] Some memo about learning haskell
24 Haskell installation notes
Couldn with 25 VS Code't start client Haskell IDE comes out (Wi...
26 Stack haskell-Let's use it in mode
27 Haskell Study Part 1-Haskell environment construction
28 Haskeller got started with Rust
29 Guidelines for getting started with Haskell to intermediate level
30 Haskell in Atom Editor
31 Make a reverse Polish notation calculator in Haskell
32 ide in Atom editor-Steps to use haskell
33 Touch Haskell on Mac Note 0.1
34 I started Haskell
35 ATOM ide-haskell installation procedure (MacOS X)
36 [Translation] Difference between PureScript and Haskell + α
37 Haskell notes
38 Haskell development environment construction on Windows 10
39 You can eat at Haskell!!
40 Haskell Weekly News Japanese Edition(Trial) (5/8)
41 Full Message to Haskell Experts Around the World
42 Implemented quicktype Haskell output to generate code for each language from JSON
43                               haskell-ide-engine introduction
44 The points that fit in with Haskell's environment construction of VS Code on macOS
Implement 45 Go Tool in Haskell
46 Haskell($)When(.)The difference of
47 Reply from Samuel Gélineau Part 1(translation)
48 Haskell environment construction memo
49 Haskeller's Weekly Rust Introductory Challenge Day 1#Rust

Article URL
0   http://qiita.com/Tatsuki-I/items/6d4a2d9f767ae...
1   http://qiita.com/mod_poppo/items/418da906f6621...
2   http://qiita.com/hiratara/items/169b5cb83b0adb...
3   http://qiita.com/sgmryk/items/bc99efe36ad1c910...
4   http://qiita.com/sparklingbaby/items/a46f299dd...
5   http://qiita.com/oskats1987/items/30f9078c5096...
6   http://qiita.com/atsuyoshi-muta/items/9dd10d48...
7   http://qiita.com/dd0125/items/a141000ead36b382...
8   http://qiita.com/oskats1987/items/dcd46780ff5e...
9   http://qiita.com/Izawa_/items/ed0579a0e7d93e5c...
10  http://qiita.com/Tatsuki-I/items/d1d122107da8c...
11  http://qiita.com/satosystems/items/32bf104a041...
12  http://qiita.com/a163236/items/5e0d0e373e87ca8...
13  http://qiita.com/takenobu-hs/items/b95f0a4409c...
14  http://qiita.com/pumbaacave/items/17e6699d4db8...
15   http://qiita.com/F_cy/items/9c49e351196943e38ad9
16  http://qiita.com/nka0i/items/d44f0c6d4df1ef582fd3
17  http://qiita.com/zhupeijun/items/4abcc5fa1cdce...
18  http://qiita.com/sparklingbaby/items/a901cb3a7...
19  http://qiita.com/gogotanaka/items/78a3ffd04abc...
20  http://qiita.com/nka0i/items/d44f0c6d4df1ef582fd3
21  http://qiita.com/legokichi/items/8e7a68ffee522...
22  http://qiita.com/arowM/items/0305d4f439752f285438
23  http://qiita.com/zhupeijun/items/4abcc5fa1cdce...
24  http://qiita.com/tnoda_/items/22b265fe9ad8ee1e...
25  http://qiita.com/yutasth/items/28af2eb0371f645...
26  http://qiita.com/t-mochizuki/items/d831df3a920...
27  http://qiita.com/CPyRbJvCHlCs/items/9da9b43b55...
28  http://qiita.com/kanimum/items/d89547235070038...
29  http://qiita.com/Lugendre/items/70e517e59698e0...
30  http://qiita.com/eielh/items/b2e85f8ea4c6cdb8012d
31  http://qiita.com/inatatsu_csg/items/b035c76ec6...
32  http://qiita.com/busyoumono99/items/220bd3c30f...
33  http://qiita.com/hiroyuki_hon/items/3eb41a16fe...
34  http://qiita.com/Cj-bc/items/583fa82805775cf17dd6
35  http://qiita.com/nakamurau1@github/items/7feae...
36  http://qiita.com/hiruberuto/items/3eb21ef81b3d...
37  http://qiita.com/sahara/items/7c7ef646fb3e9b08...
38  http://qiita.com/kitsukitsuki/items/a56cbfc0de...
39  http://qiita.com/reotasosan/items/e80ab706baef...
40  http://qiita.com/imokurity/items/f90e4c35c74fe...
41  http://qiita.com/reotasosan/items/2b37fdef025a...
42  http://qiita.com/algas/items/1ebb9b8c77fc5f344708
43  http://qiita.com/dyoshikawa/items/a1789bf7ff1d...
44    http://qiita.com/dsm/items/861d08844b1fba32f07b
45  http://qiita.com/kwhrstr1206/items/fdf460f2a9a...
46  http://qiita.com/TTsurutani/items/201200c1f288...
47  http://qiita.com/reotasosan/items/cce796d32105...
48  http://qiita.com/1ain2/items/09ad8b0e4992f7ceae0f
49  http://qiita.com/Tatsuki-I/items/e19953c051e55...
$ 

(Output files)

スクリーンショット 2020-11-08 0.27.17.png

(Excel file)

スクリーンショット 2020-11-08 0.29.50.png

(Png file)

スクリーンショット 2020-11-08 0.30.42.png

(Screen capture image file on the middle page is omitted)

スクリーンショット 2020-11-08 0.31.21.png

Others: The search string for articles can be kanji, katakana, or kana.

Console}


$ python  selenium_python_multi_pages_while.py -word category theory-max 20

Recommended Posts

[Selenium] Use a while statement to repeatedly move to the "next page"
[Selenium] Go to the next page without pressing Next
[Introduction to Python] How to use the in operator in a for statement?
[Wagtail] Add a login page to the Wagtail project
Convenient to use matplotlib subplots in a for statement
How to use the __call__ method in a Python class
[Part.2] Crawling with Python! Click the web page to move!
I didn't know how to use the [python] for statement
How to use the generator
How to use the decorator
[Python] How to use the for statement. A method of extracting by specifying a range or conditions.
How to determine the existence of a selenium element in Python
I compared while reading the documentation to use Jinja2 with Django
I want to use complicated four arithmetic operations in the IF statement of the Django template! → Use a custom template
Use Rust to move Pocket Miku.
How to use the zip function
How to use the optparse module
A memo to move Errbot locally
How to use the ConfigParser module
I tried to move the ball
A memo to simply use the illuminance sensor TSL2561 with Raspberry Pi 2
[Introduction to Udemy Python3 + Application] 47. Process the dictionary with a for statement
[Python] Explains how to use the range function with a concrete example
Use a shortcut to enable or disable the touchpad in Linux Mint
How to play a video while watching the number of frames (Mac)
I thought it would be slow to use a for statement in NumPy, but that wasn't the case.