I had to scrape sites with dynamic elements, so I had no choice but to start learning Selenium.
pip install selenium
Since the browser wants to use Chrome, download the Chrome Driver and move it under the virtual environment. I moved to / bin.
https://sites.google.com/a/chromium.org/chromedriver/downloads
I will try to see if it works immediately. Verification uses Yahoo! as the URL.
test.py
import os
import time
from selenium import webdriver
DRIVER_PATH = os.path.join(os.path.dirname(__file__), 'chromedriver')
browser = webdriver.Chrome(DRIVER_PATH)
browser.get('https://www.yahoo.co.jp')
try:
elem_1 = browser.find_element_by_class_name('emphasis')
print ('<{}>Discover!'.format(elem_1.text))
time.sleep(3)
except:
print ('No')
(flaskworks) $ python test.py
<GDP year 1.0%Downward revision to increase
Contradiction photo NEW to the Prime Minister's answer
Uncle angry testimony photo of British terrorist suspect
Mt. Fuji in Gunma?Misleading station name photo NEW
Former Idol Bartender No.1 photo
Tanaka learn the language Commentator apology photo NEW
Honda Photograph of passive smoking immediately after the game
Yamazaki Anna Photograph admitted to dating with Obata NEW>Discover!
Confirm that it works safely. I will also try page turning.
test.py
import os
import time
from selenium import webdriver
DRIVER_PATH = os.path.join(os.path.dirname(__file__), 'chromedriver')
browser = webdriver.Chrome(DRIVER_PATH)
browser.get('https://www.yahoo.co.jp')
try:
link_elem = browser.find_element_by_link_text('See more')
link_elem.click()
text_elem = browser.find_element_by_class_name('ttl')
print (text_elem.text)
time.sleep(3)
except:
print ('No')
(flaskworks)$ python test.py
North Korea launches unknown projectile
that? You can only get one case.
link_elem = browser.find_element_by_class_name('list')
When rewritten,
(flaskworks) $ python test.py
North Korea launches unknown projectile
international
6/8(wood) 7:42
Nishikori's defeat regret is a tiebreaker
Sports
6/8(wood) 5:10
Nishikori reversal defeated French Open 4 not strong
Sports
6/8(wood) 2:12
North Korean ballistic missile launch signs
international
....Omitted below
I see. Maybe this is easier than Beautiful Soup.
It's just a rough addition of page parameters. After all, I clicked Next, so it's not beautiful as a process. I think there is a better way, but this is the limit because it's just the beginning.
test.py
import os
import time
from selenium import webdriver
DRIVER_PATH = os.path.join(os.path.dirname(__file__), 'chromedriver')
browser = webdriver.Chrome(DRIVER_PATH)
url = 'https://news.yahoo.co.jp/list/?c=domestic&p='
a = 0
i = 1
while a < 5:
a += 1
try:
browser.get(url)
link_elem = browser.find_element_by_link_text('next')
link_elem.click()
text_elem = browser.find_element_by_css_selector('.list')
print (text_elem.text)
time.sleep(3)
i += 1
url = 'https://news.yahoo.co.jp/list/?c=domestic&p=' + str(i)
except:
print ('No')
Recommended Posts