[PYTHON] Début du grattage

introduction

Préparation

--Construire l'environnement suivant

Google Chrome

WebDriver

[Structure des dossiers] rpa.png

Python

selenium

selenium


pip install selenium

Beautifulsoup4

BS4


pip install beautifulsoup4

Hello Beautifulsoup4!

scraping.py


from bs4 import BeautifulSoup

html_doc = """
<!DOCTYPE html>
<html>
    <head>
        <title>TEST SOUP</title>
    </head>
    <body>
        <h1>Hello BS4</h1>
        <p class="font-big">python scraping</p>
        <button id="start" @click="getURI">Start</button>

        <ul>
            <li><a href="https://www.yahoo.co.jp">Yahoo</a></li>
            <li><a href="https://www.google.co.jp">Google</a></li>
            <li><a href="https://www.amazon.co.jp/">Amazon</a></li>
        </ul>
    </body>
</html>
"""

soup = BeautifulSoup(html_doc, 'html.parser')

print(soup.prettify())
print(soup.title)
print(soup.title.name)
print(soup.title.string)
print(soup.title.parent.name)
print(soup.h1)
print(soup.p)
print(soup.p['class'])
print(soup.button)
print(soup.find(id='start'))
print(soup.a)
print(soup.find_all('a'))

for link in soup.find_all('a'):
    print(link.get('href'))

print(soup.get_text())

Hello Selenium!

driver


#Comment spécifier un élément
#driver.find_element_by_id('ID')
#driver.find_element_by_class_name('CLASS_NAME')
#driver.find_element_by_name('NAME')
#driver.find_element_by_css_selector('CSS_SELECTOR')
#driver.find_element_by_xpath('XPath')
#driver.find_element_by_link_text('LINK_TEXT')
#driver.find_element_by_partial_link_text('LINK_TEXT')

#Manipuler des éléments
#driver.find_element_by_id('ID').click()
#el = driver.find_element_by_id('ID')
#driver.execute_script("arguments[0].click();", el)
#driver.find_element_by_id('ID').send_keys('STRINGS')
#driver.find_element_by_id('ID').text
#driver.find_element_by_id('ID').get_attribute('ATTRI_NAME')
#driver.find_element_by_id('ID').clear()

#Fonctionnement de la page
#driver.back()
#driver.forward()
#driver.refresh()
#driver.close()
#driver.quit()

selenium.py


import time
import os
os.environ['PATH'] = os.getenv('PATH') + './Scripts/chromedriver_binary;'

# WebDriver: https://sites.google.com/a/chromium.org/chromedriver/downloads
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from bs4 import BeautifulSoup
from bs4 import SoupStrainer

HEADLESS = False
URL = 'https://docs.python.org/ja/3/py-modindex.html'
SELECTOR = 'body > div.footer'

op = Options()
if HEADLESS:
    op.add_argument("--headless")

driver = webdriver.Chrome(chrome_options=op)
driver.get(URL)
WebDriverWait(driver, 30).until(
    EC.presence_of_element_located((By.CSS_SELECTOR, SELECTOR))
)

code_tag = SoupStrainer('code')
sp = BeautifulSoup(driver.page_source, features='html.parser', parse_only=code_tag)

for c in sp.find_all('code'):
    print(c.string)

driver.quit()

en conclusion

Recommended Posts

Début du grattage
Grattage 1
Divers grattage
Démarrer python
[Scraping] Scraping Python
Échantillon de grattage
raclage Web
Mémo de raclage Python
Grattage au sélénium
Scraping Python get_ranker_categories
Grattage au sélénium ~ 2 ~
Grattage avec Python
À propos du scraping Twitter
Grattage avec Python
J'ai essayé de gratter
grattage Web (prototype)
Commencez à utiliser Python
Commencer l'apprentissage en profondeur
Démarrage rapide de Python
Python racle eBay
Grattage avec du sélénium
[Python] Commencez à étudier
Démarrer le didacticiel Django 1
Grattage de 100 images Fortnite
Grattage Python get_title
Python: grattage partie 1
Scraping à l'aide de Python
Python: grattage, partie 2