Every time you scrape
test.py
from bs4 import BeautifulSoup
Since it is troublesome to write like this, I will create a template that is sure to use this for the time being.
test.py
!apt-get update
!apt install chromium-chromedriver
!cp /usr/lib/chromium-browser/chromedriver /usr/bin
!pip install selenium
!pip install requests-html
First of all, library related. I usually use clb, so I'll put this in for the time being.
test.py
import pandas as pd
import datetime
from tqdm.notebook import tqdm
import requests
from bs4 import BeautifulSoup
import time
import re
from urllib.request import urlopen
import urllib.request, urllib.error
from requests_html import HTMLSession
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
#Up to the point of getting html
options = webdriver.ChromeOptions()
options.add_argument('--headless')
options.add_argument('--no-sandbox')
options.add_argument('--disable-dev-shm-usage')
driver = webdriver.Chrome('chromedriver',options=options)
driver.implicitly_wait(10)
url="https://www.XXX.com"
driver.get(url)
html = driver.page_source.encode('utf-8')
soup = BeautifulSoup(html, "html.parser")
Yes, it's OK to copy and paste because of brain death so far. later
test.py
soup
With this, you can reach the point where you output html for the time being in a few seconds.
Strictly speaking, there are some libraries that I don't use, such as tqdm, but I also pack all the code that imports the libraries that I use in the set almost every time I scrape personally.
I myself copy and paste this and use it all the time.
Recommended Posts