[PYTHON] Cheat sheet when scraping with Google Colaboratory (Colab)

table of contents

-[How to use Beautiful Soup](How to use # beautiful-soup) -[How to use Selenium](How to use #selenium) -[How to use Pandas](How to use #pandas) -[How to handle spreadsheets](#How to handle spreadsheets) -Regular expression look-ahead, after-Yomi is described in another article.

How to use Beautiful Soup

How to eliminate garbled characters

When using requests, you would normally write it as follows,

from bs4 import BeautifulSoup
import requests

res = requests.get(url)
soup = BeautifulSoup(res.content, 'html.parser')

There are sites that are garbled with this, so if you do the following, the garbled characters can be eliminated considerably.

from bs4 import BeautifulSoup
import requests

res = requests.get(url)
soup = BeautifulSoup(res.content, "lxml", from_encoding='utf-8')

find code list

Description Code example
1 search soup.find('li')
Search all tags soup.find_all('li')
Attribute search soup.find('li', href='html://www.google.com/')
Get multiple elements soup.find_all(['a','p'])
id search soup.find('a', id="first")
class search soup.find('a', class_="first")
Attribute acquisition first_link_element['href']
Text search soup.find('dt' ,text='Search word')
Search for partial text matches soup.find('dt' ,text=re.compile('Search word'))
Get parent element .parent
Get 1 of the following elements .next_sibling
Get all the following elements .next_siblings
Get 1 previous element .previous_sibling
Get all previous elements .previous_siblings
Get text elements .string

Select code list

Description Code example
1 search soup.select_one('css selector')
Search all soup.select('css selector')

List of selector specification methods

Description Code example
id search soup.select('a#id')
class search soup.select('a.class')
Multiple search for class soup.select('a.class1.class2')
Attribute search 1 soup.select('a[class="class"]')
Attribute search 2 soup.select('a[href="http://www.google.com"]')
Attribute search 3 soup.select('a[href]')
Get child elements soup.select('.class > a[href]')
Get progeny elements soup.select('.class a[href]')

Change the attribute element according to the element you want to search. ʻId, class, href, name, summary, etc. Insert >if you want to get only child elements (one level down), and putspace` if you want to get offspring elements (all down one level).

How to use Selenium

Preparations for using Selenium

When using with Colab, Selenium download and UI specifications are not possible, so That setting is required.

#Download the libraries needed to use Selenium
!apt-get update
!apt install chromium-chromedriver
!cp /usr/lib/chromium-browser/chromedriver /usr/bin
!pip install selenium

from selenium import webdriver

#Settings for using the driver without a UI
options = webdriver.ChromeOptions()
driver = webdriver.Chrome('chromedriver',options=options)

When using Selenium and Beautiful Soup

As a use case, when the element cannot be acquired by just Beautiful Soup If you want to load the page with seleniumu and then extract the necessary information with Beautiful Soup.

html = driver.page_source.encode('utf-8')
soup = BeautifulSoup(html, 'html.parser')

Selenium basic code

Description Code example
Open URL driver.get('URL')
Go back one step driver.back()
Go one step forward driver.forward()
Update browser driver.refresh()
Get the current URL driver.current_url
Get the current title driver.title
Close current window driver.close()
Close all windows driver.quit()
Get elements in class driver.find_element_by_class_name('classname')
Get element by ID driver.find_element_by_id('id')
Get elements with XPATH driver.find_element_by_xpath('xpath')
Text search with XPATH driver.find_element_by_xpath('//*[text()="strings"]')
Text partial match search with XPATH driver.find_element_by_xpath('//*[contains(text(), "strings")]')
Click an element driver.find_element_by_xpath('XPATH').click()
Text input driver.find_element_by_id('ID').send_keys('strings')
Get text driver.find_element_by_id('ID').text
Get attributes(For href) driver.find_element_by_id('ID').get_attribute('href')
Determine if the element is displayed driver.find_element_by_xpath('xpath').is_displayed()
Determine if the element is valid driver.find_element_by_xpath('xpath').is_enabled()
Determine if an element is selected driver.find_element_by_xpath('xpath').is_selected()

When you want to select a dropdown

from selenium.webdriver.support.ui import Select

element = driver.find_element_by_xpath("xpath")
Select(element).select_by_index(indexnum) #Select by index
Select(element).select_by_value("value") #value of value
Select(element).select_by_visible_text("text") #Display text

List of Xpath specification methods

Description Code example
Select all elements //*
Select all elements //a
Select an attribute @href
Select multiple elements [a or h2]
Get element by id //*[@id="id"]
Get elements with class //*[@class="class"]
Text search //*[text()="strings"]
Partial search of text //*[contains(text(), "strings")]
Partial match of class //*contains(@class, "class")
Get the next node /following-sibling::*[1]
Two a elements after /following-sibling::a[2]
Get the back node /preceding-sibling::*[1]

Refer to here for how to get other nodes

When changing tabs

Used when a new tab is created without page transition when clicked

handle_array = driver.window_handles

Wait until a specific element is displayed

from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By

#Wait until all elements on the page are loaded (timeout judgment in 15 seconds)
WebDriverWait(driver, 15).until(EC.presence_of_all_elements_located)

#Wait until the element on the page with the specified ID is loaded (timeout judgment in 15 seconds)
WebDriverWait(driver, 15).until(EC.presence_of_element_located((By.ID, 'ID name')))

#CLASS name Wait until the element on the specified page is loaded (timeout judgment in 15 seconds)
WebDriverWait(driver, 15).until(EC.presence_of_element_located((By.CLASS_NAME, 'CLASS name')))

#Wait until the element on the page specified by the CLASS name in Xpath is loaded (timeout judgment in 15 seconds)
WebDriverWait(driver, 15).until(EC.presence_of_element_located((By.XPATH, 'xpath')))

What to do when you can't click

target = driver.find_element_by_xpath('xpath')
driver.execute_script("arguments[0].click();", target)

How to use Pandas

How to create a data frame and add data

import pandas as pd
columns = ['Item 1', 'Item 2', 'Item 3', 'Item 4', 'Item 5']
df = pd.DataFrame(columns=columns)

#Data acquisition process

se = pd.Series([data1, data2, data3, data4, data5], columns)
df = df.append(se, columns)

When downloading Pandas data

from google.colab import files

filename = 'filename.csv'
df.to_csv(filename, encoding = 'utf-8-sig') 

When saving Pandas data to My Drive

from google.colab import drive

filename = filename.csv'
path = '/content/drive/My Drive/' + filename

with open(path, 'w', encoding = 'utf-8-sig') as f:

How to work with spreadsheets

Preparations for working with spreadsheets

#Download the library needed to work with spreadsheets
!pip install gspread

from google.colab import auth
from oauth2client.client import GoogleCredentials
import gspread

#Authentication process
gc = gspread.authorize(GoogleCredentials.get_application_default())

Frequently used code

ss_id = 'Spreadsheet ID'
sht_name = 'Sheet name'

workbook = gc.open_by_key(ss_id)
worksheet = workbook.worksheet(sht_name)

#When acquiring data
worksheet.cell(2, 1).value

#When updating
worksheet.update_cell(row, column, 'Update contents')

gspread code list

Workbook operation

Description Code example
Spreadsheet selection by ID gc.open_by_key('ID')
Spreadsheet selection by URL gc.open_by_url('URL')
Get Spreadsheet Title workbook.title
Get Spreadsheet ID workbook.id

Seat operation

Description Code example
Get sheet by sheet name workbook.worksheet('Sheet name')
Get a sheet with Index workbook.get_worksheet(index)
Get all sheets in an array workbook.worksheets()
Get sheet name worksheet.title
Get sheet ID worksheet.id

Cell manipulation

Description Code example
Data acquisition by A1 method worksheet.acell('B1').value
Data acquisition by R1C1 method worksheet.cell(1, 2).value
Select multiple cells and get as a one-dimensional array worksheet.range('A1:B10')
Data acquisition of selected row worksheet.row_values(1)
Get formula for selected row worksheet.row_values(1,2)
Data acquisition of selected columns worksheet.column_values(1)
Get formula for selected column worksheet.column_values(1,2)
Get all data worksheet.get_all_values()
Update cell values with A1 method worksheet.update_acell('B1','Value to update')
Update cell value with R1C1 method worksheet.update_cell(1,2,'Value to update')

