-[How to use Beautiful Soup](How to use # beautiful-soup) -[How to use Selenium](How to use #selenium) -[How to use Pandas](How to use #pandas) -[How to handle spreadsheets](#How to handle spreadsheets) -Regular expression look-ahead, after-Yomi is described in another article.

How to use Beautiful Soup

How to eliminate garbled characters

When using requests, you would normally write it as follows,

from bs4 import BeautifulSoup
import requests

res = requests.get(url)
soup = BeautifulSoup(res.content, 'html.parser')

There are sites that are garbled with this, so if you do the following, the garbled characters can be eliminated considerably.

from bs4 import BeautifulSoup
import requests

res = requests.get(url)
soup = BeautifulSoup(res.content, "lxml", from_encoding='utf-8')

find code list

Description	Code example
1 search	soup.find('li')
Search all tags	soup.find_all('li')
Attribute search	soup.find('li', href='html://www.google.com/')
Get multiple elements	soup.find_all(['a','p'])
id search	soup.find('a', id="first")
class search	soup.find('a', class_="first")
Attribute acquisition	first_link_element['href']
Text search	soup.find('dt' ,text='Search word')
Search for partial text matches	soup.find('dt' ,text=re.compile('Search word'))
Get parent element	.parent
Get 1 of the following elements	.next_sibling
Get all the following elements	.next_siblings
Get 1 previous element	.previous_sibling
Get all previous elements	.previous_siblings
Get text elements	.string

Select code list

Description	Code example
1 search	soup.select_one('css selector')
Search all	soup.select('css selector')

List of selector specification methods

Description	Code example
id search	soup.select('a#id')
class search	soup.select('a.class')
Multiple search for class	soup.select('a.class1.class2')
Attribute search 1	soup.select('a[class="class"]')
Attribute search 2	soup.select('a[href="http://www.google.com"]')
Attribute search 3	soup.select('a[href]')
Get child elements	soup.select('.class > a[href]')
Get progeny elements	soup.select('.class a[href]')

Change the attribute element according to the element you want to search. ʻId, class, href, name, summary, etc. Insert >if you want to get only child elements (one level down), and putspace` if you want to get offspring elements (all down one level).

How to use Selenium

Preparations for using Selenium

When using with Colab, Selenium download and UI specifications are not possible, so That setting is required.

#Download the libraries needed to use Selenium
!apt-get update
!apt install chromium-chromedriver
!cp /usr/lib/chromium-browser/chromedriver /usr/bin
!pip install selenium

from selenium import webdriver

#Settings for using the driver without a UI
options = webdriver.ChromeOptions()
options.add_argument('--headless')
options.add_argument('--no-sandbox')
options.add_argument('--disable-dev-shm-usage')
driver = webdriver.Chrome('chromedriver',options=options)
driver.implicitly_wait(10)

When using Selenium and Beautiful Soup

As a use case, when the element cannot be acquired by just Beautiful Soup If you want to load the page with seleniumu and then extract the necessary information with Beautiful Soup.

driver.get(url)
html = driver.page_source.encode('utf-8')
soup = BeautifulSoup(html, 'html.parser')

Selenium basic code

Description	Code example
Open URL	driver.get('URL')
Go back one step	driver.back()
Go one step forward	driver.forward()
Update browser	driver.refresh()
Get the current URL	driver.current_url
Get the current title	driver.title
Close current window	driver.close()
Close all windows	driver.quit()
Get elements in class	driver.find_element_by_class_name('classname')
Get element by ID	driver.find_element_by_id('id')
Get elements with XPATH	driver.find_element_by_xpath('xpath')
Text search with XPATH	driver.find_element_by_xpath('//*[text()="strings"]')
Text partial match search with XPATH	driver.find_element_by_xpath('//*[contains(text(), "strings")]')
Click an element	driver.find_element_by_xpath('XPATH').click()
Text input	driver.find_element_by_id('ID').send_keys('strings')
Get text	driver.find_element_by_id('ID').text
Get attributes(For href)	driver.find_element_by_id('ID').get_attribute('href')
Determine if the element is displayed	driver.find_element_by_xpath('xpath').is_displayed()
Determine if the element is valid	driver.find_element_by_xpath('xpath').is_enabled()
Determine if an element is selected	driver.find_element_by_xpath('xpath').is_selected()

When you want to select a dropdown

from selenium.webdriver.support.ui import Select

element = driver.find_element_by_xpath("xpath")
Select(element).select_by_index(indexnum) #Select by index
Select(element).select_by_value("value") #value of value
Select(element).select_by_visible_text("text") #Display text

List of Xpath specification methods

Description	Code example
Select all elements	//*
Select all elements	//a
Select an attribute	@href
Select multiple elements	[a or h2]
Get element by id	//*[@id="id"]
Get elements with class	//*[@class="class"]
Text search	//*[text()="strings"]
Partial search of text	//*[contains(text(), "strings")]
Partial match of class	//*contains(@class, "class")
Get the next node	/following-sibling::*[1]
Two a elements after	/following-sibling::a[2]
Get the back node	/preceding-sibling::*[1]

Refer to here for how to get other nodes

When changing tabs

Used when a new tab is created without page transition when clicked

handle_array = driver.window_handles
driver.switch_to.window(handle_array[1])

Wait until a specific element is displayed


from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By

#Wait until all elements on the page are loaded (timeout judgment in 15 seconds)
WebDriverWait(driver, 15).until(EC.presence_of_all_elements_located)

#Wait until the element on the page with the specified ID is loaded (timeout judgment in 15 seconds)
WebDriverWait(driver, 15).until(EC.presence_of_element_located((By.ID, 'ID name')))

#CLASS name Wait until the element on the specified page is loaded (timeout judgment in 15 seconds)
WebDriverWait(driver, 15).until(EC.presence_of_element_located((By.CLASS_NAME, 'CLASS name')))

#Wait until the element on the page specified by the CLASS name in Xpath is loaded (timeout judgment in 15 seconds)
WebDriverWait(driver, 15).until(EC.presence_of_element_located((By.XPATH, 'xpath')))

What to do when you can't click

target = driver.find_element_by_xpath('xpath')
driver.execute_script("arguments[0].click();", target)

How to use Pandas

How to create a data frame and add data

import pandas as pd
columns = ['Item 1', 'Item 2', 'Item 3', 'Item 4', 'Item 5']
df = pd.DataFrame(columns=columns)

#Data acquisition process

se = pd.Series([data1, data2, data3, data4, data5], columns)
df = df.append(se, columns)

When downloading Pandas data

from google.colab import files

filename = 'filename.csv'
df.to_csv(filename, encoding = 'utf-8-sig') 
files.download(filename)

When saving Pandas data to My Drive

from google.colab import drive

filename = filename.csv'
path = '/content/drive/My Drive/' + filename

with open(path, 'w', encoding = 'utf-8-sig') as f:
  df.to_csv(f)

How to work with spreadsheets

Preparations for working with spreadsheets

#Download the library needed to work with spreadsheets
!pip install gspread

from google.colab import auth
from oauth2client.client import GoogleCredentials
import gspread

#Authentication process
auth.authenticate_user()
gc = gspread.authorize(GoogleCredentials.get_application_default())

Frequently used code

ss_id = 'Spreadsheet ID'
sht_name = 'Sheet name'

workbook = gc.open_by_key(ss_id)
worksheet = workbook.worksheet(sht_name)

#When acquiring data
worksheet.acell('B1').value
worksheet.cell(2, 1).value

#When updating
worksheet.update_cell(row, column, 'Update contents')

gspread code list

Workbook operation

Description	Code example
Spreadsheet selection by ID	gc.open_by_key('ID')
Spreadsheet selection by URL	gc.open_by_url('URL')
Get Spreadsheet Title	workbook.title
Get Spreadsheet ID	workbook.id

Seat operation

Description	Code example
Get sheet by sheet name	workbook.worksheet('Sheet name')
Get a sheet with Index	workbook.get_worksheet(index)
Get all sheets in an array	workbook.worksheets()
Get sheet name	worksheet.title
Get sheet ID	worksheet.id

Cell manipulation

Description	Code example
Data acquisition by A1 method	worksheet.acell('B1').value
Data acquisition by R1C1 method	worksheet.cell(1, 2).value
Select multiple cells and get as a one-dimensional array	worksheet.range('A1:B10')
Data acquisition of selected row	worksheet.row_values(1)
Get formula for selected row	worksheet.row_values(1,2)
Data acquisition of selected columns	worksheet.column_values(1)
Get formula for selected column	worksheet.column_values(1,2)
Get all data	worksheet.get_all_values()
Update cell values with A1 method	worksheet.update_acell('B1','Value to update')
Update cell value with R1C1 method	worksheet.update_cell(1,2,'Value to update')

[Reference site] BeautifulSoup4 cheat sheet (selector, etc.) Python3 memo --Beautiful Soup4 stuff Basics of CSS selectors for web scraping Summary of frequently used operation methods of Selenium webdriver What is XPath? Learn the basic knowledge of XPath! Indispensable for web scraping! Summary of Xpath Summary of how to use the gspread library! Work with spreadsheets in Python

[PYTHON] Cheat sheet when scraping with Google Colaboratory (Colab)

table of contents