[Scraping] Python scraping

environment

Linux Ubuntu Xfce

reference

Web scraping with Python Python Crawling & Scraping-Practical Development Guide for Data Collection and Analysis Practice Selenium WebDriver

tool

Chrome


sudo curl -sS -o - https://dl-ssl.google.com/linux/linux_signing_key.pub | apt-key add
sudo echo "deb [arch=amd64]  http://dl.google.com/linux/chrome/deb/ stable main" >> /etc/apt/sources.list.d/google-chrome.list
sudo apt-get -y update
sudo apt-get -y install google-chrome-stable

Other


sudo apt install chromium-chromedriver liblzma-dev \
&& pip install bs4 selenium pandas

Basic

What you can do with bs4

Various methods are prepared for bs4, and if you make full use of those methods and regular expressions (re), ** there is nothing that cannot be obtained **

Parse uses lxml

Fastest and can use the most CSS selectors

html_doc = '<html>...</html>'
soup = BeautifulSoup(html_doc, 'lxml')

Be sure to close, quit after execution

If you don't do it, the debris of the process will accumulate.

from selenium import webdriver
driver = webdriver.Chrome()
#Quit the driver
driver.close()
driver.quit()

Operate with selenium and pass html source to BS4

After the delivery is over, search for treasure with BS4

options = ChromeOptions()
options.add_argument('--headless') #Windowless mode
driver = Chrome(options=options)
url = 'https://www.example.com/'
driver.get(url)

#Start operation of Selenium
...
   ...
      ...
#End of Selenium operation

html = driver.page_source.encode('utf-8')
soup = BeautifulSoup(html, "lxml")

#BS4 processing started
...
   ...
      ...
#BS4 processing finished

Do not use find method when the number of HTML tags is small

Search by tag name directly from the BeautifulSoup object

When there are few tags like this


from bs4 import BeautifulSoup

html_doc = '''
<html>
    <head>
        <title>hello soup</title>
    </head>
    <body>
        <p class="my-story">my story</p>
    </body>
</html>
'''
soup = BeautifulSoup(html_doc, 'lxml')
print(soup.title)
print(soup.title.text)
print(soup.p)
print(soup.p['class'])
print(soup.p.text)

Execution result


<title>hello soup</title>
hello soup
<p class="my-story">my story</p>
['my-story']
my story

Know the 4 objects in bs4

BeautfulSoup has 4 types of objects: Tag, NavigableString, BeautifulSoup, Comment.

Of these, the ones I often use are Beautiful Soup and Tag.

BeautifulSoup and Tag objects

BeautifulSoup: Convert HTML source to Python-friendly format (tree structure) Tag: A Tag object is created when a specific method is used on a BeautifulSoup object.

Understand the difference between find and find_all

You can search for anything using the find and find_all methods on a BeautifulSoup object, but you need to know what the method produces for a good search.

** Objects generated by the method ** findbs4.element.Tag find_allbs4.element.ResultSet

** Return value when nothing is found ** findNone find_all [] empty list

bs4.element.Tag

You can think that it is generated by using bs4 methods other than find_all method, BeautifulSoup method, select method.

from bs4 import BeautifulSoup

html_doc = '''
<html>
    <head>
        <title>hello soup</title>
    </head>
    <body>
        <p class="my-story">my story</p>
        <a class='brother' href='http://example.com/1' id='link1'>Link 1</a>
        <a class='brother' href='http://example.com/2' id='link2'>Link 2</a>
        <a class='brother' href='http://example.com/3' id='link3'>Link 3</a>
    </body>
</html>
'''

soup = BeautifulSoup(html_doc, 'lxml')

print('tag1')
tag1 = soup.find('a')
print(tag1)
print(type(tag1))


print('tag2')
tag2 = soup.a
print(tag2)
print(type(tag2))

bs4.element.ResultSet Generated by using the find_all method, the BeautifulSoup method, and the select method.

An image with a lot of bs4.element.Tag in the list (** This image is pretty important **)

python:bs4.element.Image of ResultSet


bs4.element.ResultSet = [bs4.element.Tag, bs4.element.Tag, bs4.element.Tag,...]

Therefore, it cannot be searched as it is, and it is used after removing it from the list. If you take it out, you can use the same method as bs4.element.tag above.

  • The method cannot be used! That's almost when you're trying to use the bs4.element.Tag method for bs4.element.ResultSet.
from bs4 import BeautifulSoup

html_doc = '''
<html>
    <head>
        <title>hello soup</title>
    </head>
    <body>
        <p class="my-story">my story</p>
        <a class='brother' href='http://example.com/1' id='link1'>Link 1</a>
        <a class='brother' href='http://example.com/2' id='link2'>Link 2</a>
        <a class='brother' href='http://example.com/3' id='link3'>Link 3</a>
    </body>
</html>
'''

soup = BeautifulSoup(html_doc, 'lxml')

print('tag3')
tag3 = soup.select('a:nth-of-type(2)') #Find by the presence or absence of the a tag in the body tag
print(tag3)
print(type(tag3))

print('tag4')
tag4 = soup.select('.link1') #CSS selector class
print(tag4)
print(type(tag4))

print('tag5')
tag5 = soup.select('a[href]') #Find tags with or without attributes
print(tag5)
print(type(tag5))

Tips

Loosen print output limits

If you keep the default, when you try to print a file that is big, you will get an error ʻIO Pub data rate exceeded.`, so change it to unlimited

Create configuration file


jupyter notebook --generate-config

python:~/.jupyter/jupyter_notebook_config.py


#Before change 1000000 → After change 1e10
jupyter notebook --NotebookApp.iopub_data_rate_limit=1e10

Read and write at high speed with pickle

Fast because it reads and writes in binary format ('b' in the code means binary)

There is a library with the same function, joblib, but this is good to use when you want to reduce the file size at the expense of speed.

writing(dump)


import pickle

example = 'example'

with open('example.pickle', 'wb') as f:
    pickle.dump(example, f)

Read(load)


with open('example.pickle', 'rb') as f:
    example = pickle.load(f)

Addressed an issue that caused an error when trying to read or write something other than a string

When trying to write a bs4 object (bs4.BeautifulSoup, etc.) Since the error "maximum recursion depth exceeded while pickling an object" appears, convert it to string etc. before saving.

dump


import pickle

example = 'example'

with open('example.pickle', 'wb') as f:
    pickle.dump(str(example), f)

load


with open('example.pickle', 'rb') as f:
    example = BeatitfulSoup(pickle.load(f), 'lxml')

If you just read it, it cannot be handled by bs4 because it is a str type. Therefore, convert to bs4 type when reading

** If the above method doesn't work **

If you can't dump dict In such a case, it is good to dump with json

dump


import json

with open('example.json', 'w') as f:
    json.dump(example, f)

load


with open('example.json', 'r') as f:
    json.load(f)

Jupyter Notebook

Maximize cell width

When looking at the DataFrame of pandas, if the cell width is the default, the characters will be cut off, so set the cell width to the maximum

css:~/.jupyter/custom/custom.css


.container { width:100% !important; }

Measure processing time

Use % time which can only be used under the Jupyter environment This is a built-in method of Jupyter, so no import is required

How to use


%time example_function()

Regular expressions

Get the characters before and after the slash in the URL

When you want to get the scraping of https://www.example.com/topics/scraping Specify / with split to get the element behind

code


url = 'https://www.example.com/topics/scraping'

print(url.split('/'))
#['https:', '', 'www.example.com', 'topics', 'scraping']
print(url.split('/')[-1])
#scraping

Pandas

Pandas UserWarning: Could not import the lzma module. Your installed Python is incomplete

Error when missing required packages for pandas

sudo apt install liblzma-dev

Extract Column of DataFrame and make it a list

  1. Put the Column you want to retrieve in Series
  2. Use the Series tolist method
import pandas as pd
df = pd.DataFrame(index=[1,2,3], {'Column1':data1, 'Column2':'data2', 'Column3':'data3'})

#Extract Column3 and make it a list
col3 = pd.Series(df['Column3']).tolist()

Left justify the output result of DataFrame

The default is right-justified, so URLs and English are difficult to read.

df.style.set_properties(**{'text-align': 'left'})  #Left justified

Recommended Posts

[Scraping] Python scraping
Python scraping notes
Python Scraping get_ranker_categories
Scraping with Python
Scraping with Python
Python Scraping eBay
Python Scraping get_title
Python: Scraping Part 1
Scraping using Python
Python: Scraping Part 2
Scraping with Python (preparation)
Summary about Python scraping
Try scraping with Python.
Python
UnicodeEncodeError:'cp932' during python scraping
Basics of Python scraping basics
Scraping with Python + PhantomJS
Scraping with Selenium [Python]
Python web scraping selenium
Scraping with Python + PyQuery
Scraping 1
Scraping RSS with Python
Scraping using Python 3.5 async / await
I tried scraping with Python
Web scraping with python + JupyterLab
Scraping with selenium in Python
Scraping with Selenium + Python Part 1
python super beginner tries scraping
Web scraping notes in python3
Festive scraping with Python, scrapy
Scraping using Python 3.5 Async syntax
Scraping with Tor in Python
Web scraping using Selenium (Python)
Scraping weather forecast with python
Scraping with Selenium + Python Part 2
[Python + Selenium] Tips for scraping
I tried scraping with python
Web scraping beginner with python
Python Crawling & Scraping Chapter 4 Summary
kafka python
Try scraping with Python + Beautiful Soup
Python basics ⑤
python + lottery 6
Python Summary
Built-in python
Python comprehension
Studying python
Python 2.7 Countdown
Python memorandum
Python FlowFishMaster
Various scraping
Python service
python tips
Start scraping
Scraping with Node, Ruby and Python
Python memo
Web scraping with Python ① (Scraping prior knowledge)
Python comprehension
Scraping with Selenium in Python (Basic)
Python Singleton
Python basics ④