[Python] Scraping a table using Beautiful Soup

From tables that are updated frequently or tables that are difficult to copy and paste I wondered if I could improve the efficiency of data collection, this time. I wrote the code to scrape with python and write it to CSV.

Set environment

MacBook Air (13-inch, Mid 2011) Processor: 1.8 GHz Intel Core i7 Memory: 4 GB 1333 MHz DDR3 Version: 10.11.5 Python: 3.6.2

Preparation

Install Beautiful Soup. BeautifulSoup is a library that can retrieve data from HTML and XML.

This time I installed it using pip.

$ pip3 install beautifulsoup4
Collecting beautifulsoup4
  Downloading beautifulsoup4-4.6.0-py3-none-any.whl (86kB)
    100% |████████████████████████████████| 92kB 1.8MB/s 
Installing collected packages: beautifulsoup4
Successfully installed beautifulsoup4-4.6.0

Other options include easy_install, apt-get, and direct code download and installation. For more information, please read "Installing Beautiful Soup" in the official document below.

https://www.crummy.com/software/BeautifulSoup/bs4/doc/

Try scraping table elements

Once you have beautifulsoup4 installed, Let's get the new publication information of O'Reilly at once.

** 2019/03/20 update **: The write file is opened with with.

scraping_table.py


import csv
from urllib.request import urlopen
from bs4 import BeautifulSoup
import ssl
ssl._create_default_https_context = ssl._create_unverified_context

#Specifying the URL
html = urlopen("https://www.oreilly.co.jp/ebook/")
bsObj = BeautifulSoup(html, "html.parser")

#Specify table
table = bsObj.findAll("table", {"class":"tablesorter"})[0]
rows = table.findAll("tr")

with open("ebooks.csv", "w", encoding='utf-8') as file:
    writer = csv.writer(file)
    for row in rows:
        csvRow = []
        for cell in row.findAll(['td', 'th']):
            csvRow.append(cell.get_text())
        writer.writerow(csvRow)

The exported CSV looks like this. If you do it regularly, you won't miss any new publications! By the way, in the above code, since it was specified by get_text (), the image link in the "Add to cart" column is empty.

ISBN,Title,price,Issue month,add to cart
978-4-87311-755-3,Design design to improve performance,"2,073",2016/06,
978-4-87311-700-3,Network security through data analysis,"3,110",2016/06,
978-4-87311-754-6,UX strategy,"2,592",2016/05,
978-4-87311-768-3,An introduction to mathematics starting with Python,"2,419",2016/05,
978-4-87311-767-6,What is the software doing without your knowledge?,"2,246",2016/05,
978-4-87311-763-8,Fermentation technique,"3,110",2016/04,
978-4-87311-765-2,First Ansible,"2,764",2016/04,
978-4-87311-764-5,Kanban work technique,"3,110",2016/03,

How to apply to other sites

Basically, you can easily get the tables of other sites by modifying the following part of the code.

  1. Change the class name of the table you want to get
  2. If there are multiple tables with the same class name in the site, specify the number by the number in [].
#Specify table
table = bsObj.findAll("table",{"class":"tablesorter"})[0]
rows = table.findAll("tr")

About CSV

I'm using a Mac, so the exported CSV was utf-8. If you read it in Excel as it is, the characters will be garbled, so it is easy to use if you convert the character code and format it. If you want to know how to convert, click here [http://help.peatix.com/customer/portal/articles/530797-%E3%83%80%E3%82%A6%E3%83%B3%E3 % 83% AD% E3% 83% BC% E3% 83% 89% E3% 81% 97% E3% 81% 9F% E3% 83% 95% E3% 82% A1% E3% 82% A4% E3% 83 % AB% E3% 81% AE% E6% 96% 87% E5% AD% 97% E5% 8C% 96% E3% 81% 91% E3% 81% AB% E3% 81% A4% E3% 81% 84 % E3% 81% A6-for-mac) Please (another site)

Recommended Posts

[Python] Scraping a table using Beautiful Soup
Table scraping with Beautiful Soup
Try scraping with Python + Beautiful Soup
A memorandum when using beautiful soup
[Python] A memorandum of beautiful soup4
Scraping with Python and Beautiful Soup
Scraping using Python
Scraping a website using JavaScript in Python
Scraping with Beautiful Soup
Write a basic headless web scraping "bot" in Python with Beautiful Soup 4
Scraping using Python 3.5 Async syntax
Web scraping using Selenium (Python)
[Python] Delete by specifying a tag with Beautiful Soup
I made a Line-bot using Python!
Scraping multiple pages with Beautiful Soup
Create a python GUI using tkinter
Creating a simple table using prettytable
Scraping pages with pagination with Beautiful Soup
Scraping with Beautiful Soup in 10 minutes
[Python] Creating a scraping tool Memo
[Python] Multiplication table using for statement
Website scraping with Python's Beautiful Soup
[Python] Practical Beautiful Soup ~ Scraping the triple single odds table on the official website of Kyotei ~
Beautiful Soup
[Scraping] Python scraping
[Python] Analyze Splatoon 2 league match data using a correlation coefficient table
[Python] How to scrape a local html file and output it as CSV using Beautiful Soup
Get the link destination URL by specifying a text sentence with Python scraping (Beautiful Soup) + XPath
[Beginner] Python web scraping using Google Colaboratory
[Python] Create a Batch environment using AWS-CDK
Try HTML scraping with a Python library
Draw a tree in Python 3 using graphviz
A program that plays rock-paper-scissors using Python
Settings when using Python 3 requests and Beautiful Soup with crostini on Chromebook
Python scraping notes
Python Scraping get_ranker_categories
Scraping with Python
Create a GIF file using Pillow in Python
I tried web scraping using python and selenium
[Python] Split a large Flask file using Blueprint
Scraping with Python
Beautiful soup spills
Start using Python
Python Scraping eBay
[Memo] I tried a pivot table in Python
Create a web map using Python and GDAL
Pharmaceutical company researchers summarized web scraping using Python
View drug reviews using a list in Python
I tried reading a CSV file using Python
Python Scraping get_title
Python: Scraping Part 1
Run a Python file from html using Django
How to search HTML data using Beautiful Soup
Create a Mac app using py2app and Python3! !!
Create a MIDI file in Python using pretty_midi
Let's make a module for Python using SWIG
"Gazpacho", a scraping module that can be used more easily than Beautiful Soup
Run a python script from excel (using xlwings)
Python: Scraping Part 2
I tried various things with Python: scraping (Beautiful Soup + Selenium + PhantomJS) and morphological analysis.
[Python] Implementation of clustering using a mixed Gaussian model