Introduction

I wanted to scrape text data from a local html file, so I tried various things, but since the Python library Beautiful Soup was very convenient, I will share how to use it and how to output it to a CSV file.

Development environment

pyenv: 1.2.15 python: 3.6.5 Beautiful Soup: 4.4.0 VSCode: 1.41.1

Python environment construction

For the environment construction, I referred to the following lesson of Progate. Prepare a Python development environment! (Mac)

What is Beautiful Soup?

A type of Python library based on HTML tags and CSS selectors from HTML data You can scrape. Official reference: https://www.crummy.com/software/BeautifulSoup/bs4/doc/ Reference Japanese translation (ver 3.0): https://tdoc.info/beautifulsoup/

Introduction of Beautiful Soup

Install using pip. I referred to the following article.

[Introduction to Python] What is pip? Easy-to-understand explanation of how to use! Let's scrape with beautiful Python soup

You can install it with the following command.

pip install beautifulsoup4

html file preparation

Prepare the html file to be scraped locally. The following is a sample file.

`/sample_file/sample.html`


<!--~abridgement~-->
<div>
  <ul lass="sample">
    <li class="sample">
      <a href="aaa">aaaaaa</a>
    </li>
    <li class="sample">
      <a href="bbb">bbbbb</a>
    </li>
  </ul>
  <div class="sample">
    <a href="ccc">ccc</a>
  </div>
  <div class="sample">
    <div class="sample">
        <a href="ddd">ddddd</a>
    </div>
  </div>
</div>
<!--~abridgement~-->

Scraping work

Creating a python file

Next, create a Python program. Create it in the same directory as the html file.

`/sample_file/script.py`


import bs4
import csv #module"CSV"Call

#Create soup from scraped html file
soup = bs4.BeautifulSoup(open('sample.html'), 'html.parser')

links = soup.find_all('a') #Get all a tag elements

csvlist = [] #Create an array

for link in links: #Store text data of a tag in an array
    sample_txt = link.text
    csvlist.append(sample_txt)

#Open the CSV file. If the file does not exist, create a new one
f = open("output_sample.csv", "w")
writecsv = csv.writer(f, lineterminator='\n')

writecsv.writerow(csvlist) #output

f.close() #Close CSV file

I referred to the following article.

Parsing HTML with Python (Beautiful Soup) Output HTML scraped by Beautiful Soup to CSV

Executing a Python file

$ cd sample_file
$ python script.py

Output result

The CSV file will be output to the same directory as shown below.

`output_sample.csv`


aaaaaa,bbbbb,ccc,ddddd

CSV processing

1. Improved readability of CSV files

If you're using VS Code, you can make your CSV much easier to read by introducing an extension called Rainbow CSV.

before

after スクリーンショット 2020-01-05 10.12.49.png

For the introduction of Rainbow CSV, I referred to the following article. Introducing "Rainbow CSV" that makes CSV easier to see with VS Code

2. Process for seed data

If you scrape with the above method, many line breaks may occur. When you want to use the CSV file as the contents of the array in seed data etc. You may want to remove the line breaks to make one line. In such a case, it is recommended to delete all line breaks with the VS Code replacement function. I referred to the following article for this method. [[Visual Studio Code] How to replace the line feed code to one line] (https://kukka.me/vsc-newline/)

Summary

You can scrape with Beautiful Soup. You can output a CSV file by using the module "CSV". If the CSV file contains many line breaks due to the structure of the html file, The VS Code replacement function is useful.

Reference URL

https://prog-8.com/docs/python-env https://www.crummy.com/software/BeautifulSoup/bs4/doc/ https://www.sejuku.net/blog/50417 https://www.sejuku.net/blog/75137 https://maku77.github.io/python/parse-html-by-beautiful-soup.html https://5log.hateblo.jp/entry/2019/01/03/075552 https://qiita.com/0w0/items/07a481921a2ac09a049f https://kukka.me/vsc-newline/

[Python] How to scrape a local html file and output it as CSV using Beautiful Soup