[Python] How to scrape a local html file and output it as CSV using Beautiful Soup

Introduction

I wanted to scrape text data from a local html file, so I tried various things, but since the Python library Beautiful Soup was very convenient, I will share how to use it and how to output it to a CSV file.

Development environment

pyenv: 1.2.15 python: 3.6.5 Beautiful Soup: 4.4.0 VSCode: 1.41.1

Python environment construction

For the environment construction, I referred to the following lesson of Progate. Prepare a Python development environment! (Mac)

What is Beautiful Soup?

A type of Python library based on HTML tags and CSS selectors from HTML data You can scrape. Official reference: https://www.crummy.com/software/BeautifulSoup/bs4/doc/ Reference Japanese translation (ver 3.0): https://tdoc.info/beautifulsoup/

Introduction of Beautiful Soup

Install using pip. I referred to the following article.

[Introduction to Python] What is pip? Easy-to-understand explanation of how to use! Let's scrape with beautiful Python soup

You can install it with the following command.

pip install beautifulsoup4

html file preparation

Prepare the html file to be scraped locally. The following is a sample file.

/sample_file/sample.html


<!--~abridgement~-->
<div>
  <ul lass="sample">
    <li class="sample">
      <a href="aaa">aaaaaa</a>
    </li>
    <li class="sample">
      <a href="bbb">bbbbb</a>
    </li>
  </ul>
  <div class="sample">
    <a href="ccc">ccc</a>
  </div>
  <div class="sample">
    <div class="sample">
        <a href="ddd">ddddd</a>
    </div>
  </div>
</div>
<!--~abridgement~-->

Scraping work

Creating a python file

Next, create a Python program. Create it in the same directory as the html file.

/sample_file/script.py


import bs4
import csv #module"CSV"Call

#Create soup from scraped html file
soup = bs4.BeautifulSoup(open('sample.html'), 'html.parser')

links = soup.find_all('a') #Get all a tag elements

csvlist = [] #Create an array

for link in links: #Store text data of a tag in an array
    sample_txt = link.text
    csvlist.append(sample_txt)

#Open the CSV file. If the file does not exist, create a new one
f = open("output_sample.csv", "w")
writecsv = csv.writer(f, lineterminator='\n')

writecsv.writerow(csvlist) #output

f.close() #Close CSV file

I referred to the following article.

Parsing HTML with Python (Beautiful Soup) Output HTML scraped by Beautiful Soup to CSV

Executing a Python file

$ cd sample_file
$ python script.py

Output result

The CSV file will be output to the same directory as shown below.

output_sample.csv


aaaaaa,bbbbb,ccc,ddddd

CSV processing

1. Improved readability of CSV files

If you're using VS Code, you can make your CSV much easier to read by introducing an extension called Rainbow CSV.

before

スクリーンショット 2020-01-05 10.12.25.png

after スクリーンショット 2020-01-05 10.12.49.png

For the introduction of Rainbow CSV, I referred to the following article. Introducing "Rainbow CSV" that makes CSV easier to see with VS Code

2. Process for seed data

If you scrape with the above method, many line breaks may occur. When you want to use the CSV file as the contents of the array in seed data etc. You may want to remove the line breaks to make one line. In such a case, it is recommended to delete all line breaks with the VS Code replacement function. 970a4afb7683f49554298fc9937affa6.gif I referred to the following article for this method. [[Visual Studio Code] How to replace the line feed code to one line] (https://kukka.me/vsc-newline/)

Summary

You can scrape with Beautiful Soup. You can output a CSV file by using the module "CSV". If the CSV file contains many line breaks due to the structure of the html file, The VS Code replacement function is useful.

Reference URL

https://prog-8.com/docs/python-env https://www.crummy.com/software/BeautifulSoup/bs4/doc/ https://www.sejuku.net/blog/50417 https://www.sejuku.net/blog/75137 https://maku77.github.io/python/parse-html-by-beautiful-soup.html https://5log.hateblo.jp/entry/2019/01/03/075552 https://qiita.com/0w0/items/07a481921a2ac09a049f https://kukka.me/vsc-newline/

Recommended Posts

[Python] How to scrape a local html file and output it as CSV using Beautiful Soup
[Python] How to store a csv file as one-dimensional array data
How to read a CSV file with Python 2/3
How to search HTML data using Beautiful Soup
Read CSV file with Python and convert it to DataFrame as it is
How to input a character string in Python and output it as it is or in the opposite direction.
[Python] How to name table data and output it in csv (to_csv method)
[Python] How to output a pandas table to an excel file
Output to csv file with Python
How to paste a CSV file into an Excel file using Pandas
Process Splunk execution results using Python and save to a file
How to create a CSV dummy file containing Japanese using Faker
[Python] How to convert db file to csv
[Python] Scraping a table using Beautiful Soup
How to install python package in local environment as a general user
[Python] How to read a csv file (read_csv method of pandas module)
How to disguise a ZIP file as a PNG file
Read JSON with Python and output as CSV
How to create a JSON file in Python
Output python log to both console and file
I tried reading a CSV file using Python
Run a Python file from html using Django
How to output "Ketsumaimo" as standard output in Python
Reinforcement learning 35 python Local development, paste a link to myModule and import it.
[Introduction to Pandas] Read a csv file without a column name and give it a column name
Divide each PowerPoint slide into a JPG file and output it with python
How to read a serial number file in a loop, process it, and graph it
[Python] How to create a local web server environment with SimpleHTTPServer and CGIHTTPServer
How to set up a Python environment using pyenv
Try creating a compressed file using Python and zlib
How to build a beautiful Python environment on a new Mac and install Jupter Notebook
How to convert JSON file to CSV file with Python Pandas
How to make a Python package using VS Code
Read json file with Python, format it, and output json
How to save a table scraped by python to csv
Python script to create a JSON file from a CSV file
How to execute a command using subprocess in Python
Output the output result of sklearn.metrics.classification_report as a CSV file
Output a binary dump in binary and revert to a binary file
How to interactively draw a machine learning pipeline with scikit-learn and save it in HTML
A story that makes it easy to estimate the living area using Elasticsearch and Python
[Python / Ruby] Understanding with code How to get data from online and write it to CSV
[Python] What is a tuple? Explains how to use without tuples and how to use it with examples.
Try to make it using GUI and PyQt in Python
How to run a Python file at a Windows 10 command prompt
Change the standard output destination to a file in Python
How to import a file anywhere you like in Python
How to transpose a 2D array using only python [Note]
How to convert Youtube to mp3 and download it super-safely [Python]
Python learning basics ~ How to output (display) a character string? ~
How to write a metaclass that supports both python2 and python3
How to use a file other than .fabricrc as a configuration file
Output product information to csv using Rakuten product search API [Python]
Overview of Python virtual environment and how to create it
Run the output code on the local web server as "A, pretending to be B" in python
How to upload a file to Cloud Storage using Python [Make a fixed point camera with Raspberry PI # 1]
How to make Python 3.x and 2.x coexist on Mac (I also included opencv as a bonus)
[Python scraping] Output the URL and title of the site containing a specific keyword to a text file
Extract bigquery dataset and table list with python and output as CSV
How to install OpenCV on Cloud9 and run it in Python
Make it possible to output a log to a file with go echo