I wanted to scrape text data from a local html file, so I tried various things, but since the Python library Beautiful Soup was very convenient, I will share how to use it and how to output it to a CSV file.
pyenv: 1.2.15 python: 3.6.5 Beautiful Soup: 4.4.0 VSCode: 1.41.1
For the environment construction, I referred to the following lesson of Progate. Prepare a Python development environment! (Mac)
A type of Python library based on HTML tags and CSS selectors from HTML data You can scrape. Official reference: https://www.crummy.com/software/BeautifulSoup/bs4/doc/ Reference Japanese translation (ver 3.0): https://tdoc.info/beautifulsoup/
Install using pip. I referred to the following article.
[Introduction to Python] What is pip? Easy-to-understand explanation of how to use! Let's scrape with beautiful Python soup
You can install it with the following command.
pip install beautifulsoup4
Prepare the html file to be scraped locally. The following is a sample file.
/sample_file/sample.html
<!--~abridgement~-->
<div>
<ul lass="sample">
<li class="sample">
<a href="aaa">aaaaaa</a>
</li>
<li class="sample">
<a href="bbb">bbbbb</a>
</li>
</ul>
<div class="sample">
<a href="ccc">ccc</a>
</div>
<div class="sample">
<div class="sample">
<a href="ddd">ddddd</a>
</div>
</div>
</div>
<!--~abridgement~-->
Next, create a Python program. Create it in the same directory as the html file.
/sample_file/script.py
import bs4
import csv #module"CSV"Call
#Create soup from scraped html file
soup = bs4.BeautifulSoup(open('sample.html'), 'html.parser')
links = soup.find_all('a') #Get all a tag elements
csvlist = [] #Create an array
for link in links: #Store text data of a tag in an array
sample_txt = link.text
csvlist.append(sample_txt)
#Open the CSV file. If the file does not exist, create a new one
f = open("output_sample.csv", "w")
writecsv = csv.writer(f, lineterminator='\n')
writecsv.writerow(csvlist) #output
f.close() #Close CSV file
I referred to the following article.
Parsing HTML with Python (Beautiful Soup) Output HTML scraped by Beautiful Soup to CSV
$ cd sample_file
$ python script.py
The CSV file will be output to the same directory as shown below.
output_sample.csv
aaaaaa,bbbbb,ccc,ddddd
If you're using VS Code, you can make your CSV much easier to read by introducing an extension called Rainbow CSV.
before
after
For the introduction of Rainbow CSV, I referred to the following article. Introducing "Rainbow CSV" that makes CSV easier to see with VS Code
If you scrape with the above method, many line breaks may occur. When you want to use the CSV file as the contents of the array in seed data etc. You may want to remove the line breaks to make one line. In such a case, it is recommended to delete all line breaks with the VS Code replacement function. I referred to the following article for this method. [[Visual Studio Code] How to replace the line feed code to one line] (https://kukka.me/vsc-newline/)
You can scrape with Beautiful Soup. You can output a CSV file by using the module "CSV". If the CSV file contains many line breaks due to the structure of the html file, The VS Code replacement function is useful.
https://prog-8.com/docs/python-env https://www.crummy.com/software/BeautifulSoup/bs4/doc/ https://www.sejuku.net/blog/50417 https://www.sejuku.net/blog/75137 https://maku77.github.io/python/parse-html-by-beautiful-soup.html https://5log.hateblo.jp/entry/2019/01/03/075552 https://qiita.com/0w0/items/07a481921a2ac09a049f https://kukka.me/vsc-newline/