How to save a table scraped by python to csv

About this article

I needed to scrape the table on the web page in my research, so I will introduce the python program I used at that time. By the way, since I had no scraping history, I made it while investigating various things, but there was almost no explanation about how to convert the table part of HTML to csv after converting the table on the Web page to HTML. So I wrote this article.

Introduction

Please see the following URL for notes on scraping. https://qiita.com/Azunyan1111/items/b161b998790b1db2ff7a

Scraping with Python

The entire program can be found at here.

import

import csv
import urllib
from bs4 import BeautifulSoup

Description of the imported library -Csv is a Python standard library, and this time it is used for writing CSV files. -Urllib is used to access and acquire data (HTML) on the web. -BeautifulSoup is used to extract targeted data from HTML

Get HTML

image.png

url = "https://en.wikipedia.org/wiki/List_of_cities_in_Japan"
html = urllib.request.urlopen(url)
soup = BeautifulSoup(html, 'html.parser')
#Table from HTML(table tag)Get all the parts of
table = soup.find_all("table")

This time, I will scrape the wikipedia table that summarizes Japanese cities.

The program urllib.request.urlopen gets the HTML of the specified url. After that, format it so that it is easy to handle using Beautiful Soup, and then get all the part with the table (the part surrounded by the table tag) from HTML with soup.find_all ("table") and you are ready to go.

Find out the name of the TABLE tag you want to get

image.png

If you are using a chrome browser, you can enter the developer tools (black screen in the screenshot) by pressing F12 (command + option + I on mac). After that, you can see the HTML source code from Elements, so search for the table tag you want to scrape. This time, I would like to get the table selected in blue. Actually, this can be obtained by simply selecting the one whose className is "wikitable" from all the table tags.

for tab in table:
    table_className = tab.get("class")
    print(table_className)
    if table_className[0] == "wikitable":
        break

#Output result when there is no break statement
# ['vertical-navbox', 'nowraplinks', 'hlist']
# ['wikitable'] <-here,Exit using a break statement
# ['wikitable', 'sortable']
# ['wikitable', 'sortable']
# ['wikitable']
# ['nowraplinks', 'mw-collapsible', 'autocollapse', 'navbox-inner']

-The reason why table_className [0] is set is that the wikitable comes at the beginning of className. -Also, in this case, there are multiple other tables with the same name as the wikitable on HTML, but since the table I want this time is always the first wikitable, after passing the if statement for the first time, immediately issue a break statement. I'm using it to get out of the loop.

Once you have the desired table, convert it to CSV and save it.

Finally, add the CSV save function to the above program.

for tab in table:
    table_className = tab.get("class")
    if table_className[0] == "wikitable":
        #CSV save part
        with open("test.csv", "w", encoding='utf-8') as file:
            writer = csv.writer(file)
            rows = tab.find_all("tr")
            for row in rows:
                csvRow = []
                for cell in row.findAll(['td', 'th']):
                    csvRow.append(cell.get_text())
                writer.writerow(csvRow)
        break

The part of the CSV save function is to extract the table tag in the row direction ("tr"), take it out in the column direction ("td", "th"), append it in list format, and save it as CSV (table). If you can extract the tag, you can use it in copy and paste).

For confirmation, try displaying CSV using pandas

import pandas as pd
pd.read_csv("test.csv")

image.png

Safely, the csv saved one could be displayed by pandas!

Summary

It depends on the site you want to scrape, but I think you can get the table in CSV format in this way! Thank you for visiting us so far!

References

https://qiita.com/Azunyan1111/items/b161b998790b1db2ff7a

Recommended Posts

How to save a table scraped by python to csv
How to read a CSV file with Python 2/3
How to write a Python class
[Python] How to output a pandas table to an excel file
How to override a user-defined method generated by python swig
[Python] How to convert db file to csv
How to split and save a DataFrame
[Python] How to make a class iterable
[Python] How to convert a 2D list to a 1D list
[Python] How to add rows and columns to a table (pandas DataFrame)
[Python] How to get & change rows / columns / values from a table.
[Python] How to invert a character string
Function to save images by date [python3]
How to get a stacktrace in python
How to display multiplication table in python
[Python] How to store a csv file as one-dimensional array data
[Good By Excel] python script to generate sql to convert csv to table
How to run a Maya Python script
[Python] How to read a csv file (read_csv method of pandas module)
How to sort by specifying a column in the Python Numpy array.
[python] How to display list elements side by side
How to create a Python virtual environment (venv)
How to open a web browser from python
How to clear tuples in a list (Python)
How to embed a variable in a python string
How to create a JSON file in Python
How to generate a Python object from JSON
How to add a Python module search path
How to erase the characters output by Python
How to notify a Discord channel in Python
[Python] How to draw a histogram in Matplotlib
[Python] How to sort instances by instance variables
[Python] Continued-Convert PDF text to CSV page by page
How to install python
[Python] How to name table data and output it in csv (to_csv method)
[Python] How to easily drop a child process started by multiprocess from another process
[BigQuery] How to use BigQuery API for Python -Table creation-
How to convert / restore a string with [] in python
[Python] How to draw a line graph with Matplotlib
How to set up a Python environment using pyenv
[Python] How to expand variables in a character string
How to write a list / dictionary type of Python3
[Python] Convert PDF text to CSV page by page (2/24 postscript)
How to convert JSON file to CSV file with Python Pandas
How to display DataFrame as a table in Markdown
How to make a Python package using VS Code
[Python] How to write a docstring that conforms to PEP8
[Python] A memo to write CSV vertically with Pandas
[Python] How to rewrite the table style with python-pptx [python-pptx]
Python script to create a JSON file from a CSV file
[Python] How to create a 2D histogram with Matplotlib
How to execute a command using subprocess in Python
How to read csv containing only integers in Python
How to build a Python environment on amazon linux 2
[Python] How to call a c function from python (ctypes)
How to create a kubernetes pod from python code
[Python] How to draw a scatter plot with Matplotlib
How to execute a schedule by specifying the Python time zone and execution frequency
[GCF + Python] How to upload Excel to GCS and create a new table in BigQuery
[Python] You can save an object to a file by using the pickle module.
How to publish GitHub Pages with Pelican, a static HTML generator made by Python