[PYTHON] Extract table from wikipedia

easy_read_html.py


import pandas as pd
url = 'http://www.example.com'
df = pd.read_html(url)

Because it went wrong.

At first it went well

first.py


import pandas as pd
url = 'https://en.wikipedia.org/wiki/2020_coronavirus_pandemic_in_Italy'
df=pd.read_html(url)

Now that the table has been extracted, I thought it was "pandas`" and closed my computer tomorrow.

Next day

ValueError: invalid literal for int() with base 10: '2;'

And there is only an error. Even if you reinstall anaconda here, the situation does not change, so reinstall the OS. The result does not change: cry:

requests and beautiful soup

requests_bs.py


import requests
import pandas as pd
from bs4 import BeautifulSoup
url = 'https://en.wikipedia.org/wiki/2020_coronavirus_pandemic_in_Italy'
res = requests.get(url)
soup = BeautifulSoup(res.content, "lxml")
data=soup.find_all('table', {"wikitable mw-collapsible"})

In the end, it is unavoidable to get information from the WEB as it will be taken care of by pandas, so requests and [Beautiful Soup] ](Https://qiita.com/itkr/items/513318a9b5b92bd56185) for scraping

Does not become a data frame

df = pd.read_html(data) If you think you can go with this, TypeError: cannot parse from'ResultSet'

If you check with data

[<table class="wikitable mw-collapsible" style="float:left; text-align:right; font-size:83%;">
 <caption style="font-size:116%"><span class="nowrap">Daily COVID-19 cases in Italy by region ...

And list-like pd.read_html(data[0]) If you look at data [0], it is Html in the table, so if you try the above command, TypeError:'NoneType' object is not callable I don't know why this happens: cry:

Google

When I googled with 'nonetype' object is not callable pandas read_html, I got the result of stackoverflow, so I tried it, but it was wiped out. Is the list useless?

read_html list extraction

After searching with the above keywords, I finally found a way to work. https://teratail.com/questions/188717

type(data) Since it is bs4.element.ResultSet, it did not become an argument of read_html as it is (it seems)

did it

wiki_get.py


import requests
import pandas as pd
from bs4 import BeautifulSoup
url = 'https://en.wikipedia.org/wiki/2020_coronavirus_pandemic_in_Italy'
res = requests.get(url)
soup = BeautifulSoup(res.content, "lxml")
data=soup.find_all('table', {"wikitable mw-collapsible"})
df = pd.read_html(str(data), keep_default_na=False)[0]
df=df.iloc[:,0:28]
df.columns=['Date','VDA','LIG','PIE','LOM','VEN','TN','BZ','FVG','EMR','MAR','TOS','UMB','LAZ','ABR','MOL','CAM','BAS','PUG','CAL','SIC','SAR','ConfirmedNew','ConfirmedTotal','DeathsNew','DeathsTotal','ActiveICU','ActiveTotal']
daf=df[df['Date'].str.contains('^2020')]
daf.to_csv('Splunk_COVID19_Italy.csv', index=False)

https://en.wikipedia.org/wiki/2020_coronavirus_pandemic_in_Italy I extract the table by region in the above and bring it to CSV. df = pd.read_html(str(data), keep_default_na=False)[0] If you use str, it will be converted to str type, so you can use it as an argument of read_html. Since the return is a list, just extract the first result. df=df.iloc[:,0:28] The column is the 28th row from the beginning daf=df[df['Date'].str.contains('^2020')] Required columns are Date columns starting with 2020 Since str.contains cannot be used for DataFrame type as a whole, the column name is specified. daf.to_csv('Splunk_COVID19_Italy.csv', index=False) Since the index number is unnecessary when outputting CSV, delete it

I found it ...

https://python-forum.io/Thread-table-from-wikipedia After always making ...: cry:

Better guy

https://qiita.com/Retsuki/items/88eded5e61af200305fb Work cool: smile:

Summary

Splunk's Web Site Input didn't work, and when I googled it, the article on pandas This is easy / I'm not sure because of a mountain of errors when I try with God As a result, python has a long way to go

Recommended Posts

Extract table from wikipedia
Extract redirects from Wikipedia dumps
Extract features (features) from sentences.
Extract specific languages from Wiktionary
Extract specific data from complex JSON
Extract text from images in Python
How to access wikipedia from python
Extract strings from files in Python
Extract images from cifar and CUCUMBER-9 datasets
# 5 [python3] Extract characters from a character string
Extract Japanese text from PDF with PDFMiner
[TensorFlow] Extract features from trained model Inception-v3
[Python] (Line) Extract values from graph images