`easy_read_html.py`


import pandas as pd
url = 'http://www.example.com'
df = pd.read_html(url)

Because it went wrong.

At first it went well

`first.py`


import pandas as pd
url = 'https://en.wikipedia.org/wiki/2020_coronavirus_pandemic_in_Italy'
df=pd.read_html(url)

Now that the table has been extracted, I thought it was "pandas`" and closed my computer tomorrow.

Next day

ValueError: invalid literal for int() with base 10: '2;'

And there is only an error. Even if you reinstall anaconda here, the situation does not change, so reinstall the OS. The result does not change: cry:

requests and beautiful soup

`requests_bs.py`


import requests
import pandas as pd
from bs4 import BeautifulSoup
url = 'https://en.wikipedia.org/wiki/2020_coronavirus_pandemic_in_Italy'
res = requests.get(url)
soup = BeautifulSoup(res.content, "lxml")
data=soup.find_all('table', {"wikitable mw-collapsible"})

In the end, it is unavoidable to get information from the WEB as it will be taken care of by pandas, so requests and [Beautiful Soup] ](Https://qiita.com/itkr/items/513318a9b5b92bd56185) for scraping

Does not become a data frame

df = pd.read_html(data) If you think you can go with this, TypeError: cannot parse from'ResultSet'

If you check with data

[<table class="wikitable mw-collapsible" style="float:left; text-align:right; font-size:83%;">
 <caption style="font-size:116%"><span class="nowrap">Daily COVID-19 cases in Italy by region ...

And list-like pd.read_html(data[0]) If you look at data [0], it is Html in the table, so if you try the above command, TypeError:'NoneType' object is not callable I don't know why this happens: cry:

Google

When I googled with 'nonetype' object is not callable pandas read_html, I got the result of stackoverflow, so I tried it, but it was wiped out. Is the list useless?

read_html list extraction

After searching with the above keywords, I finally found a way to work. https://teratail.com/questions/188717

type(data) Since it is bs4.element.ResultSet, it did not become an argument of read_html as it is (it seems)

did it

`wiki_get.py`


import requests
import pandas as pd
from bs4 import BeautifulSoup
url = 'https://en.wikipedia.org/wiki/2020_coronavirus_pandemic_in_Italy'
res = requests.get(url)
soup = BeautifulSoup(res.content, "lxml")
data=soup.find_all('table', {"wikitable mw-collapsible"})
df = pd.read_html(str(data), keep_default_na=False)[0]
df=df.iloc[:,0:28]
df.columns=['Date','VDA','LIG','PIE','LOM','VEN','TN','BZ','FVG','EMR','MAR','TOS','UMB','LAZ','ABR','MOL','CAM','BAS','PUG','CAL','SIC','SAR','ConfirmedNew','ConfirmedTotal','DeathsNew','DeathsTotal','ActiveICU','ActiveTotal']
daf=df[df['Date'].str.contains('^2020')]
daf.to_csv('Splunk_COVID19_Italy.csv', index=False)

https://en.wikipedia.org/wiki/2020_coronavirus_pandemic_in_Italy I extract the table by region in the above and bring it to CSV. df = pd.read_html(str(data), keep_default_na=False)[0] If you use str, it will be converted to str type, so you can use it as an argument of read_html. Since the return is a list, just extract the first result. df=df.iloc[:,0:28] The column is the 28th row from the beginning daf=df[df['Date'].str.contains('^2020')] Required columns are Date columns starting with 2020 Since str.contains cannot be used for DataFrame type as a whole, the column name is specified. daf.to_csv('Splunk_COVID19_Italy.csv', index=False) Since the index number is unnecessary when outputting CSV, delete it

I found it ...

https://python-forum.io/Thread-table-from-wikipedia After always making ...: cry:

Better guy

https://qiita.com/Retsuki/items/88eded5e61af200305fb Work cool: smile:

Summary

Splunk's Web Site Input didn't work, and when I googled it, the article on pandas This is easy / I'm not sure because of a mountain of errors when I try with God As a result, python has a long way to go

[PYTHON] Extract table from wikipedia

easy_read_html.py