easy_read_html.py
import pandas as pd
url = 'http://www.example.com'
df = pd.read_html(url)
Because it went wrong.
first.py
import pandas as pd
url = 'https://en.wikipedia.org/wiki/2020_coronavirus_pandemic_in_Italy'
df=pd.read_html(url)
Now that the table has been extracted, I thought it was "pandas`" and closed my computer tomorrow.
ValueError: invalid literal for int() with base 10: '2;'
And there is only an error. Even if you reinstall anaconda here, the situation does not change, so reinstall the OS. The result does not change: cry:
requests_bs.py
import requests
import pandas as pd
from bs4 import BeautifulSoup
url = 'https://en.wikipedia.org/wiki/2020_coronavirus_pandemic_in_Italy'
res = requests.get(url)
soup = BeautifulSoup(res.content, "lxml")
data=soup.find_all('table', {"wikitable mw-collapsible"})
In the end, it is unavoidable to get information from the WEB as it will be taken care of by pandas
, so requests and [Beautiful Soup] ](Https://qiita.com/itkr/items/513318a9b5b92bd56185) for scraping
df = pd.read_html(data)
If you think you can go with this, TypeError: cannot parse from'ResultSet'
If you check with data
[<table class="wikitable mw-collapsible" style="float:left; text-align:right; font-size:83%;">
<caption style="font-size:116%"><span class="nowrap">Daily COVID-19 cases in Italy by region ...
And list-like
pd.read_html(data[0])
If you look at data [0]
, it is Html in the table, so if you try the above command, TypeError:'NoneType' object is not callable
I don't know why this happens: cry:
When I googled with 'nonetype' object is not callable pandas read_html
, I got the result of stackoverflow, so I tried it, but it was wiped out. Is the list useless?
After searching with the above keywords, I finally found a way to work. https://teratail.com/questions/188717
type(data)
Since it is bs4.element.ResultSet, it did not become an argument of read_html
as it is (it seems)
wiki_get.py
import requests
import pandas as pd
from bs4 import BeautifulSoup
url = 'https://en.wikipedia.org/wiki/2020_coronavirus_pandemic_in_Italy'
res = requests.get(url)
soup = BeautifulSoup(res.content, "lxml")
data=soup.find_all('table', {"wikitable mw-collapsible"})
df = pd.read_html(str(data), keep_default_na=False)[0]
df=df.iloc[:,0:28]
df.columns=['Date','VDA','LIG','PIE','LOM','VEN','TN','BZ','FVG','EMR','MAR','TOS','UMB','LAZ','ABR','MOL','CAM','BAS','PUG','CAL','SIC','SAR','ConfirmedNew','ConfirmedTotal','DeathsNew','DeathsTotal','ActiveICU','ActiveTotal']
daf=df[df['Date'].str.contains('^2020')]
daf.to_csv('Splunk_COVID19_Italy.csv', index=False)
https://en.wikipedia.org/wiki/2020_coronavirus_pandemic_in_Italy
I extract the table by region in the above and bring it to CSV.
df = pd.read_html(str(data), keep_default_na=False)[0]
If you use str
, it will be converted to str type, so you can use it as an argument of read_html
.
Since the return is a list, just extract the first result.
df=df.iloc[:,0:28]
The column is the 28th row from the beginning
daf=df[df['Date'].str.contains('^2020')]
Required columns are Date columns starting with 2020
Since str.contains
cannot be used for DataFrame type as a whole, the column name is specified.
daf.to_csv('Splunk_COVID19_Italy.csv', index=False)
Since the index number is unnecessary when outputting CSV, delete it
https://python-forum.io/Thread-table-from-wikipedia After always making ...: cry:
https://qiita.com/Retsuki/items/88eded5e61af200305fb Work cool: smile:
Splunk's Web Site Input didn't work, and when I googled it, the article on pandas
This is easy / I'm not sure because of a mountain of errors when I try with God
As a result, python has a long way to go
Recommended Posts