[PYTHON] Around feedparser error handling

TL;DR

Motivation

>>> import feedparser
>>> 
>>> resp1 = feedparser.parse('http://qiita.com/tags/python/feed')
>>> type(resp1)
<class 'feedparser.FeedParserDict'>
>>> 
>>> resp2 = feedparser.parse('http://qiita.com/tags/python1/feed')
>>> type(resp2)
<class 'feedparser.FeedParserDict'>

feedparser.parse () returns feedparser.FeedParserDict whatever the specified URL. That's a little inconvenient, so I decided to take a look at the contents.

What to look for

Variable information

Variable name URL
resp1 http://qiita.com/tags/python/feed Ordinary RSS feed
resp2 http://qiitta.com/tags/python Not an RSS feed
resp3 http://qiita.com/tags/python1/feed Status code other than 200
resp4 http://qiitta.com/tags/python/feed Domain that does not exist

Target

Try to find out

Examine the key

>>> resp1.keys()
dict_keys(['bozo', 'encoding', 'status', 'etag', 'href', 'entries', 'version', 'namespaces', 'feed', 'headers'])
>>> 
>>> resp2.keys()
dict_keys(['bozo', 'encoding', 'status', 'bozo_exception', 'etag', 'href', 'entries', 'version', 'namespaces', 'feed', 'headers'])
>>> 
>>> resp3.keys()
dict_keys(['bozo', 'encoding', 'bozo_exception', 'status', 'href', 'entries', 'version', 'namespaces', 'feed', 'headers'])
>>> 
>>> resp4.keys()
dict_keys(['bozo', 'entries', 'feed', 'bozo_exception'])
>>> 

Unexpectedly, the keys I hold are different. Only bozo, ʻentries` are common.

Contents of common keys

>>> resp1.bozo
0
>>> resp1.entries
[{'summary': '<p>Chapter 8 describes the graphical model. A graphical model is a method of graphically expressing relationships such as random variables and model parameters.
#Abbreviation
}]
>>> 
>>> resp2.bozo
1
>>> resp2.version
''
>>> resp2.entries
[]
>>>
>>> resp3.bozo
1
>>> resp3.version
''
>>> resp3.entries
[]
>>>
>>> resp4.bozo
1
>>> resp4.entries
[]
>>>

At this point, only when ** bozo is 0, it can be considered that the RSS feed was successfully parsed **.

If you look only at the original main purpose, this is almost complete, but let's dig a little deeper.

See the structure of Perth failure

You can play with ʻentries` without thinking about the result of successful parsing, so from here on, I will go to see what kind of information can be obtained for the purpose of error handling of parsing failure.

Existence of bozo_exception

If you compare resp1 and resp2, you can see HTTP-like keys such as status and headers, probably because both requests were successful. Meanwhile, there was a key that exists only in resp2. That is bozo_exception.

>>> resp2.bozo_exception
SAXParseException('undefined entity',)

It contained that decent message. Looking at this, it seems that there is almost no problem.

resp3,resp4


>>> resp3.bozo_exception
NonXMLContentType('text/html; charset=utf-8 is not an XML media type',)
>>>
>>> resp4.bozo_exception
URLError(gaierror(8, 'nodename nor servname provided, or not known'),)
>>>

Looking at other resps, it looks like this. The disadvantage is that it is troublesome to put out only the character string inside. Also, resp3 is trying to parse despite 404 Not Found, so if you have status, it's a good idea to look at it as well.

Recommended Posts

Around feedparser error handling
Mainframe error handling
Python Error Handling
django.db.migrations.exceptions.InconsistentMigrationHistory error handling
About tweepy error handling
Error handling in PythonBox
GraphQL (gqlgen) error handling
Error handling when installing mecab-python
PyCUDA build error handling memorandum
Error divided by 0 Handling of ZeroDivisionError
[Error handling] peewee.IntegrityError 1451 occurs in peewee
Error handling when updating Fish shell
Error handling during Django migrate'DIRS': [BASE_DIR /'templates']