Motivation

>>> import feedparser
>>> 
>>> resp1 = feedparser.parse('http://qiita.com/tags/python/feed')
>>> type(resp1)
<class 'feedparser.FeedParserDict'>
>>> 
>>> resp2 = feedparser.parse('http://qiita.com/tags/python1/feed')
>>> type(resp2)
<class 'feedparser.FeedParserDict'>

feedparser.parse () returns feedparser.FeedParserDict whatever the specified URL. That's a little inconvenient, so I decided to take a look at the contents.

What to look for

Variable information

Variable name	URL
resp1	http://qiita.com/tags/python/feed	Ordinary RSS feed
resp2	http://qiitta.com/tags/python	Not an RSS feed
resp3	http://qiita.com/tags/python1/feed	Status code other than 200
resp4	http://qiitta.com/tags/python/feed	Domain that does not exist

Target

FeedParserDict key
What are the common keys?

Try to find out

Examine the key

>>> resp1.keys()
dict_keys(['bozo', 'encoding', 'status', 'etag', 'href', 'entries', 'version', 'namespaces', 'feed', 'headers'])
>>> 
>>> resp2.keys()
dict_keys(['bozo', 'encoding', 'status', 'bozo_exception', 'etag', 'href', 'entries', 'version', 'namespaces', 'feed', 'headers'])
>>> 
>>> resp3.keys()
dict_keys(['bozo', 'encoding', 'bozo_exception', 'status', 'href', 'entries', 'version', 'namespaces', 'feed', 'headers'])
>>> 
>>> resp4.keys()
dict_keys(['bozo', 'entries', 'feed', 'bozo_exception'])
>>>

Unexpectedly, the keys I hold are different. Only bozo, ʻentries` are common.

Contents of common keys

>>> resp1.bozo
0
>>> resp1.entries
[{'summary': '<p>Chapter 8 describes the graphical model. A graphical model is a method of graphically expressing relationships such as random variables and model parameters.
#Abbreviation
}]
>>> 
>>> resp2.bozo
1
>>> resp2.version
''
>>> resp2.entries
[]
>>>
>>> resp3.bozo
1
>>> resp3.version
''
>>> resp3.entries
[]
>>>
>>> resp4.bozo
1
>>> resp4.entries
[]
>>>

At this point, only when ** bozo is 0, it can be considered that the RSS feed was successfully parsed **.

If you look only at the original main purpose, this is almost complete, but let's dig a little deeper.

See the structure of Perth failure

You can play with ʻentries` without thinking about the result of successful parsing, so from here on, I will go to see what kind of information can be obtained for the purpose of error handling of parsing failure.

Existence of bozo_exception

If you compare resp1 and resp2, you can see HTTP-like keys such as status and headers, probably because both requests were successful. Meanwhile, there was a key that exists only in resp2. That is bozo_exception.

>>> resp2.bozo_exception
SAXParseException('undefined entity',)

It contained that decent message. Looking at this, it seems that there is almost no problem.

`resp3,resp4`


>>> resp3.bozo_exception
NonXMLContentType('text/html; charset=utf-8 is not an XML media type',)
>>>
>>> resp4.bozo_exception
URLError(gaierror(8, 'nodename nor servname provided, or not known'),)
>>>

Looking at other resps, it looks like this. The disadvantage is that it is troublesome to put out only the character string inside. Also, resp3 is trying to parse despite 404 Not Found, so if you have status, it's a good idea to look at it as well.

[PYTHON] Around feedparser error handling