[PYTHON] [Recommended tagging for machine learning # 2] Extension of scraping script

Hi, this is Chogo again. Today is cool day and good day for programing inside warm home :)

So today topic is Scraping again. before that, I'd like to explain my goal of this series. My goal is building a system for tag suggesting with machine learning of Bayesian method. Learning articles and tags I already put on then checking articles for suggesting tags. I have to many things to learn so I don't know how many articles for the goal, I will do one by one.

Ok so now today's topic is still scraping. article #1 I explained how to scrape articles from Hatenablog. However this script was only for Hatenablog. I have to extend this script for other web sites.

First i'd like to show you modified script.

    scraper = [ 
            ["hatenablog.com","div","class","entry-content"],
            ["qiita.com","section","itemprop", "articleBody"]
            ]
    c = 0
    for domain in scraper:
        print url, domain[0]
        if re.search( domain[0], url):
            break
        c += 1

    response = urllib2.urlopen(url)
    html = response.read()
    
    soup = BeautifulSoup( html, "lxml" )
    soup.originalEnoding
    tag = soup.find( scraper[c][1], {scraper[c][2] : scraper[c][3]})
    text = ""
    for con in tag.contents:
        p = re.compile(r'<.*?>')
        text += p.sub('', con.encode('utf8'))

This script can scrape articles from Hatana Blog and Qiita. Below are tags of Hatena blog and Qiita.

Hatena Blog:

    <div class=entry-contents>
    CONTENTS to SCRAPE!
    </div>

Qiita:

    <div class="col-sm-9 itemsShowBody_articleColumn"><section class="markdownContent markdownContent-headingEnabled js-task-list-container clearfix position-relative js-task-list-enabled" id="item-xxx" itemprop="articleBody">
    CONTENTS to SCRAPE!
    </div>

So with BeautifulSoup, I wrote up like this. Feeding the elements for the soup...

    scraper = [ 
            ["hatenablog.com","div","class","entry-content"],
            ["qiita.com","section","itemprop", "articleBody"]
            ]

then, have the soup!

    tag = soup.find( scraper[c][1], {scraper[c][2] : scraper[c][3]}

Good. now I can get the elements for soup for each web site, I can extend the scrape article on other sites!

I'm going to Umemura. It's cold today. On a day like this, don't play outside, just program in a warm room.

Now, as for the continuation of scraping, before that, I would like to explain my goal this time. The goal will be a tag estimation system using machine learning. It's a substitute for learning bookmark articles that I personally tagged and then trying to estimate the tags that were in the article using the Bayesian method. So, it has become clear that there are many things to remember as we go along, so it is undecided how long this series will continue. I wonder if it will end.

Well the main subject. It will be scraping following the last time. Last time, it was a script that extracts the part of the article on the Hatena blog, but of course, it is necessary to extract it from the article on other sites as well. Therefore, it is necessary to modify it so that it has versatility.

So, you can see the script after remodeling immediately.

    scraper = [ 
            ["hatenablog.com","div","class","entry-content"],
            ["qiita.com","section","itemprop", "articleBody"]
            ]
    c = 0
    for domain in scraper:
        print url, domain[0]
        if re.search( domain[0], url):
            break
        c += 1

    response = urllib2.urlopen(url)
    html = response.read()
    
    soup = BeautifulSoup( html, "lxml" )
    soup.originalEnoding
    tag = soup.find( scraper[c][1], {scraper[c][2] : scraper[c][3]})
    text = ""
    for con in tag.contents:
        p = re.compile(r'<.*?>')
        text += p.sub('', con.encode('utf8'))

This script extracts the entry part from the Hatena blog and Qiita article. Each entry is surrounded by the following tags.

Hatena Blog:

    <div class=entry-contents>
    CONTENTS to SCRAPE!
    </div>

Qiita:

    <div class="col-sm-9 itemsShowBody_articleColumn"><section class="markdownContent markdownContent-headingEnabled js-task-list-container clearfix position-relative js-task-list-enabled" id="item-xxx" itemprop="articleBody">
    CONTENTS to SCRAPE!
    </div>

Then, specify the part required for the tag judgment of Beautiful Soup as follows.

    scraper = [ 
            ["hatenablog.com","div","class","entry-content"],
            ["qiita.com","section","itemprop", "articleBody"]
            ]

Then, have a soup!

    tag = soup.find( scraper[c][1], {scraper[c][2] : scraper[c][3]}

It will be like that. If you add the tag information of the site you want to extract the article, you can apply it to other sites.

That's all for today, but this series is still going on.