[PYTHON] [Empfohlenes Tagging beim maschinellen Lernen # 2] Erweiterung des Scraping-Skripts

Hi, this is Chogo again. Today is cool day and good day for programing inside warm home :)

So today topic is Scraping again. before that, I'd like to explain my goal of this series. My goal is building a system for tag suggesting with machine learning of Bayesian method. Learning articles and tags I already put on then checking articles for suggesting tags. I have to many things to learn so I don't know how many articles for the goal, I will do one by one.

Ok so now today's topic is still scraping. article #1 I explained how to scrape articles from Hatenablog. However this script was only for Hatenablog. I have to extend this script for other web sites.

First i'd like to show you modified script.

    scraper = [ 
            ["hatenablog.com","div","class","entry-content"],
            ["qiita.com","section","itemprop", "articleBody"]
            ]
    c = 0
    for domain in scraper:
        print url, domain[0]
        if re.search( domain[0], url):
            break
        c += 1

    response = urllib2.urlopen(url)
    html = response.read()
    
    soup = BeautifulSoup( html, "lxml" )
    soup.originalEnoding
    tag = soup.find( scraper[c][1], {scraper[c][2] : scraper[c][3]})
    text = ""
    for con in tag.contents:
        p = re.compile(r'<.*?>')
        text += p.sub('', con.encode('utf8'))

This script can scrape articles from Hatana Blog and Qiita. Below are tags of Hatena blog and Qiita.

Hatena Blog:

    <div class=entry-contents>
    CONTENTS to SCRAPE!
    </div>

Qiita:

    <div class="col-sm-9 itemsShowBody_articleColumn"><section class="markdownContent markdownContent-headingEnabled js-task-list-container clearfix position-relative js-task-list-enabled" id="item-xxx" itemprop="articleBody">
    CONTENTS to SCRAPE!
    </div>

So with BeautifulSoup, I wrote up like this. Feeding the elements for the soup...

    scraper = [ 
            ["hatenablog.com","div","class","entry-content"],
            ["qiita.com","section","itemprop", "articleBody"]
            ]

then, have the soup!

    tag = soup.find( scraper[c][1], {scraper[c][2] : scraper[c][3]}

Good. now I can get the elements for soup for each web site, I can extend the scrape article on other sites!

Ich gehe nach Umemura. Es ist kalt heute. Spielen Sie an einem solchen Tag nicht draußen, sondern programmieren Sie einfach in einem warmen Raum.

Nun, es ist eine Fortsetzung des Schabens, aber vorher möchte ich diesmal mein Ziel erklären. Das Ziel wird ein Tag-Schätzsystem sein, das maschinelles Lernen verwendet. Es ist ein Ersatz für das Erlernen der mit Lesezeichen versehenen Artikel, die ich persönlich markiert habe, und für den Versuch, die im Artikel enthaltenen Tags mithilfe der Basian-Methode zu schätzen. Es ist also klar geworden, dass es im Laufe der Zeit viele Dinge zu beachten gibt. Es ist also unentschlossen, wie lange diese Serie fortgesetzt wird. Ich frage mich, ob es enden wird.

Nun, das Hauptthema. Es wird nach dem letzten Mal kratzen. Das letzte Mal war es ein Skript, um den Teil des Artikels im Hatena-Blog zu extrahieren, aber natürlich ist es notwendig, ihn auch auf anderen Websites aus dem Artikel zu extrahieren. Daher ist es notwendig, es so zu modifizieren, dass es vielseitig einsetzbar ist.

So können Sie das geänderte Skript sofort sehen.

    scraper = [ 
            ["hatenablog.com","div","class","entry-content"],
            ["qiita.com","section","itemprop", "articleBody"]
            ]
    c = 0
    for domain in scraper:
        print url, domain[0]
        if re.search( domain[0], url):
            break
        c += 1

    response = urllib2.urlopen(url)
    html = response.read()
    
    soup = BeautifulSoup( html, "lxml" )
    soup.originalEnoding
    tag = soup.find( scraper[c][1], {scraper[c][2] : scraper[c][3]})
    text = ""
    for con in tag.contents:
        p = re.compile(r'<.*?>')
        text += p.sub('', con.encode('utf8'))

Dieses Skript extrahiert den Eintragsteil aus dem Hatena-Blog und den Qiita-Artikeln. Jeder Eintrag ist von den folgenden Tags umgeben.

Hatena Blog:

    <div class=entry-contents>
    CONTENTS to SCRAPE!
    </div>

Qiita:

    <div class="col-sm-9 itemsShowBody_articleColumn"><section class="markdownContent markdownContent-headingEnabled js-task-list-container clearfix position-relative js-task-list-enabled" id="item-xxx" itemprop="articleBody">
    CONTENTS to SCRAPE!
    </div>

Geben Sie dann den Teil an, der für die Tag-Beurteilung von Beautiful Soup erforderlich ist, wie folgt.

    scraper = [ 
            ["hatenablog.com","div","class","entry-content"],
            ["qiita.com","section","itemprop", "articleBody"]
            ]

Dann trink eine Suppe!

    tag = soup.find( scraper[c][1], {scraper[c][2] : scraper[c][3]}

Es wird so sein. Wenn Sie die Tag-Informationen der Site hinzufügen, auf der Sie den Artikel extrahieren möchten, können Sie sie auf andere Sites anwenden.

Das ist alles für heute, aber diese Serie läuft noch.