Preface

This article is based on ❷ of Collection and classification of machine learning related information (concept).

To automatically classify web pages collected by crawling Random forest and as previously investigated in Multi-label classification by random forest with scikit-learn It is possible to classify in an on-premises environment with an algorithm such as a decision tree, but on the other hand, such as Watson Natural Language Classifier usage example from Ruby It is also possible to use services on the cloud.

Pre-processing as described in Types of pre-processing in natural language processing and its power is very important.

However, when using services on the cloud to automatically classify web pages, preprocessing not mentioned in this article is required.

That is the "Web page summary".

Generally, services on the cloud have a limit on the amount of data per record. Example of using Watson Natural Language Classifier from Ruby has a limitation of 1024 characters per article to be classified, so cut out 1024 bytes from the front of the blog article. I tried to pass it to the service. This is also a rudimentary “summary”.

In the case of Web pages, there is no limit on the amount of data, so it is necessary to devise ways to reduce the amount of data passed to services on the cloud without dropping the information required for classification.

I suspect that the most orthodox technique for this is "web page summarization."

General summarization techniques

I couldn't find a lot of recent trends in summarization technology, but about three years ago Research trends in automatic summarization technology Was helpful.

Also, using LexRank as a concrete implementation of the code I will summarize Donald Trump's speech in three lines with the automatic summarization algorithm LexRank ) And Summary of product value of EC site using automatic summarization algorithm LexRank are posted on Qiita.

These implementations summarize sentences with the policy of treating sentences as a set of "sentences", assigning importance to "sentences", and extracting in order from the "sentences" with the highest importance.

The question here is how to divide a web page into “sentences”.

Regarding this, I could not find the implementation code even if I searched the net, and the only logic that could be implemented was ["HTML text segmentation for Web page summarization by extracting important sentences"](http: / /harp.lib.hiroshima-u.ac.jp/hiroshima-cu/metadata/5532) was the paper [^ 1].

After all, I couldn't find the implementation code, so this time I decided to implement the code [^ 2] with the logic of this paper.

Specific logic

The specific logic consists of the following three stages.

(1) Divide the whole into text units (sentence candidates) The break is below --Block level element --Link tag (when multiple consecutive) --Kuten (“.”)

(2) Classify text units into sentence units and non-sentence units Sentences satisfy 3 or more of the following conditions --C1. The number of independent words is 7 or more --C2. The ratio of the number of independent words to the total number of words is 0.64 or less. --C3. The ratio of the number of attached words to the number of independent words is 0.22 or more [^ 3] --C4. The ratio of the number of particles to the number of independent words is 0.26 or more. --C5. The ratio of the number of auxiliary verbs to the number of independent words is 0.06 or more.

(3) Combine and divide non-sentence units based on the number of independent words

For the sake of simplicity this time, the non-statement unit only joins and does not implement splitting.

Also, the code below is an implementation that extracts only the body part of HTML, because it is assumed that the content of the title element is used for the file name of the downloaded HTML file. In general, it's a good idea to use the content of the title element in the summary as well.

Splitting HTML into text units and plain text

`html2plaintext_part_1.py`


import codecs
import re

class Article:

    #Try character codes in this order
    encodings = [
        "utf-8",
        "cp932",
        "euc-jp",
        "iso-2022-jp",
        "latin_1"
    ]

    #Block-level element extraction regular expression
    block_level_tags = re.compile("(?i)</?(" + "|".join([
        "address", "blockquote", "center", "dir", "div", "dl",
        "fieldset", "form", "h[1-6]", "hr", "isindex", "menu",
        "noframes", "noscript", "ol", "pre", "p", "table", "ul",
        "dd", "dt", "frameset", "li", "tbody", "td", "tfoot",
        "th", "thead", "tr"
        ]) + ")(>|[^a-z].*?>)")

    def __init__(self,path):
        print(path)
        self.path = path
        self.contents = self.get_contents()

    def get_contents(self):
        for encoding in self.encodings:
            try:
                lines = ' '.join([line.rstrip('\r\n') for line in codecs.open(self.path, 'r', encoding)])
                parts = re.split("(?i)<(?:body|frame).*?>", lines, 1)
                if len(parts) == 2:
                    head, body = parts
                else:
                    print('Cannot split ' + self.path)
                    body = lines
                body = re.sub(r"(?i)<(script|style|select).*?>.*?</\1\s*>"," ", body)
                body = re.sub(self.block_level_tags, ' _BLOCK_LEVEL_TAG_ ', body)
                body = re.sub(r"(?i)<a\s.+?>",' _ANCHOR_LEFT_TAG_ ', body)
                body = re.sub("(?i)</a>",' _ANCHOR_RIGHT_TAG_ ', body)
                body = re.sub("(?i)<[/a-z].*?>", " ", body)
                blocks = []
                for block in body.split("_BLOCK_LEVEL_TAG_"):
                    units = []
                    for unit in block.split("。"):
                        unit = re.sub("_ANCHOR_LEFT_TAG_ +_ANCHOR_RIGHT_TAG_", " ", unit) #Exclude links to images
                        if not re.match(r"^ *$", unit):
                            for fragment in re.split("((?:_ANCHOR_LEFT_TAG_ .+?_ANCHOR_LEFT_TAG_ ){2,})", unit):
                                fragment = re.sub("_ANCHOR_(LEFT|RIGHT)_TAG_", ' ', fragment)
                                if not re.match(r"^ *$", fragment):
                                    if TextUnit(fragment).is_sentence():
                                        #Sentence units end with "."
                                        if len(units) > 0 and units[-1] == '―':
                                            units.append('。\n')
                                        units.append(fragment)
                                        units.append(' 。\n')
                                    else:
                                        #Non-sentence units end with "-."
                                        # (Constraint)Unlike the dissertation, non-sentence units are only combined and not divided.
                                        units.append(fragment)
                                        units.append('―')
                    if len(units) > 0 and units[-1] == '―':
                       units.append('。\n')
                    blocks += units
                return re.sub(" +", " ", "".join(blocks))
            except UnicodeDecodeError:
                continue
        print('Cannot detect encoding of ' + self.path)
        return None

Distinguish between statement units and non-statement units

`html2plaintext_part_2.py`


from janome.tokenizer import Tokenizer
from collections import defaultdict
import mojimoji
#import re

class TextUnit:

    tokenizer = Tokenizer("user_dic.csv", udic_type="simpledic", udic_enc="utf8")

    def __init__(self,fragment):
        self.fragment   = fragment
        self.categories = defaultdict(int)
        for token in self.tokenizer.tokenize(self.preprocess(self.fragment)):
            self.categories[self.categorize(token.part_of_speech)] += 1

    def categorize(self,part_of_speech):
        if re.match("^noun,(General|代noun|固有noun|Change connection|[^,]+stem)", part_of_speech):
            return 'Independence'
        if re.match("^verb", part_of_speech) and not re.match("Sa strange", part_of_speech):
            return 'Independence'
        if re.match("^adjective,Independence", part_of_speech):
            return 'Independence'
        if re.match("^Particle", part_of_speech):
            return 'Particle'
        if re.match("^Auxiliary verb", part_of_speech):
            return 'Auxiliary verb'
        return 'Other'

    def is_sentence(self):
        if self.categories['Independence'] == 0:
            return False
        match = 0
        if self.categories['Independence'] >= 7:
            match += 1
        if 100 * self.categories['Independence'] / sum(self.categories.values()) <= 64:
            match += 1
        if 100 * (self.categories['Particle'] + self.categories['Auxiliary verb']) / self.categories['Independence'] >= 22:
            #Interpreted as "attached word = particle ⋃ auxiliary verb" as in the paper(Different from the usual definition)
            match += 1
        if 100 * self.categories['Particle'] / self.categories['Independence'] >= 26:
            match += 1
        if 100 * self.categories['Auxiliary verb'] / self.categories['Independence'] >= 6:
            match += 1
        return match >= 3

    def preprocess(self, text):
        text = re.sub("&[^;]+;",  " ", text)
        text = mojimoji.han_to_zen(text, digit=False)
        text = re.sub('(\t |　)+', " ", text)
        return text

Convert HTML files to plain text files

`html2plaintext_part_3.py`


if __name__ == '__main__':
    import glob
    import os

    path_pattern = ''/home/samba/example/links/bookmarks.crawled/**/*.html'
    # The converted plaintext is put as '/home/samba/example/links/bookmarks.plaintext/**/*.txt'
    for path in glob.glob(path_pattern, recursive=True):
        article = Article(path)
        plaintext_path = re.sub("(?i)html?$", "txt", path.replace('.crawled', '.plaintext'))
        plaintext_dir  = re.sub("/[^/]+$", "", plaintext_path)
        if not os.path.exists(plaintext_dir):
            os.makedirs(plaintext_dir)
        with open(plaintext_path, 'w') as f:
            f.write(article.contents)

I'm new to Python, so I think it's a clumsy code, but I'd appreciate it if you could point out any improvements.

Plain text summary

The plain text generated in this way should be summarized using something like LexRank.

In this example, janome was used for morphological analysis for simplicity, but since "HTML-> plain text" and "plain text-> summary" have separate steps, "plain text-> summary" is completely different. You may use the morphological analysis tool of.

[^ 1]: The original research was from 2002 to 2004, and more efficient logic may have been proposed recently.

[PYTHON] Web page summary (preprocessing)