** Preprocessing is essential for natural language processing. ** Text is a list of characters and is not structured, so it is difficult to process as it is. In particular, web text contains noise such as HTML tags and JavaScript code. Unless such noise is pretreated and removed, the expected results will not be obtained.
Source: [Deep learning for computational biology](http://msb.embopress.org/content/12/7/878)This article describes ** types of preprocessing in natural language processing and their power **. As for the explanation order, the types of preprocessing are explained first. Each pre-processing will be explained from the viewpoints of 1. what kind of processing, 2. why it is done, and 3. implementation method (as much as possible). After explaining the types, we will compare the results of document classification with and without preprocessing to measure the power of preprocessing.
This section describes the five pre-processing shown below. We will explain the five pre-processes from the viewpoints of 1. what kind of processing, 2. why that processing is performed, and 3. implementation method.
Text cleaning removes the noise contained in the text. Common noise is JavaScript code and HTML tags. Eliminating these noises can reduce the negative impact of noise on task results. The image looks like this:
Removal of JavaScript and HTML tags is common, but in reality, the noise you want to remove depends on the data. Regular expressions are a hand that can be used in such cases. When writing a regular expression, use an online editor such as the one below to check the pattern match in real time, and the work will be improved.
Python is useful for cleaning things like Beautiful Soup and lxml There is a library. Click here for an example of text cleaning with BeautifulSoup: preprocessings/ja/cleaning.py
The first thing that is done for languages such as Japanese where word breaks are not clear is word splitting. The reason for splitting words is that most natural language processing systems handle input at the word level. The division is mainly performed using a morphological analyzer. The main morphological analyzers are MeCab and [Juman ++](http://nlp.ist.i.kyoto-u.ac.jp/index.php? JUMAN ++), Janome.
The image is divided as follows. At this time, the word may be prototyped to reduce the number of vocabulary:
The problem with morphological analysis is that by default it is not strong at parsing new words. Looking at the above example, the "National Art Center, Tokyo" is divided into three parts: "National", "New", and "Museum". The reason for this result is that the dictionary used for analysis does not include "The National Art Center, Tokyo". This problem is exacerbated, especially as the Web contains many new words.
This problem can be solved to some extent by adding a dictionary called NEologd. NEologd contains many new words compared to regular dictionaries. Therefore, using NEologd makes it easier to analyze new words. Below is the result of parsing the same sentence using NEologd:
After dividing the word as described above, the subsequent processing is performed. Below is the Python implementation: preprocessings/ja/tokenizer.py
In word normalization, words are replaced, such as unifying the character types of words and absorbing spelling and notational fluctuations. By performing this process, full-width "cat" and half-width "cat" and hiragana "cat" can be processed as the same word. This is an important process from the viewpoint of the amount of calculation and memory usage in the subsequent processing.
There are various processes for normalizing words, but this article introduces the following three processes.
To unify the character types, the uppercase letters of the alphabet are converted to lowercase letters, and the half-width characters are converted to full-width characters. For example, convert the uppercase part of "Natural" to lowercase to make it "natural", or convert "cat" to full-width to make it "cat". By performing this kind of processing, words can be treated as the same word regardless of the character type.
Replacing numbers replaces the numbers that appear in the text with another symbol (for example, 0). For example, let's say you see a string in a text that contains a number like "January 1, 2017". Replacing numbers will convert the numbers in this string to something like "0/0/0".
The reason for replacing numbers is that they are often not useful for natural language processing tasks, despite the variety of numerical representations and their high frequency of occurrence. For example, consider the task of categorizing news articles into categories like "sports" or "politics." At this time, various numerical expressions will appear in the article, but it is considered to be of little use for categorization. This reduces the number of vocabulary by replacing the numbers with different symbols.
Therefore, numbers are not replaced in tasks where numerical representation is important (such as information extraction).
In word unification using a dictionary, words are replaced with typical expressions. For example, when dealing with sentences in which the notations "Sony" and "Sony" are mixed, "Sony" is replaced with "Sony". This will allow the two words to be treated as the same word in subsequent processing. It should be noted that the replacement should be done in consideration of the context.
The world of word normalization is deep, and in addition to the normalization explained above, it absorbs spelling fluctuations (loooooooooooool-> lol), processes abbreviations (4eva-> forever), and represents colloquial expressions (ssu->. There is also normalization such as). If there is a large amount of data, I think that there is some processing that can be handled by the distributed expression of words described later, but I think that it is best to perform the processing necessary for the task you want to solve.
It implements some of the word normalizations described above: preprocessings/ja/normalization.py
A stop word is a word that is excluded from processing due to reasons such as being general and useless when processing natural language. For example, functional words such as particles and auxiliary verbs (such as "ha", "no", "desu", and "masu"). These words are not useful for their high frequency of occurrence and are removed because they adversely affect the amount of calculation and performance.
There are Various methods for removing stopwords, but this article introduces the following two methods.
In the dictionary method, stop words are defined in the dictionary in advance, and the words contained in the dictionary are removed from the text. You can create the dictionary yourself, but there are already defined dictionaries. Here, one of the Japanese stopword dictionaries Slothlib Let's take a look at the contents. About 300 words are defined line by line:
over there
Per
there
Over there
after
hole
you
that
How many
When
Now
Disagreeable
various
...
The words defined in this dictionary are read and used as stop words. Specifically, if the read stop word is included in the text divided into words, it will be removed. The image looks like this:
The dictionary method is simple and easy, but it has some drawbacks. The first is the cost of creating a dictionary. The other is that it may not be useful for some domains. Therefore, you need to remake it according to the corpus you are targeting.
The frequency-based method counts the frequency of words in the text and removes high-frequency (sometimes infrequent) words from the text. We remove high-frequency words because they make up a large percentage of the text, but they are useless. The figure below plots the cumulative frequency of the 50 most frequent words in an English book: If you look at the 50 words, you can see that words that seem to be useless for document classification such as the, of, and commas make up nearly 50% of the text. The frequency-based method removes these high-frequency words from the text as stop words.
Click here for an implementation that removes stopwords: preprocessings/ja/stopwords.py
In the vector representation of words, the process of converting a word that is a character string into a vector is performed. The reason for converting a character string to a vector is that the character string is variable length and difficult to handle, and it is difficult to calculate the similarity. There are various methods for vector expression, but I will introduce the following two.
The first possible way to represent a word as a vector is the one-hot representation. A one-hot expression is a method in which only one element is 1 and the other elements are 0. By setting 1 or 0 for each dimension, "whether or not it is the word" is indicated.
For example, let's say the one-hot expression represents the word python. Here, the vocabulary that is a set of words is 5 words (nlp, python, word, ruby, one-hot). Then the vector representing python looks like this:
The one-hot representation is simple, but it has the disadvantage that operations between vectors do not produce any meaningful results. For example, let's say you take the dot product to calculate the similarity between words. In the one-hot expression, different words have 1s in different places and other elements are 0, so the result of taking the inner product between different words is 0. This is not the desired result. Also, since one dimension is assigned to one word, it becomes very high dimension as the number of vocabulary increases.
Distributed representations, on the other hand, are representations of words as low-dimensional real-value vectors. It is often expressed in about 50 to 300 dimensions. For example, the words mentioned earlier can be expressed as follows in a distributed expression.
You can solve the problems of one-hot expressions by using distributed expressions. For example, operations between vectors will allow you to calculate the similarity between words. Looking at the vector above, the similarity between python and ruby is likely to be higher than the similarity between python and word. Also, even if the number of vocabulary increases, it is not necessary to increase the number of dimensions of each word.
Click here for the implementation to get the vector of distributed representation: preprocessings/ja/word_vector.py
This theory examines how effective preprocessing is. Specifically, we compared the classification performance and execution time with and without preprocessing applied to the document classification task. As a result, the pre-processing improved the classification performance and halved the execution time.
I used the livedoor news corpus as the document classification dataset. The livedoor news corpus is a collection of livedoor news articles, stripped of HTML tags. This dataset contains nine classes listed below:
In this section, we will briefly describe the types of preprocessing when preprocessing.
When not pre-processed, the text is morphologically analyzed (ipadic), converted to BoW, and weighted with TF-IDF.
On the other hand, when pre-processing, the text is first cleaned. There are three things I did:
After cleaning, use NEologd to divide the text into the morphological analysis dictionary. Then, the text is normalized for the divided words. There are two things to do:
Stopwords are removed from the normalized words based on the frequency of occurrence, and finally BoW is classified using a vector weighted by TF-IDF. RandomForest is used for classification.
For the result, compare the classification performance and the execution time. First, let's look at the classification performance (accuracy).
With pretreatment | No pretreatment |
---|---|
0.917 | 0.898 |
~~ There was almost no change in classification performance with and without pretreatment. I expected it to improve performance, but ... This area needs further consideration. ~~ After correcting the implementation mistake, there was a 1.9 point difference between pre-processing and non-pre-processing. It's a 1.9 point difference in this performance, so I think it's a reasonable difference.
When comparing the execution times, the following results were obtained. It takes about 600 seconds without preprocessing, but the calculation is completed in about half the time with preprocessing. This is thought to be due to the reduction in vocabulary due to cleaning, normalization, and especially the removal of stopwords, which led to a reduction in execution time.
notebooks/document_classification.ipynb
Preprocessing is indispensable for natural language processing. Various processes are required to handle natural language on a computer. This article has introduced some of them. We hope you find this article useful.
I also tweet information about machine learning and natural language processing on my Twitter account. @Hironsan
We look forward to your follow-up if you are interested in this area.