Basics of natural language processing

Overview of natural language processing

Natural language(NL, Natural Language)

Refers to naturally occurring languages such as Japanese and English. Like a programming language

Artificial language(Artificial Language)Is the existence of contrast.

Natural language processing(NLP, Natural Language Processing)What is

It is a technology that allows a computer to process the natural language that humans use on a daily basis. Tasks using natural language processing include document classification, machine translation, document summarization, Q & A, and dialogue.

The following words are often used in natural language processing.

Token: A character or character string that is treated as the smallest unit of a sentence when analyzing natural language.

Type: A term that describes the type of word.

Sentence: A sentence that expresses a set of contents. In natural language processing, it often refers to one sentence.

Document: Often refers to a single piece of data consisting of multiple sentences.

Corpus: Data that gives some information to a document or audio data.

Thesaurus: Top of words/Subordinate relationship, part/Classify words according to overall relationships, synonyms, synonyms, etc.
A systematic thesaurus / dictionary.

Morpheme: The smallest unit of meaning. The word "eat" can be broken down into two morphemes, "eat" and "ta".

Word: A small unit consisting of a single or multiple morphemes.

Surface layer: The description of the original text.

Prototype: The description before using.

Characteristic: Information extracted from sentences and documents.

Dictionary: In natural language processing, it refers to a list of words.

Language differences

In order for the machine to understand the meaning of a sentence, it is necessary to perform word division. For example, the sentence "I study at Aidemy."

I|Is|School | At | Study |
Nouns, particles, nouns, case particles, verbs, auxiliary verbs

You can identify word breaks and find their part of speech and uninflected words as above. In Japanese, it is difficult to identify word breaks, but on the other hand, it is easy to find part of speech and uninflected words because there are few words with multiple interpretations.

Chinese and Thai are also languages that require word splitting.

For English

I | study | at | aidemy.
Nouns, verbs, prepositions, nouns

Word delimiters can be identified by symbols such as spaces and periods, but it is difficult to identify part of speech because many words have multiple parts of speech. As can be seen from the above, the characteristic of natural language processing is that the location and difficulty of problems differ depending on the language.

It is especially difficult when writing in hiragana as shown below.

Car|so|Pine
come|Until|Pine

Word division of sentences

Morphological analysis and Ngram

There are two main methods for dividing sentences into words.

① Morphological analysis(Analysis by dictionary)
②Ngram(Mechanical analysis by the number of characters)
there is.

A morpheme is the smallest linguistic unit that has meaning, and a word has one or more morphemes. Morphological analysis refers to dividing into morphemes using a dictionary and then tagging each morpheme with part of speech (adding information).

On the other hand, Ngram is an analysis method that separates words by N characters or sentences by N words.

A monogram is cut out for each character or word, a monogram is cut out for every two letters (words), and a trigram is cut out for every three letters (words).

For example, consider the monogram, bigram, and trigram of the sentence "aiueo" as follows.

Monogram:{Ah,I,U,e,O}
Bigram:{Ai,Say,up,Eo}
Trigram:{Ah,Iue,Ueo}

Ngram does not require a dictionary or grammatical interpretation like morphological analysis It can be used regardless of language. In addition, Ngram has the advantage that feature extraction omissions are less likely to occur. There is a demerit that noise becomes large. On the contrary, morphological analysis has a difference in dictionary performance. There is less noise instead of being generated.

For example, when searching for a slightly long character string "the world's most famous IT company in Tokyo", if you divide the character string by bigram and search, there is a possibility that the company "Kyoto" will be hit. This is because the character bigram of "Tokyo" is {Tokyo, Kyoto}. This does not happen if you use morphological analysis properly, but you need to prepare a high-performance dictionary.

<Terms>

Word splitting: splitting a sentence into words
Part-of-speech tagging: The process of classifying words into part of speech and tagging them.
Morphological analysis: A summary of the work of dividing into morphemes and tagging part of speech

MeCab

A morphological analysis tool is prepared in advance for performing morphological analysis. Typical as a Japanese morphological analyzer

There are MeCab and janome.

MeCab and janome perform morphological analysis with reference to the dictionary. Here you will learn how to use Mecab. See the example of morphological analysis using Mecab below.

import MeCab

mecab = MeCab.Tagger("-Owakati")
print(mecab.parse("It will be sunny tomorrow."))
#Output result
#It will be sunny tomorrow .

mecab = MeCab.Tagger("-Ochasen")
print(mecab.parse("It will be sunny tomorrow."))

"""
#Output result
Tomorrow Ashita tomorrow noun-Adverbs possible
Ha ha is a particle-Particle
Sunny Harel Sunny verb-Independent one-stage basic type
It's a desho auxiliary verb special / death imperfect form
Uuu auxiliary verb invariant basic form
.. .. .. symbol-Kuten

EOS
"""

In MeCab

MeCab.Tagger()By changing the argument of

You can change the data output format. As in the example

"-Owakati"When is used as an argument, it is divided into words.
"-Ochasen"Morphological analysis is performed when is used as an argument.
.parse("Sentence")とすることにより引数のSentenceを指定された形式で出力させることができます。

janome (1)

janome is also one of the famous Japanese morphological analyzers.

The advantage of janome is that installing MeCab is a hassle. The package is easy to install. janome first imports Tokenizer Create a Tokenizer object with Tokenizer ().

from janome.tokenizer import Tokenizer

t = Tokenizer()

You can perform morphological analysis by passing the character string you want to parse to the tokenize method of Tokenizer.

The return value of the tokenize method is a list of tagged tokens (Token objects).

#Morphological analysis
from janome.tokenizer import Tokenizer

tokenizer = Tokenizer()  #Creating a Tokenizer object
tokens = tokenizer.tokenize("I read a python book")
for token in tokens:
    print(token)

"""
#Output result
python noun,Proper noun,Organization,*,*,*,python,*,*
Particles,Attributive,*,*,*,*,of,No,No
Book noun,General,*,*,*,*,Book,Hong,Hong
Particles,Case particles,General,*,*,*,To,Wo,Wo
Reading verb,Independence,*,*,Five steps, Ma line,Continuous connection,Read,Young,Young
Auxiliary verb,*,*,*,Special,Uninflected word,Is,Da,Da
"""

janome (2)

In janome (1), morphological analysis was performed, In the argument of tokenize method

wakati=You can have it divided by specifying True.

The return value when wakati = True is a word-separated list.

from janome.tokenizer import Tokenizer

#Word-separation
t = Tokenizer()
tokens = t.tokenize("I read a python book", wakati=True)
print(tokens)
#Output result
["python", "of", "Book", "To", "Read", "Is"]

split function

When downloading document data, etc. for natural language processing Words are often separated by special letters, such as "banana, apple, strawberry".

In such cases, the split function, which is a built-in function of Python, is often used, so it will be explained here.

The split function splits a string of numbers, alphabets, symbols, etc. It is a function that separates and lists according to a certain rule.

When the string is separated by spaces or delimiters (",", ".", "_", Etc.) You can get a list separated by whitespace or delimiters by using the string .split ("delimiter").

fruits = "banana, apple, strawberry"
print(fruits)  # str
print(fruits.split(","))  # list
#Output result
banana, apple, strawberry
["banana", " apple", " strawberry"]
fruits = "banana apple srawberry"
print(fruits)  # str
print(fruits.split())  # list
#Output result
banana apple srawberry
["banana", "apple", "srawberry"]

janome (3)

For each token (Token object)

Token.You can take out the surface layer shape with surface

Token.part_of_You can retrieve the part of speech with speech.

The surface form is the form that actually appears as a character string in a sentence.

tokens = t.tokenize("I read a python book")
#Surface type
for token in tokens:
    print(token.surface)
"""
#Output result
python 
of
Book
To
Read
Is
"""
#Part of speech
for token in tokens:
    print(token.part_of_speech)
"""
#Output result
noun,固有noun,Organization,*
Particle,Attributive,*,*
noun,General,*,*
Particle,格Particle,General,*
verb,Independence,*,*
Auxiliary verb,*,*,*

Click here for an example

from janome.tokenizer import Tokenizer

t = Tokenizer()
tokens = t.tokenize("I ate venison")

word = []

#The following is an example.
for token in tokens:
    part_of_speech = token.part_of_speech.split(",")[0]
    if part_of_speech == "noun" or part_of_speech == "verb":
        word.append(token.surface)
print(word)

#Output result
# ['deer', 'meat', 'eat']

Ngram

Ngram is to separate words by N characters as mentioned earlier. Or it is an analysis method that separates sentences by N words.

The Ngram algorithm can be written as gen_Ngram below.

If you want to find the Ngram of a word, enter the word and the number you want to cut out in the argument. If you want to find the Ngram of a sentence, use janome's tokenize function to create a list of words. Enter the list of words and the number you want to cut out as arguments.

Consider the case where the sentence "I read a python book" is divided into 3 words (N = 3).

If you use janome, it will be divided into ["python", "", "book", "", "read", "da"]. Three consecutive words can take 6 --3 + 1 (number divided based on part of speech --N + 1) = 4. The result is ["python book", "book", "read book", "read"].

from janome.tokenizer import Tokenizer
tokenizer = Tokenizer()
tokens = tokenizer.tokenize("I read a python book", wakati=True)
# tokens = ["python", "of", "Book", "To", "Read", "Is"]

def gen_Ngram(words,N):
    ngram = [] #I will add the cut out words here.
    for i in range(len(words)-N+1): #Repeat with a for statement until you get N consecutive words.
        cw = "".join(words[i:i+N]) #Connect N words and assign them to cw.
        ngram.append(cw)

    return ngram

print(gen_Ngram(tokens, 2))
print(gen_Ngram(tokens, 3))
#For text trigrams
gen_Ngram(tokens, 3)
#Output result
["python book", "Book", "Read a book", "Read"]

#For word bigram
gen_Ngram("bird", 2)
#Output result
["bi", "ir", "rd"]

from janome.tokenizer import Tokenizer
t = Tokenizer()
tokens = t.tokenize("1ro gave this book to the woman who saw 2ro.", wakati=True)

def gen_Ngram(words,N):
    #Generate Ngram
    ngram = []
    for i in range(len(words)-N+1):
        cw = "".join(words[i:i+N])
        ngram.append(cw)

    return ngram

print(gen_Ngram(tokens, 2))
print(gen_Ngram(tokens, 3))

Normalization

Normalization (1)

In natural language processing, when extracting features from multiple documents Input rules may not be unified and notational fluctuations may occur. (Example iPhone and iphone)

It analyzes words that should be the same as different ones, resulting in unintended analysis results. Converting characters based on rules, such as unifying full-width characters to half-width characters and unifying uppercase letters to lowercase letters.

This is called normalization.

Please note that if you over-normalize, you will not be able to distinguish what should be distinguished.

<Terms>

Notational fluctuation: In the same document, words that should be used with the same sound and synonym are written differently.
Normalization: Converting letters and numbers on a rule basis to prevent notational fluctuations

In string normalization, in the library

You can easily perform normalization using NEologd.

neologdn.normalize("The string you want to normalize") 
#Is written as, and the normalized string is passed as the return value.

The normalizations that can be done with neologdn are:

Unify full-width alphanumeric characters into half-width ai-> ai
Unify half-width katakana into full-width katakana->Katakana
Shortening long vowels->Wow
Unification of similar character types"---- "-> "-"
Remove unnecessary space"Space"-> "space"　
Limitation of repetition (confirmed by problem)

Click here for an example

#Please import neologdn
import neologdn

#Unify half-width katakana into full-width
a = neologdn.normalize("Katakana")
print(a)

#Long vowel shortening
b = neologdn.normalize("Mee --------")
print(b)

#Unification of similar characters
c = neologdn.normalize("Various hyphens ˗֊ ------ – ⁃⁻₋−")
print(c)

#Unify full-width alphanumeric characters to half-width+Delete unnecessary space
d = neologdn.normalize("DL Deep Planning")
print(d)

#Limitation of repetition
e = neologdn.normalize("Mii Ii Ii Ii Ii", repeat=6)
print(e)

Output result

Katakana
Me
Various hyphens-
DL deep learning
Good good good

Normalization (2)

In Normalization (1), we have looked at normalization using libraries. Next, in case you want to normalize to your own data Learn how to do the normalization yourself.

When there are two kinds of words "iphone" and "iPhone" in the document It is necessary to unify the notation to treat these as the same thing.

If you want to align uppercase letters to lowercase letters, you can align them by adding .lower () to the string. Functions that are used many times in code, such as the example below, are often used.

def lower_text(text):
    return text.lower()
lower_letters = lower_text("iPhone")
print(lower_letters)
#Output result
# iphone

Normalization (3)

Normalization may also replace numbers.

The reason for replacing numbers is that the numerical expressions are diverse and appear frequently. It may not be useful for natural language processing tasks.

For example, consider the task of categorizing news articles into categories such as "sports" and "politics." At this time, I think that various numerical expressions will appear in the article. It is considered to be of little use in categorizing categories. for that reason

It replaces the numbers with different symbols and reduces the number of vocabulary.

Here, the character string is converted using a special expression method called regular expression. Regular expressions are explained in detail in Normalization (4).

For regular expression operations

Use the Python standard library re.

To replace a string with another,

re.sub(Regular expressions,String to replace,The entire string to be replaced[,Number of replacements])
#Specify what you want to replace with the first argument with a regular expression
#Indicates the character string to be replaced with the second argument.
#Specify the entire string to replace with the third argument.
#If you do not specify the number of fourth arguments, all will be replaced.

Let's look at the following example. Suppose you have the sentence "I woke up at 6 o'clock yesterday and went to bed at 11 o'clock." normalize_number is a function that removes numbers from sentences.

re.sub The first argument \ d is a regular expression that represents a string of numbers. By using this, you can specify any number string. The second argument is "", and it is an operation to rewrite the character string specified by the first argument to . Since the sentence is specified by the third argument and the fourth argument is not specified, the operation to change all the number strings in the sentence is performed.

import re

def normalize_number(text):
    replaced_text = re.sub("\d", "<NUM>", text)
    return replaced_text

replaced_text = normalize_number("I woke up at 6 o'clock yesterday and went to bed at 11 o'clock.")
print(replaced_text)

#Output result
#Yesterday<NUM>Sometimes get up<NUM>Sometimes I slept.

Regular expressions

A regular expression is a method of replacing a set of strings with one character or another string. It is widely used for character string search functions.

12A3B -> \d\dA\dB

In the above example, all half-width numbers are represented by the regular expression \ d.

In natural language processing such as regular expression (3), a set of character strings that are not considered necessary for analysis Reduce the amount of data by replacing.

Regular expressions are used to replace a set of such strings.

The table below shows the regular expressions that are often used in natural language processing.

Python: Natural language processing