Python: Japanese text: Morphological analysis

Language processing and text corpus

What is natural language processing?

The words we usually speak, the sentences we write, etc.

Say "natural language"

Technology that lets computers process natural language

Natural language processing(Natural Language Processing, NLP)Is called.

Natural language processing consists of elemental technologies such as morphological analysis, parsing, and semantic analysis. By combining these elemental technologies, they are used in various situations such as machine translation, voice recognition, and information retrieval.

Natural language was originally created for humans to communicate with each other. Humans can interpret and communicate with words that contain ambiguous expressions. However, because computers are good at processing data accurately and at high speed.

I'm not good at dealing with natural languages that contain ambiguous elements.

As an example of natural language processing, consider categorizing Japanese news documents. Assuming that one document contains about 100 words, if the amount of the document is about 10, it can be manually classified by human eyes. However, if you have about 1,000 documents, you will want to let your computer do the work.

So how do you handle natural language on your computer?

All you have to do is convert natural language into a form that is easy for computers to process, that is, numerical values.

Natural language is a set of words. If words can be converted into numerical data "in some way" Analysis is possible with machine learning and deep learning algorithms.

In this post, I will convert natural language to numerical data. Learn how to extract topics using machine learning algorithms.

Chat Dialogue Corpus

A corpus is a collection of large amounts of natural language documents.

Because natural language is a means of communication between humans Not limited to Japanese, there are only types of languages such as English and German.

Since it is difficult to introduce everything, here we will introduce the Japanese corpus that we are familiar with. Many Japanese corpora are provided, including paid and free.

You can use the results to classify documents and extract topics.

Aozora Bunko
Modern Japanese Written Balance Corpus(BCCWJ)
Chat Dialogue Corpus
Nagoya University Conversation Corpus(Japanese Natural Conversation Transcription Corpus)
Japanese spoken corpus(CSJ)
livedoor news corpus

These corpora are provided in various file formats such as CSV format, JSON format, and XML format. Especially because JSON format and XML format have a hierarchical structure. We recommend that you take out the necessary data, convert it to a CSV file, and then use it.

Now, let's extract the data using Python's json library. To extract the data, specify the variable name where the data is stored and the key of the data you want to get.

#Read the file in read-only mode
f = open("./6110_nlp_preprocessing_data/init100/1407219916.log.json",
         "r", encoding='utf-8')
json_data = json.load(f)

# `json_data`include'Conversation ID ('dialogue-id')'Variable`dialogue`Get to
dialogue = json_data["dialogue-id"]
print(dialogue)
# >>>Output result
1407219916

Click here for usage results

import json

#Read file in read-only mode
f = open("./6110_nlp_preprocessing_data/init100/1407219916.log.json",
         "r", encoding='utf-8')
json_data = json.load(f)

#Get conversation ID
print("dialogue-id : " + json_data["dialogue-id"])

#Get speaker ID
print("speaker-id  : " + json_data["speaker-id"])

#Get the speaker and utterance content
for turn in json_data["turns"]:
    #The key of the speaker is"speaker", The key to the utterance content is"utterance"is
    print(turn["speaker"] + ":" + turn["utterance"])

image.png

The corpus used here is as follows.

Corpus

Chat Dialogue Corpus Data intended for joint analysis of dialogue system errors, labeled for all human-system conversations (chat) and system responses.

Directory structure

The downloaded data is divided into init100 and rest1046 directories. Init100 contains 100 sets of chat data, and rest1046 contains 1,046 sets of chat data. Use the data in the init100 directory in it.

File structure

The data file is provided in JSON format, and is roughly divided into human utterance (question) data and system utterance (answer) data. One file is one dialogue data.

data structure

The utterance data is expressed in Japanese sentences and is in the file. It is stored in the'turns' key. 'utterance' is the utterance data 'speaker' "U" is the person and "S" is the system utterance.

In addition, the system's utterance data includes system answers to human questions. It is a flag (label) of whether or not it is broken. It has a'breakdown'and a comment'comment'.

The flag is O utterance that is not bankruptcy, It cannot be said that T is a bankruptcy, but it is a strange utterance. X There are three types of utterances that clearly feel strange.

Since'breakdown'is attached by multiple annotators ('annotator-id'). There are multiple'breakdowns' for an answer in one system.

A part of the corpus is shown below.

{
  'dialogue-id': '1407219916',
  'group-id': 'init100',
  'speaker-id': '12_08',
  'turns': [
    {
      'annotations': [
        {
          'annotator-id': '01_A',
          'breakdown': 'O',
          'comment': '',
          'ungrammatical-sentence': 'O'
        },
        {
          'annotator-id': '01_B',
          'breakdown': 'O',
          'comment': '',
          'ungrammatical-sentence': 'O'
        },
        ...
        {
          'annotator-id': '15_A',
          'breakdown': 'T',
          'comment': 'The numbers are completely different',
          'ungrammatical-sentence': 'O'
        }
      ],
      'speaker': 'S',
      'time': '2014-08-05 15:23:07',
      'turn-index': 2,
      'utterance': 'Is the maximum temperature expected to be 17 degrees Celsius? ??'
    },
    {
      'annotations': [],
      'speaker': 'U',
      'time': '2014-08-05 15:23:15',
      'turn-index': 3,
      'utterance': 'No, it's extremely hot'
    },
    ...
}

Extraction of analytical data

We will quantitatively analyze utterances that are unlikely to collapse. The sample data to be analyzed is contained in 10 files in the init100 directory. Uses a flag to indicate whether a person's utterance and the system's utterance to it are broken.

Once you have the data you need for your analysis, first remove any duplicate unwanted data from it.

Delete duplicate data To remove a row that contains duplicate elements

Pandas drop_duplicates()Use the method.
from pandas import DataFrame

#DataFrame with duplicate index number 0 and index number 2
df=DataFrame([['AA','Camela',150000,20000],
              ['BB','Camera',70000,10000],
              ['AA','Camela',150000,20000],
              ['AA','Video',3000,150]],
              columns=['CUSTOMER','PRODUCT','PRICE','DISCOUNT'])
df
# >>>Output result
    CUSTOMER      PRODUCT        PRICE    DISCOUNT
0       AA          Camela       150000    20000
1       BB          Camera        70000    10000
2       AA          Camela       150000    20000 
3       AA          Video         3000      150
#Remove lines containing duplicates
drop = df.drop_duplicates()
drop
# >>>Output result
    CUSTOMER     PRODUCT      PRICE        DISCOUNT
0       AA          Camela      150000      20000
1       BB          Camera       70000      10000
3       AA          Video        3000        150

Click here for usage examples

import os
import json
import pandas as pd

#Specify init100 directory
file_path = './6110_nlp_preprocessing_data/init100/'
file_dir = os.listdir(file_path)

#Create an empty list to store flags and utterances
label_text = []

#Process 10 JSON files one by one
for file in file_dir[:10]:
    #Read in read-only mode
    r = open(file_path + file, 'r', encoding='utf-8')
    json_data = json.load(r)

    #Utterance data array`turns`Extract utterance content and flags from
    for turn in json_data['turns']:
        turn_index = turn['turn-index'] #Utterance turn No
        speaker = turn['speaker'] #Speaker ID
        utterance = turn['utterance'] #Utterance content
        #Exclude the first line because it is a system utterance
        if turn_index != 0:
            #Extract the utterance content of a person
            if speaker == 'U':
                u_text = ''
                u_text = utterance
            else:
                a = ''
                for annotate in turn['annotations']:
                    #Extract the flag of failure
                    a = annotate['breakdown']
                    #Store flags and utterances of people in a list
                    tmp1 = str(a) + '\t' + u_text
                    tmp2 = tmp1.split('\t')
                    label_text.append(tmp2)

#list`label_text`To DataFrame
df_label_text = pd.DataFrame(label_text)

#Remove duplicate lines
df_label_text = df_label_text.drop_duplicates()
df_label_text.head(25)

image.png

Morphological analysis of text

What is morphological analysis?

One of the methods of natural language processing

Morphological analysis(Morphological Analysis)Will be raised.

Morphological analysis divides a sentence into words based on grammatical rules and analysis dictionary data. It is a process to give a part of speech to each.

A "morpheme" is the smallest unit, or word, that has meaning in the language. Here, we will look at Japanese morphological analysis.

[Text] It will be fine today.
  ↓
[Morpheme] Today|Is|Sunny|Masu| 。
     (noun)(Particle)(verb)(助verb)(symbol)

The morphemes are separated by "|" for easy understanding. If it is a short sentence like the example, it is possible to manually divide the sentence into words. Since the sentences contained in the documents actually handled are long sentences, it is realistic to process them on a computer.

There is a tool called a morphological analysis engine that performs morphological analysis on a computer. Morphological analysis engine is installed and executed What can be called as a Web API Things that can be called as a library of programming languages, etc. It is offered in various forms including paid / free. The main difference between them is the grammar and dictionaries used for morphological analysis.

ChaSen: Developed and provided by Matsumoto Laboratory, Nara Institute of Science and Technology.
JUMAN: Developed and provided by Kurobashi / Kawahara Laboratory, Kyoto University.
MeCab: Developed and provided as open source by Taku Kudo.
Janome: Developed by Tomoko Uchida and provided as a Python library.
Rosette Base Linguistics: Developed and provided by Basis Technology (paid).

Morphological analysis and word-separation using MeCab

Let's try morphological analysis and word-separation of Japanese text using the morphological analysis engine MeCab.

Morphological analysis
①Tagger()Using an object, specify the dictionary used to divide the morpheme in the output mode of the argument.
(2) If nothing is specified, MeCab's standard system dictionary will be used.
③parse('String')で指定したStringを形態素に分割し、品詞などを付与した形態素解析結果を取得します。
import MeCab

k = MeCab.Tagger()
print(k.parse('Words you want to morphologically analyze'))
Output result
Morpheme noun,General,*,*,*,*,morpheme,Keitaiso,Keitaiso
Parsing noun,Change connection,*,*,*,*,analysis,Kaiseki,Kaiseki
Verb,Independence,*,*,Sahen Suru,Continuous form,To do,Shi,Shi
Tai auxiliary verb,*,*,*,Special Thailand,Uninflected word,Want,Thailand,Thailand
Words nouns,General,*,*,*,*,word,Kotoba,Kotoba
EOS

The output morphological analysis results are as follows in order from the left. The notation in parentheses is the attribute name when acquiring the information of each attribute.

Surface (words used in the text)
Part of speech (part_of_speech)
Part of speech subclassification 1-3 (part_of_speech)
Utilization type (infl_type)
Inflected form (infl_form)
Prototype (base)_form) (the original form of the word used in the sentence)
Reading
Pronunciation (phonetic)

Word-separation If you specify ('-Owakati') as the output mode of the Tagger () object You can only use word-separation with a space to separate each morpheme without adding part of speech.

import MeCab

k = MeCab.Tagger('-Owakati')
print(k.parse('Words you want to divide'))
>>>Output result
Words you want to divide
Other output modes
-Oyomi :Output read only
-Ochasen :ChaSen compatible format
-Odump :Output all information

Click here for usage examples

import MeCab

#Morphological analysis
m = MeCab.Tagger()
print(m.parse('Of the thighs and thighs'))

#Word-separation
w = MeCab.Tagger('-Owakati')
print(w.parse('Of the thighs and thighs'))

image.png

Morphological analysis and word-separation using Janome

Next, use the morphological analysis engine Janome to perform morphological analysis and word-separation of Japanese text.

Morphological analysis
①Tokenizer()Create an object and tokenize()Specify the character string you want to morphologically analyze in the method.
(2) How to read the output result of morphological analysis is the same as MeCab.
from janome.tokenizer import Tokenizer

#Creating a Tokenizer object
t = Tokenizer()
tokens = t.tokenize("Words you want to morphologically analyze")
for token in tokens:
    print(token)
>>>Output result
Morpheme noun,General,*,*,*,*,morpheme,Keitaiso,Keitaiso
Parsing noun,Change connection,*,*,*,*,analysis,Kaiseki,Kaiseki
Verb,Independence,*,*,Sahen Suru,Continuous form,To do,Shi,Shi
Tai auxiliary verb,*,*,*,Special Thailand,Uninflected word,Want,Thailand,Thailand
Words nouns,General,*,*,*,*,word,Kotoba,Kotoba

Word-separation If wakati = True is specified in the argument of the tokenize () method, only word-separation is performed.

from janome.tokenizer import Tokenizer

#Creating a Tokenizer object
t = Tokenizer()
tokens = t.tokenize("Words you want to divide", wakati=True)
for token in tokens:
    print(token)
>>>Output result
Word-separation
Shi
Want
word

Other functions

① You can filter by part of speech.

If you want to exclude it, specify the part of speech you want to exclude.
POSStopFilter(['conjunction', 'symbol', 'Particle', 'Auxiliary verb'])

If you want to get it, specify the part of speech you want to get.
POSKeepFilter(['noun'])

(2) Analyzer is a framework for creating templates for pre-processing and post-processing of morphological analysis. Pass an Analyzer (preprocessing, Tokenizer object, filter). Set the pre-processing part as follows.

char_filters = [UnicodeNormalizeCharFilter(), 
                  RegexReplaceCharFilter('Regular expressions', 'Characters you want to convert')]
UnicodeNormalizeCharFilter()
Normalizes notational fluctuations in Unicode strings.
The argument is"NFKC"、"NFC"、"NFKD"、"NFD"The default is NFKC.
For example, full-width"ABC"Is half-width"ABC"To, half-width"Kana"Is full-width"Kana"Normalize such as to.

RegexReplaceCharFilter('Regular expressions', 'Characters you want to convert')
Replaces the string that matches the regular expression pattern.
from janome.tokenizer import Tokenizer
from janome.tokenfilter import POSKeepFilter
from janome.analyzer import Analyzer

#Creating a Tokenizer object
t = Tokenizer()

#Generate a filter that extracts only nouns
token_filters = [POSKeepFilter(['noun'])]

#Generating an analysis framework with filters
analyzer = Analyzer([], t, token_filters)

#Run
for token in analyzer.analyze("Words you want to filter"):
    print(token)
>>>Output result
Filter noun,General,*,*,*,*,filter,filter,filter
Words nouns,General,*,*,*,*,word,Kotoba,Kotoba

Click here for usage examples

from janome.tokenizer import Tokenizer
from janome.tokenfilter import POSKeepFilter
from janome.analyzer import Analyzer

#Generation of morphological analysis object
t = Tokenizer()

#Generate a filter that extracts only nouns
token_filters = [POSKeepFilter(['noun'])]

#Generating an analysis framework with filters
analyzer = Analyzer([], t, token_filters)

for token in analyzer.analyze('Of the thighs and thighs'):
    print(token)

Text normalization

Dictionary used for morphological analysis

The analysis result of morphological analysis depends on the dictionary. At runtime, the standard "standard dictionary" is used to divide sentences into words and add part of speech.

Although the standard dictionary contains common words It doesn't often contain jargon or new words are updated.

In such cases, words may be unnaturally split or part of speech may be parsed as unknown.

[Text] I go to Tokyo Tower.
[Analysis result] I|Is|Tokyo|Tower|To|To go|Masu| 。

To prevent this, prepare a user dictionary separately from the standard dictionary. The method of creating a user dictionary differs depending on the morphological analysis engine. Just remember the existence of the user dictionary here.

There is also a user dictionary that is distributed free of charge, so search according to your purpose. You may install it. (However, it shall be used at your own discretion and responsibility.)

Text normalization

Before performing morphological analysis, perform work to normalize notation fluctuations, such as deleting unnecessary symbols and unifying notations.

[Text] I ate an apple yesterday. I'm going to drink apple juice today.
[After normalization] I ate an apple yesterday. I'm going to drink apple juice today.

In the above example, there are two types of punctuation marks, "," and ",", but since they have the same meaning, they can be unified into ",". Also, the words "apple" and "apple" can be unified in the same way.

When normalizing on a computer, a regular expression is used to specify the character string. A regular expression is the expression of several strings in one form. For example, to search for a string in the text, the string you are searching for is represented by the following character type:

[0-9] :Matches any one of the numbers 0-9
[0-9a-z]+ :Matches one or more of the numbers 0-9 and lowercase letters a-z

When you want to remove or replace a certain character string contained in the text

re.sub()Is used, but specify the character string whose argument you want to remove with this regular expression.
import re

re.sub("The string you want to remove", "Converted string", "Text you want to remove")

Click here for usage examples

import re

#Output text excluding alphanumeric characters
re.sub("[0-9a-zA-Z]+", "", "I buy 10 items A.")

Recommended Posts

Python: Japanese text: Morphological analysis
Japanese morphological analysis with Python
Text mining with Python ① Morphological analysis
[Python] Morphological analysis with MeCab
Japanese morphological analysis using Janome
Text mining with Python ① Morphological analysis (re: Linux version)
Python: Negative / Positive Analysis: Text Analysis Application
Speak Japanese text with OpenJTalk + python
Data analysis python
From the introduction of JUMAN ++ to morphological analysis of Japanese with Python
Python: Simplified morphological analysis with regular expressions
Thorough comparison of three Python morphological analysis libraries
Clustering text in Python
Data analysis with python 2
python text aloud (pyttsx3)
Data analysis using Python 0
Data analysis overview python
Voice analysis with python
Text processing in Python
Python error list (Japanese)
Python data analysis template
Japanese output in Python
Association analysis in Python
python Environmentally-friendly Japanese setting
Voice analysis with python
Data analysis with Python
Regression analysis in Python
Python: Japanese text: Characteristic of utterance from word similarity
Collecting information from Twitter with Python (morphological analysis with MeCab)
Challenge principal component analysis of text data with Python
Pure Python version online morphological analysis tool Rakuten MA
Python: Japanese text: Characteristic of utterance from word continuity
[Note] WordCloud from morphological analysis
My python data analysis container
UTF8 text processing in python
Send Japanese email with Python3
[Co-occurrence analysis] Easy co-occurrence analysis with Python! [Python]
Python for Data Analysis Chapter 4
Python Statistical Techniques-Statistical Analysis Against Python-
Sentiment analysis with Python (word2vec)
Static analysis of Python programs
[Python] Notes on data analysis
Axisymmetric stress analysis in Python
Natural language processing 1 Morphological analysis
Python data analysis learning notes
Planar skeleton analysis with Python
Japanese NLP @ janome / spaCy / Python
#python Python Japanese syntax error avoidance
Speech to speech in python [text to speech]
I made a python text
Python for Data Analysis Chapter 2
Simple regression analysis in Python
I understand Python in Japanese!
Data analysis using python pandas
Muscle jerk analysis with Python
Morphological analysis using Igo + mecab-ipadic-neologd in Python (with Ruby bonus)
[PowerShell] Morphological analysis with SudachiPy
tesseract-OCR for Python [Japanese version]
Text sentiment analysis with ML-Ask
Python for Data Analysis Chapter 3
Get Japanese synonyms in Python