[PYTHON] 100 Language Processing Knock-35 (using pandas): Noun concatenation

Language processing 100 knocks 2015 "Chapter 4: Morphological analysis" This is the record of 35th "Noun Concatenation" of .ac.jp/nlp100/#ch4). Like last time, it's something that pandas can't handle easily. However, the "noun concatenation" part is a not difficult process of about 10 lines.

Reference link

Link Remarks
035.Noun articulation.ipynb Answer program GitHub link
100 amateur language processing knocks:35 Copy and paste source of many source parts
MeCab Official The first MeCab page to look at

environment

type version Contents
OS Ubuntu18.04.01 LTS It is running virtually
pyenv 1.2.16 I use pyenv because I sometimes use multiple Python environments
Python 3.8.1 python3 on pyenv.8.I'm using 1
Packages are managed using venv
Mecab 0.996-5 apt-Install with get

In the above environment, I am using the following additional Python packages. Just install with regular pip.

type version
pandas 1.0.1

Chapter 4: Morphological analysis

content of study

Apply the morphological analyzer MeCab to Natsume Soseki's novel "I Am a Cat" to obtain the statistics of the words in the novel.

Morphological analysis, MeCab, part of speech, frequency of occurrence, Zipf's law, matplotlib, Gnuplot

Knock content

Using MeCab for the text (neko.txt) of Natsume Soseki's novel "I am a cat" Morphological analysis and save the result in a file called neko.txt.mecab. Use this file to implement a program that addresses the following questions.

For problems 37, 38, and 39, use matplotlib or Gnuplot.

35. Noun concatenation

Extract the concatenation of nouns (nouns that appear consecutively) with the longest match.

Problem supplement (about "")

Answer

Answer Program [035. Noun Concatenation.ipynb](https://github.com/YoheiFukuhara/nlp100/blob/master/04.%E5%BD%A2%E6%85%8B%E7%B4%A0%E8 % A7% A3% E6% 9E% 90 / 035.% E5% 90% 8D% E8% A9% 9E% E3% 81% AE% E9% 80% A3% E6% 8E% A5.ipynb)

import pandas as pd

def read_text():
    # 0:Surface type(surface)
    # 1:Part of speech(pos)
    # 2:Part of speech subclassification 1(pos1)
    # 7:Uninflected word(base)
    df = pd.read_table('./neko.txt.mecab', sep='\t|,', header=None, 
                       usecols=[0, 1, 2, 7], names=['surface', 'pos', 'pos1', 'base'], 
                       skiprows=4, skipfooter=1 ,engine='python')
    return df[df['pos'] == 'noun']

df = read_text()

nouns = []
for index in df['surface'].index:
    nouns.append(df['surface'][index])
    
    #End of noun concatenation if there is no index one ahead
    if (index + 1) not in df.index:
        
        #Multiple(Articulation)in the case of
        if len(nouns) > 1:
            print(len(nouns), '\t', index, '\t', ''.join(nouns))
        nouns = []
    
    #Limited because there are many
    if index > 2000:
        break

Answer commentary

Read file

This time, only nouns are needed, so the entries are narrowed down to only nouns immediately after reading the file.

python


def read_text():
    # 0:Surface type(surface)
    # 1:Part of speech(pos)
    # 2:Part of speech subclassification 1(pos1)
    # 7:Uninflected word(base)
    df = pd.read_table('./neko.txt.mecab', sep='\t|,', header=None, 
                       usecols=[0, 1, 2, 7], names=['surface', 'pos', 'pos1', 'base'], 
                       skiprows=4, skipfooter=1 ,engine='python')
    return df[df['pos'] == 'noun']

Noun articulation extraction

The rest is processing in a loop. The concatenation end is judged by whether the index of the next entry is +1. Noun concatenation is output only when the length of the concatenated list is 1 or more at the end of the concatenation.

python


nouns = []
for index in df['surface'].index:
    nouns.append(df['surface'][index])
    
    #End of noun concatenation if there is no index one ahead
    if (index + 1) not in df.index:
        
        #Multiple(Articulation)in the case of
        if len(nouns) > 1:
            print(len(nouns), '\t', index, '\t', ''.join(nouns))
        nouns = []

Output result (execution result)

When the program is executed, the following results will be output.

Output result


2 28
2 66 Human
2 69 The worst
2 172
2 190 Hair
2 209 Then the cat
2 222 once
2 688 House
2 860 Other than students
3 1001
2 1028 The other day
2 1031 Mima
2 1106 Midaidokoro
2 1150 as it is
2 1235 All-day study
2 1255 studyer
2 1266 Studyer
2 1288 Diligent
3 1392 Page 23
2 1515 Other than the master
2 1581 As long as I am
2 1599 Morning master
2 1690 Most mindful
2 1733 One floor
2 1781 Last hard
3 1829 Neurogastric weakness
2 1913 Language break
2 1961
3 1965 Total family

Recommended Posts

100 Language Processing Knock-35 (using pandas): Noun concatenation
100 Language Processing Knock-33 (using pandas): Sahen noun
100 Language Processing Knock-31 (using pandas): Verb
100 Language Processing Knock-38 (using pandas): Histogram
100 Language Processing Knock-39 (using pandas): Zipf's Law
100 Language Processing Knock-34 (using pandas): "A B"
100 language processing knock-20 (using pandas): reading JSON data
100 Language Processing Knock-32 (using pandas): Prototype of verb
100 language processing knock-98 (using pandas): Ward's method clustering
100 language processing knock-99 (using pandas): visualization by t-SNE
100 language processing knock-95 (using pandas): Rating with WordSimilarity-353
100 Language Processing Knock-36 (using pandas): Frequency of word occurrence
100 Language Processing Knock: Chapter 2 UNIX Command Basics (using pandas)
100 Language Processing Knock-83 (using pandas): Measuring word / context frequency
100 language processing knock-30 (using pandas): reading morphological analysis results
100 language processing knock-76 (using scikit-learn): labeling
100 language processing knock-73 (using scikit-learn): learning
100 language processing knock-74 (using scikit-learn): Prediction
100 Language Processing Knock (2020): 28
100 Language Processing Knock (2020): 38
100 language processing knock 00 ~ 02
100 Language Processing Knock-84 (using pandas): Creating a word context matrix
100 language processing knock 2020 [00 ~ 39 answer]
100 language processing knock 2020 [00-79 answer]
100 language processing knock 2020 [00 ~ 69 answer]
100 Language Processing Knock 2020 Chapter 1
100 Amateur Language Processing Knock: 17
100 Language Processing Knock-52: Stemming
100 Language Processing Knock Chapter 1
100 Amateur Language Processing Knock: 07
100 Language Processing Knock 2020 Chapter 3
100 Language Processing Knock 2020 Chapter 2
100 Amateur Language Processing Knock: 09
100 Amateur Language Processing Knock: 47
100 Language Processing Knock-53: Tokenization
100 Amateur Language Processing Knock: 97
100 language processing knock 2020 [00 ~ 59 answer]
100 Amateur Language Processing Knock: 67
100 language processing knock-90 (using Gensim): learning with word2vec
100 language processing knock-79 (using scikit-learn): precision-recall graph drawing
100 language processing knock-75 (using scikit-learn): weight of features
100 language processing knock-72 (using Stanford NLP): feature extraction
100 Language Processing Knock-93 (using pandas): Calculate the accuracy rate of analogy tasks
100 Language Processing with Python Knock 2015
100 language processing knock-92 (using Gensim): application to analogy data
100 Language Processing Knock-51: Word Clipping
100 Language Processing Knock-57: Dependency Analysis
100 language processing knock-50: sentence break
100 Language Processing Knock Chapter 1 (Python)
100 Language Processing Knock Chapter 2 (Python)
100 Language Processing Knock-25: Template Extraction
100 Language Processing Knock-87: Word Similarity
100 language processing knock-94 (using Gensim): similarity calculation with WordSimilarity-353
I tried 100 language processing knock 2020
100 language processing knock-56: co-reference analysis
Solving 100 Language Processing Knock 2020 (01. "Patatokukashi")
100 language processing knocks-37 (using pandas): Top 10 most frequent words
100 Amateur Language Processing Knock: Summary
100 language processing knock-77 (using scikit-learn): measurement of correct answer rate
100 Language Processing Knock-70 (using Stanford NLP): Obtaining and shaping data
100 Language Processing Knock 2020 Chapter 2: UNIX Commands