[PYTHON] 100 Language Processing Knock-36 (using pandas): Frequency of word occurrence

Language processing 100 knocks 2015 "Chapter 4: Morphological analysis" It is a record of 36th "word appearance frequency" of .ac.jp/nlp100/#ch4). This time, it's super easy because pandas is good at counting the number of appearances and sorting.

Reference link

Link Remarks
036.Frequency of word occurrence.ipynb Answer program GitHub link
100 amateur language processing knocks:36 Copy and paste source of many source parts
MeCab Official The first MeCab page to look at

environment

type version Contents
OS Ubuntu18.04.01 LTS It is running virtually
pyenv 1.2.16 I use pyenv because I sometimes use multiple Python environments
Python 3.8.1 python3 on pyenv.8.I'm using 1
Packages are managed using venv
Mecab 0.996-5 apt-Install with get

In the above environment, I am using the following additional Python packages. Just install with regular pip.

type version
pandas 1.0.1

Chapter 4: Morphological analysis

content of study

Apply the morphological analyzer MeCab to Natsume Soseki's novel "I Am a Cat" to obtain the statistics of the words in the novel.

Morphological analysis, MeCab, part of speech, frequency of occurrence, Zipf's law, matplotlib, Gnuplot

Knock content

Using MeCab for the text (neko.txt) of Natsume Soseki's novel "I am a cat" Morphological analysis and save the result in a file called neko.txt.mecab. Use this file to implement a program that addresses the following questions.

For problems 37, 38, and 39, use matplotlib or Gnuplot.

36. Frequency of word occurrence

Find the words that appear in the sentence and their frequency of appearance, and arrange them in descending order of frequency of appearance.

Answer

Answer program [036. Frequency of word occurrence.ipynb](https://github.com/YoheiFukuhara/nlp100/blob/master/04.%E5%BD%A2%E6%85%8B%E7%B4%A0% E8% A7% A3% E6% 9E% 90/036.% E5% 8D% 98% E8% AA% 9E% E3% 81% AE% E5% 87% BA% E7% 8F% BE% E9% A0% BB % E5% BA% A6.ipynb)

import pandas as pd

def read_text():
    # 0:Surface type(surface)
    # 1:Part of speech(pos)
    # 2:Part of speech subclassification 1(pos1)
    # 7:Uninflected word(base)
    df = pd.read_table('./neko.txt.mecab', sep='\t|,', header=None, 
                       usecols=[0, 1, 2, 7], names=['surface', 'pos', 'pos1', 'base'], 
                       skiprows=4, skipfooter=1 ,engine='python')
    return df[(df['pos'] != 'Blank') & (df['surface'] != 'EOS') & (df['pos'] != 'symbol')]

df = read_text()

df['surface'].value_counts()[:30]

#Exclude particles and auxiliary verbs
df[~df['pos'].str.startswith('Assist')]['surface'].value_counts()[:30]

Answer commentary

Occurrence frequency count and sort

[Knock 19th "Calculate the frequency of appearance of the character string in the first column of each line and arrange it in descending order of frequency of appearance"](https://qiita.com/FukuharaYohei/items/87f0413b87c6109e8ca4#019%E5%90%84% E8% A1% 8C% E3% 81% AE1% E3% 82% B3% E3% 83% A9% E3% 83% A0% E7% 9B% AE% E3% 81% AE% E6% 96% 87% E5% AD% 97% E5% 88% 97% E3% 81% AE% E5% 87% BA% E7% 8F% BE% E9% A0% BB% E5% BA% A6% E3% 82% 92% E6% B1% 82% E3% 82% 81% E5% 87% BA% E7% 8F% BE% E9% A0% BB% E5% BA% A6% E3% 81% AE% E9% AB% 98% E3% 81% 84% E9% A0% 86% E3% 81% AB% E4% B8% A6% E3% 81% B9% E3% 82% 8Bipynb) Use value_counts to delete and sort duplicates. It is convenient to sort in descending order by default.

python


df['surface'].value_counts()[:30]

Excluding particles and auxiliary verbs only adds a condition. Since the only part of speech that starts with "assistant" is particles and auxiliary verbs, I made it a negative condition for str.starts with.

python


#Exclude particles and auxiliary verbs
df[~df['pos'].str.startswith('Assist')]['surface'].value_counts()[:30]

Output result (execution result)

When the program is executed, the following results will be output. The top 30 for all targets. Naturally, the content cannot be inferred with only particles and auxiliary verbs.

Output result(All targets)


9109
6697
Is 6384
To 6147
6068
And 5476
Is 5259
3916
At 3774
Also 2433
2272
2264
Not 2254
From 2001
There is 1705
1579
Or 1446
1416
1249
Thing 1177
To 1033
986
974
Things 971
You 955
Say 937
Master 928
U 922
Yo 687
673
Name: surface, dtype: int64

Particles and auxiliary verbs have been excluded from output. It's much easier to analogize "I am a cat" than to target everything.

Output result(Exclude particles and auxiliary verbs)


2201
1597
1249
Thing 1177
986
Things 971
You 955
Say 937
Master 928
There is 723
Not 708
Yo 687
Hmm 667
This 635
Go 598
That 560
What 518
I 477
Person 449
Yes 448
443
Become 410
403
This 397
It 370
Coming 367
See 349
Labyrinth 343
Re 327
Time 316
Name: surface, dtype: int64

Recommended Posts

100 Language Processing Knock-36 (using pandas): Frequency of word occurrence
100 Language Processing Knock-83 (using pandas): Measuring word / context frequency
100 Language Processing Knock-32 (using pandas): Prototype of verb
100 Language Processing Knock-31 (using pandas): Verb
100 Language Processing Knock-38 (using pandas): Histogram
100 Language Processing Knock-84 (using pandas): Creating a word context matrix
100 Language Processing Knock-33 (using pandas): Sahen noun
100 Language Processing Knock-35 (using pandas): Noun concatenation
100 Language Processing Knock-39 (using pandas): Zipf's Law
100 Language Processing Knock-34 (using pandas): "A B"
100 language processing knock-20 (using pandas): reading JSON data
100 language processing knock-98 (using pandas): Ward's method clustering
100 language processing knock-75 (using scikit-learn): weight of features
100 language processing knock-99 (using pandas): visualization by t-SNE
100 language processing knock-95 (using pandas): Rating with WordSimilarity-353
100 Language Processing Knock-51: Word Clipping
100 Language Processing Knock-87: Word Similarity
100 Language Processing Knock: Chapter 2 UNIX Command Basics (using pandas)
100 language processing knock-30 (using pandas): reading morphological analysis results
100 Language Processing Knock-82 (Context Word): Context Extraction
Language processing 100 knock-86: Word vector display
100 Language Processing Knock 2020 Chapter 7: Word Vector
100 Language Processing Knock-59: Analysis of S-expressions
100 language processing knock-73 (using scikit-learn): learning
100 language processing knock-74 (using scikit-learn): Prediction
100 language processing knock-77 (using scikit-learn): measurement of correct answer rate
100 language processing knock-97 (using scikit-learn): k-means clustering
100 Language Processing Knock-91: Preparation of Analogy Data
100 Language Processing Knock (2020): 28
100 Language Processing Knock-26: Removal of emphasized markup
100 Language Processing Knock-71 (using Stanford NLP): Stopword
100 Language Processing Knock-96 (using Gensim): Extraction of vector for country name
100 Language Processing Knock (2020): 38
100 language processing knock 00 ~ 02
100 language processing knock-90 (using Gensim): learning with word2vec
100 language processing knock-79 (using scikit-learn): precision-recall graph drawing
[Pandas] Basics of processing date data using dt
100 Language Processing Knock-45: Extraction of verb case patterns
100 language processing knock-72 (using Stanford NLP): feature extraction
100 language processing knock 2020 [00 ~ 39 answer]
100 language processing knock 2020 [00-79 answer]
100 language processing knock 2020 [00 ~ 69 answer]
100 Language Processing Knock 2020 Chapter 1
100 Amateur Language Processing Knock: 17
100 language processing knock 2020 [00 ~ 49 answer]
100 Language Processing Knock-52: Stemming
100 Language Processing Knock Chapter 1
100 Amateur Language Processing Knock: 07
100 Language Processing Knock 2020 Chapter 3
100 Language Processing Knock 2020 Chapter 2
100 Amateur Language Processing Knock: 09
100 Amateur Language Processing Knock: 47
100 Language Processing Knock-53: Tokenization
100 Amateur Language Processing Knock: 97
100 language processing knock 2020 [00 ~ 59 answer]
100 Amateur Language Processing Knock: 67
100 language processing knock-92 (using Gensim): application to analogy data
100 Language Processing Knock-49: Extraction of Dependency Paths Between Nouns
100 language processing knock-94 (using Gensim): similarity calculation with WordSimilarity-353
100 language processing knocks-37 (using pandas): Top 10 most frequent words
Easy learning of 100 language processing knock 2020 with "Google Colaboratory"