[PYTHON] 100 language processing knocks-37 (using pandas): Top 10 most frequent words

Language processing 100 knocks 2015 "Chapter 4: Morphological analysis" This is the record of 37th "Top 10 most frequent words" of .ac.jp/nlp100/#ch4). This time, we will use matplotlib for graph display. It seems that everyone will fall into the matplotlib "Tofu problem" (Japanese Corresponds to the phenomenon that the displayed tofu-like characters are displayed on the graph).

Reference link

Link Remarks
037.Top 10 most frequent words.ipynb Answer program GitHub link
100 amateur language processing knocks:37 Copy and paste source of many source parts
MeCab Official The first MeCab page to look at

environment

type version Contents
OS Ubuntu18.04.01 LTS It is running virtually
pyenv 1.2.16 I use pyenv because I sometimes use multiple Python environments
Python 3.8.1 python3 on pyenv.8.I'm using 1
Packages are managed using venv
Mecab 0.996-5 apt-Install with get

In the above environment, I am using the following additional Python packages. Just install with regular pip.

type version
matplotlib 3.1.3
pandas 1.0.1

Chapter 4: Morphological analysis

content of study

Apply the morphological analyzer MeCab to Natsume Soseki's novel "I Am a Cat" to obtain the statistics of the words in the novel.

Morphological analysis, MeCab, part of speech, frequency of occurrence, Zipf's law, matplotlib, Gnuplot

Knock content

Using MeCab for the text (neko.txt) of Natsume Soseki's novel "I am a cat" Morphological analysis and save the result in a file called neko.txt.mecab. Use this file to implement a program that addresses the following questions.

For problems 37, 38, and 39, use matplotlib or Gnuplot.

37. Top 10 most frequent words

Display the 10 words that appear frequently and their frequency of appearance in a graph (for example, a bar graph).

Answer

Answer program [037. Top 10 most frequently used words.ipynb](https://github.com/YoheiFukuhara/nlp100/blob/master/04.%E5%BD%A2%E6%85%8B%E7%B4%A0% E8% A7% A3% E6% 9E% 90/037.% 20% E9% A0% BB% E5% BA% A6% E4% B8% 8A% E4% BD% 8D10% E8% AA% 9E.ipynb)

import pandas as pd
import matplotlib.pyplot as plt

plt.rcParams['font.family'] = 'IPAexGothic'

def read_text():
    # 0:Surface type(surface)
    # 1:Part of speech(pos)
    # 2:Part of speech subclassification 1(pos1)
    # 7:Uninflected word(base)
    df = pd.read_table('./neko.txt.mecab', sep='\t|,', header=None, 
                       usecols=[0, 1, 2, 7], names=['surface', 'pos', 'pos1', 'base'], 
                       skiprows=4, skipfooter=1 ,engine='python')
    return df[(df['pos'] != 'Blank') & (df['surface'] != 'EOS') & (df['pos'] != 'symbol')]

df = read_text()

df['surface'].value_counts()[:10].plot.bar()

#Exclude particles and auxiliary verbs
df[~df['pos'].str.startswith('Assist')]['surface'].value_counts()[:10].plot.bar()

Answer commentary

Compatible with tofu (corresponding to garbled graph characters)

I made tofu (corresponding to garbled graph characters) by referring to the following article. Please note that the support method depends greatly on the OS and Python environment (such as using pyenv).

-[Resolve garbled Japanese characters in matplotlib](https://qiita.com/katuemon/items/5c4db01997ad9dc343e0#%E3%83%95%E3%82%A9%E3%83%B3%E3%83% 88% E3% 82% AD% E3% 83% A3% E3% 83% 83% E3% 82% B7% E3% 83% A5% E3% 81% AE% E5% 89% 8A% E9% 99% A4) -About garbled Japanese characters in matplotlib

1. Font installation

Install fonts with ʻapt-get`

apt-get install fonts-ipaexfont

2. Delete cache

Physically delete the following files that are the font cache of matplotlib. I don't know the difference between the two, but I deleted it with the feeling that I can clear the cache.

-/Users/username/.cache/matplotlib/fontlist-v300.json -/Users/username/.cache/matplotlib/fontlist-v310.json

3. Specify font in Python

Specify the font to output the graph with the following settings on Python. This completes the "tofu" support.

python


plt.rcParams['font.family'] = 'IPAexGothic'

Graph output

pandas is very convenient because it can be output as it is using plot.

python


df['surface'].value_counts()[:10].plot.bar()

#Exclude particles and auxiliary verbs
df[~df['pos'].str.startswith('Assist')]['surface'].value_counts()[:10].plot.bar()

Output result (execution result)

When the program is executed, the following results will be output. After all, it is easier to understand if you graph it rather than just looking at the numbers.

All words

image.png

Words excluding particles and auxiliary verbs

image.png

Recommended Posts

100 language processing knocks-37 (using pandas): Top 10 most frequent words
100 Language Processing Knock-31 (using pandas): Verb
100 Language Processing Knock-38 (using pandas): Histogram
100 language processing knocks (2020): 40
100 language processing knocks (2020): 32
100 language processing knocks (2020): 35
100 Language Processing Knock-33 (using pandas): Sahen noun
100 language processing knocks (2020): 47
100 language processing knocks (2020): 39
100 language processing knocks (2020): 22
100 language processing knocks (2020): 26
100 language processing knocks (2020): 34
100 language processing knocks (2020): 42
100 language processing knocks (2020): 29
100 language processing knocks (2020): 49
100 language processing knocks 06 ~ 09
100 language processing knocks (2020): 43
100 language processing knocks (2020): 24
100 language processing knocks (2020): 45
100 language processing knocks (2020): 10-19
100 Language Processing Knock-35 (using pandas): Noun concatenation
100 language processing knocks (2020): 00-09
100 language processing knocks (2020): 31
100 language processing knocks (2020): 48
100 language processing knocks (2020): 44
100 language processing knocks (2020): 41
100 language processing knocks (2020): 37
100 Language Processing Knock-39 (using pandas): Zipf's Law
100 language processing knocks (2020): 25
100 language processing knocks (2020): 23
100 language processing knocks (2020): 33
100 language processing knocks (2020): 20
100 language processing knocks (2020): 27
100 language processing knocks (2020): 46
100 language processing knocks (2020): 21
100 Language Processing Knock-34 (using pandas): "A B"
100 language processing knocks (2020): 36
100 language processing knock-20 (using pandas): reading JSON data
100 Language Processing Knock-32 (using pandas): Prototype of verb
100 language processing knock-98 (using pandas): Ward's method clustering
100 language processing knock-99 (using pandas): visualization by t-SNE
100 language processing knock-95 (using pandas): Rating with WordSimilarity-353
100 amateur language processing knocks: 41
100 amateur language processing knocks: 71
100 amateur language processing knocks: 56
100 amateur language processing knocks: 24
100 amateur language processing knocks: 50
100 amateur language processing knocks: 59
100 amateur language processing knocks: 62
100 amateur language processing knocks: 60
100 amateur language processing knocks: 92
100 amateur language processing knocks: 30
100 amateur language processing knocks: 06
100 amateur language processing knocks: 84
100 amateur language processing knocks: 81
100 amateur language processing knocks: 33
100 amateur language processing knocks: 46
100 amateur language processing knocks: 88
100 amateur language processing knocks: 89
100 amateur language processing knocks: 40
100 amateur language processing knocks: 45