[PYTHON] 100 Language Processing Knock-39 (using pandas): Zipf's Law

Language processing 100 knocks 2015 "Chapter 4: Morphological analysis" It is a record of 39th "Zipf's Law" of .ac.jp/nlp100/#ch4). "Zipf's Law" is [Wikipedia](https://ja.wikipedia.org/wiki/%E3%82%B8%E3%83%83%E3%83%97%E3%81%AE%E6%B3 According to% 95% E5% 89% 87), it is written in the following explanation, and to put it plainly, the law that ** the more frequently it appears, the greater the proportion of the whole **.

Zipf's law (Zipf's law) or Zipf's law is an empirical rule that the proportion of the kth most frequent element in the whole is proportional to $ \ frac {1} {k} $.

Reference link

Link Remarks
039.Zipf's law.ipynb Answer program GitHub link
100 amateur language processing knocks:39 Copy and paste source of many source parts
MeCab Official The first MeCab page to look at

environment

type version Contents
OS Ubuntu18.04.01 LTS It is running virtually
pyenv 1.2.16 I use pyenv because I sometimes use multiple Python environments
Python 3.8.1 python3 on pyenv.8.I'm using 1
Packages are managed using venv
Mecab 0.996-5 apt-Install with get

In the above environment, I am using the following additional Python packages. Just install with regular pip.

type version
matplotlib 3.1.3
pandas 1.0.1

Chapter 4: Morphological analysis

content of study

Apply the morphological analyzer MeCab to Natsume Soseki's novel "I Am a Cat" to obtain the statistics of the words in the novel.

Morphological analysis, MeCab, part of speech, frequency of occurrence, Zipf's law, matplotlib, Gnuplot

Knock content

Using MeCab for the text (neko.txt) of Natsume Soseki's novel "I am a cat" Morphological analysis and save the result in a file called neko.txt.mecab. Use this file to implement a program that addresses the following questions.

For problems 37, 38, and 39, use matplotlib or Gnuplot.

39. Zipf's Law

Plot a log-log graph with the frequency of occurrence of words on the horizontal axis and the frequency of occurrence on the vertical axis.

Answer

Answer Program [039. Zipf's Law.ipynb](https://github.com/YoheiFukuhara/nlp100/blob/master/04.%E5%BD%A2%E6%85%8B%E7%B4%A0%E8 % A7% A3% E6% 9E% 90 / 039.Zipf% E3% 81% AE% E6% B3% 95% E5% 89% 87.ipynb)

import matplotlib.pyplot as plt
import pandas as pd

def read_text():
    # 0:Surface type(surface)
    # 1:Part of speech(pos)
    # 2:Part of speech subclassification 1(pos1)
    # 7:Uninflected word(base)
    df = pd.read_table('./neko.txt.mecab', sep='\t|,', header=None, 
                       usecols=[0, 1, 2, 7], names=['surface', 'pos', 'pos1', 'base'], 
                       skiprows=4, skipfooter=1 ,engine='python')
    return df[(df['pos'] != 'Blank') & (df['surface'] != 'EOS') & (df['pos'] != 'symbol')]

df = read_text()

frequency = df['surface'].value_counts().values.tolist()

plt.xscale('log')
plt.yscale('log')

plt.xlim(1, len(frequency)+1)
plt.ylim(1, frequency[0])
plt.xlabel('Rank')
plt.ylabel('Frequency of appearance')

plt.scatter(x=range(1, len(frequency)+1), y=df['surface'].value_counts().values.tolist())

Answer commentary

List of appearance frequency

The value_counts function counts the unique frequencies and the tolist function lists them.

python


frequency = df['surface'].value_counts().values.tolist()

Log scale display

This time, the problem statement is a "log-log graph", so I'm using a log scale.

Plot a log-log graph with the frequency of occurrence of words on the horizontal axis and the frequency of occurrence on the vertical axis.

python


plt.xscale('log')
plt.yscale('log')

The x-axis maximizes the list length + 1 (since Python starts at 0), and the y-axis maximizes the 0th value, which is the most common in descending sorts.

python


plt.xlim(1, len(frequency)+1)
plt.ylim(1, frequency[0])

Output result (execution result)

When the program is executed, the following results will be output. It's brilliantly downhill.

image.png

Recommended Posts

100 Language Processing Knock-39 (using pandas): Zipf's Law
100 Language Processing Knock-31 (using pandas): Verb
100 Language Processing Knock-38 (using pandas): Histogram
100 Language Processing Knock-33 (using pandas): Sahen noun
100 Language Processing Knock-35 (using pandas): Noun concatenation
100 Language Processing Knock-34 (using pandas): "A B"
100 language processing knock-20 (using pandas): reading JSON data
100 Language Processing Knock-32 (using pandas): Prototype of verb
100 language processing knock-99 (using pandas): visualization by t-SNE
100 language processing knock-95 (using pandas): Rating with WordSimilarity-353
100 Language Processing Knock-36 (using pandas): Frequency of word occurrence
100 Language Processing Knock: Chapter 2 UNIX Command Basics (using pandas)
100 Language Processing Knock-83 (using pandas): Measuring word / context frequency
100 language processing knock-76 (using scikit-learn): labeling
100 language processing knock-73 (using scikit-learn): learning
100 language processing knock-74 (using scikit-learn): Prediction
100 Language Processing Knock (2020): 28
100 Language Processing Knock (2020): 38
100 language processing knock 00 ~ 02
100 Language Processing Knock-84 (using pandas): Creating a word context matrix
100 language processing knock-97 (using scikit-learn): k-means clustering
100 Language Processing Knock-71 (using Stanford NLP): Stopword
100 language processing knock 2020 [00 ~ 39 answer]
100 language processing knock 2020 [00-79 answer]
100 language processing knock 2020 [00 ~ 69 answer]
100 Language Processing Knock 2020 Chapter 1
100 Amateur Language Processing Knock: 17
100 language processing knock 2020 [00 ~ 49 answer]
100 Language Processing Knock-52: Stemming
100 Amateur Language Processing Knock: 07
100 Language Processing Knock 2020 Chapter 3
100 Language Processing Knock 2020 Chapter 2
100 Amateur Language Processing Knock: 09
100 Amateur Language Processing Knock: 47
100 Language Processing Knock-53: Tokenization
100 Amateur Language Processing Knock: 97
100 language processing knock 2020 [00 ~ 59 answer]
100 Amateur Language Processing Knock: 67
100 language processing knock-75 (using scikit-learn): weight of features
100 language processing knock-72 (using Stanford NLP): feature extraction
100 Language Processing Knock-93 (using pandas): Calculate the accuracy rate of analogy tasks
100 Language Processing with Python Knock 2015
100 language processing knock-92 (using Gensim): application to analogy data
100 Language Processing Knock-51: Word Clipping
100 Language Processing Knock-58: Tuple Extraction
100 Language Processing Knock-57: Dependency Analysis
100 language processing knock-50: sentence break
100 Language Processing Knock Chapter 1 (Python)
100 Language Processing Knock Chapter 2 (Python)
100 Language Processing Knock-25: Template Extraction
100 Language Processing Knock-87: Word Similarity
100 language processing knock-94 (using Gensim): similarity calculation with WordSimilarity-353
I tried 100 language processing knock 2020
100 language processing knock-56: co-reference analysis
Solving 100 Language Processing Knock 2020 (01. "Patatokukashi")
100 language processing knocks-37 (using pandas): Top 10 most frequent words
100 Amateur Language Processing Knock: Summary
100 language processing knock-77 (using scikit-learn): measurement of correct answer rate
100 Language Processing Knock-70 (using Stanford NLP): Obtaining and shaping data
100 Language Processing Knock 2020 Chapter 2: UNIX Commands
100 Language Processing Knock 2015 Chapter 5 Dependency Analysis (40-49)