[PYTHON] 100 Language Processing Knock-83 (using pandas): Measuring word / context frequency

This is the record of the 83rd "Measurement of word / context frequency" of Language processing 100 knock 2015. It takes time (about 7 minutes) because it is a process for a file of about 800 MB. I thought that a memory error would occur if I read it all at once, and when I tried to survive using the chunksize option of pandas, I had a hard time because I could not do it at all. After all, it could be read all at once and there was no particular problem.

Reference link

Link Remarks
083.Word / context frequency measurement.ipynb Answer program GitHub link
100 amateur language processing knocks:83 I am always indebted to you by knocking 100 language processing
100 language processing knock 2015 version(83,84) I referred to it in Chapter 9.
How to use Pandas groupby Easy to understand how to use pandas group by
to_pickle function to_pickle functionの公式ヘルプ

environment

type version Contents
OS Ubuntu18.04.01 LTS It is running virtually
pyenv 1.2.15 I use pyenv because I sometimes use multiple Python environments
Python 3.6.9 python3 on pyenv.6.I'm using 9
3.7 or 3.There is no deep reason not to use 8 series
Packages are managed using venv

In the above environment, I am using the following additional Python packages. Just install with regular pip.

type version
pandas 0.25.3

Task

Chapter 9: Vector Space Method (I)

enwiki-20150112-400-r10-105752.txt.bz2 Is the text of 105,752 articles randomly sampled 1/10 from the English Wikipedia articles as of January 12, 2015, which consist of more than 400 words, compressed in bzip2 format. is there. Using this text as a corpus, I want to learn a vector (distributed expression) that expresses the meaning of a word. In the first half of Chapter 9, principal component analysis is applied to the word context co-occurrence matrix created from the corpus, and the process of learning word vectors is implemented by dividing it into several processes. In the latter half of Chapter 9, the word vector (300 dimensions) obtained by learning is used to calculate the similarity of words and perform analogy.

Note that if problem 83 is implemented obediently, a large amount (about 7GB) of main memory is required. If you run out of memory, devise a process or 1/100 sampling corpus enwiki-20150112-400-r100-10576.txt.bz2 Use /nlp100/data/enwiki-20150112-400-r100-10576.txt.bz2).

This time * "1/100 sampling corpus [enwiki-20150112-400-r100-10576.txt.bz2](http://www.cl.ecei.tohoku.ac.jp/nlp100/data/enwiki-20150112-" 400-r100-10576.txt.bz2) ”* is used.

83. Measuring word / context frequency

Use the output of> 82 to find the following appearance distribution and constants.

-$ f (t, c) $: Co-occurrence of the word $ t $ and the context word $ c $ -$ f (t, ∗) $: Number of occurrences of word $ t $ -$ f (∗, c) $: Number of occurrences of context word $ c $ -$ N $: Total number of occurrences of word / context word pairs

Problem supplement

"Word $ t $" is [previous article](https://qiita.com/FukuharaYohei/items/64e20ced30cba76383dc#%E6%96%87%E8%84%88%E8%AA%9E%E3%81% It is ** "Target Word" ** written in A8% E3% 81% AF). Suppose the output of 82 is as follows.

t(Target word) c(Contextual words)
t1 c1
t1 c2
t2 c1
t1 c1

Then, the following is output. So-called SQL Group By.

** $ f (t, c) $: Co-occurrence of word $ t $ and contextual word $ c $ **

t(Target word) c(Contextual words) Number of co-occurrence
t1 c1 2
t1 c2 1
t2 c1 1

** $ f (t, ∗) $: Number of occurrences of word $ t $ **

t(Target word) Number of appearances
t1 3
t2 1

** $ f (∗, c) $: Number of occurrences of context word $ c $ **

c(Contextual words) Number of appearances
c1 3
c2 1

** $ N $: Total number of occurrences of word / context word pairs ** 4 (because the whole is 4 lines)

Answer

Answer Program [083. Measuring Word / Context Frequency.ipynb](https://github.com/YoheiFukuhara/nlp100/blob/master/09.%E3%83%99%E3%82%AF%E3%83] % 88% E3% 83% AB% E7% A9% BA% E9% 96% 93% E6% B3% 95% 20 (I) / 083.% 20% E5% 8D% 98% E8% AA% 9E% EF % BC% 8F% E6% 96% 87% E8% 84% 88% E3% 81% AE% E9% A0% BB% E5% BA% A6% E3% 81% AE% E8% A8% 88% E6% B8 % AC.ipynb)

import sys

import pandas as pd

df = pd.read_table('./082.context.txt', header=None, names=['t', 'c'])
print(df.info())

def to_pickle_file(grouped, path):
    print('length:', grouped.size)
    grouped.to_pickle(path)

to_pickle_file(df.groupby(['t','c'])['c'].agg('count'), './083_group_tc.zip')
to_pickle_file(df.groupby('t')['c'].agg('count'), './083_group_t.zip')
to_pickle_file(df.groupby('c')['c'].agg('count'), './083_group_c.zip')

Answer commentary

I'm using pandas to load a c file with the column name t.

df = pd.read_table('./082.context.txt', header=None, names=['t', 'c'])
print(df.info())

As a result of df.info (), the following is output, and you can see that the result of * "$ N $: total number of occurrences of word / context word pairs" * is 6800317. You can also see that it uses about 1GB of memory. By the way, it took about 1.5 minutes for the read part.

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 68000317 entries, 0 to 68000316
Data columns (total 2 columns):
t    object
c    object
dtypes: object(2)
memory usage: 1.0+ GB
None

Here, the count of the groupby result of pandas is saved by pickle. If you set the file extension to zip, it will be compressed automatically, which is convenient. If you load the file saved here, you can restore it as the same pandas Series object as when you saved it.

def to_pickle_file(grouped, path):
    print('length:', grouped.size)
    grouped.to_pickle(path)

This is the main part of this time. Grouping is done using groupby of pandas and the result is counted.

to_pickle_file(df.groupby(['t','c'])['c'].agg('count'), './083_group_tc.zip')
to_pickle_file(df.groupby('t')['c'].agg('count'), './083_group_t.zip')
to_pickle_file(df.groupby('c')['c'].agg('count'), './083_group_c.zip')

By the way, the following is the information of each process.

Number of lines processing time file size
$ f(t,c) $ 21,327,Line 945 4min 38s 103.7MB
$ f(t,*) $ 388,Line 836 34.7s 2.8MB
$ f(*,c) $ 388,Line 836 24.2s 2.8MB

Tips / Troubleshooting system

Tips: File shrinkage

Since the target file size is large (about 800MB), trial and error was very difficult. So, at first, I created and coded a file that covered only the first 100,000 lines.

cat 082.context.txt | head -n 100000 >> 082.context_mini.txt

Output specific lines in large files

When I got an error on line XX in DataFrame, I combined head and tail to see the contents of the file. Normally, I simply open the file, but in the case of a large file, it takes time to open it, so I did this. The following command displays lines 12198 to 3 of the file. By the way, this error is a failure related to tokenization of the sentence written in "Spilling story" of the previous article.

$ cat 082.context.txt | head -n 124150 | tail -n 3

"b")("s"	"c
−	"b")("s"
−	"c

Recommended Posts

100 Language Processing Knock-83 (using pandas): Measuring word / context frequency
100 Language Processing Knock-36 (using pandas): Frequency of word occurrence
100 Language Processing Knock-82 (Context Word): Context Extraction
100 Language Processing Knock-31 (using pandas): Verb
100 Language Processing Knock-38 (using pandas): Histogram
100 Language Processing Knock-33 (using pandas): Sahen noun
100 Language Processing Knock-35 (using pandas): Noun concatenation
100 Language Processing Knock-39 (using pandas): Zipf's Law
100 Language Processing Knock-34 (using pandas): "A B"
100 language processing knock-20 (using pandas): reading JSON data
100 Language Processing Knock-32 (using pandas): Prototype of verb
100 language processing knock-98 (using pandas): Ward's method clustering
100 language processing knock-99 (using pandas): visualization by t-SNE
100 language processing knock-95 (using pandas): Rating with WordSimilarity-353
100 Language Processing Knock-87: Word Similarity
100 Language Processing Knock: Chapter 2 UNIX Command Basics (using pandas)
100 language processing knock-30 (using pandas): reading morphological analysis results
100 language processing knock-76 (using scikit-learn): labeling
Language processing 100 knock-86: Word vector display
100 Language Processing Knock 2020 Chapter 7: Word Vector
100 language processing knock-73 (using scikit-learn): learning
100 language processing knock-74 (using scikit-learn): Prediction
100 language processing knock-97 (using scikit-learn): k-means clustering
100 Language Processing Knock (2020): 28
100 Language Processing Knock-71 (using Stanford NLP): Stopword
100 Language Processing Knock (2020): 38
100 language processing knock 00 ~ 02
100 Language Processing Knock-93 (using pandas): Calculate the accuracy rate of analogy tasks
100 language processing knock-90 (using Gensim): learning with word2vec
100 language processing knock-79 (using scikit-learn): precision-recall graph drawing
100 language processing knock-75 (using scikit-learn): weight of features
100 language processing knock-72 (using Stanford NLP): feature extraction
100 language processing knock 2020 [00 ~ 39 answer]
100 language processing knock 2020 [00-79 answer]
100 language processing knock 2020 [00 ~ 69 answer]
100 Language Processing Knock 2020 Chapter 1
100 Amateur Language Processing Knock: 17
100 language processing knock 2020 [00 ~ 49 answer]
100 Language Processing Knock Chapter 1
100 Amateur Language Processing Knock: 07
100 Language Processing Knock 2020 Chapter 3
100 Language Processing Knock 2020 Chapter 2
100 Amateur Language Processing Knock: 09
100 Amateur Language Processing Knock: 47
100 Language Processing Knock-53: Tokenization
100 Amateur Language Processing Knock: 97
100 language processing knock 2020 [00 ~ 59 answer]
100 Amateur Language Processing Knock: 67
[MacOS] Installation of Go (Go language) using goenv
100 Language Processing Knock-32 (using pandas): Prototype of verb
100 language processing knock-75 (using scikit-learn): weight of features
100 Language Processing Knock-36 (using pandas): Frequency of word occurrence
100 language processing knock-92 (using Gensim): application to analogy data
100 language processing knock-94 (using Gensim): similarity calculation with WordSimilarity-353
100 language processing knocks-37 (using pandas): Top 10 most frequent words
100 Language Processing with Python Knock 2015
100 Language Processing Knock-58: Tuple Extraction
100 Language Processing Knock-57: Dependency Analysis
100 language processing knock-50: sentence break
100 Language Processing Knock Chapter 1 (Python)
100 Language Processing Knock Chapter 2 (Python)
Natural language processing 3 Word continuity
100 Language Processing Knock-25: Template Extraction
I tried 100 language processing knock 2020