[PYTHON] 100 Language Processing Knock-80 (Replace with Regular Expression): Corpus Formatting

It is a record of the 80th "Corpus shaping" of Language processing 100 knock 2015. Finally, we have reached Chapter 9, "Vector Space Law (I)". This time, it's easy because it's just character replacement using regular expressions in the preprocessing system.

Reference link

Link Remarks
080.Corpus shaping.ipynb Answer program GitHub link
100 amateur language processing knocks:80 I am always indebted to you by knocking 100 language processing
100 language processing knock 2015 version(80~82) Chapter 9 was helpful

environment

type version Contents
OS Ubuntu18.04.01 LTS It is running virtually
pyenv 1.2.15 I use pyenv because I sometimes use multiple Python environments
Python 3.6.9 python3 on pyenv.6.I'm using 9
3.7 or 3.There is no deep reason not to use 8 series
Packages are managed using venv

Task

Chapter 9: Vector Space Method (I)

enwiki-20150112-400-r10-105752.txt.bz2 Is the text of 105,752 articles randomly sampled 1/10 from the English Wikipedia articles as of January 12, 2015, which consist of more than 400 words, compressed in bzip2 format. is there. Using this text as a corpus, I want to learn a vector (distributed expression) that expresses the meaning of a word. In the first half of Chapter 9, principal component analysis is applied to the word context co-occurrence matrix created from the corpus, and the process of learning word vectors is implemented by dividing it into several processes. In the latter half of Chapter 9, the word vector (300 dimensions) obtained by learning is used to calculate the similarity of words and perform analogy.

Note that if problem 83 is implemented obediently, a large amount (about 7GB) of main memory is required. If you run out of memory, devise a process or 1/100 sampling corpus enwiki-20150112-400-r100-10576.txt.bz2 Use /nlp100/data/enwiki-20150112-400-r100-10576.txt.bz2).

This time * "1/100 sampling corpus [enwiki-20150112-400-r100-10576.txt.bz2](http://www.cl.ecei.tohoku.ac.jp/nlp100/data/enwiki-20150112-" 400-r100-10576.txt.bz2) ”* is used.

80. Corpus shaping

The simplest way to convert a sentence into a word string is to separate it into words with whitespace characters. However, with this method, symbols such as periods and parentheses at the end of sentences are included in the word. Therefore, divide the text on each line of the corpus into a list of tokens with whitespace characters, and then perform the following processing on each token to remove the symbol from the word.

-Removed the following characters that appear at the beginning and end of the token:.,!?;: () []'" -Delete the token that became an empty string

After applying the above process, concatenate the tokens with spaces and save them in a file.

100 knocks of past language processing "100 knocks of language processing-71 (using Stanford NLP): Stopword" is.

Answer

Answer Program [080. Corpus Shaping.ipynb](https://github.com/YoheiFukuhara/nlp100/blob/master/09.%E3%83%99%E3%82%AF%E3%83%88%E3 % 83% AB% E7% A9% BA% E9% 96% 93% E6% B3% 95% 20 (I) /080.%E3%82%B3%E3%83%BC%E3%83%91%E3 % 82% B9% E3% 81% AE% E6% 95% B4% E5% BD% A2.ipynb)

import bz2
import re

#Regular expression for leading and trailing symbol removal(Line feed code at the end)
reg_sym = re.compile(r'^[.,!?;:\(\)\[\]\'"]+|[.,!?;:\(\)\[\]\'"\n]+$')

with bz2.open('./enwiki-20150112-400-r100-10576.txt.bz2', 'rt') as data_file, \
         open('./080.corpus.txt', mode='w') as out_file:
    for i, line in enumerate(data_file):

        #Disassemble with blanks, remove symbols before and after
        tokens = []
        tokens.extend([reg_sym.sub('', chunk) for chunk in line.split(' ') if len(reg_sym.sub('', chunk)) > 0])
        
        #Blank lines are not applicable
        if tokens:
            #File output
            print(*tokens, sep=' ', end='\n', file=out_file)
        
        #3 lines also output to console
        if i < 3:
            print(i, line, tokens)

Answer commentary

The process will be completed in about 1 minute. I get this result on the console:

0 Anarchism
 ['Anarchism']
1 
 []
2 Anarchism is a political philosophy that advocates stateless societies often defined as self-governed voluntary institutions, but that several authors have defined as more specific institutions based on non-hierarchical free associations. Anarchism holds the state to be undesirable, unnecessary, or harmful. While anti-statism is central, anarchism entails opposing authority or hierarchical organisation in the conduct of human relations, including, but not limited to, the state system.
 ['Anarchism', 'is', 'a', 'political', 'philosophy', 'that', 'advocates', 'stateless', 'societies', 'often', 'defined', 'as', 'self-governed', 'voluntary', 'institutions', 'but', 'that', 'several', 'authors', 'have', 'defined', 'as', 'more', 'specific', 'institutions', 'based', 'on', 'non-hierarchical', 'free', 'associations', 'Anarchism', 'holds', 'the', 'state', 'to', 'be', 'undesirable', 'unnecessary', 'or', 'harmful', 'While', 'anti-statism', 'is', 'central', 'anarchism', 'entails', 'opposing', 'authority', 'or', 'hierarchical', 'organisation', 'in', 'the', 'conduct', 'of', 'human', 'relations', 'including', 'but', 'not', 'limited', 'to', 'the', 'state', 'system']

The file is directly opened compressed using the bz2 package. Also, the read file and the write file are open at the same time.

with bz2.open('./enwiki-20150112-400-r100-10576.txt.bz2', 'rt') as data_file, \
         open('./080.corpus.txt', mode='w') as out_file:

This is the main symbol removal regular expression this time. It is a brief explanation.

Regular expressions meaning
^ lead
[] Means grouping[]Any of the characters enclosed in
.,!?;:()[]'" Symbol to remove. Same at the beginning and end.\Escape
+ lead/Even if there are consecutive symbols at the end
\n Line feed code(Only at the end)
Vertical bar Union(Or)
$ end
reg_sym = re.compile(r'^[.,!?;:\(\)\[\]\'"]+|[.,!?;:\(\)\[\]\'"\n]+$')

About \ xa0

I thought that it was not erased by the symbol later, but when I looked closely at the contents, there was a part containing \ xa0. I left \ xa0. For example, it is in the file where "However B. nutans" is written, and I thought "the dot at the end of B. has not been removed ", but internally it is B. \ xa0 nutans. It seems that "B. nutans" is one word. I noticed by looking at the article "[Python3] What to do if you encounter [\ xa0] during scraping".

Recommended Posts

100 Language Processing Knock-80 (Replace with Regular Expression): Corpus Formatting
100 Language Processing with Python Knock 2015
100 Language Processing Knock with Python (Chapter 1)
100 Language Processing Knock with Python (Chapter 3)
100 language processing knocks 2020: Chapter 3 (regular expression)
100 Language Processing Knock 2020 Chapter 3: Regular Expressions
100 Language Processing Knock 2020 with GiNZA v3.1 Chapter 4
100 Language Processing Knock (2020): 28
100 Language Processing Knock with Python (Chapter 2, Part 2)
100 Language Processing Knock with Python (Chapter 2, Part 1)
100 Language Processing Knock (2020): 38
100 language processing knock 00 ~ 02
100 Language Processing Knock-88: 10 Words with High Similarity
100 language processing knock-90 (using Gensim): learning with word2vec
100 Language Processing Knock Regular Expressions Learned in Chapter 3
100 language processing knock-95 (using pandas): Rating with WordSimilarity-353
100 language processing knock 2020 [00 ~ 39 answer]
100 language processing knock 2020 [00-79 answer]
100 Language Processing Knock 2020 Chapter 1
100 Amateur Language Processing Knock: 17
100 language processing knock 2020 [00 ~ 49 answer]
100 Language Processing Knock-52: Stemming
100 Language Processing Knock Chapter 1
Regular expression with pymongo
100 Amateur Language Processing Knock: 07
100 Language Processing Knock 2020 Chapter 3
100 Language Processing Knock 2020 Chapter 2
100 Amateur Language Processing Knock: 09
100 Amateur Language Processing Knock: 47
100 Language Processing Knock-53: Tokenization
100 Amateur Language Processing Knock: 97
100 language processing knock 2020 [00 ~ 59 answer]
100 Amateur Language Processing Knock: 67
100 language processing knock-94 (using Gensim): similarity calculation with WordSimilarity-353
Easy learning of 100 language processing knock 2020 with "Google Colaboratory"
100 Language Processing Knock-51: Word Clipping
100 Language Processing Knock-58: Tuple Extraction
100 Language Processing Knock-57: Dependency Analysis
100 language processing knock-50: sentence break
100 Language Processing Knock Chapter 1 (Python)
100 Language Processing Knock Chapter 2 (Python)
100 Language Processing Knock-25: Template Extraction
100 Language Processing Knock-87: Word Similarity
Regular expression manipulation with Python
I tried 100 language processing knock 2020
100 language processing knock-56: co-reference analysis
Solving 100 Language Processing Knock 2020 (01. "Patatokukashi")
100 Amateur Language Processing Knock: Summary
Study natural language processing with Kikagaku
100 Language Processing Knock 2020 Chapter 2: UNIX Commands
100 Language Processing Knock 2015 Chapter 5 Dependency Analysis (40-49)
[Language processing 100 knocks 2020] Chapter 3: Regular expressions
100 Language Processing Knock Chapter 1 in Python
100 Language Processing Knock 2020 Chapter 4: Morphological Analysis
100 Language Processing Knock 2020 Chapter 9: RNN, CNN
100 language processing knock-76 (using scikit-learn): labeling
100 language processing knock-55: named entity extraction
[Natural language processing] Preprocessing with Japanese
I tried 100 language processing knock 2020: Chapter 3
100 Language Processing Knock-82 (Context Word): Context Extraction
100 Language Processing Knock: Chapter 1 Preparatory Movement