This is the record of the 81st "Countermeasures consisting of compound words" in Language Processing 100 Knock 2015. This time as well, following the previous Corpus shaping, the preprocessing system is used, and the main processing is character replacement using regular expressions. However, I am doing the troublesome work manually in the part of making the country name list. Because of that, programming itself is not difficult, but it took time.

Reference link

Link	Remarks
081.Dealing with country names consisting of compound words.ipynb	Answer program GitHub link
100 amateur language processing knocks:81	I am always indebted to you by knocking 100 language processing
100 language processing knock 2015 version(80～82)	Chapter 9 was helpful

environment

type	version	Contents
OS	Ubuntu18.04.01 LTS	It is running virtually
pyenv	1.2.15	I use pyenv because I sometimes use multiple Python environments
Python	3.6.9	python3 on pyenv.6.I'm using 9 3.7 or 3.There is no deep reason not to use 8 series Packages are managed using venv

Task

Chapter 9: Vector Space Method (I)

enwiki-20150112-400-r10-105752.txt.bz2 Is the text of 105,752 articles randomly sampled 1/10 from the English Wikipedia articles as of January 12, 2015, which consist of more than 400 words, compressed in bzip2 format. is there. Using this text as a corpus, I want to learn a vector (distributed expression) that expresses the meaning of a word. In the first half of Chapter 9, principal component analysis is applied to the word context co-occurrence matrix created from the corpus, and the process of learning word vectors is implemented by dividing it into several processes. In the latter half of Chapter 9, the word vector (300 dimensions) obtained by learning is used to calculate the similarity of words and perform analogy.

Note that if problem 83 is implemented obediently, a large amount (about 7GB) of main memory is required. If you run out of memory, devise a process or 1/100 sampling corpus enwiki-20150112-400-r100-10576.txt.bz2 Use /nlp100/data/enwiki-20150112-400-r100-10576.txt.bz2).

This time * "1/100 sampling corpus [enwiki-20150112-400-r100-10576.txt.bz2](http://www.cl.ecei.tohoku.ac.jp/nlp100/data/enwiki-20150112-" 400-r100-10576.txt.bz2) ”* is used.

81. Dealing with country names consisting of compound words

In English, the concatenation of multiple words can make sense. For example, the United States is expressed as "United States" and the United Kingdom is expressed as "United Kingdom", but the words "United", "States", and "Kingdom" alone are ambiguous in the concept or substance they point to. Therefore, we would like to estimate the meaning of a compound word by recognizing the compound word contained in the corpus and treating the compound word as one word. However, it is very difficult to accurately identify compound words, so here we would like to identify a country name consisting of compound words.

Get your own list of country names on the Internet and replace spaces with underscores for the country names of compound words that appear in the 80 corpora. For example, "United States" should be "United_States" and "Isle of Man" should be "Isle_of_Man".

It is troublesome to "get the country name list by yourself from the Internet" ...

Answer

Country name list creation

1. Get a list of country names

I thought that the page "Country codes / names" would be good, but the "Isle of" in the problem statement There is no Man ". "Isle of Man" seems to be in ISO 3166-1, so [Wikipedia's ISO 3166-1] I got the list from (https://en.wikipedia.org/wiki/ISO_3166-1). In other words, we are creating a country name list from the following three.

" Country codes / names "](http://www.fao.org/countryprofiles/iso3list/en/) Short name column
" Country codes / names "](http://www.fao.org/countryprofiles/iso3list/en/) Official name column
[「Wikipedia ISO 3166-1」] (https://en.wikipedia.org/wiki/ISO_3166-1) ʻEnglish short name` column

2. Removal of the beginning the

Some names obtained from the ʻOfficial namecolumn of ["Country codes / names"](http://www.fao.org/countryprofiles/iso3list/en/) are prefixed withthe`. I removed it later because it was an obstacle.

3. Duplicate deletion

Since I got it from three, some country names are duplicated, so I deleted the duplicates.

4. Delete single name

The theme this time is "Country name consisting of compound words", and a single word country name is not required. I did = COUNTIF (A1," * * ") on EXCEL and judged the country name with a space between them as a compound word, and removed the country name whose EXCEL function result was 0.

Since a single name is also required for the 96th knock "Extract vector related to country name", it is easier to leave EXCEL.

5. Manual fine adjustment

Some of them cannot be used as they are, so I made fine adjustments manually. It takes time ... The following is an example.

Former	After change
Bolivia (Plurinational State of)	Plurinational State of Bolivia
Cocos (Keeling) Islands	Cocos Keeling Islands Cocos Keeling Cocos Islands Keeling Islands

In the end, 247 country names were created.

Answer program [081. Dealing with country names consisting of compound words.ipynb](https://github.com/YoheiFukuhara/nlp100/blob/master/09.%E3%83%99%E3%82%AF%E3% 83% 88% E3% 83% AB% E7% A9% BA% E9% 96% 93% E6% B3% 95% 20 (I) / 081.% E8% A4% 87% E5% 90% 88% E8% AA% 9E% E3% 81% 8B% E3% 82% 89% E3% 81% AA% E3% 82% 8B% E5% 9B% BD% E5% 90% 8D% E3% 81% B8% E3% 81% AE% E5% AF% BE% E5% 87% A6.ipynb)

It is a program. The process is short and trivial (I spend a couple of hours making it due to lack of skills ...). However, it takes about 12 minutes to perform a full-text search and replace as many as 247 country names. Article "100 knocks of language processing 2015 version (80-82)" It is faster if you use the sed command. Is not it?

import re

#Remove line feed code from file and prefix word number for sorting
with open('./081.countries.txt') as countires:
    country_num = [[len(country.split()), country.rstrip('\n')] for country in countires]

country_num.sort(reverse=True)

with open('./080.corpus.txt') as file_in:
    body = file_in.read()

for i, country in enumerate(country_num):
    print(i, country[1])
    regex = re.compile(country[1], re.IGNORECASE)
    body = regex.sub(country[1].replace(' ', '_'), body)

with open('./081.corpus.txt', mode='w') as file_out:
    file_out.write(body)

Answer commentary

The country name list file is read, the number of words is added to the list, and the sort is done in descending order. This is because "United States of America" is replaced with "United_States", which has a smaller number of words, and is not "United_States of America".

#Remove line feed code from file and prefix word number for sorting
with open('./081.countries.txt') as countires:
    country_num = [[len(country.split()), country.rstrip('\n')] for country in countires]

country_num.sort(reverse=True)

By using re.INGNORECASE in the regular expression, it is replaced without being case-sensitive (I have not confirmed whether this fluctuation is useful).

regex = re.compile(country[1], re.IGNORECASE)

[PYTHON] 100 language processing knock-81 (batch replacement): Dealing with country names consisting of compound words