[PYTHON] 100 language processing knock-81 (batch replacement): Dealing with country names consisting of compound words

This is the record of the 81st "Countermeasures consisting of compound words" in Language Processing 100 Knock 2015. This time as well, following the previous Corpus shaping, the preprocessing system is used, and the main processing is character replacement using regular expressions. However, I am doing the troublesome work manually in the part of making the country name list. Because of that, programming itself is not difficult, but it took time.

Reference link

Link Remarks
081.Dealing with country names consisting of compound words.ipynb Answer program GitHub link
100 amateur language processing knocks:81 I am always indebted to you by knocking 100 language processing
100 language processing knock 2015 version(80~82) Chapter 9 was helpful

environment

type version Contents
OS Ubuntu18.04.01 LTS It is running virtually
pyenv 1.2.15 I use pyenv because I sometimes use multiple Python environments
Python 3.6.9 python3 on pyenv.6.I'm using 9
3.7 or 3.There is no deep reason not to use 8 series
Packages are managed using venv

Task

Chapter 9: Vector Space Method (I)

enwiki-20150112-400-r10-105752.txt.bz2 Is the text of 105,752 articles randomly sampled 1/10 from the English Wikipedia articles as of January 12, 2015, which consist of more than 400 words, compressed in bzip2 format. is there. Using this text as a corpus, I want to learn a vector (distributed expression) that expresses the meaning of a word. In the first half of Chapter 9, principal component analysis is applied to the word context co-occurrence matrix created from the corpus, and the process of learning word vectors is implemented by dividing it into several processes. In the latter half of Chapter 9, the word vector (300 dimensions) obtained by learning is used to calculate the similarity of words and perform analogy.

Note that if problem 83 is implemented obediently, a large amount (about 7GB) of main memory is required. If you run out of memory, devise a process or 1/100 sampling corpus enwiki-20150112-400-r100-10576.txt.bz2 Use /nlp100/data/enwiki-20150112-400-r100-10576.txt.bz2).

This time * "1/100 sampling corpus [enwiki-20150112-400-r100-10576.txt.bz2](http://www.cl.ecei.tohoku.ac.jp/nlp100/data/enwiki-20150112-" 400-r100-10576.txt.bz2) ”* is used.

81. Dealing with country names consisting of compound words

In English, the concatenation of multiple words can make sense. For example, the United States is expressed as "United States" and the United Kingdom is expressed as "United Kingdom", but the words "United", "States", and "Kingdom" alone are ambiguous in the concept or substance they point to. Therefore, we would like to estimate the meaning of a compound word by recognizing the compound word contained in the corpus and treating the compound word as one word. However, it is very difficult to accurately identify compound words, so here we would like to identify a country name consisting of compound words.

Get your own list of country names on the Internet and replace spaces with underscores for the country names of compound words that appear in the 80 corpora. For example, "United States" should be "United_States" and "Isle of Man" should be "Isle_of_Man".

It is troublesome to "get the country name list by yourself from the Internet" ...

Answer

Country name list creation

1. Get a list of country names

I thought that the page "Country codes / names" would be good, but the "Isle of" in the problem statement There is no Man ". "Isle of Man" seems to be in ISO 3166-1, so [Wikipedia's ISO 3166-1] I got the list from (https://en.wikipedia.org/wiki/ISO_3166-1). In other words, we are creating a country name list from the following three.

  1. " Country codes / names "](http://www.fao.org/countryprofiles/iso3list/en/) Short name column
  2. " Country codes / names "](http://www.fao.org/countryprofiles/iso3list/en/) Official name column
  3. [「Wikipedia ISO 3166-1」] (https://en.wikipedia.org/wiki/ISO_3166-1) ʻEnglish short name` column

2. Removal of the beginning the

Some names obtained from the ʻOfficial namecolumn of ["Country codes / names"](http://www.fao.org/countryprofiles/iso3list/en/) are prefixed withthe`. I removed it later because it was an obstacle.

3. Duplicate deletion

Since I got it from three, some country names are duplicated, so I deleted the duplicates.

4. Delete single name

The theme this time is "Country name consisting of compound words", and a single word country name is not required. I did = COUNTIF (A1," * * ") on EXCEL and judged the country name with a space between them as a compound word, and removed the country name whose EXCEL function result was 0.

5. Manual fine adjustment

Some of them cannot be used as they are, so I made fine adjustments manually. It takes time ... The following is an example.

Former After change
Bolivia (Plurinational State of) Plurinational State of Bolivia
Cocos (Keeling) Islands Cocos Keeling Islands
Cocos Keeling
Cocos Islands
Keeling Islands

In the end, 247 country names were created.

Answer program [081. Dealing with country names consisting of compound words.ipynb](https://github.com/YoheiFukuhara/nlp100/blob/master/09.%E3%83%99%E3%82%AF%E3% 83% 88% E3% 83% AB% E7% A9% BA% E9% 96% 93% E6% B3% 95% 20 (I) / 081.% E8% A4% 87% E5% 90% 88% E8% AA% 9E% E3% 81% 8B% E3% 82% 89% E3% 81% AA% E3% 82% 8B% E5% 9B% BD% E5% 90% 8D% E3% 81% B8% E3% 81% AE% E5% AF% BE% E5% 87% A6.ipynb)

It is a program. The process is short and trivial (I spend a couple of hours making it due to lack of skills ...). However, it takes about 12 minutes to perform a full-text search and replace as many as 247 country names. Article "100 knocks of language processing 2015 version (80-82)" It is faster if you use the sed command. Is not it?

import re

#Remove line feed code from file and prefix word number for sorting
with open('./081.countries.txt') as countires:
    country_num = [[len(country.split()), country.rstrip('\n')] for country in countires]

country_num.sort(reverse=True)

with open('./080.corpus.txt') as file_in:
    body = file_in.read()

for i, country in enumerate(country_num):
    print(i, country[1])
    regex = re.compile(country[1], re.IGNORECASE)
    body = regex.sub(country[1].replace(' ', '_'), body)

with open('./081.corpus.txt', mode='w') as file_out:
    file_out.write(body)

Answer commentary

The country name list file is read, the number of words is added to the list, and the sort is done in descending order. This is because "United States of America" is replaced with "United_States", which has a smaller number of words, and is not "United_States of America".

#Remove line feed code from file and prefix word number for sorting
with open('./081.countries.txt') as countires:
    country_num = [[len(country.split()), country.rstrip('\n')] for country in countires]

country_num.sort(reverse=True)

By using re.INGNORECASE in the regular expression, it is replaced without being case-sensitive (I have not confirmed whether this fluctuation is useful).

regex = re.compile(country[1], re.IGNORECASE)

Recommended Posts

100 language processing knock-81 (batch replacement): Dealing with country names consisting of compound words
100 Language Processing Knock-88: 10 Words with High Similarity
Easy learning of 100 language processing knock 2020 with "Google Colaboratory"
100 Language Processing with Python Knock 2015
100 Language Processing Knock with Python (Chapter 1)
100 Language Processing Knock with Python (Chapter 3)
100 Language Processing Knock-59: Analysis of S-expressions
100 Language Processing Knock-96 (using Gensim): Extraction of vector for country name
100 Language Processing Knock 2020 with GiNZA v3.1 Chapter 4
100 Language Processing Knock-91: Preparation of Analogy Data
100 Language Processing Knock-44: Visualization of Dependent Tree
Language processing 100 knocks-22: Extraction of category names
100 Language Processing Knock-26: Removal of emphasized markup
100 Language Processing Knock with Python (Chapter 2, Part 2)
100 Language Processing Knock with Python (Chapter 2, Part 1)
Rehabilitation of Python and NLP skills starting with "100 Language Processing Knock 2015" (Chapter 1)
100 language processing knock-90 (using Gensim): learning with word2vec
100 Language Processing Knock-32 (using pandas): Prototype of verb
100 Language Processing Knock-45: Extraction of verb case patterns
100 language processing knock-75 (using scikit-learn): weight of features
100 language processing knock-95 (using pandas): Rating with WordSimilarity-353
100 Language Processing Knock (2020): 28
100 Language Processing Knock (2020): 38
100 language processing knock 00 ~ 02
Rehabilitation of Python and NLP skills starting with "100 Language Processing Knock 2015" (Chapter 2 second half)
Rehabilitation of Python and NLP skills starting with "100 Language Processing Knock 2015" (Chapter 2 first half)
100 Language Processing Knock-80 (Replace with Regular Expression): Corpus Formatting
100 Language Processing Knock-36 (using pandas): Frequency of word occurrence
[Chapter 5] Introduction to Python with 100 knocks of language processing
[Chapter 6] Introduction to scikit-learn with 100 knocks of language processing
3. Natural language processing with Python 4-1. Analysis for words with KWIC
[Chapter 3] Introduction to Python with 100 knocks of language processing
100 Language Processing Knock-49: Extraction of Dependency Paths Between Nouns
[Chapter 2] Introduction to Python with 100 knocks of language processing
100 language processing knock-94 (using Gensim): similarity calculation with WordSimilarity-353
[Chapter 4] Introduction to Python with 100 knocks of language processing
3. Natural language processing with Python 5-5. Emotion value analysis of Japanese sentences [Japanese evaluation polarity dictionary (words)]
100 language processing knock 2020 [00 ~ 39 answer]
100 language processing knock 2020 [00-79 answer]
100 language processing knock 2020 [00 ~ 69 answer]
100 Language Processing Knock 2020 Chapter 1
100 Amateur Language Processing Knock: 17
100 language processing knock 2020 [00 ~ 49 answer]
100 Language Processing Knock Chapter 1
100 Amateur Language Processing Knock: 07
100 Language Processing Knock 2020 Chapter 3
100 Language Processing Knock 2020 Chapter 2
100 Amateur Language Processing Knock: 09
100 Amateur Language Processing Knock: 47
100 Language Processing Knock-53: Tokenization
100 Amateur Language Processing Knock: 97
100 language processing knock 2020 [00 ~ 59 answer]
100 Amateur Language Processing Knock: 67
100 language processing knock-77 (using scikit-learn): measurement of correct answer rate
100 language processing knock-42: Display of the phrase of the person concerned and the person concerned
100 language processing knock-29: Get the URL of the national flag image