This is the record of the 96th "Extraction of vector related to country name" of Language processing 100 knock 2015. Extract only the country name from the Gensim version of the word vector saved in Knock 90th. It's technically easy, but the country name part is a bit tedious.

Reference link

Link	Remarks
096.Extraction of vector for country name.ipynb	Answer program GitHub link
100 amateur language processing knocks:96	I am always indebted to you by knocking 100 language processing

environment

type	version	Contents
OS	Ubuntu18.04.01 LTS	It is running virtually
pyenv	1.2.15	I use pyenv because I sometimes use multiple Python environments
Python	3.6.9	python3 on pyenv.6.I'm using 9 3.7 or 3.There is no deep reason not to use 8 series Packages are managed using venv

In the above environment, I am using the following additional Python packages. Just install with regular pip.

type	version
gensim	3.8.1
numpy	1.17.4
pandas	0.25.3

Task

Chapter 10: Vector Space Method (II)

In Chapter 10, we will continue to study word vectors from the previous chapter.

96. Extraction of vector for country name

Extract only the vector related to the country name from the learning result of word2vec.

Problem supplement (about country name)

"Language processing 100 knock-81 (collective replacement): Dealing with country names consisting of compound words" I thought about using the country name file However, the file does not have a one-word country name (such as "England"). [Step "4. Delete Single Name"](https://qiita.com/FukuharaYohei/items/67be619ce9dd33392fcd#4-%E5%8D%98%E4%B8%80%E5%90%8D%E5%89 This is because I erased it with% 8A% E9% 99% A4). Once again, [Step "4. Delete Single Name"](https://qiita.com/FukuharaYohei/items/67be619ce9dd33392fcd#4-%E5%8D%98%E4%B8%80%E5%90%8D%E5 I added and used the country name deleted in% 89% 8A% E9% 99% A4).

Answer

Answer program [096. Extraction of vector for country name.ipynb](https://github.com/YoheiFukuhara/nlp100/blob/master/10.%E3%83%99%E3%82%AF%E3%83%88 % E3% 83% AB% E7% A9% BA% E9% 96% 93% E6% B3% 95% 20 (II) /096.%E5%9B%BD%E5%90%8D%E3%81%AB % E9% 96% A2% E3% 81% 99% E3% 82% 8B% E3% 83% 99% E3% 82% AF% E3% 83% 88% E3% 83% AB% E3% 81% AE% E6 % 8A% BD% E5% 87% BA.ipynb)

import numpy as np
import pandas as pd
from gensim.models import Word2Vec

model = Word2Vec.load('./090.word2vec.model')
print(model)

index = []
vector = []

with open('./096.countries.txt') as file_in:
    for line in file_in:
        country = line.rstrip().replace(' ', '_')
        try:
            vector.append(model.wv[country].tolist())
            index.append(country)
        except KeyError:
            pass

pd.DataFrame(vector, index=index).to_pickle('096.country_vector.zip')

Answer commentary

I read the file line by line to get the country name vector and add it to the list. Spaces are replaced with underscores in "Language processing 100 knock-81 (Batch replacement): Dealing with country names consisting of compound words" Because I did the same thing. Some of them are not included in the corpus and some are excluded because they appear less frequently, so we use ʻexcept Key Error` to catch the error.

for line in file_in:
    country = line.rstrip().replace(' ', '_')
    try:
        vector.append(model.wv[country].tolist())
        index.append(country)
    except KeyError:
        pass

After that, put the country name as an index in DataFrame and output it as a file. 238 countries are output. Since the original file was in 416 countries, a little less than 60% of the word vectors exist.

pd.DataFrame(vector, index=index).to_pickle('096.country_vector.zip')

[PYTHON] 100 Language Processing Knock-96 (using Gensim): Extraction of vector for country name