This is the record of the 96th "Extraction of vector related to country name" of Language processing 100 knock 2015. Extract only the country name from the Gensim version of the word vector saved in Knock 90th. It's technically easy, but the country name part is a bit tedious.
| Link | Remarks | 
|---|---|
| 096.Extraction of vector for country name.ipynb | Answer program GitHub link | 
| 100 amateur language processing knocks:96 | I am always indebted to you by knocking 100 language processing | 
| type | version | Contents | 
|---|---|---|
| OS | Ubuntu18.04.01 LTS | It is running virtually | 
| pyenv | 1.2.15 | I use pyenv because I sometimes use multiple Python environments | 
| Python | 3.6.9 | python3 on pyenv.6.I'm using 9 3.7 or 3.There is no deep reason not to use 8 series Packages are managed using venv | 
In the above environment, I am using the following additional Python packages. Just install with regular pip.
| type | version | 
|---|---|
| gensim | 3.8.1 | 
| numpy | 1.17.4 | 
| pandas | 0.25.3 | 
In Chapter 10, we will continue to study word vectors from the previous chapter.
Extract only the vector related to the country name from the learning result of word2vec.
"Language processing 100 knock-81 (collective replacement): Dealing with country names consisting of compound words" I thought about using the country name file However, the file does not have a one-word country name (such as "England"). [Step "4. Delete Single Name"](https://qiita.com/FukuharaYohei/items/67be619ce9dd33392fcd#4-%E5%8D%98%E4%B8%80%E5%90%8D%E5%89 This is because I erased it with% 8A% E9% 99% A4). Once again, [Step "4. Delete Single Name"](https://qiita.com/FukuharaYohei/items/67be619ce9dd33392fcd#4-%E5%8D%98%E4%B8%80%E5%90%8D%E5 I added and used the country name deleted in% 89% 8A% E9% 99% A4).
import numpy as np
import pandas as pd
from gensim.models import Word2Vec
model = Word2Vec.load('./090.word2vec.model')
print(model)
index = []
vector = []
with open('./096.countries.txt') as file_in:
    for line in file_in:
        country = line.rstrip().replace(' ', '_')
        try:
            vector.append(model.wv[country].tolist())
            index.append(country)
        except KeyError:
            pass
pd.DataFrame(vector, index=index).to_pickle('096.country_vector.zip')
I read the file line by line to get the country name vector and add it to the list. Spaces are replaced with underscores in "Language processing 100 knock-81 (Batch replacement): Dealing with country names consisting of compound words" Because I did the same thing. Some of them are not included in the corpus and some are excluded because they appear less frequently, so we use ʻexcept Key Error` to catch the error.
for line in file_in:
    country = line.rstrip().replace(' ', '_')
    try:
        vector.append(model.wv[country].tolist())
        index.append(country)
    except KeyError:
        pass
After that, put the country name as an index in DataFrame and output it as a file. 238 countries are output. Since the original file was in 416 countries, a little less than 60% of the word vectors exist.
pd.DataFrame(vector, index=index).to_pickle('096.country_vector.zip')
Recommended Posts