[PYTHON] 100 Language Processing Knock-96 (using Gensim): Extraction of vector for country name

This is the record of the 96th "Extraction of vector related to country name" of Language processing 100 knock 2015. Extract only the country name from the Gensim version of the word vector saved in Knock 90th. It's technically easy, but the country name part is a bit tedious.

Reference link

Link Remarks
096.Extraction of vector for country name.ipynb Answer program GitHub link
100 amateur language processing knocks:96 I am always indebted to you by knocking 100 language processing

environment

type version Contents
OS Ubuntu18.04.01 LTS It is running virtually
pyenv 1.2.15 I use pyenv because I sometimes use multiple Python environments
Python 3.6.9 python3 on pyenv.6.I'm using 9
3.7 or 3.There is no deep reason not to use 8 series
Packages are managed using venv

In the above environment, I am using the following additional Python packages. Just install with regular pip.

type version
gensim 3.8.1
numpy 1.17.4
pandas 0.25.3

Task

Chapter 10: Vector Space Method (II)

In Chapter 10, we will continue to study word vectors from the previous chapter.

96. Extraction of vector for country name

Extract only the vector related to the country name from the learning result of word2vec.

Problem supplement (about country name)

"Language processing 100 knock-81 (collective replacement): Dealing with country names consisting of compound words" I thought about using the country name file However, the file does not have a one-word country name (such as "England"). [Step "4. Delete Single Name"](https://qiita.com/FukuharaYohei/items/67be619ce9dd33392fcd#4-%E5%8D%98%E4%B8%80%E5%90%8D%E5%89 This is because I erased it with% 8A% E9% 99% A4). Once again, [Step "4. Delete Single Name"](https://qiita.com/FukuharaYohei/items/67be619ce9dd33392fcd#4-%E5%8D%98%E4%B8%80%E5%90%8D%E5 I added and used the country name deleted in% 89% 8A% E9% 99% A4).

Answer

Answer program [096. Extraction of vector for country name.ipynb](https://github.com/YoheiFukuhara/nlp100/blob/master/10.%E3%83%99%E3%82%AF%E3%83%88 % E3% 83% AB% E7% A9% BA% E9% 96% 93% E6% B3% 95% 20 (II) /096.%E5%9B%BD%E5%90%8D%E3%81%AB % E9% 96% A2% E3% 81% 99% E3% 82% 8B% E3% 83% 99% E3% 82% AF% E3% 83% 88% E3% 83% AB% E3% 81% AE% E6 % 8A% BD% E5% 87% BA.ipynb)

import numpy as np
import pandas as pd
from gensim.models import Word2Vec

model = Word2Vec.load('./090.word2vec.model')
print(model)

index = []
vector = []

with open('./096.countries.txt') as file_in:
    for line in file_in:
        country = line.rstrip().replace(' ', '_')
        try:
            vector.append(model.wv[country].tolist())
            index.append(country)
        except KeyError:
            pass

pd.DataFrame(vector, index=index).to_pickle('096.country_vector.zip')

Answer commentary

I read the file line by line to get the country name vector and add it to the list. Spaces are replaced with underscores in "Language processing 100 knock-81 (Batch replacement): Dealing with country names consisting of compound words" Because I did the same thing. Some of them are not included in the corpus and some are excluded because they appear less frequently, so we use ʻexcept Key Error` to catch the error.

for line in file_in:
    country = line.rstrip().replace(' ', '_')
    try:
        vector.append(model.wv[country].tolist())
        index.append(country)
    except KeyError:
        pass

After that, put the country name as an index in DataFrame and output it as a file. 238 countries are output. Since the original file was in 416 countries, a little less than 60% of the word vectors exist.

pd.DataFrame(vector, index=index).to_pickle('096.country_vector.zip')

Recommended Posts

100 Language Processing Knock-96 (using Gensim): Extraction of vector for country name
100 language processing knock-90 (using Gensim): learning with word2vec
100 Language Processing Knock-32 (using pandas): Prototype of verb
100 Language Processing Knock-45: Extraction of verb case patterns
100 language processing knock-75 (using scikit-learn): weight of features
100 language processing knock-72 (using Stanford NLP): feature extraction
100 language processing knock-92 (using Gensim): application to analogy data
100 Language Processing Knock-58: Tuple Extraction
100 Language Processing Knock-36 (using pandas): Frequency of word occurrence
100 Language Processing Knock-25: Template Extraction
100 Language Processing Knock-49: Extraction of Dependency Paths Between Nouns
100 language processing knock-94 (using Gensim): similarity calculation with WordSimilarity-353
100 language processing knock-77 (using scikit-learn): measurement of correct answer rate
100 language processing knock-76 (using scikit-learn): labeling
100 language processing knock-55: named entity extraction
100 Language Processing Knock-82 (Context Word): Context Extraction
Language processing 100 knock-86: Word vector display
100 Language Processing Knock 2020 Chapter 7: Word Vector
100 Language Processing Knock-59: Analysis of S-expressions
100 Language Processing Knock-31 (using pandas): Verb
100 language processing knock 2020 "for Google Colaboratory"
100 language processing knock-73 (using scikit-learn): learning
100 language processing knock-74 (using scikit-learn): Prediction
100 Language Processing Knock-38 (using pandas): Histogram
100 language processing knock-97 (using scikit-learn): k-means clustering
100 Language Processing Knock-33 (using pandas): Sahen noun
100 Language Processing Knock-91: Preparation of Analogy Data
100 Language Processing Knock-44: Visualization of Dependent Tree
Language processing 100 knocks-22: Extraction of category names
100 Language Processing Knock-26: Removal of emphasized markup
100 Language Processing Knock-71 (using Stanford NLP): Stopword
100 Language Processing Knock-35 (using pandas): Noun concatenation
100 Language Processing Knock-39 (using pandas): Zipf's Law
100 Language Processing Knock-34 (using pandas): "A B"
100 Language Processing Knock-93 (using pandas): Calculate the accuracy rate of analogy tasks
100 Language Processing Knock (2020): 38
100 language processing knock 00 ~ 02
100 language processing knock-79 (using scikit-learn): precision-recall graph drawing
100 language processing knock-20 (using pandas): reading JSON data
100 language processing knock-98 (using pandas): Ward's method clustering
100 language processing knock-99 (using pandas): visualization by t-SNE
100 language processing knock-95 (using pandas): Rating with WordSimilarity-353
100 language processing knock-81 (batch replacement): Dealing with country names consisting of compound words
100 language processing knock 2020 [00 ~ 39 answer]
100 language processing knock 2020 [00-79 answer]
100 language processing knock 2020 [00 ~ 69 answer]
100 Language Processing Knock 2020 Chapter 1
100 Amateur Language Processing Knock: 17
100 language processing knock 2020 [00 ~ 49 answer]
100 Language Processing Knock-52: Stemming
100 Language Processing Knock Chapter 1
Language processing 100 knocks-46: Extraction of verb case frame information
100 Amateur Language Processing Knock: 07
100 Language Processing Knock 2020 Chapter 3
100 Language Processing Knock: Chapter 2 UNIX Command Basics (using pandas)
100 Language Processing Knock 2020 Chapter 2
100 Language Processing Knock-83 (using pandas): Measuring word / context frequency
100 language processing knock-30 (using pandas): reading morphological analysis results
100 Amateur Language Processing Knock: 09
100 Amateur Language Processing Knock: 47
100 Language Processing Knock-53: Tokenization