[PYTHON] 100 Language Processing Knock-91: Preparation of Analogy Data

This is the record of the 91st "Preparation of analogy data" of Language processing 100 knock 2015. This time it is technically super easy because it is a pretreatment system for later knocking.

Reference link

Link Remarks
091.Preparation of analogy data.ipynb Answer program GitHub link
100 amateur language processing knocks:91 I am always indebted to you by knocking 100 language processing

environment

type version Contents
OS Ubuntu18.04.01 LTS It is running virtually
pyenv 1.2.15 I use pyenv because I sometimes use multiple Python environments
Python 3.6.9 python3 on pyenv.6.I'm using 9
3.7 or 3.There is no deep reason not to use 8 series
Packages are managed using venv

Task

Chapter 10: Vector Space Method (II)

In Chapter 10, we will continue to study word vectors from the previous chapter.

91. Preparation of analogy data

Download Word Analogy Evaluation Data. The line starting with ":" in this data represents the section name. For example, the line ": capital-common-countries" marks the beginning of the section "capital-common-countries". From the downloaded evaluation data, extract the evaluation cases included in the section "family" and save them in a file.

Problem supplement

"Analogy data" seems to be data for analogy. The first 10 lines are shown below. A colon at the beginning, such as : capital-common-countries, means a block, followed by ʻAthens Greece Baghdad Iraq` and the relationship between the capital and the country in two sets on one line. In this way, it is data in which blocks and dozens of lines after that are arranged in two sets of one line. This time, we will extract the contents of the family block from this data.

questions-words.txt


: capital-common-countries
Athens Greece Baghdad Iraq
Athens Greece Bangkok Thailand
Athens Greece Beijing China
Athens Greece Berlin Germany
Athens Greece Bern Switzerland
Athens Greece Cairo Egypt
Athens Greece Canberra Australia
Athens Greece Hanoi Vietnam
Athens Greece Havana Cuba

Answer

Answer Program [091. Preparation of Analogy Data.ipynb](https://github.com/YoheiFukuhara/nlp100/blob/master/10.%E3%83%99%E3%82%AF%E3%83%88% E3% 83% AB% E7% A9% BA% E9% 96% 93% E6% B3% 95% 20 (II) /091.%E3%82%A2%E3%83%8A%E3%83%AD% E3% 82% B8% E3% 83% BC% E3% 83% 87% E3% 83% BC% E3% 82% BF% E3% 81% AE% E6% BA% 96% E5% 82% 99.ipynb)

with open('./questions-words.txt') as file_in, \
       open('./091.analogy_family.txt', 'w') as file_out:

    target = False      #Target data
    for line in file_in:

        if target:

            #In the case of target data, output until it becomes another section
            if line.startswith(': '):
                break
            print(line.strip(), file=file_out)

        elif line.startswith(': family'):

            #Target data discovery
            target = True

Answer commentary

To be honest, I haven't done anything special technically, so I have no point to explain. If you force it, more than 90% is a copy of 100 amateur language processing knocks: 91. The first 10 lines of the resulting text are:

091.analogy_family.txt


boy girl brother sister
boy girl brothers sisters
boy girl dad mom
boy girl father mother
boy girl grandfather grandmother
boy girl grandpa grandma
boy girl grandson granddaughter
boy girl groom bride
boy girl he she
boy girl his her
Omitted thereafter

Recommended Posts

100 Language Processing Knock-91: Preparation of Analogy Data
100 language processing knock-92 (using Gensim): application to analogy data
100 Language Processing Knock-59: Analysis of S-expressions
100 Language Processing Knock (2020): 28
100 Language Processing Knock (2020): 38
100 language processing knock 00 ~ 02
100 Language Processing Knock-44: Visualization of Dependent Tree
100 Language Processing Knock-89: Analogy by Additive Constitutiveness
100 Language Processing Knock-26: Removal of emphasized markup
100 language processing knock 2020 [00 ~ 69 answer]
100 Language Processing Knock 2020 Chapter 1
100 language processing knock 2020 [00 ~ 49 answer]
100 Language Processing Knock-52: Stemming
100 Language Processing Knock Chapter 1
100 Amateur Language Processing Knock: 07
100 Language Processing Knock 2020 Chapter 3
100 Language Processing Knock 2020 Chapter 2
100 Amateur Language Processing Knock: 09
100 Amateur Language Processing Knock: 47
100 Language Processing Knock-53: Tokenization
100 Amateur Language Processing Knock: 97
100 language processing knock 2020 [00 ~ 59 answer]
100 Amateur Language Processing Knock: 67
100 language processing knock-20 (using pandas): reading JSON data
100 Language Processing Knock-32 (using pandas): Prototype of verb
100 Language Processing Knock-45: Extraction of verb case patterns
100 language processing knock-75 (using scikit-learn): weight of features
100 Language Processing Knock-93 (using pandas): Calculate the accuracy rate of analogy tasks
100 Language Processing with Python Knock 2015
100 Language Processing Knock-51: Word Clipping
100 Language Processing Knock-58: Tuple Extraction
100 Language Processing Knock-57: Dependency Analysis
100 language processing knock-50: sentence break
100 Language Processing Knock-36 (using pandas): Frequency of word occurrence
100 Language Processing Knock Chapter 1 (Python)
100 Language Processing Knock Chapter 2 (Python)
Performance verification of data preprocessing in natural language processing
100 Language Processing Knock-25: Template Extraction
100 Language Processing Knock-87: Word Similarity
100 Language Processing Knock-49: Extraction of Dependency Paths Between Nouns
I tried 100 language processing knock 2020
100 language processing knock-56: co-reference analysis
Solving 100 Language Processing Knock 2020 (01. "Patatokukashi")
Easy learning of 100 language processing knock 2020 with "Google Colaboratory"
Overview of natural language processing and its data preprocessing
100 Amateur Language Processing Knock: Summary
100 language processing knock-77 (using scikit-learn): measurement of correct answer rate
100 language processing knock-42: Display of the phrase of the person concerned and the person concerned
100 language processing knock-29: Get the URL of the national flag image
100 Language Processing Knock-70 (using Stanford NLP): Obtaining and shaping data
100 Language Processing Knock 2020 Chapter 2: UNIX Commands
100 Language Processing Knock 2015 Chapter 5 Dependency Analysis (40-49)
100 Language Processing Knock with Python (Chapter 1)
100 Language Processing Knock Chapter 1 in Python
100 Language Processing Knock 2020 Chapter 4: Morphological Analysis
100 Language Processing Knock 2020 Chapter 9: RNN, CNN
100 language processing knock-76 (using scikit-learn): labeling
100 language processing knock-55: named entity extraction
I tried 100 language processing knock 2020: Chapter 3
100 Language Processing Knock-82 (Context Word): Context Extraction
100 Language Processing Knock with Python (Chapter 3)