I have created code to search for and list synonyms using Japanese WordNet.
Please refer to the following article. Knowing Japanese WordNet I made a tool with python that can search for synonyms using Japanese WordNet
Environment: Google Colaboratory The flow is to process/extract "wnjpn.db" downloaded from the Japanese WordNet website with sqlite, store it in the DataFrame of pandas, and search for similar words from the created DataFrame.
import gzip
import shutil
import sqlite3
import pandas as pd
#DL and unzip Japanese wordnet
! wget "http://compling.hss.ntu.edu.sg/wnja/data/1.1/wnjpn.db.gz" # 1~2 minutes
with gzip.open('wnjpn.db.gz', 'rb') as f_in:
with open('wnjpn.db', 'wb') as f_out:
shutil.copyfileobj(f_in, f_out)
# synset(Concept ID)And lemma(word)Combination of DataFrame creation
conn = sqlite3.connect("wnjpn.db")
q = 'SELECT synset,lemma FROM sense,word USING (wordid) WHERE sense.lang="jpn"'
sense_word = pd.read_sql(q, conn)
#Define a function that lists and returns synonyms
def get_synonyms(word):
"""Returns a list of synonyms for the input word.
Args:
word(str):Words to search for synonyms
Returns:
list[str]:List of synonyms
"""
#Search for synonyms Search for word synset
synsets = sense_word.loc[sense_word.lemma == word, "synset"]
#Get all the words associated with that synset (set it as there is a possibility of duplication)
synset_words = set(sense_word.loc[sense_word.synset.isin(synsets), "lemma"])
#Deleted because the original word will be included
if word in synset_words:
synset_words.remove(word)
return list(synset_words)
#Example of use
get_synonyms("word")
# >> ['Resignation', 'word', 'word', 'word']
#Empty list if you specify a word that is not in WordNet
get_synonyms("Super word")
# >> []
I think this part is difficult to understand, so it's a little supplement.
python
# synset(Concept ID)And lemma(word)Combination of DataFrame creation
conn = sqlite3.connect("wnjpn.db")
q = 'SELECT synset,lemma FROM sense,word USING (wordid) WHERE sense.lang="jpn"'
sense_word = pd.read_sql(q, conn)
Here, in sqlite, issue a query that joins the "sense" table and "word" table included in "wnjpn.db", and ** all combinations of synset (concept ID) and lemma (word) ** are included. The table is creating. Here, synset is a word concept (ID conversion), and words with the same synset (concept) are synonyms **. The created table will have the following form, and the same synset "00001740-v", "breathing", "breathing", "exhaling", "breathing", and "breathing" are synonyms.
synset | lemma | |
---|---|---|
1 | 00001740-n | entity |
2 | 00001740-r | With a cappella |
3 | 00001740-v | Breathe |
4 | 00001740-v | Breathing |
5 | 00001740-v | Vomiting |
6 | 00001740-v | Breathing |
7 | 00001740-v | Breath |
If you don't know what the "sense" table and "word" table are, it's hard to get an image, so I'll briefly introduce the contents of each. If you want to know more, read the linked article posted at the very beginning.
sense It is a table that shows the word id (word ID) included in the synset (concept ID). The combination of synset and wordid makes it unique. Also, as used in this query, the lang column can be used to determine whether the word is Japanese (jpn) or English (eng).
synset | wordid | lang | rank | lexid | freq | src | |
---|---|---|---|---|---|---|---|
0 | 02130160-v | 155287 | eng | 0 | 1 | 1 | eng-30 |
1 | 00001740-v | 186954 | jpn | nan | nan | nan | hand |
2 | 00001740-v | 216393 | jpn | nan | nan | nan | hand |
word It is a correspondence table of wordid (word ID) and lemma (word). It is unique with wordid. This time I'm using it to convert the word id of sense into a word. The lang column is similar to sense.
wordid | lang | lemma | pron | pos | |
---|---|---|---|---|---|
0 | 155287 | eng | lay_eyes_on | v | |
1 | 186954 | jpn | Breathing | v | |
2 | 216393 | jpn | Breath | v |
That is all. I would appreciate it if you could point out any mistakes.
Recommended Posts