[PYTHON] Search / list synonyms using Japanese WordNet

I have created code to search for and list synonyms using Japanese WordNet.

What is WordNet in the first place?

Please refer to the following article. Knowing Japanese WordNet I made a tool with python that can search for synonyms using Japanese WordNet

Code immediately

Environment: Google Colaboratory The flow is to process/extract "wnjpn.db" downloaded from the Japanese WordNet website with sqlite, store it in the DataFrame of pandas, and search for similar words from the created DataFrame.

import gzip
import shutil
import sqlite3
import pandas as pd

#DL and unzip Japanese wordnet
! wget "http://compling.hss.ntu.edu.sg/wnja/data/1.1/wnjpn.db.gz"  # 1~2 minutes

with gzip.open('wnjpn.db.gz', 'rb') as f_in:
    with open('wnjpn.db', 'wb') as f_out:
        shutil.copyfileobj(f_in, f_out)

# synset(Concept ID)And lemma(word)Combination of DataFrame creation
conn = sqlite3.connect("wnjpn.db")
q = 'SELECT synset,lemma FROM sense,word USING (wordid) WHERE sense.lang="jpn"'
sense_word = pd.read_sql(q, conn)

#Define a function that lists and returns synonyms
def get_synonyms(word):
    """Returns a list of synonyms for the input word.

    Args:
        word(str):Words to search for synonyms
 
    Returns:
        list[str]:List of synonyms
    """
    #Search for synonyms Search for word synset
    synsets = sense_word.loc[sense_word.lemma == word, "synset"]

    #Get all the words associated with that synset (set it as there is a possibility of duplication)
    synset_words = set(sense_word.loc[sense_word.synset.isin(synsets), "lemma"])

    #Deleted because the original word will be included
    if word in synset_words:
        synset_words.remove(word)

    return list(synset_words)
#Example of use
get_synonyms("word")
# >> ['Resignation', 'word', 'word', 'word']

#Empty list if you specify a word that is not in WordNet
get_synonyms("Super word")
# >> []

Supplement

I think this part is difficult to understand, so it's a little supplement.

python


# synset(Concept ID)And lemma(word)Combination of DataFrame creation
conn = sqlite3.connect("wnjpn.db")
q = 'SELECT synset,lemma FROM sense,word USING (wordid) WHERE sense.lang="jpn"'
sense_word = pd.read_sql(q, conn)

Here, in sqlite, issue a query that joins the "sense" table and "word" table included in "wnjpn.db", and ** all combinations of synset (concept ID) and lemma (word) ** are included. The table is creating. Here, synset is a word concept (ID conversion), and words with the same synset (concept) are synonyms **. The created table will have the following form, and the same synset "00001740-v", "breathing", "breathing", "exhaling", "breathing", and "breathing" are synonyms.

synset lemma
1 00001740-n entity
2 00001740-r With a cappella
3 00001740-v Breathe
4 00001740-v Breathing
5 00001740-v Vomiting
6 00001740-v Breathing
7 00001740-v Breath

The contents of the table used for the join

If you don't know what the "sense" table and "word" table are, it's hard to get an image, so I'll briefly introduce the contents of each. If you want to know more, read the linked article posted at the very beginning.

sense It is a table that shows the word id (word ID) included in the synset (concept ID). The combination of synset and wordid makes it unique. Also, as used in this query, the lang column can be used to determine whether the word is Japanese (jpn) or English (eng).

synset wordid lang rank lexid freq src
0 02130160-v 155287 eng 0 1 1 eng-30
1 00001740-v 186954 jpn nan nan nan hand
2 00001740-v 216393 jpn nan nan nan hand

word It is a correspondence table of wordid (word ID) and lemma (word). It is unique with wordid. This time I'm using it to convert the word id of sense into a word. The lang column is similar to sense.

wordid lang lemma pron pos
0 155287 eng lay_eyes_on v
1 186954 jpn Breathing v
2 216393 jpn Breath v

That is all. I would appreciate it if you could point out any mistakes.

Recommended Posts

Search / list synonyms using Japanese WordNet
Search for synonyms from the word list (csv) using Python Japanese WordNet
Python error list (Japanese)
Search Twitter using Python
Count up list using collections.Counter
Search list for duplicate elements
Japanese morphological analysis using Janome
In-graph path search using Networkx
Search algorithm using word2vec [python]
WordNet structure and synonym search
Get Japanese synonyms in Python