Implemented code to detect synonyms using Japanese WordNet. The specification takes a csv file with words as input, searches for synonyms from the word group, lists them, and outputs a list of synonyms as text. The implementation is mainly based on the material of here.
Knowing Japanese WordNet Since the network is visualized, it is easy to imagine intuitively. If you are interested in the definition of WordNet, please read it.
Here is the official website. A Japanese semantic dictionary developed by the National Institute of Information and Communications Technology (NICT). This implementation requires downloading the official website Japanese Wordnet and English WordNet in an sqlite3 database. .. Download file name: wnjpn.db.gz If you unzip this, you can get the db file of the dictionary data. By loading this db with Python, it is possible to detect synonyms.
create_similar_words.py
import sqlite3
import csv
import re
#db connection
conn = sqlite3.connect("wnjpn.db")
# ui
csvfile = 'words.csv'
outfile = 'similar_words.txt'
'''functions
csv_input:Return list by inputting csv
SearchSimilarWords:Create and return a synonym list
create_similar_wordlst:Synonym list shaping
save_synonyms:Save synonym list
'''
def csv_input(path_name):
rows = []
with open(path_name,encoding='utf-8') as f:
reader = csv.reader(f)
for row in reader:
rows.append(row)
return rows
def SearchSimilarWords(word):
word = ','.join(word)
cur = conn.execute("select wordid from word where lemma='%s'" % word)
word_id = 99999999 #temp
for row in cur:
word_id = row[0]
#Determining if a word exists in Wordnet
if word_id==99999999:
return
cur = conn.execute("select synset from sense where wordid='%s'" % word_id)
synsets = []
for row in cur:
synsets.append(row[0])
simdict = []
for synset in synsets:
cur1 = conn.execute("select name from synset where synset='%s'" % synset)
cur2 = conn.execute("select def from synset_def where (synset='%s' and lang='jpn')" % synset)
cur3 = conn.execute("select wordid from sense where (synset='%s' and wordid!=%s)" % (synset,word_id))
for row3 in cur3:
target_word_id = row3[0]
cur3_1 = conn.execute("select lemma from word where wordid=%s" % target_word_id)
for row3_1 in cur3_1:
#Store similar words in a list
simdict.append(row3_1[0])
return simdict
def create_similar_wordlst(full_word):
parent = []
child = []
with open(csvfile, encoding='utf-8') as f:
reader = csv.reader(f)
for row in reader:
child = []
synonym = SearchSimilarWords(row)
if not synonym is None:
row = ','.join(row)
child.append(row)
for f_row in full_word:
f_row = ','.join(f_row)
for syn in synonym:
if f_row == syn:
child.append(syn)
if len(child) > 1:
parent.append(set(child))
# print(parent)
return parent
def save_synonyms(lst):
norlst = []
for row in lst:
row = list(row)
row = ','.join(row)
norlst.append(row)
norlst = set(norlst)
with open(outfile, mode='w') as f:
for row in norlst:
f.write(row+'\n')
def main():
full_word = csv_input(csvfile)
save_synonyms(create_similar_wordlst(full_word))
if __name__ == "__main__":
main()
create_similar_words.py wards.csv wnjpn.db
This time, for simple implementation, it is assumed that the character string is inserted in only one column. In addition, it is a mechanism to search for synonyms within the characters in ** words.csv. ** **
words.csv
development of
development
・
・
・
get together
Flock
Takaru
similar_words.csv
development of,development
・
・
・
get together,Flock,Takaru
I created a script that searches for synonyms in characters in csv and outputs them in csv. If you have any questions or imperfections in the implementation, please point them out. LGTM is also welcome! Thank you for reading.
Recommended Posts