Suddenly, please take a quick look at the following sentences.
Hello, what is Misa? I'm Genki. This is a sequel to the sardines, but it's a sequel to the sardines. When you squeeze the egg
If you have the same time Jiban-n is a mess, but it's also a sequel to read I deliberately put in the letters Jinban too much. How are you? Chinya and Yachimeu right?
how is it? Isn't it surprisingly easy to read? This sentence is (personally) a relatively famous copy and paste, and its official name is [Typoglycemia](https://ja.wikipedia.org/wiki/%E3%82%BF%E3%82%A4%E3% 83% 9D% E3% 82% B0% E3% 83% AA% E3% 82% BB% E3% 83% 9F% E3% 82% A2).
Roughly speaking, when recognizing a word, humans do not understand it character by character, but visually recognize it as a set of characters **. At that time, the words are understood and predicted instantly in the brain, so even if the letters that make up the words are slightly replaced, they can be corrected and read **. *** * These corrections depend on individual knowledge and vocabulary, so there are individual differences. *** ***
This time, the Parsing API provided by COTOHA API Use reference / apireference.html # parsing) to parse the input text and output it as a typoglycemia text.
before :
Post typoglycemia sentences output using Python and COTOHA to Qiita
after :
Let's use pothyn and choota, and let's send the taposi miriguabushon to qiita.
What a result like that! !!
COTOHA
is a ** natural language processing / speech processing API platform ** provided by NTT Communications.
In addition to the syntax analysis
introduced in this article, named entity extraction
・ resolution analysis
・ keyword extraction
・ similarity calculation
・ sentence type judgment
・ user attribute estimation
・ sentiment analysis Various functions such as
・ summary
are provided.
User registration is easy, and each API can be used ** 1000 calls / day ** even within the free tier, so you can play around with it.
Now I'm collaborating with Qiita
and doing this kind of project, so please join us. Please give me! !!
You can register for free as a user from the COTOHA API Portal. After entering some basic items, a user ID and secret for using the API will be issued, so please make a note of it if you want to try the subsequent scripts at hand.
I referred to the following article. Both articles are very easy to understand and are highly recommended!
-I tried using the COTOHA API rumored to be easy to handle natural language processing in Python -The result of having COTOHA summarize "Mentos and the memories of Go". COTOHA with the fastest tutorial
The base is based on the above article, but I've tweaked the endpoint part of the API a bit.
Originally, BASE_URL
included nlp
, but it is omitted according to the official COTOHA format.
cotoha_api.py
import os
import urllib.request
import json
import configparser
import codecs
import re
import jaconv
import random
#COTOHA API operation class
class CotohaApi:
#Initialization
def __init__(self, client_id, client_secret, developer_api_base_url, access_token_publish_url):
self.client_id = client_id
self.client_secret = client_secret
self.developer_api_base_url = developer_api_base_url
self.access_token_publish_url = access_token_publish_url
self.getAccessToken()
#Get access token
def getAccessToken(self):
#Access token acquisition URL specification
url = self.access_token_publish_url
#Header specification
headers={
"Content-Type": "application/json;charset=UTF-8"
}
#Request body specification
data = {
"grantType": "client_credentials",
"clientId": self.client_id,
"clientSecret": self.client_secret
}
#Encode request body specification to JSON
data = json.dumps(data).encode()
#Request generation
req = urllib.request.Request(url, data, headers)
#Send a request and receive a response
res = urllib.request.urlopen(req)
#Get response body
res_body = res.read()
#Decode the response body from JSON
res_body = json.loads(res_body)
#Get an access token from the response body
self.access_token = res_body["access_token"]
#Parsing API
def parse(self, sentence):
#Parsing API URL specification
url = self.developer_api_base_url + "nlp/v1/parse"
#Header specification
headers={
"Authorization": "Bearer " + self.access_token,
"Content-Type": "application/json;charset=UTF-8",
}
#Request body specification
data = {
"sentence": sentence
}
#Encode request body specification to JSON
data = json.dumps(data).encode()
#Request generation
req = urllib.request.Request(url, data, headers)
#Send a request and receive a response
try:
res = urllib.request.urlopen(req)
#What to do if an error occurs in the request
except urllib.request.HTTPError as e:
#If the status code is 401 Unauthorized, get the access token again and request again
if e.code == 401:
print ("get access token")
self.access_token = getAccessToken(self.client_id, self.client_secret)
headers["Authorization"] = "Bearer " + self.access_token
req = urllib.request.Request(url, data, headers)
res = urllib.request.urlopen(req)
#Show cause for errors other than 401
else:
print ("<Error> " + e.reason)
#Get response body
res_body = res.read()
#Decode the response body from JSON
res_body = json.loads(res_body)
#Get analysis result from response body
return res_body
#Named entity recognition API
def ne(self, sentence):
#Named entity extraction API URL specification
url = self.developer_api_base_url + "nlp/v1/ne"
#Header specification
headers={
"Authorization": "Bearer " + self.access_token,
"Content-Type": "application/json;charset=UTF-8",
}
#Request body specification
data = {
"sentence": sentence
}
#Encode request body specification to JSON
data = json.dumps(data).encode()
#Request generation
req = urllib.request.Request(url, data, headers)
#Send a request and receive a response
try:
res = urllib.request.urlopen(req)
#What to do if an error occurs in the request
except urllib.request.HTTPError as e:
#If the status code is 401 Unauthorized, get the access token again and request again
if e.code == 401:
print ("get access token")
self.access_token = getAccessToken(self.client_id, self.client_secret)
headers["Authorization"] = "Bearer " + self.access_token
req = urllib.request.Request(url, data, headers)
res = urllib.request.urlopen(req)
#Show cause for errors other than 401
else:
print ("<Error> " + e.reason)
#Get response body
res_body = res.read()
#Decode the response body from JSON
res_body = json.loads(res_body)
#Get analysis result from response body
return res_body
#Resolution API
def coreference(self, document):
#Correspondence analysis API acquisition URL specification
url = self.developer_api_base_url + "beta/coreference"
#Header specification
headers={
"Authorization": "Bearer " + self.access_token,
"Content-Type": "application/json;charset=UTF-8",
}
#Request body specification
data = {
"document": document
}
#Encode request body specification to JSON
data = json.dumps(data).encode()
#Request generation
req = urllib.request.Request(url, data, headers)
#Send a request and receive a response
try:
res = urllib.request.urlopen(req)
#What to do if an error occurs in the request
except urllib.request.HTTPError as e:
#If the status code is 401 Unauthorized, get the access token again and request again
if e.code == 401:
print ("get access token")
self.access_token = getAccessToken(self.client_id, self.client_secret)
headers["Authorization"] = "Bearer " + self.access_token
req = urllib.request.Request(url, data, headers)
res = urllib.request.urlopen(req)
#Show cause for errors other than 401
else:
print ("<Error> " + e.reason)
#Get response body
res_body = res.read()
#Decode the response body from JSON
res_body = json.loads(res_body)
#Get analysis result from response body
return res_body
#Keyword extraction API
def keyword(self, document):
#Keyword extraction API URL specification
url = self.developer_api_base_url + "nlp/v1/keyword"
#Header specification
headers={
"Authorization": "Bearer " + self.access_token,
"Content-Type": "application/json;charset=UTF-8",
}
#Request body specification
data = {
"document": document
}
#Encode request body specification to JSON
data = json.dumps(data).encode()
#Request generation
req = urllib.request.Request(url, data, headers)
#Send a request and receive a response
try:
res = urllib.request.urlopen(req)
#What to do if an error occurs in the request
except urllib.request.HTTPError as e:
#If the status code is 401 Unauthorized, get the access token again and request again
if e.code == 401:
print ("get access token")
self.access_token = getAccessToken(self.client_id, self.client_secret)
headers["Authorization"] = "Bearer " + self.access_token
req = urllib.request.Request(url, data, headers)
res = urllib.request.urlopen(req)
#Show cause for errors other than 401
else:
print ("<Error> " + e.reason)
#Get response body
res_body = res.read()
#Decode the response body from JSON
res_body = json.loads(res_body)
#Get analysis result from response body
return res_body
#Similarity calculation API
def similarity(self, s1, s2):
#Similarity calculation API URL specification
url = self.developer_api_base_url + "nlp/v1/similarity"
#Header specification
headers={
"Authorization": "Bearer " + self.access_token,
"Content-Type": "application/json;charset=UTF-8",
}
#Request body specification
data = {
"s1": s1,
"s2": s2
}
#Encode request body specification to JSON
data = json.dumps(data).encode()
#Request generation
req = urllib.request.Request(url, data, headers)
#Send a request and receive a response
try:
res = urllib.request.urlopen(req)
#What to do if an error occurs in the request
except urllib.request.HTTPError as e:
#If the status code is 401 Unauthorized, get the access token again and request again
if e.code == 401:
print ("get access token")
self.access_token = getAccessToken(self.client_id, self.client_secret)
headers["Authorization"] = "Bearer " + self.access_token
req = urllib.request.Request(url, data, headers)
res = urllib.request.urlopen(req)
#Show cause for errors other than 401
else:
print ("<Error> " + e.reason)
#Get response body
res_body = res.read()
#Decode the response body from JSON
res_body = json.loads(res_body)
#Get analysis result from response body
return res_body
#Statement type determination API
def sentenceType(self, sentence):
#Statement type determination API URL specification
url = self.developer_api_base_url + "nlp/v1/sentence_type"
#Header specification
headers={
"Authorization": "Bearer " + self.access_token,
"Content-Type": "application/json;charset=UTF-8",
}
#Request body specification
data = {
"sentence": sentence
}
#Encode request body specification to JSON
data = json.dumps(data).encode()
#Request generation
req = urllib.request.Request(url, data, headers)
#Send a request and receive a response
try:
res = urllib.request.urlopen(req)
#What to do if an error occurs in the request
except urllib.request.HTTPError as e:
#If the status code is 401 Unauthorized, get the access token again and request again
if e.code == 401:
print ("get access token")
self.access_token = getAccessToken(self.client_id, self.client_secret)
headers["Authorization"] = "Bearer " + self.access_token
req = urllib.request.Request(url, data, headers)
res = urllib.request.urlopen(req)
#Show cause for errors other than 401
else:
print ("<Error> " + e.reason)
#Get response body
res_body = res.read()
#Decode the response body from JSON
res_body = json.loads(res_body)
#Get analysis result from response body
return res_body
#User attribute estimation API
def userAttribute(self, document):
#User attribute estimation API URL specification
url = self.developer_api_base_url + "beta/user_attribute"
#Header specification
headers={
"Authorization": "Bearer " + self.access_token,
"Content-Type": "application/json;charset=UTF-8",
}
#Request body specification
data = {
"document": document
}
#Encode request body specification to JSON
data = json.dumps(data).encode()
#Request generation
req = urllib.request.Request(url, data, headers)
#Send a request and receive a response
try:
res = urllib.request.urlopen(req)
#What to do if an error occurs in the request
except urllib.request.HTTPError as e:
#If the status code is 401 Unauthorized, get the access token again and request again
if e.code == 401:
print ("get access token")
self.access_token = getAccessToken(self.client_id, self.client_secret)
headers["Authorization"] = "Bearer " + self.access_token
req = urllib.request.Request(url, data, headers)
res = urllib.request.urlopen(req)
#Show cause for errors other than 401
else:
print ("<Error> " + e.reason)
#Get response body
res_body = res.read()
#Decode the response body from JSON
res_body = json.loads(res_body)
#Get analysis result from response body
return res_body
if __name__ == '__main__':
#Get the location of the source file
APP_ROOT = os.path.dirname(os.path.abspath( __file__)) + "/"
#Get set value
config = configparser.ConfigParser()
config.read(APP_ROOT + "config.ini")
CLIENT_ID = config.get("COTOHA API", "Developer Client id")
CLIENT_SECRET = config.get("COTOHA API", "Developer Client secret")
DEVELOPER_API_BASE_URL = config.get("COTOHA API", "Developer API Base URL")
ACCESS_TOKEN_PUBLISH_URL = config.get("COTOHA API", "Access Token Publish URL")
#COTOHA API instantiation
cotoha_api = CotohaApi(CLIENT_ID, CLIENT_SECRET, DEVELOPER_API_BASE_URL, ACCESS_TOKEN_PUBLISH_URL)
#Analysis target sentence
sentence = "Post typoglycemia sentences output using Python and COTOHA to Qiita"
#Display before shaping
print('before :')
print(sentence)
#Parsing API execution
result_json = cotoha_api.parse(sentence)
#String list before formatting
word_list_base = []
#Character string list after formatting
word_list = []
#Regular expression for alphanumeric judgment
alnumReg = re.compile(r'^[a-zA-Z0-9 ]+$')
#Loop processing of analysis results
for i in range(len(result_json['result'])):
for j in range(len(result_json['result'][i]['tokens'])):
#For alphanumeric characters'form'For Japanese, the value of'kana'Use the value of
word = result_json['result'][i]['tokens'][j]['form']
kana = result_json['result'][i]['tokens'][j]['kana']
#Judgment whether it is half-width alphanumeric characters
if alnumReg.match(word) is not None:
#Determine if it is one word or not
if ' ' in word:
#If more than one is configured later, disassemble further
word_list_base.extend(word.split(' '))
else :
word_list_base.append(word)
#Japanese
else :
#Convert katakana to hiragana and add to list
word_list_base.append(jaconv.kata2hira(kana))
#Parse each word and replace the characters other than the beginning and end of 4 or more characters
for i in range(len(word_list_base)):
#4 characters or more
if len(word_list_base[i]) > 3:
#First break down into a character-by-character list
wl_all = list(word_list_base[i])
#Keep the first and last characters
first_word = wl_all[0]
last_word = wl_all[len(wl_all) - 1]
#Get the characters inside in list format
wl = wl_all[1:len(wl_all) - 1]
#Shuffle
random.shuffle(wl)
word_list.append(first_word + ''.join(wl) + last_word)
#If it is less than 4 characters, leave it as it is
else :
word_list.append(word_list_base[i])
#Display formatting results
print('after :')
print(' '.join(word_list))
config.ini
[COTOHA API]
Developer API Base URL: https://api.ce-cotoha.com/api/dev/
Developer Client id:[Client ID]
Developer Client secret:【secret】
Access Token Publish URL: https://api.ce-cotoha.com/v1/oauth/accesstokens
To use it, enter the client ID and secret in config.ini
and place it in the same hierarchy as cotoha_api.py
.
The execution is as follows.
python cotoha_api.py
Execution result
before :
Post typoglycemia sentences output using Python and COTOHA to Qiita
after :
Let's use pothyn and choota, and let's send the taposi miriguabushon to qiita.
Even though I didn't have any knowledge about Python
or natural language processing
, I went on to implement it (** almost human script **).
However, even in such a state, I personally got satisfactory results, so COTOHA
is very easy to handle, and I think it is the best introduction.
We hope that you will be interested in COTOHA
even a little after reading this article.
-COTOHA API Reference -I tried using the COTOHA API rumored to be easy to handle natural language processing in Python -The result of having COTOHA summarize "Mentos and the memories of Go". COTOHA with the fastest tutorial
As an aside, ** "Typoglycemia" ** means ** "typographical error" ** and hypoglycemia ** "Hypoglycemi" ** portmanteau .org / wiki /% E3% 81% 8B% E3% 81% B0% E3% 82% 93% E8% AA% 9E). Portmanteau is a word made using multiple words, such as "Netsusama-to" and "Nespresso".
that? This seems to be interesting if you handle it with COTOHA
...
If I were to do it, it would be ** "I tried to create a now portmanteau using COTOHA" ** (laughs)
Recommended Posts