It is a service provided by NTT Group that provides various natural language processing / speech processing APIs such as parsing, anaphora resolution, keyword extraction, speech recognition, and summarization.
COTOHA API | Natural language processing and speech recognition API platform utilizing Japan's largest Japanese dictionary developed by NTT Communications https://api.ce-cotoha.com/contents/index.html
Some may wonder, "Is NTT a telephone company?", But NTT laboratories have been studying Japanese processing on computers for decades. (I was also taken care of by visiting the NTT Communication Science Laboratories and reading papers when I was a student.)
Recently, cloud technology has advanced, and it has become possible to provide processing that used to run on a local computer as an API service via the Internet. It seems that it is not disclosed which company / department of NTT provides the development of the logic itself, but it seems that the API is provided by NTT Communications.
URL HTTP request POST HTTP response curl command POST JSON GSON Java Maven It is around.
The COTOHA API page has the following:
Parsing The parsing API receives sentences written in Japanese as input, analyzes and outputs the structure and meaning of the sentences. The input sentence is decomposed into phrases and morphemes, and semantic information such as dependency relations between clauses, dependency relations between morphemes, and part of speech information is added.
"Sentences written in Japanese (= natural sentences)"
I ran to school today.
It says a sentence like. If you divide this into phrases
Today / I went to school / ran.
It will be like. (In the API, the clause is called chunk) Also, if you divide by morpheme unit
Today / is / school / to / run / tsu / te / line / ki / better / ta /.
It will be like. (The morpheme is called token in the API) In morphological analysis, the original "go" is output for the "line" part, and the part of speech is output for each morpheme.
Machine learning is also very popular these days, but I think the reality is that it is still a little difficult to process natural sentences as they are without morphological analysis or parsing, or that good results cannot be obtained. Even if you apply machine learning, I think it is better to apply machine learning to the values obtained by morphological analysis and parsing. In the case of Japanese analysis, there is no word-separation in Japanese, and the order of words is relatively free, so I think that there may be circumstances in which "simple machine learning" is difficult to apply.
Now, let's call the COTOHA API.
As a preparation, the flow is as follows.
If you follow the guide, the account registration in 1. will be completed without any problem.
In 2., when you access the API portal, the following screen will be displayed, so make a note of it by copying the "Client ID" and "Client secret".
"Client ID" and "Client secret" are equivalent to user ID and password, but in recent APIs, it is not good to send user ID and password for each access, so first of all, "access" It is supposed to get a "token" and reuse it. The COTOHA API has a maximum 24-hour deadline, so you can reuse what you got in the first session for 24 hours.
It's almost the same when calling any API service, but first look at the specifications to find out how to access them.
Get access token|reference| COTOHA API https://api.ce-cotoha.com/contents/reference/accesstoken.html
When you look at
is what it reads.
It's not a curl command format ... Suddenly a typo!
Wrong documentation is common in the IT industry, so don't be fooled by this. (Lol) The code is the specification. (← Quotations)
But I understand what I mean. It means that you should send a POST request like the one below in the curl command.
$ curl -X POST -H "Content-Type:application/json;charset=UTF-8" -d '{"grantType":"client_credentials","clientId": "[client id]","clientSecret":"[client secret]"}' "[Access Token Publish URL]"
In the [client id] [client secret] [Access Token Publish URL] part, enter the parameters written in the portal.
There are several ways to send an HTTP request in Java, but for now I'll try using a library called OkHttp. Also, since JSON is used when sending a request, this also uses the famous Gson.
Maven
<!-- https://mvnrepository.com/artifact/com.squareup.okhttp3/okhttp -->
<dependency>
<groupId>com.squareup.okhttp3</groupId>
<artifactId>okhttp</artifactId>
<version>3.14.2</version>
</dependency>
<!-- https://mvnrepository.com/artifact/com.google.code.gson/gson -->
<dependency>
<groupId>com.google.code.gson</groupId>
<artifactId>gson</artifactId>
<version>2.8.6</version>
</dependency>
If you write the code in Java + OkHttp, it will look like the following.
String url = "https://api.ce-cotoha.com/v1/oauth/accesstokens";
String clientId = "[client id]";
String clientSecret = "[client secret]";
// {
// "grantType": "client_credentials",
// "clientId": "[client id]",
// "clientSecret": "[client secret]"
// }
Gson gson = new Gson();
JsonObject jsonObj = new JsonObject();
jsonObj.addProperty("grantType", "client_credentials");
jsonObj.addProperty("clientId", clientId);
jsonObj.addProperty("clientSecret", clientSecret);
OkHttpClient client = new OkHttpClient();
MediaType JSON = MediaType.get("application/json; charset=utf-8");
RequestBody body = RequestBody.create(JSON, jsonObj.toString());
Request request = new Request.Builder() //
.url(url) //
.post(body) //
.build();
try (Response response = client.newCall(request).execute()) {
int responseCode = response.code();
String originalResponseBody = response.body().string();
System.err.println(responseCode); // 201
System.err.println(originalResponseBody);
// 201
// {
// "access_token": "xxx",
// "token_type": "bearer",
// "expires_in": "86399" ,
// "scope": "" ,
// "issued_at": "1581590104700"
// }
}
}
I think the output looks like the following. The actual access token is the one that is hidden as "xxx". It's not very cool programmatically, but let's copy and use it.
{
"access_token": "xxx",
"token_type": "bearer",
"expires_in": "86399" ,
"scope": "" ,
"issued_at": "1581590104700"
}
Read the following firmly.
API Reference-Parsing https://api.ce-cotoha.com/contents/reference/apireference.html
... There are several calling options, but the following curl command example seems to be the simplest.
$ curl -X POST -H "Content-Type:application/json;charset=UTF-8" -H "Authorization:Bearer [Access Token]" -d '{"sentence":"The dog walks.","type": "default"}' "[API Base URL]/nlp/v1/parse"
The curl command is a simple line, but in Java it looks like this:
String url = "https://api.ce-cotoha.com/api/dev" + "/nlp/v1/parse";
String sentence = "It is a good weather today.";
String type = "default";
String access_token = "xxx";
Gson gson = new Gson();
JsonObject jsonObj = new JsonObject();
jsonObj.addProperty("sentence", sentence);
jsonObj.addProperty("type", type);
OkHttpClient client = new OkHttpClient();
MediaType JSON = MediaType.get("application/json; charset=utf-8");
RequestBody body = RequestBody.create(JSON, jsonObj.toString());
Request request = new Request.Builder() //
.addHeader("Authorization", "Bearer " + access_token) //
.url(url) //
.post(body) //
.build();
try (Response response = client.newCall(request).execute()) {
String originalResponseBody = response.body().string();
System.err.println(originalResponseBody);
}
Well, what about the result? I think it will look something like the following.
Looking at JSON while staring at the specs, information on phrases and morphemes is output in considerable detail. (The results are considerably more detailed than the analysis results of other companies' APIs.)
This JSON format output result will be parsed and used as a Java object. That part is a general Java technique rather than an API call, so I'll write it in the next article. → Continued article Parsing COTOHA API parsing in Java
{
"result": [
{
"chunk_info": {"id": 0,"head": 2,"dep": "D","chunk_head": 0,"chunk_func": 1,
"links": []
},
"tokens": [
{
"id": 0,
"form": "today",
"kana": "today",
"lemma": "today",
"pos": "noun",
"features": ["Date and time"],
"dependency_labels": [
{
"token_id": 1,
"label": "case"
}
],
"attributes": {
}
},
{
"id": 1,
"form": "Is",
"kana": "C",
"lemma": "Is",
"pos": "Conjunctive particles",
"features": [],
"attributes": {
}
}
]
},
{
"chunk_info": {
"id": 1,
"head": 2,
"dep": "D",
"chunk_head": 0,
"chunk_func": 1,
"links": []
},
"tokens": [
{
"id": 2,
"form": "I",
"kana": "I",
"lemma": "Good",
"pos": "Adjective stem",
"features": [
"Step"
],
"dependency_labels": [
{
"token_id": 3,
"label": "aux"
}
],
"attributes": {
}
},
{
"id": 3,
"form": "I",
"kana": "I",
"lemma": "I",
"pos": "Adjective suffix",
"features": [
"Attributive form"
],
"attributes": {
}
}
]
},
{
"chunk_info": {
"id": 2,
"head": -1,
"dep": "O",
"chunk_head": 0,
"chunk_func": 1,
"links": [
{
"link": 0,
"label": "time"
},
{
"link": 1,
"label": "adjectivals"
}
],
"predicate": []
},
"tokens": [
{
"id": 4,
"form": "weather",
"kana": "weather",
"lemma": "weather",
"pos": "noun",
"features": [],
"dependency_labels": [
{
"token_id": 0,
"label": "nmod"
},
{
"token_id": 2,
"label": "amod"
},
{
"token_id": 5,
"label": "cop"
},
{
"token_id": 6,
"label": "punct"
}
],
"attributes": {
}
},
{
"id": 5,
"form": "is",
"kana": "death",
"lemma": "is",
"pos": "Judgment",
"features": [
"stop"
],
"attributes": {
}
},
{
"id": 6,
"form": "。",
"kana": "",
"lemma": "。",
"pos": "Kuten",
"features": [],
"attributes": {
}
}
]
}
],
"status": 0,
"message": ""
}
Recommended Posts