Prescribed procedure for application

COTOHA API Portal

Thanks. I want an iPad. It is a story article because it is said that if you write a story article using COTOHA, you will get an iPad (there is a bad word).

What did you make

Those who just want to see the code

I put it in [github] 4, so please have a look. Only the user information required to access COTOHA is dummy, so please overwrite it with your own information.

Overview of what I made

The conclusion is the linguistic analysis of the title and synopsis of [Become a Novelist] 2.

[Become a novelist] 2 is a web novel posting site. A novel is also a creation, but I think it also has a game-like aspect of how to gain access. If there is a trendy story, get on it and gain access with works that will become the counter. In a sense, there is a culture close to Twitter's Ogiri.

So, the main point of this article is to try to hack this game with language analysis. Looking at [COTOHA API Portal] 1, I came up with keyword extraction and similarity calculation.

Can you extract popular keywords from the title and synopsis?
Is there a correlation between the similarity between the title and the synopsis and the popularity?

It seems that 1 can be done surely, but when I write 2 as well, it is a part that is sober. Recently, the title is often in sentence format, so the title ≒ synopsis. Is that all right, or is it better to enter information that is different from the title? I'm curious.

how did you make it?

Naro Novel API

The title and synopsis use [Naro Novel API] 3. It is an API that can get the outline information of Naro official novel.

For accurate information, see [Documentation] 3. Roughly speaking, if you attach a query to the URL below with GET and throw it, It's a simple API that returns an overview of the novel that was caught.

https://api.syosetu.com/novelapi/api/

Actually, I wanted to include the first 10 episodes in the analysis, but the official API does not disclose the text information. Some sites have refined UI by pulling out data, I think you didn't like the server load because of that. It's moral to do something that isn't official, even using scraping, so I've stopped it.

https://api.syosetu.com/novelapi/api/?out=json&lim=50&order=dailypoint

For example, if you want to get the top 50 items in the daily ranking with json, it looks like this. I'm sorry to put a server load every time during the test, so With this tool, once you get it, you save it locally and reuse it. By the way, the code looks like this.

url = 'https://api.syosetu.com/novelapi/api/'
param = {
    'out': 'json',
    'lim': '50',
    'order': 'dailypoint',
}
url_format = '{}?{}'.format(url, urllib.parse.urlencode(param))
res = requests.get(url=url_format)

First, the number of output of the novel called all count is returned, so It's a bit clunky, but you need to avoid all counts when using for.

narou_datas = res.json()
for data in narou_datas:
    if 'title' in data:
        title = data['title']
        story = data['story']
        daily_point = data['daily_point'],

COTOHA API [COTOHA API Portal] 1 seems to require a two-step process when using it.

Throw your credentials and get an access token
Use the API with an access token in the header

1. Throw your credentials and get an access token

url = 'https://api.ce-cotoha.com/v1/oauth/accesstokens'
header = {
    'Content-Type':'application/json'
}
param = {
    'grantType': 'client_credentials',
    'clientId': conf['clientId'],
    'clientSecret': conf['clientSecret'],
}
res = requests.post(url=url, headers=header, data=json.dumps(param))
access_token = res['access_token']

2. Use the API with an access token in the header

Keyword extraction

url = 'https://api.ce-cotoha.com/api/dev/nlp/v1/keyword'
header = {
    'Content-Type' : 'application/json;charset=UTF-8',
    'Authorization' :  f"Bearer {access_token}",
}
param = {
    'document': title,
    'type' : 'kuzure',
    'max_keyword_num' : 10,
}
res = requests.post(url=url, headers=header, data=json.dumps(param))
result = res['result']

Similarity calculation

url = 'https://api.ce-cotoha.com/api/dev/nlp/v1/similarity'
header = {
    'Content-Type' : 'application/json;charset=UTF-8',
    'Authorization' :  f"Bearer {access_token}",
}
param = {
    's1': title,
    's2': story,
    'type' : 'kuzure',
}
res = requests.post(url=url, headers=header, data=json.dumps(param))
result = res['result']

Result is?

1. Can you extract popular keywords from the title and synopsis?

This was largely successful. It seems that the top 50 cases are not biased unexpectedly, For example, it seems that the degree of epidemic can be measured by executing it regularly and showing the tendency of increase or decrease.

If you extract the main results from the top 100 results, "Skill" "Childhood friend" "Expulsion" "Different world" "Musou" "The strongest" "Insulation" "Villain daughter" Isn't it a lineup that is generally convincing?

Mostly, when comparing visually, there were some keywords that should be corrected.

Contains general nouns

Common nouns such as "man", "he", and "she" are coming in as keywords. Since it is information that becomes noise from the original meaning of a keyword, It seems necessary to make an exclusion list and so on.

Complex noun

For example, "strongest slow life" was counted as one keyword. If possible, count this as two keywords, "strongest" and "slow life." When nouns overlap, it tends to be counted as one word. However, for example, "reincarnation in another world" is a complex noun, but I want it to be one word. General-purpose measures seem to be quite difficult.

Tacit understanding

I think it wouldn't be strange if the numbers for different worlds and villain daughters were even more overwhelming. Since these have their own setting items and categories, it seems that they tend not to be simply described in the title or synopsis. It seems that tags etc. need to be weighted separately and then processed in a complex manner.

2. Is there a correlation between the similarity between the title and the synopsis and the popularity?

It's painful to say without making a clear graph, but I didn't seem to find much correlation.

If you follow the trend,

"title": "Kensei's childhood friend hits me hard with power harassment, so I decided to insulate and start again in the frontier.", "daily_point": 5468, "score": 0.9695894

Titles in sentence format such as are calculated with high similarity,

"title": "Cooking with Wild Game", "daily_point": 574, "score": 0.44519746

Old-fashioned noun titles such as are calculated to have low similarity.

If you look at the distribution of scores, there are many numbers with high similarity, so It seems to be a fact that there are many sentence-style titles that have a so-called light novel feel. However, with this alone, I do not know whether it is a sentence format so there are many in the upper ranks, or simply because it is popular and there are only a large number of parameters.

In order to make corrections, the similarity is calculated for the results sorted by the posting date and time, and compared with the coefficient. If it is similar to that similarity distribution, it simply reflects the population parameter. If the ratio of sentence formats is small in that distribution, it seems that the sentence formats have a strong ability to read.

However, the similarity calculation was originally a survey of what kind of synopsis would be popular. After conducting additional research, I couldn't come to a conclusion about what to do with the synopsis, so I ended here. (The question of what the title should be is a low priority personally)

Impressions of using

Easy to use, but hard to see errors

Because it works with the test code above. It's very easy. Anyone who can hit curl can use it in seconds, so it seems to be quite easy to use.

On the other hand, it is difficult to understand the response when an error occurs. There is something wrong with the access token, there are too many characters, I get an error for some other reason. The status of the response and the granularity of the message are quite coarse, so It was quite difficult to identify the cause from there.

It seems that the usage will expand by combining with other APIs

As some people have used it to summarize Aozora Bunko, I thought that it would show its true value when used in combination with other services.

For example, by combining voice recognition and user attribute estimation, It seems that the attributes of humans staying in a specific space can be automatically collected. Install it in a restaurant, estimate the attributes of customers, This area will increase the inventory of higher menus, etc. (It looks like it will burn due to privacy issues)

Conclusion

I want an iPad.

[PYTHON] Try to extract the keywords that are popular in COTOHA