[PYTHON] Try to extract the keywords that are popular in COTOHA

Prescribed procedure for application

COTOHA API Portal

Thanks. I want an iPad. It is a story article because it is said that if you write a story article using COTOHA, you will get an iPad (there is a bad word).

What did you make

Those who just want to see the code

I put it in [github] 4, so please have a look. Only the user information required to access COTOHA is dummy, so please overwrite it with your own information.

Overview of what I made

The conclusion is the linguistic analysis of the title and synopsis of [Become a Novelist] 2.

[Become a novelist] 2 is a web novel posting site. A novel is also a creation, but I think it also has a game-like aspect of how to gain access. If there is a trendy story, get on it and gain access with works that will become the counter. In a sense, there is a culture close to Twitter's Ogiri.

So, the main point of this article is to try to hack this game with language analysis. Looking at [COTOHA API Portal] 1, I came up with keyword extraction and similarity calculation.

  1. Can you extract popular keywords from the title and synopsis?
  2. Is there a correlation between the similarity between the title and the synopsis and the popularity?

It seems that 1 can be done surely, but when I write 2 as well, it is a part that is sober. Recently, the title is often in sentence format, so the title ≒ synopsis. Is that all right, or is it better to enter information that is different from the title? I'm curious.

how did you make it?

Naro Novel API

The title and synopsis use [Naro Novel API] 3. It is an API that can get the outline information of Naro official novel.

For accurate information, see [Documentation] 3. Roughly speaking, if you attach a query to the URL below with GET and throw it, It's a simple API that returns an overview of the novel that was caught.

https://api.syosetu.com/novelapi/api/

Actually, I wanted to include the first 10 episodes in the analysis, but the official API does not disclose the text information. Some sites have refined UI by pulling out data, I think you didn't like the server load because of that. It's moral to do something that isn't official, even using scraping, so I've stopped it.

https://api.syosetu.com/novelapi/api/?out=json&lim=50&order=dailypoint

For example, if you want to get the top 50 items in the daily ranking with json, it looks like this. I'm sorry to put a server load every time during the test, so With this tool, once you get it, you save it locally and reuse it. By the way, the code looks like this.

url = 'https://api.syosetu.com/novelapi/api/'
param = {
    'out': 'json',
    'lim': '50',
    'order': 'dailypoint',
}
url_format = '{}?{}'.format(url, urllib.parse.urlencode(param))
res = requests.get(url=url_format)

First, the number of output of the novel called all count is returned, so It's a bit clunky, but you need to avoid all counts when using for.

narou_datas = res.json()
for data in narou_datas:
    if 'title' in data:
        title = data['title']
        story = data['story']
        daily_point = data['daily_point'],

COTOHA API [COTOHA API Portal] 1 seems to require a two-step process when using it.

  1. Throw your credentials and get an access token
  2. Use the API with an access token in the header

1. Throw your credentials and get an access token

url = 'https://api.ce-cotoha.com/v1/oauth/accesstokens'
header = {
    'Content-Type':'application/json'
}
param = {
    'grantType': 'client_credentials',
    'clientId': conf['clientId'],
    'clientSecret': conf['clientSecret'],
}
res = requests.post(url=url, headers=header, data=json.dumps(param))
access_token = res['access_token']

2. Use the API with an access token in the header

Keyword extraction

url = 'https://api.ce-cotoha.com/api/dev/nlp/v1/keyword'
header = {
    'Content-Type' : 'application/json;charset=UTF-8',
    'Authorization' :  f"Bearer {access_token}",
}
param = {
    'document': title,
    'type' : 'kuzure',
    'max_keyword_num' : 10,
}
res = requests.post(url=url, headers=header, data=json.dumps(param))
result = res['result']

Similarity calculation

url = 'https://api.ce-cotoha.com/api/dev/nlp/v1/similarity'
header = {
    'Content-Type' : 'application/json;charset=UTF-8',
    'Authorization' :  f"Bearer {access_token}",
}
param = {
    's1': title,
    's2': story,
    'type' : 'kuzure',
}
res = requests.post(url=url, headers=header, data=json.dumps(param))
result = res['result']

Result is?

1. Can you extract popular keywords from the title and synopsis?

This was largely successful. It seems that the top 50 cases are not biased unexpectedly, For example, it seems that the degree of epidemic can be measured by executing it regularly and showing the tendency of increase or decrease.

If you extract the main results from the top 100 results, "Skill" "Childhood friend" "Expulsion" "Different world" "Musou" "The strongest" "Insulation" "Villain daughter" Isn't it a lineup that is generally convincing?

Mostly, when comparing visually, there were some keywords that should be corrected.

Contains general nouns

Common nouns such as "man", "he", and "she" are coming in as keywords. Since it is information that becomes noise from the original meaning of a keyword, It seems necessary to make an exclusion list and so on.

Complex noun

For example, "strongest slow life" was counted as one keyword. If possible, count this as two keywords, "strongest" and "slow life." When nouns overlap, it tends to be counted as one word. However, for example, "reincarnation in another world" is a complex noun, but I want it to be one word. General-purpose measures seem to be quite difficult.

Tacit understanding

I think it wouldn't be strange if the numbers for different worlds and villain daughters were even more overwhelming. Since these have their own setting items and categories, it seems that they tend not to be simply described in the title or synopsis. It seems that tags etc. need to be weighted separately and then processed in a complex manner.

2. Is there a correlation between the similarity between the title and the synopsis and the popularity?

It's painful to say without making a clear graph, but I didn't seem to find much correlation.

If you follow the trend,

"title": "Kensei's childhood friend hits me hard with power harassment, so I decided to insulate and start again in the frontier.", "daily_point": 5468, "score": 0.9695894

Titles in sentence format such as are calculated with high similarity,

"title": "Cooking with Wild Game", "daily_point": 574, "score": 0.44519746

Old-fashioned noun titles such as are calculated to have low similarity.

If you look at the distribution of scores, there are many numbers with high similarity, so It seems to be a fact that there are many sentence-style titles that have a so-called light novel feel. However, with this alone, I do not know whether it is a sentence format so there are many in the upper ranks, or simply because it is popular and there are only a large number of parameters.

In order to make corrections, the similarity is calculated for the results sorted by the posting date and time, and compared with the coefficient. If it is similar to that similarity distribution, it simply reflects the population parameter. If the ratio of sentence formats is small in that distribution, it seems that the sentence formats have a strong ability to read.

However, the similarity calculation was originally a survey of what kind of synopsis would be popular. After conducting additional research, I couldn't come to a conclusion about what to do with the synopsis, so I ended here. (The question of what the title should be is a low priority personally)

Impressions of using

Easy to use, but hard to see errors

Because it works with the test code above. It's very easy. Anyone who can hit curl can use it in seconds, so it seems to be quite easy to use.

On the other hand, it is difficult to understand the response when an error occurs. There is something wrong with the access token, there are too many characters, I get an error for some other reason. The status of the response and the granularity of the message are quite coarse, so It was quite difficult to identify the cause from there.

It seems that the usage will expand by combining with other APIs

As some people have used it to summarize Aozora Bunko, I thought that it would show its true value when used in combination with other services.

For example, by combining voice recognition and user attribute estimation, It seems that the attributes of humans staying in a specific space can be automatically collected. Install it in a restaurant, estimate the attributes of customers, This area will increase the inventory of higher menus, etc. (It looks like it will burn due to privacy issues)

Conclusion

I want an iPad.

Recommended Posts

Try to extract the keywords that are popular in COTOHA
Cython to try in the shortest
Try to decipher the login data stored in Firefox
[Jinja2] Solution to the problem that variables added in the for statement are not inherited
[Django] Let's try to clarify the part of Django that was somehow through in the test
A solution to the problem that files containing [and] are not listed in glob.glob ()
Try to extract the features of the sensor data with CNN
To make sure that the specified key is in the specified bucket in Boto 3
Extract the index of the original set list that corresponds to the list of subsets.
[Cloudian # 9] Try to display the metadata of the object in Python (boto3)
Regular expressions that are easy and solid to learn in Python
Upload and manage packages that are not in conda to anaconda.org
[Cloudian # 2] Try to display the object storage bucket in Python (boto3)
How to smartly define objects that are commonly used in View
Try to model the cumulative return of rollovers in futures trading
Try to write a program that abuses the program and sends 100 emails
Programming to fight in the world ~ 5-1
Programming to fight in the world ~ 5-5,5-6
Programming to fight in the world 5-3
Programming to fight in the world-Chapter 4
In the python command python points to python3.8
Try to introduce the theme to Pelican
Try to put data in MongoDB
The fastest way to try EfficientNet
Programming to fight in the world ~ 5-2
The easiest way to try PyQtGraph
I tried to summarize the methods that are often used when implementing basic algo in Quantx Factory
Memorandum Regular expression When there are multiple characters in the character string that you want to separate
Heroku deployment of the first Django app that beginners are addicted to
Use features that are no longer visible on the UI in Slack
How to scrape pages that are “Access Denied” in Selenium + Headless Chrome
[Cloudian # 5] Try to list the objects stored in the bucket with Python (boto3)
Try to extract specific data from JSON format data in object storage Cloudian/S3
Try logging in to qiita with Python
Try using the Wunderlist API in Python
[Python] Use pandas to extract △△ that maximizes ○○
Try using the Kraken API in Python
Try using the HL band in order
Try to face the integration by parts
Try to reproduce NumPy's add.at in Julia
How to extract polygon area in Python
In Jupyter, add IPerl to the kernel.
I tried to touch the COTOHA API
10 Python errors that are common to beginners
Python amateurs try to summarize the list ①
Try the new scheduler chaining in PyTorch 1.4
Various comments to write in the program
Try hitting the YouTube API in Python
[Note] Terms that are difficult to remember
Try hitting the Spotify API in Django.
I want to visualize where and how many people are in the factory
A collection of Numpy, Pandas Tips that are often used in the field
A solution to the problem that the Python version in Conda cannot be changed
Handle CSV that contains the element you want to parse in the file name
Try adding an external module to pepper. For the time being, in requests.
Why Docker is so popular. What is Docker in the first place? How to use
I tried to summarize the operations that are likely to be used with numpy-stl
Try to display the Fibonacci sequence in various languages in the name of algorithm practice
Use PIL in Python to extract only the data you want from Exif
Solution to the problem that you can't activate by putting conda in pyenv
How to find the coefficient of the trendline that passes through the vertices in Python