"Natural language processing" is fun! !!
Of course, it can be difficult, but it's more fun than that, it's just fun.
I would like to write about it as a starting point for such fun.
Japanese, which is normally used in conversation, is called "** natural language **".
Please think that "** natural language / processing " is to " process " such " natural language **" into a form that is easy to handle on a computer.
For example, there is "** numerical data **" that is easy for a computer to understand. It can be expressed numerically such as "EC site sales", "temperature", and "traffic volume". "Numerical data" is easy to handle with mathematical formulas and computers, so you can easily do this.
--You can easily draw a graph in a table with excel ――You can see the relationship between temperature and beer sales. --Turn on the air conditioner when the temperature drops below 10 degrees ――You can also do AI / machine learning that is popular now.
...... If you could do something similar to this with "** natural language **", that is, "conversation", "SNS sentences", "novel", and "lyrics", it wouldn't be fun. Is it?
For example
――Try to make a graph by converting the lyrics of your favorite artist and your favorite novel into data (somehow) ――Compare the lines of your favorite manga --Forcibly block negative comments on SNS
Lyrics, newspapers, Wikipedia, setting materials for your favorite anime, anything is fine, so if you analyze various things using "text" as a material, you may find something interesting.
Isn't it exciting somehow? I will do it! !!
** Supplement: What is natural language again? ** ** "Natural language" is a term that refers to "natural (developmental) language" that has developed naturally for everyday communication. On the other hand, there is also an "artificial language (formal language)", and the familiar one is the "program language". This is a language created "artificially" for a certain purpose. The big difference is that "natural language" has very vague grammar and word meanings and allows various expressions, while "artificial language (formal language)" has clear and unambiguous meanings. .. This "ambiguity" is very broadly interpreted because it depends on each person's lifestyle and culture, but that is also the difficulty and enjoyment of natural language processing.
Although it sounds interesting, "natural language processing" wasn't that easy ... in the past ... yes, but now it's easy.
It's a really convenient world, and by relying on an external API, it is possible to reach only the fun part with a shortcut.
Services that provide natural language processing include AWS, Azure, and many cloud services, but this time we will use ** COTOHA API **, which is a Japanese manufacturer and is strong in Japanese processing.
COTOHA API https://api.ce-cotoha.com/
The introduction was long, but it's best to have fun and play. The main thing I want to convey in this Qiita is to have fun.
This time, I will omit the details of programming. It's written in Python, which is popular these days, so that it can be done quickly and easily. Just stick it on Jupyter or something and it will work!
First, please register from the COTOHA API site to get various information.
Please do your best only here, or see the start guide around https://api.ce-cotoha.com/contents/gettingStarted.html and do your best. It's a little easier to use, you can try it for free, and unlike other cloud services, you don't need to enter a credit card from the beginning, so you can play with confidence.
If you can register successfully, you can get the information necessary for execution on the account home screen as shown below.

The rest is easy. Rewrite the ★ part in the code below with the value you got above and execute Python.
##############################
#★ Rewrite with the character string on the dashboard of COTOHA API
##############################
api_base_url = 'https://api.ce-cotoha.com/api/dev/'
client_id = '************'
client_secret = '************'
access_token_url = 'https://api.ce-cotoha.com/v1/oauth/accesstokens'
##############################
#★ Enter the text you want to analyze
##############################
text = 'I'm sorry I was born.'
##############################
import requests
import json
#Get Token
headers = { "Content-Type" : "application/json" }
data = { "grantType":"client_credentials", "clientId":client_id, "clientSecret":client_secret }
r = requests.post(access_token_url, data=json.dumps(data), headers=headers)
bearer_token = r.json()["access_token"]
#Get user attributes
headers = { "Content-Type" : "application/json;charset=UTF-8", "Authorization":"Bearer "+bearer_token }
data = { "sentence":text }
url = api_base_url + "nlp/v1/sentiment"
r = requests.post(url, data=json.dumps(data), headers=headers)
r.json()
If you try it, you will get the following result.
{'message': 'OK',
 'result': {'emotional_phrase': [{'emotion': 'sad', 'form': 'I'm sorry'}],
  'score': 0.21369802583055023,
  'sentiment': 'Negative'},
 'status': 0}
This is a sample using COTOHA's sentiment analysis API.
** Sentiment analysis ** Judge the writer's emotions as positive or negative when writing a sentence. It also recognizes certain emotions in the text, such as "pleasing" and "surprise."
The results of analyzing the famous sentence "** I'm sorry I was born. **" can be summarized as follows.
emotion  : 'sad'
sentiment: 'Negative'
This one sentence expresses ** emotion ** of ** sad **, and the result is that the whole sentence is ** Negative **.
I was able to convert the indescribable natural sentence "I'm sorry to be born" into easy-to-use data by "natural language processing".
This is one step in what is called "natural language processing."
Please try it by putting in various sentences. I think the result will be quite interesting ...
Now try running the following code.
Next, let's analyze multiple texts at once by putting multiple texts in the form of an array in the opening texts.
Try it in the same way as the previous sample.
##############################
#★ Rewrite with the character string on the dashboard of COTOHA API
##############################
api_base_url = 'https://api.ce-cotoha.com/api/dev/'
client_id = '************'
client_secret = '************'
access_token_url = 'https://api.ce-cotoha.com/v1/oauth/accesstokens'
##############################
#★ Enter the text you want to analyze
##############################
texts = [
    'I am a cat.',
    'There is no name yet.',
    'I have no idea where I was born.',
    'I remember only crying in a dim and damp place.',
    'I saw human beings for the first time here.',
    'Moreover, I heard later that it was the most evil race of human beings called Shosei.',
    'This student is a story that sometimes catches us and boiled and eats.',
    'However, I didn't think anything at that time, so I didn't think it was particularly scary.',
    'It just felt fluffy when it was placed on his palm and lifted up.',
    'It is probably the beginning of what is called a human being that he calms down a little on his palm and sees the student's face.',
    'The feeling that I thought was strange at this time still remains. The face, which should be decorated with the first hair, is slippery and looks like a kettle.',
    'After that, I met a cat a lot, but I have never met such a single wheel.',
    'Not only that, the center of the face is too protruding.',
    'Then I sometimes blow smoke from the hole.',
    'It was so throaty that I was really weak.',
    'It was around this time that I finally learned that this is a cigarette that humans drink.'
]
##############################
print("sentiment,score")
import requests
import json
#Get Token
headers = { "Content-Type" : "application/json" }
data = { "grantType":"client_credentials", "clientId":client_id, "clientSecret":client_secret }
r = requests.post(access_token_url, data=json.dumps(data), headers=headers)
bearer_token = r.json()["access_token"]
#Get user attributes
headers = { "Content-Type" : "application/json;charset=UTF-8", "Authorization":"Bearer "+bearer_token }
for text in texts:
    data = { "sentence":text }
    url = api_base_url + "nlp/v1/sentiment"
    r = requests.post(url, data=json.dumps(data), headers=headers)
    r_json = r.json()
    print( "{},{}".format( r_json['result']['sentiment'], r_json['result']['score'] ) )
If you try it, you will get the following result.
sentiment,score
Neutral,0.3753601806177662
Neutral,0.28184469062696865
Neutral,0.3836848869293042
Negative,0.39071316583764915
Neutral,0.3702709760069095
Negative,0.513838361667319
Neutral,0.47572556634191593
Negative,0.6752951176068892
Neutral,0.42154746899352424
Positive,0.14142126089599155
Neutral,0.4397035866256947
Neutral,0.3335122613499773
Neutral,0.36874320529074195
Neutral,0.3721780539113525
Negative,0.19851456636071463
Neutral,0.4334376882848198
If you look at the code, you can imagine it, but this is Natsume Soseki's "I am a cat", which is analyzed sentence by sentence and the results are arranged.
You can draw various graphs by reading the result of this in Excel and trying to get it.
For example, it looks like this. Osamu Dazai's "No Longer Human" is also listed because it is not interesting to have one.
The top is "I am a cat" and the bottom is "human disqualification".

It's interesting to arrange them in this way ...
"I am a cat" continues with the feeling of going back and forth between ** Negative ** and ** Neutral **, while "human disqualification" flutters. The range of emotions that go back and forth between ** Negative ** and ** Positive ** is amazing.
I've only done about the first 10 sentences, but I feel that I can see more things by trying this in full sentences, by chapter, and by comparing each work and each writer.
If it can be converted into data up to this point, it will be possible to analyze various statistical data, such as looking at the "average" or looking at the "variance".
how was it? It seems that you can do something that seems to be fun more easily than you think.
For example, the COTOHA API has many other APIs such as "estimate age and gender from sentences" and "extract part of speech (verbs and nouns)". Of course, various data can be converted to Azure, AWS, IBM Cloud, and so on. Personality Insights on IBM Cloud is also interesting. You can make a personality diagnosis from the text.
By using "natural language processing" in this way, various sentences can be made easy to handle on a computer.
Once it is made into a shape that is easy to handle, various applications will be possible.
Analyze the text of the website and improve it to raise the CVR, analyze the SNS and utilize it for marketing, analyze the essay, reading impressions and scoring results to analyze what kind of text has a high score ... ….
Extract various data from essays and reading impressions and use them as "explanatory variables", set the scoring as the "objective variable", and perform regression analysis. If you write a beautiful decision tree, you may be able to aim for a high score ...!
Even without the difficult things (the difficult things are done by the specialists who provide the service!), It is fun to be able to play various things in the "natural language" that you are used to. I hope this Qiita will be the first step in natural language processing! !!
Recommended Posts