[PYTHON] Challenge to generate Yahoo! News headlines using only COTOHA API

Introduction

COTOHA API is collaborating with Qiita. I want a PS4 to remake FF7 soon .. (-p-)

https://zine.qiita.com/event/collaboration-cotoha-api/

It's a completely impure motive, but I tried natural language processing using the COTOHA API. Today is the deadline for posting, so it's pretty close, but I managed to make it ...

What i did

I tried to summarize news articles using only the API provided by COTOHA. The theme is ** Yahoo! News Headline Generation **.

Yahoo! News Headlines

As you all know, Yahoo! News has a headline for each article. For example, it looks like the following.

スクリーンショット 2020-03-10 23.54.59.png

This headline that I usually see casually is actually made according to various rules and is deep.

First of all, in order to communicate simply and consistently in a limited space The number of characters is limited to ** up to 13 characters ** (to be exact, 13.5 characters including half-width spaces).

The heading also contains ** location information **. In the case of an incident or accident, the importance of news and the degree of interest of users vary greatly depending on where it occurs.

And the words used in the headline are basically ** words in the article **. This is because the articles are distributed to each media, so unless it fits in the number of characters, It seems that he does so so as not to twist the content of the article.

If you want to make a headline using the words in the article, use the COTOHA API I thought I could do it to some extent.

There are other rules, but the rules we have covered this time are summarized below.

-** [Rule 1] The maximum number of characters in the headline is 13 characters ** -** [Rule 2] Include location information in headings ** -** [Rule 3] Use the words in the article for the headline **

[Reference] The secret of Yahoo! News topics "13-character headlines" https://news.yahoo.co.jp/newshack/inside/yahoonews_topics_heading.html

COTOHA API ** API for natural language processing and voice recognition ** provided by NTT Communications. 14 natural language processing and speech processing APIs such as parsing and speech recognition are provided. https://api.ce-cotoha.com/contents/index.html

This time, I used the ** Developers version ** of the COTOHA API. There are some restrictions compared to the Enterprise version, but you can use it for free.

Articles targeted this time

This time, I targeted the following article that Bill Gates retired from MS. https://news.yahoo.co.jp/pickup/6354056

The headline that was attached is here.

Bill Gates retires from MS board

Umm. Certainly complete and easy to understand.

Step 1: First, challenge with just the summary API

The COTOHA API provides a ** summary API **. Although it is still in beta, you can use it to ** extract sentences that you think are important in the sentence **.

First of all, I decided to extract one sentence using this API.

{
  "result": "Gates retired from management in 2008 and retired as chairman in 2014, but remained on the board.",
  "status": 0
}

I was able to extract it safely, but as it is, it clearly exceeds 13 characters, so I have to ** shorten it **. I was worried about how to shorten it, but I decided to proceed with the method of ** leaving only the keywords with high importance **.

Step 2: Extract important words using the keyword extraction API

Earlier, I wrote in a Qiita article that you can extract keywords with high importance by using term extract.

[Reference] Qiita tag automatic generator https://qiita.com/fukumasa/items/7f6f69d4f6336aff3d90

The COTOHA API also provides a ** keyword extraction API **, Characteristic phrases and words contained in the text can be extracted as keywords.

Let's extract keywords for the one sentence extracted earlier.

{
  "result": [
    {
      "form": "Chairman",
      "score": 14.48722
    },
    {
      "form": "Line",
      "score": 11.3583
    },
    {
      "form": "Retired",
      "score": 11.2471
    },
    {
      "form": "board of directors",
      "score": 10.0
    }
  ],
  "status": 0,
  "message": ""
}

At this point, I'm already suspicious ... ** The essential information "who" ** (Mr. Gates) has not been extracted. Well, I'll continue for the time being.

Step 3: Extract location information using named entity extraction API

As I wrote in the opening rule, the heading must include location information. COTOHA provides a convenient API for getting location information. ** Named entity extraction API **. By using this API, you can get named entities such as personal names and place names.

I tried it with the one sentence extracted earlier, but it did not include the location information.

If it is included, it's very easy, but ** "extracted location information" with "de" ** I decided to include it at the beginning of the summary.

Step 4: Add particles to keywords using the parsing API

It seems difficult to generate a headline (sentence) just by arranging these extracted keywords. I couldn't do advanced things to automatically generate sentences based on keywords, and I was quite worried.

Since I am imposing the restriction of using only the COTOHA API, when I was looking at the API list again, one flashed. Using the ** parsing API **, you can add ** particles such as "ga" and "o" to each extracted keyword to connect each keyword **.

By using this API, the text is decomposed into phrases and morphemes, and the dependency relationships between phrases and Dependency relationships between morphemes, semantic information such as part of speech information, etc. are added.

Is there a relationship between the extracted keywords and particles? (I don't know how to express it ...) It seems that I can extract it. For example, in the case of "air is delicious", it is like extracting "ga" as a particle of the keyword "air".

Let's use this API to add particles to the previous keyword.

Returned response (folded because it's long)
{
  "result": [
    {
      "chunk_info": {
        "id": 0,
        "head": 7,
        "dep": "D",
        "chunk_head": 1,
        "chunk_func": 2,
        "links": []
      },
      "tokens": [
        {
          "id": 0,
          "form": "Gates",
          "kana": "Gates",
          "lemma": "Gates",
          "pos": "noun",
          "features": [
            "Unique",
            "Surname"
          ],
          "attributes": {}
        },
        {
          "id": 1,
          "form": "Mr",
          "kana": "Shi",
          "lemma": "Mr",
          "pos": "Noun suffix",
          "features": [
            "noun"
          ],
          "dependency_labels": [
            {
              "token_id": 0,
              "label": "name"
            },
            {
              "token_id": 2,
              "label": "case"
            }
          ],
          "attributes": {}
        },
        {
          "id": 2,
          "form": "Is",
          "kana": "C",
          "lemma": "Is",
          "pos": "Conjunctive particles",
          "features": [],
          "attributes": {}
        }
      ]
    },
    {
      "chunk_info": {
        "id": 1,
        "head": 4,
        "dep": "D",
        "chunk_head": 0,
        "chunk_func": 1,
        "links": []
      },
      "tokens": [
        {
          "id": 3,
          "form": "2008",
          "kana": "Nisen Hachinen",
          "lemma": "2008",
          "pos": "noun",
          "features": [
            "Date and time"
          ],
          "dependency_labels": [
            {
              "token_id": 4,
              "label": "case"
            }
          ],
          "attributes": {}
        },
        {
          "id": 4,
          "form": "To",
          "kana": "D",
          "lemma": "To",
          "pos": "Case particles",
          "features": [
            "Continuous use"
          ],
          "attributes": {}
        }
      ]
    },
    {
      "chunk_info": {
        "id": 2,
        "head": 3,
        "dep": "D",
        "chunk_head": 0,
        "chunk_func": 1,
        "links": []
      },
      "tokens": [
        {
          "id": 5,
          "form": "management",
          "kana": "Keiei",
          "lemma": "management",
          "pos": "noun",
          "features": [
            "motion"
          ],
          "dependency_labels": [
            {
              "token_id": 6,
              "label": "case"
            }
          ],
          "attributes": {}
        },
        {
          "id": 6,
          "form": "of",
          "kana": "No",
          "lemma": "of",
          "pos": "Case particles",
          "features": [
            "Attributive form"
          ],
          "attributes": {}
        }
      ]
    },
    {
      "chunk_info": {
        "id": 3,
        "head": 4,
        "dep": "D",
        "chunk_head": 0,
        "chunk_func": 1,
        "links": [
          {
            "link": 2,
            "label": "adjectivals"
          }
        ]
      },
      "tokens": [
        {
          "id": 7,
          "form": "Line",
          "kana": "Issen",
          "lemma": "Line",
          "pos": "noun",
          "features": [],
          "dependency_labels": [
            {
              "token_id": 5,
              "label": "nmod"
            },
            {
              "token_id": 8,
              "label": "case"
            }
          ],
          "attributes": {}
        },
        {
          "id": 8,
          "form": "From",
          "kana": "Kara",
          "lemma": "From",
          "pos": "Case particles",
          "features": [
            "Continuous use"
          ],
          "attributes": {}
        }
      ]
    },
    {
      "chunk_info": {
        "id": 4,
        "head": 7,
        "dep": "P",
        "chunk_head": 0,
        "chunk_func": 1,
        "links": [
          {
            "link": 1,
            "label": "goal"
          },
          {
            "link": 3,
            "label": "object"
          }
        ],
        "predicate": []
      },
      "tokens": [
        {
          "id": 9,
          "form": "Retire",
          "kana": "Sirizo",
          "lemma": "Retire",
          "pos": "Verb stem",
          "features": [
            "K"
          ],
          "dependency_labels": [
            {
              "token_id": 3,
              "label": "nmod"
            },
            {
              "token_id": 7,
              "label": "dobj"
            },
            {
              "token_id": 10,
              "label": "aux"
            },
            {
              "token_id": 11,
              "label": "punct"
            }
          ],
          "attributes": {}
        },
        {
          "id": 10,
          "form": "Ki",
          "kana": "Ki",
          "lemma": "Ki",
          "pos": "Verb suffix",
          "features": [
            "Continuous use"
          ],
          "attributes": {}
        },
        {
          "id": 11,
          "form": "、",
          "kana": "",
          "lemma": "、",
          "pos": "Comma",
          "features": [],
          "attributes": {}
        }
      ]
    },
    {
      "chunk_info": {
        "id": 5,
        "head": 7,
        "dep": "D",
        "chunk_head": 0,
        "chunk_func": 1,
        "links": []
      },
      "tokens": [
        {
          "id": 12,
          "form": "14 years",
          "kana": "Juyonen",
          "lemma": "14 years",
          "pos": "noun",
          "features": [
            "Date and time"
          ],
          "dependency_labels": [
            {
              "token_id": 13,
              "label": "case"
            }
          ],
          "attributes": {}
        },
        {
          "id": 13,
          "form": "To",
          "kana": "Niha",
          "lemma": "To",
          "pos": "Conjunctive particles",
          "features": [],
          "attributes": {}
        }
      ]
    },
    {
      "chunk_info": {
        "id": 6,
        "head": 7,
        "dep": "D",
        "chunk_head": 0,
        "chunk_func": 1,
        "links": []
      },
      "tokens": [
        {
          "id": 14,
          "form": "Chairman",
          "kana": "Kaicho",
          "lemma": "Chairman",
          "pos": "noun",
          "features": [],
          "dependency_labels": [
            {
              "token_id": 15,
              "label": "case"
            }
          ],
          "attributes": {}
        },
        {
          "id": 15,
          "form": "To",
          "kana": "Wo",
          "lemma": "To",
          "pos": "Case particles",
          "features": [
            "Continuous use"
          ],
          "attributes": {}
        }
      ]
    },
    {
      "chunk_info": {
        "id": 7,
        "head": 9,
        "dep": "D",
        "chunk_head": 0,
        "chunk_func": 3,
        "links": [
          {
            "link": 0,
            "label": "agent"
          },
          {
            "link": 4,
            "label": "manner"
          },
          {
            "link": 5,
            "label": "time"
          },
          {
            "link": 6,
            "label": "agent"
          }
        ],
        "predicate": [
          "past"
        ]
      },
      "tokens": [
        {
          "id": 16,
          "form": "Retired",
          "kana": "Tay Ninh",
          "lemma": "Retired",
          "pos": "noun",
          "features": [
            "motion"
          ],
          "dependency_labels": [
            {
              "token_id": 1,
              "label": "nsubj"
            },
            {
              "token_id": 9,
              "label": "advcl"
            },
            {
              "token_id": 12,
              "label": "nmod"
            },
            {
              "token_id": 14,
              "label": "nsubj"
            },
            {
              "token_id": 17,
              "label": "aux"
            },
            {
              "token_id": 18,
              "label": "aux"
            },
            {
              "token_id": 19,
              "label": "mark"
            },
            {
              "token_id": 20,
              "label": "punct"
            }
          ],
          "attributes": {}
        },
        {
          "id": 17,
          "form": "Shi",
          "kana": "Shi",
          "lemma": "Shi",
          "pos": "Verb conjugation ending",
          "features": [],
          "attributes": {}
        },
        {
          "id": 18,
          "form": "Ta",
          "kana": "Ta",
          "lemma": "Ta",
          "pos": "Verb suffix",
          "features": [
            "Connect"
          ],
          "attributes": {}
        },
        {
          "id": 19,
          "form": "But",
          "kana": "Moth",
          "lemma": "But",
          "pos": "Connection suffix",
          "features": [
            "Continuous use"
          ],
          "attributes": {}
        },
        {
          "id": 20,
          "form": "、",
          "kana": "",
          "lemma": "、",
          "pos": "Comma",
          "features": [],
          "attributes": {}
        }
      ]
    },
    {
      "chunk_info": {
        "id": 8,
        "head": 9,
        "dep": "D",
        "chunk_head": 0,
        "chunk_func": 1,
        "links": []
      },
      "tokens": [
        {
          "id": 21,
          "form": "board of directors",
          "kana": "Torishima Yakkai",
          "lemma": "board of directors",
          "pos": "noun",
          "features": [],
          "dependency_labels": [
            {
              "token_id": 22,
              "label": "case"
            }
          ],
          "attributes": {}
        },
        {
          "id": 22,
          "form": "To",
          "kana": "Niha",
          "lemma": "To",
          "pos": "Conjunctive particles",
          "features": [],
          "attributes": {}
        }
      ]
    },
    {
      "chunk_info": {
        "id": 9,
        "head": -1,
        "dep": "O",
        "chunk_head": 0,
        "chunk_func": 4,
        "links": [
          {
            "link": 7,
            "label": "manner"
          },
          {
            "link": 8,
            "label": "place"
          }
        ],
        "predicate": [
          "past",
          "past"
        ]
      },
      "tokens": [
        {
          "id": 23,
          "form": "Remaining",
          "kana": "Saw",
          "lemma": "Remain",
          "pos": "Verb stem",
          "features": [
            "R"
          ],
          "dependency_labels": [
            {
              "token_id": 16,
              "label": "advcl"
            },
            {
              "token_id": 21,
              "label": "nmod"
            },
            {
              "token_id": 24,
              "label": "aux"
            },
            {
              "token_id": 25,
              "label": "aux"
            },
            {
              "token_id": 26,
              "label": "aux"
            },
            {
              "token_id": 27,
              "label": "aux"
            },
            {
              "token_id": 28,
              "label": "punct"
            }
          ],
          "attributes": {}
        },
        {
          "id": 24,
          "form": "Tsu",
          "kana": "Tsu",
          "lemma": "Tsu",
          "pos": "Verb conjugation ending",
          "features": [],
          "attributes": {}
        },
        {
          "id": 25,
          "form": "hand",
          "kana": "Te",
          "lemma": "hand",
          "pos": "Verb suffix",
          "features": [
            "Connect",
            "Continuous use"
          ],
          "attributes": {}
        },
        {
          "id": 26,
          "form": "I",
          "kana": "I",
          "lemma": "Is",
          "pos": "Verb stem",
          "features": [
            "A",
            "L for continuous use"
          ],
          "attributes": {}
        },
        {
          "id": 27,
          "form": "Ta",
          "kana": "Ta",
          "lemma": "Ta",
          "pos": "Verb suffix",
          "features": [
            "stop"
          ],
          "attributes": {}
        },
        {
          "id": 28,
          "form": "。",
          "kana": "",
          "lemma": "。",
          "pos": "Kuten",
          "features": [],
          "attributes": {}
        }
      ]
    }
  ],
  "status": 0,
  "message": ""
}
['Chairman', 'From the line', 'Retired', 'To the board']

Step 5: Generate a headline!

Let's combine each keyword with particles added earlier to make a sentence of 13 characters or less. I almost got it in step 4, but I got this result.

Retired chairman from the line

I don't think it's an intriguing headline **, "Who has retired?" It seems that I will tell you.

However, as I wrote in step 2, there is no information saying "who" or company names such as "Microsoft" and "MS", so it feels subtle, so it is objective to see how good the headline generated this time is **. I decided to look it up **.

Step 6: Check the completeness of the headline generated using the similarity calculation API

You can also check the completeness of the generated heading using the COTOHA API. ** Similarity calculation API **. By using this API, you can ** calculate the semantic similarity between two sentences **. The similarity is output in the domain of 0 to 1, and the closer it is to 1, the greater the similarity between texts.

The headline Bill Gates retired from MS board, which was attached to the article, I calculated the similarity of the generated heading Chairman retired from the line.

{
  "result": {
    "score": 0.9716939
  },
  "status": 0,
  "message": "OK"
}

Oh, isn't 0.97 quite expensive ...! ?? (Puzzled If COTOHA says so. ..

bonus

For reference, I also tried other articles.

Revolution at convenience store "Lentin"

The article is 4 pages in total, but for the time being, I tried it only on the 1st page. https://news.yahoo.co.jp/pickup/6353834

● Extracted sentence
Nowadays, it is attracting attention as a product that symbolizes the Chin Revolution in the range that is advancing in the convenience store industry.
● Extracted keywords
['Symbol','Attention',' Range',' Convenience store',' Chin Revolution']

** ● Generated headline **

In the symbolic attention range(Similarity: 0.45899978)

Even if you look at the headline, it's a mess ... The similarity is also very low. After all, the extracted keywords are ** attached in descending order of score to make a sentence **, so I think it looks like this. You may have known for the first time ** abbreviated chin as ** lentin ** in the microwave. Or rather, what is the Chin Revolution?

Game is 60 minutes a day 80% agree with the plan

This is an article about game regulations in Kagawa Prefecture, which is controversial. https://news.yahoo.co.jp/pickup/6353894

● Extracted sentence
A measure ordinance for "game addiction" that the Kagawa Prefectural Assembly is aiming to enforce in April.
● Extracted keywords
['Countermeasures Ordinance','Kagawa Prefectural Assembly',' Enforcement',' Game Dependence']

** ● Generated headline **

Measures Ordinance Kagawa Prefectural Assembly Enforcement(Similarity: 0.2842004)

If you look at the headline, you can't tell if it's a game from Kagawa prefecture, but what I want to convey in this article is probably ** 80% of the supporters **. The degree of similarity is also very low. However, the extracted sentence and the generated headline did not contain numerical information. Also, although the article contains a specific value of 84%, it is converted to an easy-to-understand expression of 80% in the headline. It's easier to get a feel for it if you say it roughly than if you say it in detail. Is this area a skill unique to humans?

Snowfall in Tokyo Sleet in the city center

Yesterday's article. It seems that it snowed in Tokyo. It's still cold days ... https://news.yahoo.co.jp/pickup/6354091

● Extracted sentence
In central Tokyo, which is more than 10 ℃ lower than noon yesterday, after observing the maximum temperature of 12.3 ℃ after midnight on the 14th, the temperature has dropped steadily and is now 2.5 ℃ at 14:00.
● Extracted keywords
['Observation','Temperature', '12.3 ° c', '2.5 ° c', '10 ° c or more']
● Extracted location information
['Shin Tokyo']

** ● Generated headline **

Observed temperature in central Tokyo(Similarity: 0.99335754)

In the case where location information is included, 4 characters are consumed in central Tokyo, and the amount of information created is not very large. I feel that the extracted keywords also have too much numerical information. However, the similarity is extremely high at 0.99 ...

Summary

It may seem a little subtle if the headline that was generated this time is said to be a great success, but it was fun to do it. In the first place, when I examined the summary, it seems that there are roughly the following classifications.

-** Extraction type ** --Extract the sentences that you think are important from the target sentences and create a summary -** Abstract type ** ――Use words that are not included in the original sentence to understand the meaning of the sentence and create an appropriate summary.

The summary API provided by COTOHA used this time is the former ** extraction type **.

However, as we do in Yahoo! News, when trying to create a summary under various rules and restrictions, it is difficult to use only the extraction type, so ** combine with other services ** or the latter ** abstract type I felt that it would be necessary to use the summarization service **.

Also, in order to reduce the number of characters, it seems easy to omit it because there are some rules in the case of country names etc. I feel that the hurdles are still high even if I use natural language processing technology.

I felt that the day when Yahoo! News headline generation (skillful technique) would be replaced by AI would not come for the time being.

at the end

I personally love natural language processing because it's interesting. However, I don't have many opportunities to use it at work, so I would like to continue to enjoy it privately. Please PS4.

Reference site

The following Qiita article is very helpful for the summary.

--Sentence summary (Qiita) for the era of natural language

[Reference] Source code

There may be a lot of ramblings, but if you're interested, check it out
import requests
import pprint
import json
import re
from bs4 import BeautifulSoup

base_url = 'https://api.ce-cotoha.com/api/dev/nlp/'


'''
Get access token for COTOHA API
'''
def get_access_token():
    url = 'https://api.ce-cotoha.com/v1/oauth/accesstokens'

    req_data = {
      'grantType' : 'client_credentials',
      'clientId' : 'Client ID',
      'clientSecret' : 'Client secret'
    }

    headers = {
        'Content-Type' : 'application/json'
    }

    response = requests.post(url, json.dumps(req_data), headers=headers)
    token = response.json()['access_token']
    
    return token


'''
Call the summary API
'''
def get_summary(token, document) :
    url = base_url + '/beta/summary'

    req_data = {
        'document' : document,
        'sent_len' : '1'
    }

    headers = {
        'Content-Type' : 'application/json;charset=UTF-8',
        'Authorization' : 'Bearer {}'.format(token)
    }

    response = requests.post(url, json.dumps(req_data), headers=headers)
    summary = response.json()['result']
    
    return summary


'''
Call the keyword extraction API
'''
def get_keywords(token, document):
    url = base_url + '/v1/keyword'

    req_data = {
        'document' : document,
        'type' : 'default',
        'do_segment' : True
    }

    headers = {
        'Content-Type' : 'application/json;charset=UTF-8',
        'Authorization' : 'Bearer {}'.format(token)
    }

    response = requests.post(url, json.dumps(req_data), headers=headers)
    keywords = [item.get('form') for item in response.json()['result']]
    
    return keywords


'''
Call the named entity extraction API to get information about the location
'''
def get_ne_loc(token,sentence):
    url = base_url + '/v1/ne'

    req_data = {
        'sentence' : sentence
    }

    headers = {
        'Content-Type' : 'application/json;charset=UTF-8',
        'Authorization' : 'Bearer {}'.format(token)
    }
    
    response = requests.post(url, json.dumps(req_data), headers=headers)
    ne = response.json()['result']
    
    ne_loc = []
    for item in ne:
        if item['class'] == 'LOC':
            ne_loc.append(item['form'])
    
    #There are cases where duplication occurs if only words are used
    if ne_loc:
        ne_loc = list(set(ne_loc))
        
    return ne_loc

    
'''
Call the parsing API
'''
def parse_doc(token, sentence) :
    url = base_url + '/v1/parse'

    req_data = {
        'sentence':sentence
    }

    headers = {
        'Content-Type' : 'application/json;charset=UTF-8',
        'Authorization' : 'Bearer {}'.format(token)
    }
    
    response = requests.post(url, json.dumps(req_data), headers=headers)
    parsed_result = response.json()['result']
    
    tokens = []
    for tokens_ary in parsed_result:
        for token in tokens_ary['tokens']:
            if token:
                tokens.append(token)
        
    return tokens


'''
Call the similarity calculation API
'''
def get_similarity(token, doc1, doc2):
    url = base_url + '/v1/similarity'

    req_data = {
        's1' : doc1,
        's2' : doc2,
        'type' : 'kuzure'
    }

    headers = {
        'Content-Type' : 'application/json;charset=UTF-8',
        'Authorization' : 'Bearer {}'.format(token)
    }
    
    response = requests.post(url, json.dumps(req_data), headers=headers)
    similarity = response.json()['result']
        
    return similarity       


'''
Yahoo!Extract content from news article URLs
(Supports only a single page, does not support multiple pages or specific article formats...)
'''
def get_contents(url):
    top_page = requests.get(url) 
    soup = BeautifulSoup(top_page.text, 'lxml')
    article_url = soup.find('div',class_=re.compile('pickupMain_articleInfo')).find('a').get('href')
    article_page = requests.get(article_url) 
    soup = BeautifulSoup(article_page.text, "lxml")
    for tag in soup.find_all('p',{'class':'photoOffer'}):
        tag.decompose()
    for tag in soup.find_all('a'):
        tag.decompose()
    contents =  re.sub('\n|\u3000','',soup.find('div',class_=re.compile('articleMain')).getText());
    
    return contents


'''
Yahoo!Extract titles from news article URLs
(This is the correct answer)
'''
def get_title(url):
    top_page = requests.get(url) 
    soup = BeautifulSoup(top_page.text, "lxml")
    title = soup.find("title").getText().split(' - ')[0]
    
    return title


'''
Yahoo!Generate topics for news articles
'''
def create_news_topic(token, contents):
    #Implemented article summary
    summary = get_summary(token, contents)
    print(summary)
    print("------------")
    
    #If the summary is 13 characters or less, return it as a topic
    if len(summary) <= 13:
        return summary[:-1]

    #Extract keywords and place names from the abstract
    keywords = get_keywords(token, summary)
    print(keywords)
    print("------------")
    ne_loc = get_ne_loc(token, summary)
    print(ne_loc)
    print("------------") 

    topic = ''
    #Add to heading if there is location information
    #Even if there are multiple, only the first one for the time being
    if ne_loc:
        topic += ne_loc[0] + 'so'
        #Remove if it is also included in the keyword
        if ne_loc[0] in keywords:
            keywords.remove(ne_loc[0])

    #Parsing the abstract
    tokens = parse_doc(token, summary)

    #Create a summary while getting keyword particles
    for keyword in keywords:
        for token in tokens:
            if token['form'] == keyword:
                print(token)
                for dependency_label in token['dependency_labels']:
                    if dependency_label['label'] == 'case':
                        keyword += tokens[int(dependency_label['token_id'])]['form']
                        break
                break
    
        if len(topic) + len(keyword) <= 13:
            topic += keyword
        else:
            return topic

    return topic


'''
Main
'''
if __name__ == '__main__':
    #Yahoo you want to generate headlines!News article URL
    url = 'https://news.yahoo.co.jp/pickup/6354056'   

    #Extract article content and title
    contents = get_contents(url)
    title = get_title(url)
    print("------------")
    print(contents)
    print("------------")
    print(title)
    print("------------")    
    
    #Get a token for COTOHA API
    token = get_access_token()
    
    #Generate article headlines
    topic = create_news_topic(token, contents)
    print(topic)
    print("------------") 
    
    #Calculate the similarity between the original heading and the generated heading
    similarity = get_similarity(token, title, topic)['score']
    print(similarity)
    print("------------") 

Recommended Posts

Challenge to generate Yahoo! News headlines using only COTOHA API
Bulk posting to Qiita: Team using Qiita API
Generate API spec using GO / Gin gin-swagger
I tried to touch the COTOHA API
Play with puns using the COTOHA API
Let the COTOHA API do the difficult things-Introduction to "learn using" natural language processing-