[PYTHON] I tried to extract players and skill names from sports articles

Motive [Qiita x COTOHA API present plan] This is a related post.

I tried using an API different from Problem that give me chocolate is not made even if I analyze it with COTOHA API.

This time Named entity recognition (/ nlp / v1 / ne) API.

With MeCab, I feel that I couldn't extract a person's name without learning proper nouns and registering them in a dictionary. Also, KNP seems to have good accuracy, but the package itself is heavy. : scream: Also, when morphological analysis is performed, whether the machine-learned distributor is excellent or not, even the part of speech is output accurately, but I feel that it is not so if I could classify the nouns that appear a lot in the sentence. I will. In COTOHA, nouns are classified in detail only by API.

In order to easily try how far proper nouns can be extracted, I tried to output the person name and technique name from the sports article.

Environment

Dataset Tokyo Sports The selection criteria is a lofty reason that this sports newspaper is not available in the area where you live. : camera_with_flash:

Method As mentioned above COTOHA API specific extraction https://api.ce-cotoha.com/contents/reference/apireference.html#parsing_io_part I am using.

The player (person) is x ["class "] ==" PSN "and x ["extended_class "] ==" ", the technique name is x ["class "] ==" ART "and x ["extended_class " "] in [ "Doctrine_Method_Other"] It is extracted with. Doctrine_Method_Other means (principle method name_other).

name Description
ORG Organization name
PSN Personal name
LOC place
ART Unique name
DAT Date representation
TIM Time representation
NUM Numerical representation
MNY Amount expression
PCT Percentage expression
OTH Other

Development

Script

** Script code ** (Click to see the code.)
import argparse
import requests
from bs4 import BeautifulSoup
import json

#---Get these 4 parameters with Portal---
PUBLISH_URL = "--- get your parameter ---"
CLIENT_ID = "--- get your parameter ---" 
CLIENT_SECRET = "--- get your parameter ---" 
BASE_URL = "--- get your parameter ---"


class COTOHA:

    def __init__(self):
        self._token = self._getAccessToken()

    def _getAccessToken(self):
        header = {"Content-Type": "application/json"}
        contents = {
            "grantType": "client_credentials",
            "clientId": CLIENT_ID,
            "clientSecret": CLIENT_SECRET
        }
        raw_res = requests.post(PUBLISH_URL, headers=header, json=contents)
        response = raw_res.json()
        return response["access_token"]

    def compose(self, sentence):
        header = {
            "Authorization": "Bearer {}".format(self._token),
            "Content-Type": "application/json"
        }
        contents = {
            "sentence": sentence
        }
        raw_res = requests.post(
            BASE_URL +
            "nlp/v1/parse",
            headers=header,
            json=contents)
        response = raw_res.json()
        return response

    def properNoun(self, sentence):
        header = {
            "Authorization": "Bearer {}".format(self._token),
            "Content-Type": "application/json"
        }
        contents = {
            "sentence": sentence
        }
        raw_res = requests.post(
            BASE_URL +
            "nlp/v1/ne",
            headers=header,
            json=contents)
        response = raw_res.json()
        return response

    def keyword(self, sentence):
        header = {
            "Authorization": "Bearer {}".format(self._token),
            "Content-Type": "application/json"
        }
        contents = {
            "document": sentence
        }
        raw_res = requests.post(
            BASE_URL +
            "nlp/v1/keyword",
            headers=header,
            json=contents)
        response = raw_res.json()
        return response

    def coreference(self, sentence):
        header = {
            "Authorization": "Bearer {}".format(self._token),
            "Content-Type": "application/json"
        }
        contents = {
            "document": sentence
        }
        raw_res = requests.post(
            BASE_URL +
            "nlp/v1/coreference",
            headers=header,
            json=contents)
        response = raw_res.json()
        return response


def extract_norn_list(_apiobj, contents, condition):
    dst = []
    for p in contents:
        items = _apiobj.properNoun(p.text)["result"]
        _raw = list(filter(condition, items))

        # print(_raw)
        #Abbreviations excluded
        for _p in _raw:
            name = _p["form"]
            _exist = False
            for pname in dst:
                if name in pname:
                    _exist = True
            if not _exist:
                dst.append(name)
    return dst


def main():
    parser = argparse.ArgumentParser()
    parser.add_argument("--url")
    args = parser.parse_args()

    #Create API object
    cotoha = COTOHA()

    #Get articles from URL(Tokyo sports specifications)
    res = requests.get(args.url)
    soup = BeautifulSoup(res.text, 'html.parser')
    title_text = soup.find('title').get_text()
    contents = soup.find('div', {"class": "detail-content"}).find_all("p")

    #Extraction condition
    def is_person(x): return x["class"] == "PSN" and x["extended_class"] == ""
    def is_attack(x): return x["class"] == "ART" and x["extended_class"] in [
        "Doctrine_Method_Other"]

    #Output player
    print(extract_norn_list(cotoha, contents, is_person))
    #Output the technique name
    print(extract_norn_list(cotoha, contents, is_attack))

if __name__ == "__main__":
    main()

Command

python main.py --url https://www.tokyo-sports.co.jp/prores/ddt/1754700/

Consequence

Run with two articles.

[New day 1.5 Tokyo Dome] Minoru attacked Moxley in defense of the US Championship "Who are you selling fights!"

https://www.tokyo-sports.co.jp/prores/njpw/1682622/

dataset

In the IWGP / US Heavyweight Championship match held at New Japan Pro-Wrestling's biggest box office "Wrestle Kingdom 14" (Tokyo Dome on the 5th), champion John Moxley (34) was the IWGP tag champion Juice Robinson (30). He rejected the challenge and succeeded in his first defense.
At the Tokyo Dome tournament the night before (4th), Moxley regained the title from Lance Archer (32). Juice won the Tag Team Championship in combination with David Finlay (26). The next day, it was a decisive battle between the new champions, but Moxley had robbed Juice of the title in June last year and had declared that he would settle on the ring the night before.
Juice took the lead in the early stages, but Moxley took out a chair outside the venue and hit his back. Furthermore, I bit the forehead of the juice. A rough man who rampaged as a "mad dog" during the WWE era forcibly regained his pace.
Juice counterattacked with a daring high-angle power bomb, but the champion fired a series of unexpected attacks with a four-shaped iron pillar from a four-legged character. The challenger is from the avalanche brainbuster to Jack Hammer and German. I evaded Moxley's Deslider (double-armed DDT) and punched it out with a lariat.
However, the champion flashes a strong running knee from the smashing battle. After turning back the pulp friction of the juice, he exploded a deadly deslider from DDT and took 3 counts at 12 minutes 48 seconds.
After the match, the entrance theme song was played, and Minoru Suzuki (51) suddenly appeared. He was hit by Deslider from Moxley at the Hiroshima tournament on December 8th last year, and he cannot hide his anger with a rugged expression. After taking off the jersey on the flower road and getting ready for battle, he met the champion and elbow on the ring. Powerful Minoru KOed Moxley with a Gotch-type pile driver from rear-naked choke.
Minoru grabbed the microphone and declared war, "Who are you selling fights to, this Yarrow! I'm Minoru Suzuki, a professional wrestler. I'll buy this fight!" The outbreak of the "rabies" vs. "bad guys" conflict over the US Championship has given off a dangerous scent.
Minoru's story "Who are you selling fights to? Hey. I was waiting for you to come in front of me. John Moxley ... No, John Boy, take care of me. I'll kill you."
Juice's story "Everything ends here. Jon Moxley was stronger than me today. I couldn't surpass it again. I thought about today after yesterday's match. Until then, today's match. I didn't think about that. "

output

['John Moxley', 'Lance Archer', 'David finlay', 'Minoru Suzuki', 'John Boy']
['Foot 4 character consolidation', 'Avalanche', 'Jackhammer', 'Lariat', 'Rear-naked choke']

[New day 1.4 Tokyo Dome] Naito regained the IC title by reversing "The purpose is not this belt"

https://www.tokyo-sports.co.jp/prores/njpw/1681815/

dataset

Tetsuya Naito (37) defeated champion Jay White (27) at the IWGP Intercontinental (IC) Championship held at New Japan Pro-Wrestling's biggest box office "Wrestle Kingdom 14" (Tokyo Dome on the 4th). In addition to regaining the title, he advanced to a double title match with the IWGP Heavyweight Champion (Kazuchika Okada VS Kota Ibushi's winner) at the Tokyo Dome Tournament on the 5th.
Lost to Jay at the Kobe tournament in September last year, and fell from the IC title for the second time last year. He also experienced the humiliation of the nomination "0" at the "Pro Wrestling Awards" established by the Tokyo Sports Newspaper. However, a large crowd is waiting for the resurrection of the "out of control man". When he pushed his back with a big Naito call from the beginning, he gave Jay a merciless boo.
Naito took the lead by shooting a neckbreaker with an apron outside the venue. However, Jay's second outer road pulls Naito's leg from the outside and disturbs the pace. The champion focused on Naito's left knee and attacked. Naito jumps from the corner and fights back with Frankensteiner. It's a low-altitude drop kick that is skewered after spitting on the opponent's face.
It seemed that this would keep the pace, but he was in agony after eating Jay's DDT and was attacked on his left knee again. It is thrown out of the hall with a back drop, and the inferiority does not change. In addition, the knees were tightened with the four back legs.
Naito in a big pinch breaks the rope while distorting his face. When I managed to escape, it was a counterattack kick. In addition, the onslaught of spine buster, rotary DDT, avalanche Frankensteiner, and Gloria. The referee broke into the gap when the referee went down due to an accident, but he was repulsed by a sneak attack.
Naito, who played the game, fired a series of Coryend-style Destino. After completely preventing Jay's deadly Blade Runner (transformed face crushing), he finally took 3 counts with the whole body of Destino.
Victory in a fierce battle at 33 minutes and 54 seconds. "Uncontrollable man" who has been advocating the ambition of IWGP and IC, two crowns since January last year, will challenge the big stage of the generation to complete revival.
[Naito's story] "The purpose of this two-game series is not to take this belt. I'm glad that the customer said" Congratulations to Naito. "But Tranquilo. It's not the purpose of this time, so there. Well, which is tomorrow's opponent? My plan is okada. Ideal is okada. Come on. "
[Jay White's story] "Where did he (Naito) go ... I was unfortunately one of the supporting characters in the story that everyone made. Everyone wanted Jay White to lose. It must have been. Naito, who you like, won. Why don't you laugh. My new Destino ... Fate begins tomorrow. "

output

['Tetsuya Naito', 'Jay White', 'Kazuchika Okada', 'Kota Ibushi', 'Destino...Destiny']
['Neckbreaker', 'Back drop', 'Foot 4 character consolidation', 'Spine Buster']

Consideration ――The names of the players are extracted except for "Destino ... Fate". It seems that general person names can be classified without problems. ――It's a technical name, but unfortunately it doesn't appear in the classification of COTOHA API. The combination that seems to be the most extractable from the API output several times was class: ART, extended_class: Doctrine_Method_Other, so I tried to output it, but I tried to output it, but" High angle power bomb "and" Coryend type Destino " "Is not applicable. If you add class: ART, extended_class: Product as the second condition, other than the technique name will be extracted, so 100% was strict: tired_face: ――If it is a specialized book rather than a sports article, it may be effective. This is because the following type parameters can be added to the API. (Only for Enterprise users ,,, so it can be used for a fee.)

param name
IT Computer / Information / Communication
automobile Automobile
chemistry Chemical / petroleum industry
company Company
construction Civil engineering and construction
economy Economy / Decree
energy Electric power / energy
institution Institution / organization
machinery machine
medical Medicine
metal Non-ferrous / metal

PostScript I said that the accuracy of person name extraction is good, but for some reason the recently retired "Beast God Thunder Liger" was not extracted correctly. It was classified as "ART: Unique object name". : japanese_ogre: Isn't it better to send support to the staff of the talent directory?: Thinking :. As: sushi :.

Recommended Posts

I tried to extract players and skill names from sports articles
I tried to extract characters from subtitles (OpenCV: tesseract-ocr edition)
I tried to extract characters from subtitles (OpenCV: Google Cloud Vision API)
I tried to automatically extract the movements of PES players with software
I tried to learn the angle from sin and cos with chainer
I tried to extract and illustrate the stage of the story using COTOHA
I implemented DCGAN and tried to generate apples
[Introduction to PID] I tried to control and play ♬
Python programming: I tried to get (crawling) news articles using Selenium and BeautifulSoup4.
I tried to extract a line art from an image with Deep Learning
I tried to pass the G test and E qualification by training from 50
I tried to extract features with SIFT of OpenCV
I tried to read and save automatically with VOICEROID2 2
I tried to detect the iris from the camera image
I tried to implement and learn DCGAN with PyTorch
I tried adding post-increment to CPython. Overview and summary
I tried to extract various information of remote PC from Python by WMI Library
I tried adding system calls and scheduler to Linux
I tried to debug.
I tried to paste
I tried to implement Grad-CAM with keras and tensorflow
[Deep Learning from scratch] I tried to explain Dropout
I tried to install scrapy on Anaconda and couldn't
I tried web application development and thought about how to prevent beginners from getting sick.
I made a server with Python socket and ssl and tried to access it from a browser
I tried to make a bot that randomly acquires Wikipedia articles and tweets once a day
I tried to classify Shogi players Takami 7th Dan and Masuda 6th Dan by CNN [For CNN beginners]
I tried to predict and submit Titanic survivors with Kaggle
I tried to create API list.csv in Python from swagger.yaml
I tried to get Web information using "Requests" and "lxml"
I tried to illustrate the time and time in C language
I tried to enumerate the differences between java and python
I tried to make GUI tic-tac-toe with Python and Tkinter
I tried changing the python script from 2.7.11 to 3.6.0 on windows10
I tried to implement Perceptron Part 1 [Deep Learning from scratch]
I tried to get various information from the codeforces API
I tried to get data from AS / 400 quickly using pypyodbc
I tried to learn PredNet
I tried to implement PCANet
I tried to introduce Pylint
I tried to touch jupyter
I tried to implement StarGAN (1)
[Graduation from article scattering] I tried to develop a service that can list articles by purpose
[Qiita API] [Statistics • Machine learning] I tried to summarize and analyze the articles posted so far.
I tried to visualize bookmarks flying to Slack with Doc2Vec and PCA
I tried to make Kana's handwriting recognition Part 1/3 First from MNIST
I tried to let Pepper talk about event information and member information
I tried to make a periodical process with Selenium and Python
I tried to create Bulls and Cows with a shell program
I tried to cut out a still image from the video
I tried to easily detect facial landmarks with python and dlib
I tried to get data from AS / 400 quickly using pypyodbc Preparation 1
I tried to deliver mail from Node.js and Python using the mail delivery service (SendGrid) of IBM Cloud!
I tried to implement Deep VQE
I tried to create Quip API
I tried to touch Python (installation)
I tried to implement adversarial validation
I tried Watson Speech to Text
I tried to touch Tesla's API
I tried to implement hierarchical clustering
I tried task queuing from Celery