Find the part that is 575 from Wikipedia in Python

** Explain in advance ** This is what I made while doing a part-time warrior during the summer vacation and rubbing my sleepy eyes before going to bed. It works, but there may be some inefficient parts, insecure parts, unused variables, etc. If you find an improvement, I'd be happy if you could tell me in gentle words.

During this summer vacation, a movement to develop SlackBot suddenly occurred among the first year students of our circle. One of the functions of the bot that I made by taking advantage of the trend is equipped with a function that finds out the part that is 575 from Wikipedia. I will write about what I learned when implementing it as a memorandum.

Get the body of a random Wikipedia page

URL to access a random Wikipedia page

There was a SlackBot that was made by my seniors and introduced a random Wikipedia page, so I investigated if there was a way, and as a result, I was skipped to a random Wikipedia article by accessing the following URL. http://ja.wikipedia.org/wiki/Special:Randompage I had written scraping itself in C #, but it was my first time writing it in Python. https://qiita.com/poorko/items/9140c75415d748633a10 Refer to this site,

python


import requests
import pandas as pd
from bs4 import BeautifulSoup

html=requests.get("http://ja.wikipedia.org/wiki/Special:Randompage").text
soup=BeautifulSoup(html,"html.parser")
for script in soup(["script", "style"]):
    script.decompose()

Write. (In the citation source, it is included in the list for each line break, but since it is not necessary to detect it across sentences to detect 575, I used this list as it is)

Detection of 575 parts

Morphological analysis

Detecting 575 is, in other words, detecting that "look at the reading of the sentence and divide it into 575 word by word". That is, you have to look at the reading of the sentence and the word breaks. Is it useful there? ** Morphological analysis **. (Strictly speaking, morphemes and words are not the same, but they are troublesome, so I don't think deeply.) First, let's count the number of characters in a sentence.

python


def howmuch(moziyomi):
    i = 0
    for chara in moziyomi:
        if chara == '-':
            i = i + 1
        for kana in [chr(i) for i in range(12449, 12532 + 1)]:
            if chara == kana:
                i = i + 1
                if chara == 'Turbocharger' or chara == 'Yu' or chara == 'Yo'or chara == 'A'or chara == 'I'or chara == 'U'or chara == 'E'or chara == 'Oh':
                    i = i - 1
    return (i)

When morphological analysis is performed with Janome, the reading is returned in full-width katakana. So count the number of characters in the returned katakana string. The stretch bar "-" is counted as one character, and small katakana other than "tsu" are ignored.

Next is the 575 judgment part

python


        fin = False
        flag = False
        for file in files:
            # print(file)
            s = file
            if s.find('Edit') > 0:
                flag = True
            if flag:
                words = [token.surface for token in t.tokenize(s)]
                hinsi = [token.part_of_speech.split(',')[0] for token in t.tokenize(s)]
                yomi = [token.reading for token in t.tokenize(s)]

                for i in range(len(words)):
                    if fin:
                        break
                    uta = ""
                    utayomi = ""
                    kami = ""
                    naka = ""
                    simo = ""
                    keyword = ""
                    if hinsi[i] == "noun":  # hinsi[i] == "verb" or
                        keyword = words[i]
                        num = 0
                        utastat = 0
                        count = i
                        while num < 18 and count < len(yomi) and yomi[count].find("*") < 0:
                            num = num + howmuch(yomi[count])
                            uta = uta + words[count]
                            utayomi = utayomi + yomi[count]

                            if utastat == 0:
                                kami = kami + words[count]
                                if num > 5:
                                    break
                                elif num == 5:
                                    utastat = 1
                            elif utastat == 1:
                                naka = naka + words[count]
                                if num > 12:
                                    break
                                elif num == 12:
                                    utastat = 2
                            else:
                                simo = simo + words[count]

                            if num == 17:
                                if utayomi.find("。") >= 0:
                                    continue
                                elif (utayomi.find("(") >= 0 and utayomi.find(")") >= 0) or (
                                        utayomi.find("「") >= 0 and utayomi.find("」") >= 0) or (
                                        utayomi.find("<") >= 0 and utayomi.find(">") >= 0) or (
                                        utayomi.find("『") >= 0 and utayomi.find("』") >= 0):
                                    fin = True
                                    break
                                elif utayomi.find("(") >= 0 or utayomi.find(")") >= 0 or utayomi.find(
                                        "「") >= 0 or utayomi.find("」") >= 0 or utayomi.find("<") >= 0 or utayomi.find(
                                    ">") >= 0 or utayomi.find("『") >= 0 or utayomi.find(
                                    "』") >= 0:
                                    continue
                                elif uta != "" and uta.find("Link source") < 0:
                                    fin = True
                                    break
                            count = count + 1

What we are doing here

--Check each line and ignore the lines until the word "edit" appears. (Otherwise, it may contain a character string that is common to all pages such as "main page") --When a noun or verb comes, count the character string from it. (Because I thought that 575 would be a natural senryu if it started with a noun or verb) --Check if the reading of the string contains the symbol "\ *", and if so, find the next noun verb and start over. (Because Janome returns unreadable characters such as numbers with "\ *") ――If you look at it while separating it with words and it is not separated by just 575, look for the next noun or verb and count from there again. ――When you straddle ".", Find the next noun or verb and start over. (Because it becomes unnatural if you straddle sentences within 575) --If there is a beginning of the parenthesis symbol, check if there is a closing parenthesis in 575 (But with this confirmation method, "" and "toka are all right and not enough) --If the string "link source" is included in 575, start over. (Because it returns non-specific senryu such as "update of link source related page".)

If 575 is not found, repeat the previous operation. (Go to the random page again and do the same)

The whole picture of the fucking code I wrote

python



    def howmuch(moziyomi):
        i = 0
        for chara in moziyomi:
            if chara == '-':
                i = i + 1
            for kana in [chr(i) for i in range(12449, 12532 + 1)]:
                if chara == kana:
                    i = i + 1
                    if chara == 'Turbocharger' or chara == 'Yu' or chara == 'Yo':
                        i = i - 1
        return (i)

    hujubun = True
    while hujubun:
        html = requests.get("http://ja.wikipedia.org/wiki/Special:Randompage").text
        soup = bs4.BeautifulSoup(html, "html.parser")
        for script in soup(["script", "style"]):
            script.decompose()
        text = soup.get_text()
        # print(text)
        t = Tokenizer()
        files = text.split("\n")
        fin = False
        flag = False
        for file in files:
            # print(file)
            s = file
            if s.find('Edit') > 0:
                flag = True
            if flag:
                words = [token.surface for token in t.tokenize(s)]
                hinsi = [token.part_of_speech.split(',')[0] for token in t.tokenize(s)]
                yomi = [token.reading for token in t.tokenize(s)]

                for i in range(len(words)):
                    if fin:
                        break
                    uta = ""
                    utayomi = ""
                    kami = ""
                    naka = ""
                    simo = ""
                    keyword = ""
                    if hinsi[i] == "noun":  # hinsi[i] == "verb" or
                        keyword = words[i]
                        num = 0
                        utastat = 0
                        count = i
                        while num < 18 and count < len(yomi) and yomi[count].find("*") < 0:
                            num = num + howmuch(yomi[count])
                            uta = uta + words[count]
                            utayomi = utayomi + yomi[count]

                            if utastat == 0:
                                kami = kami + words[count]
                                if num > 5:
                                    break
                                elif num == 5:
                                    utastat = 1
                            elif utastat == 1:
                                naka = naka + words[count]
                                if num > 12:
                                    break
                                elif num == 12:
                                    utastat = 2
                            else:
                                simo = simo + words[count]

                            if num == 17:
                                if utayomi.find("。") >= 0:
                                    continue
                                elif (utayomi.find("(") >= 0 and utayomi.find(")") >= 0) or (
                                        utayomi.find("「") >= 0 and utayomi.find("」") >= 0) or (
                                        utayomi.find("<") >= 0 and utayomi.find(">") >= 0) or (
                                        utayomi.find("『") >= 0 and utayomi.find("』") >= 0):
                                    fin = True
                                    break
                                elif utayomi.find("(") >= 0 or utayomi.find(")") >= 0 or utayomi.find(
                                        "「") >= 0 or utayomi.find("」") >= 0 or utayomi.find("<") >= 0 or utayomi.find(
                                    ">") >= 0 or utayomi.find("『") >= 0 or utayomi.find(
                                    "』") >= 0:
                                    continue
                                elif uta != "" and uta.find("Link source") < 0:
                                    fin = True
                                    break
                            count = count + 1
        if uta != "" and uta.find("Link source") < 0 and uta.find("Used under") < 0:
            hujubun = False
    print(kami + "\n" + naka + "\n" + simo)

I think this will probably work. Since the code itself has not been thoroughly reviewed, there may be unused variables and apparently inefficient parts, but since it is a child who grows up with praise, it is really easy to point out ...

Recommended Posts

Find the part that is 575 from Wikipedia in Python
Find the difference in Python
Find out the name of the method that called it from the method that is python
Play a sound in Python assuming that the keyboard is a piano keyboard
How to judge that the cross key is input in Python3
From a book that programmers can learn (Python): Find the mode
What is "mahjong" in the Python library? ??
What is wheezy in the Docker Python image?
About the difference between "==" and "is" in python
Find the solution of the nth-order equation in python
The one that displays the progress bar in Python
[Python] Find the transposed matrix in a comprehension
python xlwings: Find the cell in the last row
Find the maximum Python
List find in Python
Find the position in the original image from the coordinates after affine transformation (Python + OpenCV)
How to find the coefficient of the trendline that passes through the vertices in Python
Check if the string is a number in python
Linux is something like that in the first place
Output the time from the time the program was started in python
Find the Hermitian matrix and its eigenvalues in Python
Modules that may go through the shell in Python
How to find the first element that matches your criteria in a Python list
Download the file in Python
OCR from PDF in Python
Find permutations / combinations in Python
UI Automation Part 2 in Python
Let's find pi in Python
Find the maximum python (improved)
How to test that Exception is raised in python unittest
Find out the apparent width of a string in python
Get your heart rate from the fitbit API in Python!
Python --Find out number of groups in the regex expression
The story that `while queue` did not work in python
Automatically get the port where Arduino is stuck in Python
Is there a bias in the numbers that appear in the Fibonacci numbers?
Find out how many each character is in the string.
What beginners learned from the basics of variables in python
Find the eigenvalues of a real symmetric matrix in Python
[Python] Programming to find the number of a in a character string that repeats a specified number of times.
Transpose CSV files in Python Part 1
How is the progress? Let's get on with the boom ?? in Python
To make sure that the specified key is in the specified bucket in Boto 3
Difference between == and is in python
[Python] Find the second smallest value.
Wrap (part of) the AtCoder Library in Cython for use in Python
Movement that changes direction in the coordinate system I tried Python 3
Python in the browser: Brython's recommendation
Get data from Quandl in Python
Save the binary file in Python
Hit the Sesami API in Python
Use networkx, a library that handles graphs in python (Part 2: Tutorial)
Python is UnicodeEncodeError in CodeBox docker
Get the desktop path in Python
[Python] What is @? (About the decorator)
Delete a particular character in Python if it is the last
How to deal with the problem that the current directory moves when Python is executed from Atom
Get the script path in Python
In the python command python points to python3.8
Implement the Singleton pattern in Python
Find the Levenshtein Distance with python