[PYTHON] [PowerShell] Get the reading of the character string

What was made

> "Natto (not genetically modified)"|Get-ReadingWithSudachi|fl

Line     :Natto (not genetically modified)
Reading  :Natto (Idenshikumakaedenai)
Tokenize :Natto(Natto)/(/gene(Idenshi)/Recombinant(Kumikae)/so/Absent/)
Markup   : <p><ruby>Natto<rt>Natto</rt></ruby>(<ruby>gene<rt>Idenshi</rt></ruby>
           <ruby>Recombinant<rt>Kumikae</rt></ruby>Not)</p>

code

environment:

> $PSVersionTable

Name                           Value
----                           -----
PSVersion                      7.0.3
PSEdition                      Core
GitCommitId                    7.0.3
OS                             Microsoft Windows 10.0.18362
Platform                       Win32NT
PSCompatibleVersions           {1.0, 2.0, 3.0, 4.0…}
PSRemotingProtocolVersion      2.3
SerializationVersion           1.1.0.1
WSManStackVersion              3.0

Call the previously written morphological analysis with SudachiPy ([PowerShell] morphological analysis with SudachiPy).

function Get-ReadingWithSudachi {
    param (
        [switch]$readingOnly,
        [switch]$ignoreParen
    )
    $ret = New-Object System.Collections.ArrayList
    $tokenizedResults = $input | Invoke-SudachiTokenizer -ignoreParen:$ignoreParen
    foreach ($result in $tokenizedResults) {
        $reading = New-Object System.Text.StringBuilder
        $tokenize = New-Object System.Collections.ArrayList
        $markup = New-Object System.Collections.ArrayList

        foreach ($token in $result.parsed) {

            $tokenSurface = $token.surface
            if ($token.pos -match "symbol|Blank" -or $tokenSurface -match "^([A-Vu]|[a-zA-Za-zA-Z]|[0-90-9]|[\W\s])+$") {
                $tokenReading = $tokenSurface
                $tokenInfo = $tokenSurface
                $tokenMarkup = $tokenSurface
            }
            elseif (-not $token.reading) {
                $tokenReading = $tokenSurface
                $tokenInfo = "$($tokenSurface)(?)"
                $tokenMarkup = $tokenSurface
            }
            else {
                $tokenReading = $token.reading
                $tokenInfo = ($tokenSurface -match "^[Ah-Hmm]+$")?
                    $tokenSurface :
                    "$($tokenSurface)($tokenReading)"
                $tokenMarkup = ($tokenSurface -match "^[Ah-Hmm]+$")?
                    $tokenSurface :
                    "<ruby>{0}<rt>{1}</rt></ruby>" -f $tokenSurface, $tokenReading
            }
            $reading.Append($tokenReading) > $null
            $tokenize.Add($tokenInfo) > $null
            $markup.Add($tokenMarkup) > $null
        }

        $ret.Add([PSCustomObject]@{
            Line = $result.line
            Reading = $reading.ToString()
            Tokenize = $tokenize -join "/"
            Markup = "<p>{0}</p>" -f ($markup -join "")
        }) > $null

    }

    return ($readingOnly)? $ret.reading : $ret
}

html markup

202009192183358.png

Sometimes I fail to analyze technical terms like this.

If you have one or two, you can check it visually, but since it would be a problem to process hundreds of lines, I added a property called Markup to spit out html markup.

(cat hogehoge.txt |Get-ReadingWithSudachi).markup|Out-File hogehoge.html

202009192184427.png

I believe that if you convert it to html as described above and check it with a browser, oversights will be reduced to some extent.

Recommended Posts

[PowerShell] Get the reading of the character string
[Python] Get the character code of the file
Get the number of digits
# Function that returns the character code of a string
Get the number of views of Qiita
Get the attributes of an object
Get the query string (query string) in Django
Get the column list & data list of CASTable
Omit BOM from the beginning of the string
Get the minutes of the Diet via API
[Pandas] Expand the character string to DataFrame
Basic grammar of Python3 system (character string)
Get the value of the middle layer of NN
Get the last day of the specified month
Get the filename of a directory (glob)
Reading comprehension of "The Go Memory Model"
[Introduction to Python] Thorough explanation of the character string type used in Python!
[Python] Get / edit the scale label of the figure
[Python] Get the main topics of Yahoo News
Get the caller of a function in Python
Divides the character string by the specified number of characters. In Ruby and Python.
Divide the string into the specified number of characters
Character range / character string range
Calculation of match rate of character string breaks [python]
The story of reading HSPICE data in Python
Get only the address part of NIC (eth0)
To get the path of the currently running python.exe
Convert the character code of the file with Python3
[Python] Get the day of the week (English & Japanese)
Get the last element of the array by splitting the string in Python and PHP
Get the update date of the Python memo file.
[Ansible] Example of playbook that adds a character string to the first line of the file
Get the title of yahoo news and analyze sentiment
[Python] Get the official file path of the shortcut file (.lnk)
[Python] Get the text of the law from the e-GOV Law API
Cut a part of the string using a Python slice
Get the image of "Suzu Hirose" by Google image search.
Get the absolute path of the script you are running
Store Japanese (multibyte character string) in sqlite3 of python
[python] Get the list of classes defined in the module
Since there are many earthquakes, get the history of earthquakes
Get the return code of the Python script from bat
[C language] [Linux] Get the value of environment variable
Summarize the knowledge of reading Go's HTTP implementation ~ Slice ~
Get to know the feelings of gradient boosting trees
Get the size (number of elements) of UnionFind in Python
Summarize the knowledge of reading Go's HTTP implementation ~ Channel ~
[Python] Get the list of ExifTags names of Pillow library
[Django 2.2] Sort and get the value of the relation destination
[Python] Get the number of views of all posted articles
Get the URL of the HTTP redirect destination in Python
Calculate the product of matrices with a character expression?
I tried to summarize the string operations of Python
The beginning of cif2cell
Various character string operations
Get the GNOME version
The meaning of self
[Python] Programming to find the number of a in a character string that repeats a specified number of times.
The story of sys.path.append ()
How to quickly count the frequency of appearance of characters from a character string in Python?
Summary of string operations