> "Natto (not genetically modified)"|Get-ReadingWithSudachi|fl
Line     :Natto (not genetically modified)
Reading  :Natto (Idenshikumakaedenai)
Tokenize :Natto(Natto)/(/gene(Idenshi)/Recombinant(Kumikae)/so/Absent/)
Markup   : <p><ruby>Natto<rt>Natto</rt></ruby>(<ruby>gene<rt>Idenshi</rt></ruby>
           <ruby>Recombinant<rt>Kumikae</rt></ruby>Not)</p>
environment:
> $PSVersionTable
Name                           Value
----                           -----
PSVersion                      7.0.3
PSEdition                      Core
GitCommitId                    7.0.3
OS                             Microsoft Windows 10.0.18362
Platform                       Win32NT
PSCompatibleVersions           {1.0, 2.0, 3.0, 4.0…}
PSRemotingProtocolVersion      2.3
SerializationVersion           1.1.0.1
WSManStackVersion              3.0
Call the previously written morphological analysis with SudachiPy ([PowerShell] morphological analysis with SudachiPy).
function Get-ReadingWithSudachi {
    param (
        [switch]$readingOnly,
        [switch]$ignoreParen
    )
    $ret = New-Object System.Collections.ArrayList
    $tokenizedResults = $input | Invoke-SudachiTokenizer -ignoreParen:$ignoreParen
    foreach ($result in $tokenizedResults) {
        $reading = New-Object System.Text.StringBuilder
        $tokenize = New-Object System.Collections.ArrayList
        $markup = New-Object System.Collections.ArrayList
        foreach ($token in $result.parsed) {
            $tokenSurface = $token.surface
            if ($token.pos -match "symbol|Blank" -or $tokenSurface -match "^([A-Vu]|[a-zA-Za-zA-Z]|[0-90-9]|[\W\s])+$") {
                $tokenReading = $tokenSurface
                $tokenInfo = $tokenSurface
                $tokenMarkup = $tokenSurface
            }
            elseif (-not $token.reading) {
                $tokenReading = $tokenSurface
                $tokenInfo = "$($tokenSurface)(?)"
                $tokenMarkup = $tokenSurface
            }
            else {
                $tokenReading = $token.reading
                $tokenInfo = ($tokenSurface -match "^[Ah-Hmm]+$")?
                    $tokenSurface :
                    "$($tokenSurface)($tokenReading)"
                $tokenMarkup = ($tokenSurface -match "^[Ah-Hmm]+$")?
                    $tokenSurface :
                    "<ruby>{0}<rt>{1}</rt></ruby>" -f $tokenSurface, $tokenReading
            }
            $reading.Append($tokenReading) > $null
            $tokenize.Add($tokenInfo) > $null
            $markup.Add($tokenMarkup) > $null
        }
        $ret.Add([PSCustomObject]@{
            Line = $result.line
            Reading = $reading.ToString()
            Tokenize = $tokenize -join "/"
            Markup = "<p>{0}</p>" -f ($markup -join "")
        }) > $null
    }
    return ($readingOnly)? $ret.reading : $ret
}

Sometimes I fail to analyze technical terms like this.
If you have one or two, you can check it visually, but since it would be a problem to process hundreds of lines, I added a property called Markup to spit out html markup.
(cat hogehoge.txt |Get-ReadingWithSudachi).markup|Out-File hogehoge.html

I believe that if you convert it to html as described above and check it with a browser, oversights will be reduced to some extent.
Recommended Posts