[PowerShell] Morphological analysis with SudachiPy

I found a wonderful morphological analysis called SudachiPy, so I tried to call it from my usual PowerShell.

The finished product

202006212144302.png

If you pipe a string, it will return the string you entered in the line property and the object with the parsing result in the parsd property.

code

The structure is that the main analysis process is written in Python and called from PowerShell. It is also possible to use command line arguments and standard output with print for input and output of strings, but since there are the following problems, we will use temporary files.

Processing on the Python side

If you get Python via Scoop, it will be easier to handle the path around it nicely. As a preliminary preparation, install SudachiPy and fire with pip.

pip install sudachipy
pip install fire

The process of "morphologically analyzing the contents of a text file line by line and outputting the result to another text file" is put together in a function and made into a cli tool with fire.Fire ().

sudachi_tokenizer.py


import fire
import re
from sudachipy import tokenizer
from sudachipy import dictionary

def main(input_file_path, output_file_path, ignore_paren = False):
    tokenizer_obj = dictionary.Dictionary().create()
    mode = tokenizer.Tokenizer.SplitMode.C

    with open(input_file_path, "r", encoding="utf_8_sig") as input_file:
        all_lines = input_file.read()
    lines = all_lines.splitlines()

    json_style_list = []
    for line in lines:
        if not line:
            json_style_list.append({"line": "", "parsed": []})
        else:
            if ignore_paren:
                target = re.sub(r"\(.+?\)|\[.+?\]|(.+?)|[.+?]", "", line)
            else:
                target = line
            tokens = tokenizer_obj.tokenize(target, mode)
            parsed = []
            for t in tokens:
                surface = t.surface()
                pos = t.part_of_speech()[0]
                c_type = t.part_of_speech()[4]
                c_form = t.part_of_speech()[5]
                yomi = t.reading_form()
                parsed.append({"surface": surface, "pos": pos, "yomi": yomi, "c_type": c_type, "c_form": c_form})
            json_style_list.append({"line": line, "parsed": parsed})
    with open(output_file_path, mode = "w", encoding="utf_8_sig") as output_file:
        output_file.write(str(json_style_list))

if __name__ == "__main__":
    fire.Fire(main)

For business purposes, I often skipped the round paren () () and the bracket [] [], so I added an option.

The character code of the input / output file is attached with BOM because of the PowerShell specifications described later.

Processing on the PowerShell side

You can use the cmdlet from the console by creating the following .ps1 file in the same directory as the above sudachi_tokenizer.py and reading it from $ PROFILE.

function Invoke-SudachiTokenizer {
    param (
        [switch]$ignoreParen
    )

    $outputTmp = New-TemporaryFile
    $inputTmp = New-TemporaryFile
    $input | Out-File -Encoding utf8 -FilePath $inputTmp.FullName #With BOM

    $sudachiPath = "{0}\sudachi_tokenizer.py" -f $PSScriptRoot
    $command = 'python -B "{0}" "{1}" "{2}"' -f $sudachiPath, $inputTmp.FullName, $outputTmp.FullName
    if ($ignoreParen) {
        $command += ' --ignore_paren=True'
    }

    Invoke-Expression -Command $command
    $parsed = Get-Content -Path $outputTmp.FullName -Encoding UTF8

    @($inputTmp, $outputTmp) | Remove-Item #Manually clean up temporary files

    return ($parsed | ConvertFrom-Json)
}

If you put the dictionary types in a list in Python, it will be in the same format as an array in json format, so I converted it to an object with ConvertFrom-Json in PowerShell.

As I wrote in the comment, it is important to note that if you specify UTF8 in the -encoding parameter of PowerShell, it will automatically have a BOM.

Recommended Posts

[PowerShell] Morphological analysis with SudachiPy
[Python] Morphological analysis with MeCab
Text mining with Python ① Morphological analysis
I played with Mecab (morphological analysis)!
Python: Simplified morphological analysis with regular expressions
Classify Qiita posts without morphological analysis with Tweet2Vec
Data analysis with python 2
Basket analysis with Spark (1)
Dependency analysis with CaboCha
Voice analysis with python
Dynamic analysis with Valgrind
Regression analysis with NumPy
Data analysis with Python
Make a morphological analysis bot loosely with LINE + Flask
Collecting information from Twitter with Python (morphological analysis with MeCab)
[Co-occurrence analysis] Easy co-occurrence analysis with Python! [Python]
Python: Japanese text: Morphological analysis
Multiple regression analysis with Keras
venv environment with windows powershell
Sentiment analysis with Python (word2vec)
Texture analysis learned with pyradiomics
Natural language processing 1 Morphological analysis
Planar skeleton analysis with Python
Simple synonym dictionary with sudachipy
Muscle jerk analysis with Python
Morphological analysis using Igo + mecab-ipadic-neologd in Python (with Ruby bonus)
Text sentiment analysis with ML-Ask
From the introduction of JUMAN ++ to morphological analysis of Japanese with Python
Activate Anaconda's virtual environment with PowerShell
3D skeleton structure analysis with Python
Impedance analysis (EIS) with python [impedance.py]
100 language processing knocks 2020: Chapter 4 (morphological analysis)
100 Language Processing Knock 2020 Chapter 4: Morphological Analysis
How to use virtualenv with PowerShell
Principal component analysis with Spark ML
■ [Google Colaboratory] Use morphological analysis (janome)
Data analysis starting with python (data visualization 1)
Logistic regression analysis Self-made with python
Data analysis starting with python (data visualization 2)
100 Language Processing Knock 2015 Chapter 4 Morphological Analysis (30-39)
Hit ISE's ERS API with PowerShell
From preparation for morphological analysis with python using polyglot to part-of-speech tagging
[Let's play with Python] Aiming for automatic sentence generation ~ Perform morphological analysis ~