I got a Value Error when using JUMAN ++ with PyKNP

error contents

When I try to morpheme-decompose a Japanese sentence from PyKNP using Human ++, I get this error ...! ValueError: invalid literal for int() with base 10: 'input' (The method of using Human, Juman ++ from Python is omitted)

  File "/$HOME/.pyenv/versions/anaconda3-2019.03/lib/python3.6/site-packages/pyknp/
juman/morpheme.py", line 143, in _parse_spec                                                   
    self.hinsi_id = int(parts[4])

ValueError: invalid literal for int() with base 10: 'input'

Possible causes of error

For the time being, when I look it up based on the error message

-Symbols to be careful when using JUMAN from PyKNP -Talk about the cause and countermeasures for Value Error when playing with JUMAN ++ --EnsekiTT Blog

It seems that half-width spaces and half-width characters are bad.

Correct input text

-[Python] Convert full-width and half-width characters to each other in one line (alphabetic characters + numbers + symbols)

So replace all half-width characters with full-width characters.

However

Even if I corrected all half-width characters to full-width characters, the same error continued to appear. Apparently the cause is different from the situation in the above article.

Cause found

It doesn't matter, but pdb is convenient

So I ran it with pdb and checked the contents of the variable parts at the time of the error ~~ Do it from the beginning ~~.

(Pdb) parts
['InvalidParameter:', 'byte', 'size', 'of', 'input', 'string', '(4302)', 'is', 'greater', 'than│(base)
', 'maximum', 'allowed', '(4096)']

(It was originally a specification that the error content is included in the list that contains the analysis result when an error occurs ...)

Apparently ** the size (number of bytes) of the input string was too large **. ** The limit of the input character string seems to be 4096 bytes in total **, so it seems better to limit the capacity to less than that.

Solution for the time being

I was in the process of creating a dataset to be sent to BERT, but a sentence that is too long is a pass! ~~ UTF-8 seems to have different number of bytes depending on the character type, so it is troublesome to cut ~~

Detect statements larger than 4096 bytes under the following conditions and take some workaround. (Split or pass) It examines and compares the number of bytes in the string text.

if len(text.encode('utf-8')) > 4096:

Click here for how to find out the number of bytes instead of the number of characters in a string

-Python string length and number of bytes by encoding --Memoize2

Summary

The cause of the error when using Human ++ from PyKNP is combined with the article introduced above.

-Half-width space -Some half-width symbols -** Input string size is 4096 bytes or more **

was.

Recommended Posts

I got a Value Error when using JUMAN ++ with PyKNP
I got an error when using Tensorboard with Pytorch
I got an error when saving with OpenCV
Linux Ubuntu16.04 I got a little scary error when I ran a command using sudo
When creating a pipenv environment, I got addicted to "Value Error: Not a valid python path"
I got a TypeError:'int' object is not iterable when using keras
What I got into when using Tensorflow-gpu
When I get an error with PyInstaller
A reminder of what I got stuck when starting Atcoder with python
Error when installing a module with Python pip
When I get a chromedriver error in Selenium
I get a UnicodeDecodeError when running with mod_wsgi
I tried using a database (sqlite3) with kivy
I got an error when I put opencv in python3 with Raspberry Pi [Remedy]
When I tried to connect with SSH, I got a warning about free space.
I got a sqlite3.OperationalError
After installing basemap, I got a numpy import error
I got a UnicodeDecodeError when pip install on ubuntu
A memorandum when an error occurs with pip install
I can't exe a project using PyWebView with PyInstaller
I got stuck when trying to specify a relative path with relative_to () in python
I got an error when pip install pandas on Mac, so I dealt with it
I got an error when pip install tweepy on macOS Sierra, so I dealt with it
When I made CaboCha usable with python3, I got stuck (Windows 10)
When using optparse with iPython
I wrote python3.4 in .envrc with direnv and allowed it, but I got a syntax error
I got "ModuleNotFoundError: No module named'azure'" when running a program using Azure SDK for Python
A story about installing matplotlib using pip with an error
Using a printer with Debian 10
Error when playing with python
I got an SSL Error when I installed Anaconda in a new environment, so I solved it (Windows10, Anaconda3-2019.10)
I get an OS Error: [Errno 8] Exec format error when running a Flask application with a python command
The story that a hash error came out when using Pipenv
A memorandum when I tried to get it automatically with selenium
A story that I fixed when I got Lambda logs from Cloudwatch Logs
I got an error when I ran composer global require laravel / installer
When I get an error with Pylint in Atom on Windows
I get an error when trying to install maec 4.0.1.0 with pip
[Python] Error and solution memo when using venv with pyenv + anaconda
A note I was addicted to when creating a table with SQLAlchemy
Unable to bind to interface error when using apollo federation with gqlgen
[Django] Error when using Q object (Related Field got invalid lookup)
I get a can't set attribute when using @property in python
I tried to make a todo application using bottle with python
"Value Error: Unable to configure handler'file_output_handler'" when starting a python program
I made a poker game server chat-holdem using websocket with python
I got an error when I ran meteor add accounts-password and got hooked
What I stumbled upon when using CodeIgniter on a Linux server
When coverage fails with _sqlite3 error
A memorandum when using beautiful soup
I made a Line-bot using Python!
I made a fortune with Python.
I made a daemon with Python
Using a webcam with Raspberry Pi
When using MeCab with virtualenv python
Precautions when using six with Python 2.5
I don't know the value error
I got an error when trying to install Xgboost and its solution
How to deal with OAuth2 error when using Google APIs from Python
When you want to replace a column with a missing value (NaN) column by column
I got an SSL related error with pip install, so I solved it