Article content that can be understood in 3 lines

Created Wrapper to call Human ++, KNP from Java
Since it runs in multiple processes ~~ If you have machine power ~~ You can process multiple sentences at high speed
The resulting parse can be converted to any class if you prepare your own Parser.

Overview

Recently, the wave of Deep is rushing to the NLP area, so some people may say, "I don't use KNP these days." However, if you don't have decent data and you can't use common methods, or if you want to try the rule-based results before trying the deep method, there are plenty of case analysis results and various features. KNP is still useful because it gives you a lot of information.

On the other hand, it can't be used as a library like Sudachi, so it can be a hassle to use in a program (especially in languages other than Python with pyKNP).

So this time I made a Wrapper library that calls KNP (and Human ++) from Java. (https://github.com/Natsuume/knp4j) ~~ I wanted to publish it to the Maven repository, but I couldn't make it in time, so I will publish it to the Maven repository soon. ~~

~~ In addition, although we have confirmed the operation to some extent, there is a possibility that problems will occur because we have not written a proper test ~~

Postscript

Published to Maven Central. Now available on Maven, Gradle, etc.

`pom.xml`


<dependency>
  <groupId>dev.natsuume.knp4j</groupId>
  <artifactId>knp4j</artifactId>
  <version>1.1.3</version>
</dependency>

`build.gradle`


implementation 'dev.natsuume.knp4j:knp4j:1.1.3'

How to use

It's almost as README.md on github.

`Sample.java`


//Builder for creating KNPWrapper
ResultParser<KnpResult> knpResultParser = new KnpResultParser();
KnpWrapperBuilder<KnpResult> knpWrapperBuilder = new KnpWrapperBuilder<>();
KnpWrapper<KnpResult> wrapper = knpWrapperBuilder
    .setJumanCommand(List.of("bash", "-c", "jumanpp")) //Juman execution command
    .setKnpCommand(List.of("bash", "-c", "knp -tab -print-num -anaphora")) //KNP execution command(Currently"-tab」「-print-num」「-anaphora "option required)
    .setJumanMaxNum(1) //Maximum number of Human processes to start at the same time
    .setJumanStartNum(1) //Number of Human processes to start at initialization
    .setKnpMaxNum(1) //Maximum number of KNP processes to start at the same time
    .setKnpStartNum(1) //Number of KNP processes to start at initialization
    .setRetryNum(0) //Number of retries if result acquisition fails
    .setResultParser(knpResultParser) //List of output results<String>Set Parser to convert to any class
    .start();
var texts = List.of(
    "Test text 1",
    "Test text 2",
    "Test text 3"
);
texts.parallelStream().map(wrapper::analyze)
    .flatMap(List::stream)
    .map(KnpResult::getSurfaceForm)
    .forEach(System.out::println);

Give various settings with KnpWrapperBuilder and generate & start KnpWrapper for the first time withstart (). For setJumanCommand and setKnpCommand, give the same command as given to ProcessBuilder. Depending on the environment, it may be possible to execute with only the JUMAN and KNP paths. (In my environment, I had to call JUMANPP, KNP on WSL, so I gave a command like the above example)

For settings other than setResultParser (), the contents of the above example are the default values.

function

Run in multiple processes

Set up multiple processes and reuse them. The number of processes that can be set up at the same time can be freely set for each of JUMAN and KNP.

There is a server mode for JUMAN and KNP, but this is not currently supported (will be supported in the future).

Re-execution when analysis fails

Basically, it is assumed that it will not be used, but when ʻIOException or ʻInterruptedException occurs in a series of processes, the process in which the exception occurred is terminated, and another process tries to analyze again.

Result Parser

Any Parser that implements the ResultParser interface can be used as the output Parser. The following two types of methods are defined in ResultParser.

`ResultParser.java`


public interface ResultParser<OutputT> {

  /**
   *Returns an arbitrary instance with the analysis result of Knp as input.
   *
   * @param list Knp analysis result
   * @return Instance representing the analysis result
   */
  OutputT parse(List<String> list);

  /**
   *Returns the instance to use when parsing fails.
   *
   * @return Instance to return when parsing fails
   */
  OutputT getInvalidResult();
}

getInvalidResult () is a method that returns an instance when a normal parsing result cannot be obtained. It is used when re-execution at the above exception fails, or when KNP fails to analyze (KNP fails to analyze if half-width +, * is included).

Check if it is faster than the single process

Change jumanMaxNum, knpMaxNum with the code below and compare the execution time (ms). Also, in the experimental environment, the CPU is Ryzen 7 3700x and the heap size is 32GB.

In the experimental environment, WSL's JUMAN and KNP are called. (WSL is said to be slow IO, so it may be a little faster in other environments?)

  public static void main(String[] args) {
    long time = System.currentTimeMillis();

    KnpWrapperBuilder<KnpResult> knpWrapperBuilder = new KnpWrapperBuilder<>();
    int jumanMaxNum = 1;
    int knpMaxNum = 1;
    int textSize = 100;
    KnpWrapper<KnpResult> wrapper =
        knpWrapperBuilder
            .setJumanMaxNum(jumanMaxNum)
            .setKnpMaxNum(knpMaxNum)
            .setResultParser(new KnpResultParser())
            .start();
    var sampleText = "I registered in the Advent calendar with Nori," 
        + "I don't see any sign of time so today%You can only sleep after working for d hours.";
    var texts =
        IntStream.range(0, textSize)
            .mapToObj(i -> String.format(sampleText, i))
            .collect(Collectors.toList());
    var results =
        texts
            .parallelStream()
            .map(wrapper::analyze)
            .flatMap(List::stream)
            .collect(Collectors.toList());

    System.out.println("time: " + (System.currentTimeMillis() - time));
    System.exit(0);
  }

result

jumanMaxNum	knpMaxNum	First time	Second time	Third time	4th	5th time	average
1	1	17297	17320	17241	17159	17421	17287.6
1	5	2808	2764	2858	2791	2789	2802
5	1	20334	20211	19974	20037	20189	20149

For the time being, I found that both JUMAN and KNP are faster when KNP is executed in multiple processes than when they are executed in a single process. On the other hand, unlike KNP, which is a bottleneck, the JUMAN side seems to slow down if it is increased too much.

In order to see how much the result differs depending on the number of JUMAN and KNP processes, the number of texts was increased from 100 to 500 and the following combinations were additionally measured.

jumanMaxNum	knpMaxNum	First time	Second time	Third time	4th	5th time	average
1	5	27953	27590	27674	27999	27669	27777
1	10	15825	16366	15118	15632	14931	15574.4
5	10	18704	17778	17355	16134	17254	17445
10	10	19514	19265	20459	19891	19233	19672.4
1	15	14533	22271	14187	21838	19794	18524.6
5	15	14149	14584	14929	15709	15228	14919.8
10	15	19313	17903	15478	18219	16740	17530.6
1	20	21620	14489	21960	20456	15671	18839.2
5	20	15899	15820	15713	14720	17053	15841
10	20	18850	15850	18461	18200	16357	17543.6

~~ I don't know. ~~ For the time being, the fastest average combination in this environment was a combination of 5 processes for JUMAN and 15 processes for KNP. However, I feel that the combinations around [1, 10], [5, 15], [5, 20] are within the margin of error.

Also, try the result of typing the following command on the WSL terminal.

time echo "I registered it on the Advent calendar, but I don't see any sign of it in time, so I have to work for an hour today before I can sleep." | jumanpp | knp -tab -print-num -anaphora

result	First time	Second time	Third time	4th	5th time	average
real	220	223	234	219	234	226
user	78	78	63	109	94	84.4
sys	125	125	141	78	125	118.8

bonus

It's fun to see the CPU spinning around

I made a Wrapper that calls KNP from Java