Recently, the wave of Deep is rushing to the NLP area, so some people may say, "I don't use KNP these days." However, if you don't have decent data and you can't use common methods, or if you want to try the rule-based results before trying the deep method, there are plenty of case analysis results and various features. KNP is still useful because it gives you a lot of information.
On the other hand, it can't be used as a library like Sudachi, so it can be a hassle to use in a program (especially in languages other than Python with pyKNP).
So this time I made a Wrapper library that calls KNP (and Human ++) from Java. (https://github.com/Natsuume/knp4j) ~~ I wanted to publish it to the Maven repository, but I couldn't make it in time, so I will publish it to the Maven repository soon. ~~
~~ In addition, although we have confirmed the operation to some extent, there is a possibility that problems will occur because we have not written a proper test ~~
Published to Maven Central. Now available on Maven, Gradle, etc.
pom.xml
<dependency>
<groupId>dev.natsuume.knp4j</groupId>
<artifactId>knp4j</artifactId>
<version>1.1.3</version>
</dependency>
build.gradle
implementation 'dev.natsuume.knp4j:knp4j:1.1.3'
It's almost as README.md on github.
Sample.java
//Builder for creating KNPWrapper
ResultParser<KnpResult> knpResultParser = new KnpResultParser();
KnpWrapperBuilder<KnpResult> knpWrapperBuilder = new KnpWrapperBuilder<>();
KnpWrapper<KnpResult> wrapper = knpWrapperBuilder
.setJumanCommand(List.of("bash", "-c", "jumanpp")) //Juman execution command
.setKnpCommand(List.of("bash", "-c", "knp -tab -print-num -anaphora")) //KNP execution command(Currently"-tab」「-print-num」「-anaphora "option required)
.setJumanMaxNum(1) //Maximum number of Human processes to start at the same time
.setJumanStartNum(1) //Number of Human processes to start at initialization
.setKnpMaxNum(1) //Maximum number of KNP processes to start at the same time
.setKnpStartNum(1) //Number of KNP processes to start at initialization
.setRetryNum(0) //Number of retries if result acquisition fails
.setResultParser(knpResultParser) //List of output results<String>Set Parser to convert to any class
.start();
var texts = List.of(
"Test text 1",
"Test text 2",
"Test text 3"
);
texts.parallelStream().map(wrapper::analyze)
.flatMap(List::stream)
.map(KnpResult::getSurfaceForm)
.forEach(System.out::println);
Give various settings with KnpWrapperBuilder
and generate & start KnpWrapper for the first time withstart ()
.
For setJumanCommand
and setKnpCommand
, give the same command as given to ProcessBuilder
.
Depending on the environment, it may be possible to execute with only the JUMAN and KNP paths.
(In my environment, I had to call JUMANPP, KNP on WSL, so I gave a command like the above example)
For settings other than setResultParser ()
, the contents of the above example are the default values.
Set up multiple processes and reuse them. The number of processes that can be set up at the same time can be freely set for each of JUMAN and KNP.
There is a server mode for JUMAN and KNP, but this is not currently supported (will be supported in the future).
Basically, it is assumed that it will not be used, but when ʻIOException or ʻInterruptedException
occurs in a series of processes, the process in which the exception occurred is terminated, and another process tries to analyze again.
Any Parser that implements the ResultParser
interface can be used as the output Parser.
The following two types of methods are defined in ResultParser
.
ResultParser.java
public interface ResultParser<OutputT> {
/**
*Returns an arbitrary instance with the analysis result of Knp as input.
*
* @param list Knp analysis result
* @return Instance representing the analysis result
*/
OutputT parse(List<String> list);
/**
*Returns the instance to use when parsing fails.
*
* @return Instance to return when parsing fails
*/
OutputT getInvalidResult();
}
getInvalidResult ()
is a method that returns an instance when a normal parsing result cannot be obtained.
It is used when re-execution at the above exception fails, or when KNP fails to analyze (KNP fails to analyze if half-width +
, *
is included).
Change jumanMaxNum
, knpMaxNum
with the code below and compare the execution time (ms).
Also, in the experimental environment, the CPU is Ryzen 7 3700x and the heap size is 32GB.
In the experimental environment, WSL's JUMAN and KNP are called. (WSL is said to be slow IO, so it may be a little faster in other environments?)
public static void main(String[] args) {
long time = System.currentTimeMillis();
KnpWrapperBuilder<KnpResult> knpWrapperBuilder = new KnpWrapperBuilder<>();
int jumanMaxNum = 1;
int knpMaxNum = 1;
int textSize = 100;
KnpWrapper<KnpResult> wrapper =
knpWrapperBuilder
.setJumanMaxNum(jumanMaxNum)
.setKnpMaxNum(knpMaxNum)
.setResultParser(new KnpResultParser())
.start();
var sampleText = "I registered in the Advent calendar with Nori,"
+ "I don't see any sign of time so today%You can only sleep after working for d hours.";
var texts =
IntStream.range(0, textSize)
.mapToObj(i -> String.format(sampleText, i))
.collect(Collectors.toList());
var results =
texts
.parallelStream()
.map(wrapper::analyze)
.flatMap(List::stream)
.collect(Collectors.toList());
System.out.println("time: " + (System.currentTimeMillis() - time));
System.exit(0);
}
jumanMaxNum | knpMaxNum | First time | Second time | Third time | 4th | 5th time | average |
---|---|---|---|---|---|---|---|
1 | 1 | 17297 | 17320 | 17241 | 17159 | 17421 | 17287.6 |
1 | 5 | 2808 | 2764 | 2858 | 2791 | 2789 | 2802 |
5 | 1 | 20334 | 20211 | 19974 | 20037 | 20189 | 20149 |
For the time being, I found that both JUMAN and KNP are faster when KNP is executed in multiple processes than when they are executed in a single process. On the other hand, unlike KNP, which is a bottleneck, the JUMAN side seems to slow down if it is increased too much.
In order to see how much the result differs depending on the number of JUMAN and KNP processes, the number of texts was increased from 100 to 500 and the following combinations were additionally measured.
jumanMaxNum | knpMaxNum | First time | Second time | Third time | 4th | 5th time | average |
---|---|---|---|---|---|---|---|
1 | 5 | 27953 | 27590 | 27674 | 27999 | 27669 | 27777 |
1 | 10 | 15825 | 16366 | 15118 | 15632 | 14931 | 15574.4 |
5 | 10 | 18704 | 17778 | 17355 | 16134 | 17254 | 17445 |
10 | 10 | 19514 | 19265 | 20459 | 19891 | 19233 | 19672.4 |
1 | 15 | 14533 | 22271 | 14187 | 21838 | 19794 | 18524.6 |
5 | 15 | 14149 | 14584 | 14929 | 15709 | 15228 | 14919.8 |
10 | 15 | 19313 | 17903 | 15478 | 18219 | 16740 | 17530.6 |
1 | 20 | 21620 | 14489 | 21960 | 20456 | 15671 | 18839.2 |
5 | 20 | 15899 | 15820 | 15713 | 14720 | 17053 | 15841 |
10 | 20 | 18850 | 15850 | 18461 | 18200 | 16357 | 17543.6 |
~~ I don't know. ~~ For the time being, the fastest average combination in this environment was a combination of 5 processes for JUMAN and 15 processes for KNP. However, I feel that the combinations around [1, 10], [5, 15], [5, 20] are within the margin of error.
Also, try the result of typing the following command on the WSL terminal.
time echo "I registered it on the Advent calendar, but I don't see any sign of it in time, so I have to work for an hour today before I can sleep." | jumanpp | knp -tab -print-num -anaphora
result | First time | Second time | Third time | 4th | 5th time | average |
---|---|---|---|---|---|---|
real | 220 | 223 | 234 | 219 | 234 | 226 |
user | 78 | 78 | 63 | 109 | 94 | 84.4 |
sys | 125 | 125 | 141 | 78 | 125 | 118.8 |
It's fun to see the CPU spinning around
Recommended Posts