Based on I tried OCR processing a PDF file with Java, I analyzed the frequently-used words of security specialist AM2.
-I tried OCR processing of PDF file with Java -Program to count the number of words contained in List
JapaneseAnalyser.java
package jpn;
import java.nio.charset.Charset;
import java.nio.file.Files;
import java.nio.file.Paths;
import java.util.List;
import java.util.stream.Collectors;
import org.atilika.kuromoji.Token;
import org.atilika.kuromoji.Tokenizer;
public class JapaneseAnalyser {
public static void main(String[] args) throws Exception{
//parse_1.The result of OCR reading of the security specialist AM2 question paper is stored in txt.
String input = Files.lines(Paths.get("parse_1.txt"), Charset.forName("MS932"))
.reduce((s,v)->s+v.replaceAll("\\r\\n", "").trim()).get();
analysis(input);
}
public static void analysis(String s){
Tokenizer tokenizer = Tokenizer.builder().build();
List<Token> tokens = tokenizer.tokenize(s);
tokens
.stream()
.filter(a ->(a.getPartOfSpeech().indexOf("noun")>=0))
.map(e -> e.getSurfaceForm())
.sorted()
.collect(
Collectors.groupingBy(b->b,
Collectors.summingInt(b->1))
)
.forEach((m1,m2)->System.out.println(String.format("Frequency of appearance%d Appearing word: %s",m2,m1)));
}
}
--As usual, there are many DNS problems. Is it often targeted as a security hole? ――It is new that words related to machine learning appear. For those who are studying the Information-Technology Engineers Examination from now on, it may be necessary to study because the field of AI may be thin. ――If you have Java's Stream API, you don't need to use python to do a little Japanese analysis. In fact, python has different libraries that can be used in 2nd and 3rd series, which is frustrating to be honest. If you say that python is easy to use, how do you develop it?
Recommended Posts