Fall 2017 Security Specialist I checked the frequency of words that appeared in the morning 2

At the beginning

Based on I tried OCR processing a PDF file with Java, I analyzed the frequently-used words of security specialist AM2.

Reference page

-I tried OCR processing of PDF file with Java -Program to count the number of words contained in List

Source for analysis

JapaneseAnalyser.java


package jpn;

import java.nio.charset.Charset;
import java.nio.file.Files;
import java.nio.file.Paths;
import java.util.List;
import java.util.stream.Collectors;

import org.atilika.kuromoji.Token;
import org.atilika.kuromoji.Tokenizer;



public class JapaneseAnalyser {
    public static void main(String[] args) throws Exception{
    	//parse_1.The result of OCR reading of the security specialist AM2 question paper is stored in txt.
        String input = Files.lines(Paths.get("parse_1.txt"), Charset.forName("MS932"))
        .reduce((s,v)->s+v.replaceAll("\\r\\n", "").trim()).get();
        analysis(input);
    }	
    public static void analysis(String s){
        Tokenizer tokenizer = Tokenizer.builder().build();
        List<Token> tokens = tokenizer.tokenize(s);
        tokens
            .stream()
            .filter(a ->(a.getPartOfSpeech().indexOf("noun")>=0))
            .map(e -> e.getSurfaceForm())
            .sorted()
            .collect(
    				Collectors.groupingBy(b->b,
    						Collectors.summingInt(b->1))
    				)
            .forEach((m1,m2)->System.out.println(String.format("Frequency of appearance%d Appearing word: %s",m2,m1)));
    }
}

Analysis result

I posted it on my blog

Impressions

--As usual, there are many DNS problems. Is it often targeted as a security hole? ――It is new that words related to machine learning appear. For those who are studying the Information-Technology Engineers Examination from now on, it may be necessary to study because the field of AI may be thin. ――If you have Java's Stream API, you don't need to use python to do a little Japanese analysis. In fact, python has different libraries that can be used in 2nd and 3rd series, which is frustrating to be honest. If you say that python is easy to use, how do you develop it?

Recommended Posts

Fall 2017 Security Specialist I checked the frequency of words that appeared in the morning 2
A program that counts the number of words in a List
I tried to summarize the words that I often see in docker-compose.yml
I checked the place of concern of java.net.URL # getPath
I checked the number of taxis with Ruby
When I think about the 402 error that suddenly appeared in the middle of the introduction of PAY.jp, there was an unexpected place
Count the frequency of occurrence of words in a sentence by stream processing (Apache Apex)
I checked asynchronous execution of queries in Spring Boot 1.5.9
Counting the frequency of occurrence of words in sentences by stream processing (Apache Apex) Part 2 Coding
Determine that the value is a multiple of 〇 in Ruby
I compared the build times of various Dockerfiles in Rust
I touched the devise controller that I felt in the black box