NLP4J [006-034c] 100 language processing knocks with NLP4J # 34 Try to solve "A's B" smarter (final edition)

Return to Index

Task

NLP4J [006-034b] Let's make an Annotator of 100 language processing knock # 34 "A's B" with NLP4J It was possible to reuse the logic by cutting it out and defining it.

However, this is still not enough. Because the keyword extraction rules are written in logic, there is not enough flexibility. If it is "B of A", this logic is fine, but if you want to extract something like "A is B", you have to prepare another logic.

Creating a program logic method just to extract "A to B" is fine for study, but it is not efficient for using natural language processing in business.

/**
 *"Noun noun" to "word"_nn_no_Extract as the "nn" keyword.
 * @author Hiroki Oya
 */
public class Nokku34Annotator extends AbstractDocumentAnnotator implements DocumentAnnotator {
	@Override
	public void annotate(Document doc) throws Exception {
		ArrayList<Keyword> newkwds = new ArrayList<>();
		Keyword meishi_a = null;
		Keyword no = null;
		for (Keyword kwd : doc.getKeywords()) {
			if (meishi_a == null && kwd.getFacet().equals("noun")) {
				meishi_a = kwd;
			} //
			else if (meishi_a != null && no == null && kwd.getLex().equals("of")) {
				no = kwd;
			} //
			else if (meishi_a != null && no != null && kwd.getFacet().equals("noun")) {
				Keyword kw = new DefaultKeyword();
				kwd.setLex(meishi_a.getLex() + no.getLex() + kwd.getLex());
				kwd.setFacet("word_nn_no_nn");
				kwd.setBegin(meishi_a.getBegin());
				kwd.setEnd(kwd.getEnd());
				kwd.setStr(meishi_a.getStr() + no.getStr() + kwd.getStr());
				kwd.setReading(meishi_a.getReading() + no.getReading() + kwd.getReading());
				newkwds.add(kw);
				meishi_a = null;
				no = null;
			} //
			else {
				meishi_a = null;
				no = null;
			}
		}
		doc.addKeywords(newkwds);
	}
}

How to solve

Language processing 100 knocks 2015 Looking back at # 34, # 34 is simply

"B of A" Extract a noun phrase in which two nouns are connected by "no"

Is only written. It is not AI-like to create logic just to solve this problem.

Therefore, I decided to develop my own rule description.

Rule description

Extract a noun phrase in which two nouns are connected by "no"

So let's make a rule that allows you to write this.

If you write it in JSON that everyone loves, the rules are as follows.

[{'facet':'noun'},{'lex':'of'},{'facet':'noun'}]

The extraction rule is that the keywords are lined up as a JSON array and look like "noun, noun".

Description of extraction result

Extract a noun phrase in which two nouns are connected by "no"

However, it is not specified what to extract from the noun phrase. If you think about it humanly, let's assume that it is the normal form (original form).

There are various ways to use the grammar as a description rule for the extraction results, but I would like to keep it in the same format as IBM Watson Explorer, the most widely used enterprise text mining software in Japan. (Don't make it JSON ...)

The documentation for the IBM Watson Explorer rules file is as follows: Content Analysis Collection Custom Rules File (https://www.ibm.com/support/knowledgecenter/en/SS5RWK_3.5.0/com.ibm.discovery.es.ta.doc/iiysatextanalrules.htm)

Although it is a difficult manual, the way to write the keyword extraction part is as follows.

${0.lex}-${1.lex}-${2.lex}

The number is the index value of the extracted keyword. The string that follows the period (here lex) is the attribute value of the keyword. lex means the original form. When translated into Japanese

${Prototype of 0th Keyword}-${Prototype of the first Keyword}-${The original form of the second Keyword}

It becomes a description that the original form of the extracted keyword is concatenated with a hyphen.

Since you can describe anything other than the $ {...} part, simply concatenate

${0.lex}${1.lex}${2.lex}

You can do it

${0.lex} ... ${1.lex} ... ${2.lex}

It can also be described as.

CODE

As a code

String rule = "[{facet:'noun'},{lex:'of'},{facet:'noun'}]";
String facet = "word_nn_no_nn";
String value = "${0.lex}-${1.lex}-${2.lex}";

If you can extract with this setting, it will be smart. It looks a bit like AI. (It's even smarter if you write the rules in natural language ...)

As an Annotator that extracts keywords according to the specified rules nlp4j.annotator.KeywordSequencePatternAnnotator I prepared. Since the code has become long, I will omit it here.

String rule = "[{facet:'noun'},{lex:'of'},{facet:'noun'}]"; // #34 Only this
String facet = "word_nn_no_nn"; // #34 Only this
String value = "${0.lex}-${1.lex}-${2.lex}"; // #34 Only this

//Use the text file crawler provided by NLP4J
Crawler crawler = new TextFileLineSeparatedCrawler();
crawler.setProperty("file", "src/test/resources/nlp4j.crawler/neko_short_utf8.txt");
crawler.setProperty("encoding", "UTF-8");
crawler.setProperty("target", "text");

//Document crawl
List<Document> docs = crawler.crawlDocuments();

//Definition of NLP pipeline (process by connecting multiple processes as a pipeline)
DocumentAnnotatorPipeline pipeline = new DefaultDocumentAnnotatorPipeline();
{
	// Yahoo!Annotator using Japan's morphological analysis API
	DocumentAnnotator annotator = new YJpMaAnnotator();
	pipeline.add(annotator);
}
{
	KeywordSequencePatternAnnotator annotator = new KeywordSequencePatternAnnotator();
	annotator.setProperty("rule[0]", rule);
	annotator.setProperty("facet[0]", facet);
	annotator.setProperty("value[0]", value);
	pipeline.add(annotator);
}
//Execution of annotation processing
pipeline.annotate(docs);

System.err.println("<Extracted keywords>");
for (Document doc : docs) {
	for (Keyword kwd : doc.getKeywords(facet)) {
		System.err.println(kwd);
	}
}
System.err.println("</Extracted keywords>");

result

It became as follows. You could extract keywords just by specifying the rules!

<Extracted keywords>
he-of-palm[sequence=-1, facet=word_nn_no_nn, lex=he-of-palm, str=he-of-palm, reading=null, count=-1, begin=2, end=5, correlation=0.0]
palm-of-Up[sequence=-1, facet=word_nn_no_nn, lex=palm-of-Up, str=palm-of-Up, reading=null, count=-1, begin=0, end=3, correlation=0.0]
Student-of-face[sequence=-1, facet=word_nn_no_nn, lex=Student-of-face, str=Student-of-face, reading=null, count=-1, begin=11, end=15, correlation=0.0]
Should be-of-face[sequence=-1, facet=word_nn_no_nn, lex=Should be-of-face, str=Should be-of-face, reading=null, count=-1, begin=13, end=17, correlation=0.0]
face-of-middle[sequence=-1, facet=word_nn_no_nn, lex=face-of-middle, str=face-of-middle, reading=null, count=-1, begin=5, end=9, correlation=0.0]
hole-of-During ~[sequence=-1, facet=word_nn_no_nn, lex=hole-of-During ~, str=hole-of-During ~, reading=null, count=-1, begin=6, end=9, correlation=0.0]
</Extracted keywords>

Maven

The above code works with nlp4j-core 1.2.0.0 and above.

<dependency>
  <groupId>org.nlp4j</groupId>
  <artifactId>nlp4j-core</artifactId>
  <version>1.2.0.0</version>
</dependency>

It seems that 1.2.0.0 took more than 12 hours from build + upload to deployment with Maven. Is the server busy? It seems.

Impressions

I think that writing and solving rules is smarter than other language processing 100 knock # 34.

Summary

With NLP4J, you can easily process natural language in Java!

Project URL

https://www.nlp4j.org/ NLP4J_N_128.png


Return to Index

Recommended Posts

NLP4J [006-034c] 100 language processing knocks with NLP4J # 34 Try to solve "A's B" smarter (final edition)
NLP4J [006-034b] Try to make an Annotator of 100 language processing knock # 34 "A's B" with NLP4J
NLP4J [006-034] 100 language processing knocks with NLP4J # 34 "A B"
NLP4J [006-031] 100 language processing knocks with NLP4J # 31 verb
NLP4J [006-033] 100 language processing knocks with NLP4J # 33 Sahen noun
NLP4J [006-030] 100 language processing knocks with NLP4J # 30 Reading morphological analysis results
Convert C language to JavaScript with Emscripten
NLP4J [006-032] 100 language processing with NLP4J Knock # 32 Prototype of verb