NLP4J [005-2] NLP4J + Twitter4J (Analysis 1)

Return to Index: [005-1] NLP4J + Twitter4J (data collection) > this> Next page

Take a look at the results

The full text of the output result is located at here.

Now let's see the result ...

processing time

processing time[ms]:34586

It takes 34 seconds. It's pretty slow. There are 90 documents in total, but I think this is because each document calls the Yahoo API twice and 180 times in total. Yahoo's natural language processing API is easy to use, but considering the number of times limit and performance, I think it is necessary to consider using a library that can be hit locally.

NLP4J provides a function to Wrap other natural language processing libraries, so I would like to implement it later. [When? ]

Noun frequency order

count=117,facet=noun,lex=co
count=117,facet=noun,lex=https
count=76,facet=noun,lex=2019
count=50,facet=noun,lex=TMS
count=40,facet=noun,lex=Tokyo Motor Show
count=30,facet=noun,lex=Nissan
count=30,facet=noun,lex=RT
count=29,facet=noun,lex=2
count=28,facet=noun,lex=HondaTMS
count=25,facet=noun,lex=1
count=24,facet=noun,lex=3
count=24,facet=noun,lex=4
count=22,facet=noun,lex=6
count=22,facet=noun,lex=TOYOTA
count=21,facet=noun,lex=booth
count=20,facet=noun,lex=5
count=19,facet=noun,lex=Here
count=18,facet=noun,lex=Honda
count=18,facet=noun,lex=8
count=16,facet=noun,lex=Toyota
count=15,facet=noun,lex=10
count=14,facet=noun,lex=player
count=14,facet=noun,lex=future
count=14,facet=noun,lex=See
count=12,facet=noun,lex=9
count=12,facet=noun,lex=Experience
count=12,facet=noun,lex=NissanTMS
count=11,facet=noun,lex=By all means
count=10,facet=noun,lex=Venue
count=10,facet=noun,lex=fit
count=10,facet=noun,lex=Waiting
count=10,facet=noun,lex=PR

Since the Tokyo Motor Show is being held, "TMS", "Tokyo Motor Show", "Future", etc. are ranked high.

So, "co" and "http" are the highest, so it's annoying. .. Apparently, Yahoo's natural language processing API doesn't treat "URLs" differently. Also, numbers like "2019" stand out. Yahoo's natural language processing API seems to be a specification that does not return "numerals".

Check Yahoo's morphological analysis

Let's see what the result is for URLs and numerals.

//Natural text
String text = "http://www.yahoo.co.jp/is. I picked up 100 yen.";
//Japanese morphological analysis
YJpMaService service = new YJpMaService();
//Get the result of morphological analysis
ArrayList<Keyword> kwds = service.getKeywords(text);
//Output all keywords
for (Keyword kwd : kwds) {
	System.out.println(kwd);
}

http [sequence=1, facet=noun, lex=http, str=http, reading=http, count=-1, begin=0, end=4, correlation=0.0]
: [sequence=2, facet=Special, lex=:, str=:, reading=:, count=-1, begin=4, end=5, correlation=0.0]
/ [sequence=3, facet=Special, lex=/, str=/, reading=/, count=-1, begin=5, end=6, correlation=0.0]
/ [sequence=4, facet=Special, lex=/, str=/, reading=/, count=-1, begin=5, end=6, correlation=0.0]
www [sequence=5, facet=noun, lex=www, str=www, reading=www, count=-1, begin=7, end=10, correlation=0.0]
. [sequence=6, facet=Special, lex=., str=., reading=., count=-1, begin=10, end=11, correlation=0.0]
yahoo [sequence=7, facet=noun, lex=yahoo, str=yahoo, reading=yahoo, count=-1, begin=11, end=16, correlation=0.0]
. [sequence=8, facet=Special, lex=., str=., reading=., count=-1, begin=16, end=17, correlation=0.0]
co [sequence=9, facet=noun, lex=co, str=co, reading=co, count=-1, begin=17, end=19, correlation=0.0]
. [sequence=10, facet=Special, lex=., str=., reading=., count=-1, begin=19, end=20, correlation=0.0]
jp [sequence=11, facet=noun, lex=jp, str=jp, reading=jp, count=-1, begin=20, end=22, correlation=0.0]
/ [sequence=12, facet=Special, lex=/, str=/, reading=/, count=-1, begin=22, end=23, correlation=0.0]
  [sequence=13, facet=Special, lex= , str= , reading= , count=-1, begin=23, end=24, correlation=0.0]
is[sequence=14, facet=Auxiliary verb, lex=is, str=is, reading=is, count=-1, begin=24, end=26, correlation=0.0]
。 [sequence=15, facet=Special, lex=。, str=。, reading=。, count=-1, begin=26, end=27, correlation=0.0]
100 [sequence=16, facet=noun, lex=100, str=100, reading=100, count=-1, begin=27, end=30, correlation=0.0]
Circle[sequence=17, facet=Suffix, lex=Circle, str=Circle, reading=yen, count=-1, begin=30, end=31, correlation=0.0]
pick up[sequence=18, facet=verb, lex=pick up, str=Pick up, reading=Wide, count=-1, begin=31, end=33, correlation=0.0]
Masu[sequence=19, facet=Auxiliary verb, lex=Masu, str=Better, reading=Better, count=-1, begin=33, end=35, correlation=0.0]
Ta[sequence=20, facet=Auxiliary verb, lex=Ta, str=Ta, reading=Ta, count=-1, begin=35, end=36, correlation=0.0]
。 [sequence=21, facet=Special, lex=。, str=。, reading=。, count=-1, begin=36, end=37, correlation=0.0]

... this was a bit of a problem. URLs and numerals are judged as "nouns", so this is something I would like to correct. NLP4J also has a mechanism to process the results of morphological analysis, so I would like to support it from the next time onwards.

Return to Index: [005-1] NLP4J + Twitter4J (data collection) > this> Next page

Recommended Posts

NLP4J [005-2] NLP4J + Twitter4J (Analysis 1)
NLP4J [005-1] Try Twitter analysis with Twitter4J and NLP4J (data collection)
NLP4J [001b] Morphological analysis in Java (using kuromoji)