I made a code that you can enjoy text mining by writing a simple Java program like the following, so I am thinking of releasing it as open source soon. We are targeting people who want to do natural language processing and want to do text mining.
The morphological analysis engine uses Yahoo Japan's Web service. Outputs characteristic keywords in the document using the results of morphological analysis
Processing and Input
List<Document> docs = new ArrayList<Document>();
{
docs.add(createDocument("Toyota", "I am making a hybrid car."));
docs.add(createDocument("Toyota", "We sell hybrid cars."));
docs.add(createDocument("Toyota", "I'm making a car."));
docs.add(createDocument("Toyota", "I sell cars."));
docs.add(createDocument("Nissan", "I'm making an EV."));
docs.add(createDocument("Nissan", "I sell EVs."));
docs.add(createDocument("Nissan", "I sell cars."));
docs.add(createDocument("Nissan", "We are affiliated with Renault."));
docs.add(createDocument("Nissan", "I sell light cars."));
docs.add(createDocument("Honda", "I'm making a car."));
docs.add(createDocument("Honda", "I sell cars."));
docs.add(createDocument("Honda", "I'm making a motorcycle."));
docs.add(createDocument("Honda", "I sell motorcycles."));
docs.add(createDocument("Honda", "I sell light cars."));
docs.add(createDocument("Honda", "I am making a light car."));
}
Annotator annotator = new YJpMaAnnotator();{
//Morphological analysis processing
annotator.annotate(docs);
}
Index index = new SimpleDocumentIndex();{
//Keyword indexing process
index.addDocuments(docs);
}
{
//Acquisition of highly co-occurrence keywords
List<Keyword> kwds = index.getKeywords("noun", "item=Nissan");
System.out.println("Keywords(noun) for Nissan");
for (Keyword kwd : kwds) {
System.out.println(String.format("%.1f,%s", kwd.getCorrelation(), kwd.getLex()));
}
}
{
//Acquisition of highly co-occurrence keywords
List<Keyword> kwds = index.getKeywords("noun", "item=Toyota");
System.out.println("Keywords(noun) for Toyota");
for (Keyword kwd : kwds) {
System.out.println(String.format("%.1f,%s", kwd.getCorrelation(), kwd.getLex()));
}
}
{
//Acquisition of highly co-occurrence keywords
List<Keyword> kwds = index.getKeywords("noun", "item=Honda");
System.out.println("Keywords(noun) for Honda");
for (Keyword kwd : kwds) {
System.out.println(String.format("%.1f,%s", kwd.getCorrelation(), kwd.getLex()));
}
}
}
Output: Displays keywords characteristic of Nissan in descending order of coefficient
Keywords for Nissan
3.0,EV
3.0,Renault
3.0,Alliance
1.0,Light car
0.6,Automobile
Click here for Toyota and Honda
Keywords(noun) for Toyota
3.8,hybrid
3.8,car
1.5,Automobile
Keywords(noun) for Honda
2.5,bike
1.7,Light car
1.0,Automobile
Recommended Posts