I tried morphological analysis with MeCab

It is a procedure up to the point of executing morphological analysis using Ruby (mecab gem) on Ubuntu (bionic).

First install.

apt install mecab mecab-ipadic-utf8 libmecab-dev
gem install mecab

You can try to output the analysis result with the following program.

require 'mecab'

tagger = MeCab::Tagger.new
puts tagger.parse(open('sample.txt').read)

This is a sample that parses the output result string and displays it in order of the number of times the word appears.

require 'mecab'

tagger = MeCab::Tagger.new
t = tagger.parse(open('sample.txt').read)
words = {}
t.split("\n").each do |l|
  w = l.split("\t")[0]
  c = words[w] || 0
  c += 1
  words[w] = c
end

words.sort {|a,b| a[1] <=> b[1]}.each do |v|
  puts v[0]+"\t"+v[1].to_s
end

In this example, part of speech is not taken into consideration, so ",. (Punctuation)" etc. are also included. I think that filtering etc. is necessary according to the purpose.

There seems to be a gem called natto, and it may be a good idea to use these powers. Also, if you want to easily try more specialized analysis methods, or if you want to visualize (graph), free software called KH Coder may be useful ( It seems that MeCab is still used internally).

--Reference: I tried using mecab

Addendum (20.06.13) I tried to improve the code according to the advice given in the comment section. The version of Ruby included in the standard of Ubuntu (bionic-beaver) was 2.5.1p57, so it is a form other than tally.

require 'mecab'

tagger = MeCab::Tagger.new
t = tagger.parse(IO.read('sample.txt'))
words = Hash.new(0)
t.split("\n").each do |l|
  w = l.split("\t")[0]
  words[w] += 1
end

words.sort_by {|a| a[1]}.each do |w,f|
  puts "%4d %s" % [f,w]
end

Recommended Posts

I tried morphological analysis with MeCab
English morphological analysis like MeCab with OpenNLP
Chinese morphological analysis like Mecab with FNLP
I tried DI with Ruby
I tried source code analysis
I tried UPSERT with PostgreSQL.
I tried BIND with Docker
Morphological analysis in Java with Kuromoji
I tried using JOOQ with Gradle
I tried to interact with Java
I tried UDP communication with Java
I tried GraphQL with Spring Boot
I tried Flyway with Spring Boot
I tried customizing slim with Scaffold
I tried using Realm with Swift UI
I tried to get started with WebAssembly
I tried using Scalar DL with Docker
I tried using OnlineConverter with SpringBoot + JODConverter
I tried time-saving management learning with Studyplus.
I tried playing with BottomNavigationView a little ①
I tried using OpenCV with Java + Tomcat
I tried Lazy Initialization with Spring Boot 2.2.0
I tried to implement ModanShogi with Kinx
I tried Spring.
I tried tomcat
I tried youtubeDataApi.
I tried refactoring ①
I tried FizzBuzz.
I tried JHipster 5.1
I tried to verify AdoptOpenJDK 11 (11.0.2) with Docker image
I tried to make Basic authentication with Java
I tried to manage login information with JMX
I tried writing CRUD with Rails + Vue + devise_token_auth
I also tried WebAssembly with Nim and C
I made blackjack with Ruby (I tried using minitest)
I tried Eclipse MicroProfile OpenAPI with WildFly Swarm
I tried to break a block with java (1)
I tried Getting Started with Gradle on Heroku
I tried what I wanted to try with Stream softly.
[I tried] Spring tutorial
I tried to implement file upload with Spring MVC
I tried to read and output CSV with Outsystems
I tried to implement TCP / IP + BIO with JAVA
I tried running Autoware
I tried using Gson
[Java 11] I tried to execute Java without compiling with javac
[Rails] I tried playing with the comment send button
I started MySQL 5.7 with docker-compose and tried to connect
I tried QUARKUS immediately
I tried to get started with Spring Data JPA
I tried using TestNG
[Machine learning] I tried Object Detection with Create ML [Object detection]
I tried Spring Batch
I tried using Galasa
I tried to draw animation with Blazor + canvas API
I tried OCR processing a PDF file with Java
I played with Refinements
I tried to implement Stalin sort with Java Collector
I tried node-jt400 (Programs)
I tried node-jt400 (execute)
NLP4J [006-030] 100 language processing knocks with NLP4J # 30 Reading morphological analysis results