Series of articles table of contents
It's getting more and more interesting to see how to integrate Ruby and Rust. So far ((3)-(5)) I've tried numerical calculations, so let's try text processing now. Alright, suddenly, I'll use Rust's morphological analysis library to extract only the morphemes of a particular part of speech, such as only proper nouns, all nouns, adjectives and adverbs, etc. from the text.
The author has lived a long and thin Ruby life, but Rust is an amateur, and morphological analysis is just a little play with MeCab in Ruby. I don't know how difficult it is.
We use Lindera as a morphological analysis library made by Rust. This is a fork by @mosuka from an experimental library called kuromoji-rs. It takes the form of a fork, but it has taken over the development under another name. See @ mosuka's article below for details. Rust beginners took over the development of Rust's Japanese morphological analyzer --Qiita
As for the mechanism of cooperation between Ruby and Rust, Rutie is used as in (4) and (5).
The division of roles between Ruby and Rust is thought like this. In Rust, create a Ruby class that can be called a morpheme extractor (Rutie can write a Ruby class in Rust). Give a list of what parts of speech to pick up at initialization (pick up all the parts of speech in the list). If you create an instance of a morpheme extractor and give it text, the corresponding morpheme will be returned as an array of character strings (in the order of appearance, there are duplicates).
In the sample program on the Ruby side, a frequency table is created from the list of returned morphemes and displayed in order from the one with the highest frequency.
For an overview of Lindera, see the link, but here I would like to point out only the following points.
I'm grateful that the dictionary is included from the beginning. It's a little painful to download the dictionary from somewhere, type a command, and place the file there, even if you just give it a try.
In addition, IPADIC is overwhelmingly insufficient in the number of words to handle sentences that fly through various media such as SNS, but I am grateful that a large dictionary such as IPADIC-NEologd can be easily used.
To add a user word, simply place a CSV file and specify the path.
In this article, I would like to present "a code that is not practical, but simple enough to imagine a path to a practical code" so that others can easily refer to it.
In Ruby, natto, jumanpp_ruby are gems for using the morphological analyzer MeCab and JUMAN ++. There are [^ gem] respectively.
[^ gem]: There seems to be others, but I'm not familiar with them.
So why bother writing code that calls Rust from Ruby? The background to this is the hypothesis described in I want to avoid GC.
When using MeCab etc. from Ruby, a large amount of character string data is brought to the Ruby side for each morpheme. Most of them become garbage, and when they accumulate to some extent, they become garbage collection targets. It seems that it is inefficient [^ ef].
[^ ef]: It cannot be said without proper experimentation as to whether it is really inefficient and how much text should be handled to affect the performance.
For tasks such as noun extraction, it would be efficient if only the nouns were extracted on the Rust side and only the character strings desired by the Ruby side were returned. Garbage collection does not occur on the Rust side. Variables that are out of scope disappear at that moment.
First
cargo new phoneme_extractor --lib
And. phoneme means morpheme. I don't know if the Japanese word morpheme extractor is valid, and if the English is really phoneme extractor.
So, in Cargo.toml
Cargo.toml
[dependencies]
lindera = "0.5.1"
lazy_static = "1.4.0"
rutie = "0.8.1"
serde = "1.0.115"
serde_json = "1.0.57"
[lib]
crate-type = ["cdylib"]
Write.
(Addition 2020-10-01) The version of Rutie was "0.7.0", but it has been changed to the latest version "0.8.1". This will eliminate the warning that was issued in Rust 1.46. If anyone says, "I was able to compile with 0.7.0, but I couldn't compile with 0.8.1", please let me know.
lindera is a crate for morphological analysis, which is the key to this task. rutie is a crate that connects Ruby and Rust. lazy_static is a crate required when creating a class with Rutie.
I didn't know a good way to convey information from Ruby to Rust, such as what part of speech to extract, so I decided to use a JSON-formatted string. To do this, use serde and serde_json.
This is the whole code on the Rust side.
src/lib.rs
#[macro_use]
extern crate rutie;
#[macro_use]
extern crate lazy_static;
use serde::{Deserialize};
use rutie::{Object, Class, RString, Array};
use lindera::tokenizer::Tokenizer;
#[derive(Deserialize)]
pub struct RustPhonemeExtractor {
mode: String,
allowed_poss: Vec<String>,
}
wrappable_struct!(RustPhonemeExtractor, PhonemeExtractorWrapper, PHONEME_EXTRACTOR_WRAPPER);
class!(PhonemeExtractor);
methods!(
PhonemeExtractor,
rtself,
fn phoneme_extractor_new(params: RString) -> PhonemeExtractor {
let params = params.unwrap().to_string();
let rpe: RustPhonemeExtractor = serde_json::from_str(¶ms).unwrap();
Class::from_existing("PhonemeExtractor").wrap_data(rpe, &*PHONEME_EXTRACTOR_WRAPPER)
}
fn extract(input: RString) -> Array {
let extractor = rtself.get_data(&*PHONEME_EXTRACTOR_WRAPPER);
let input = input.unwrap();
let mut tokenizer = Tokenizer::new(&extractor.mode, "");
let tokens = tokenizer.tokenize(input.to_str());
let mut result = Array::new();
for token in tokens {
let detail = token.detail;
let pos: String = detail.join(",");
if extractor.allowed_poss.iter().any(|s| pos.starts_with(s)) {
result.push(RString::new_utf8(&token.text));
}
}
result
}
);
#[allow(non_snake_case)]
#[no_mangle]
pub extern "C" fn Init_phoneme_extractor() {
Class::new("PhonemeExtractor", None).define(|klass| {
klass.def_self("new", phoneme_extractor_new);
klass.def("extract", extract);
});
}
Below, I will add a little explanation.
RustPhoneneExtractor
Using Rutie, create a class called PhonemeExtractor for Ruby. First, create a structure called RustPhonemeExtractor, and wrap it to create a PhonemeExtractor.
This is the definition of RustPhonemeExtractor.
#[derive(Deserialize)]
pub struct RustPhonemeExtractor {
mode: String,
allowed_poss: Vec<String>,
}
Oh, I didn't say that Lindera has two "modes", normal
and decompose
. Roughly speaking, decompose
is a mode for decomposing compound words. In other words, decompose
is finer than normal
.
Allow this to be specified with mode
.
On the other hand, ʻallowed_posshas a list of part of speech to be picked up in the form of a vector. The name
poss is quite appropriate, but since the English word for" part of speech "is" part of speech, "it is abbreviated as
pos. I made it
poss` in the plural (?) (Poses are confusing with the third person singular present form of pose).
PhonenemeExtractor
Next, create a Ruby class PhonenemeExtractor.
To wrap RustPhonemeExtractor to create PhonemeExtractor
wrappable_struct!(RustPhonemeExtractor, PhonemeExtractorWrapper, PHONEME_EXTRACTOR_WRAPPER);
Write. The explanation is the last time Ruby / Rust linkage (5) Numerical calculation with Rutie ② Bezier --Qiita I want you to see.
And to make a class
class!(PhonemeExtractor);
Write.
Next, write the PhonenemeExtractor method with the methods!
Macro.
The following two methods are described.
phoneme_extractor_new
(create an instance)This is the definition.
fn phoneme_extractor_new(params: RString) -> PhonemeExtractor {
let params = params.unwrap().to_string();
let rpe: RustPhonemeExtractor = serde_json::from_str(¶ms).unwrap();
Class::from_existing("PhonemeExtractor").wrap_data(rpe, &*PHONEME_EXTRACTOR_WRAPPER)
}
RString
is the type of Rust (defined in Rutie) that corresponds to the Ruby String class.
params
is a character string that represents Lindera's mode for initialization and the part of speech list to be picked up in JSON format.
So, this is an interesting part, but the process of creating the value of the RustPhonemeExtractor structure based on the JSON string contained in params
is
serde_json::from_str(¶ms).unwrap()
It's just made.
This is the amazing part of the crate called Serde (I don't know). It interprets JSON according to the definition of the structure. If a JSON string that does not match the definition of the structure is given, the program will crash at the time of ʻunwrap ()`. If you want to create a practical library, you should handle the error properly.
By the way, I'm hoping that such a JSON string will be given.
{
"mode": "normal",
"allowed_poss": [
"noun,General",
"noun,固有noun",
"noun,Adverbs possible",
"noun,Change connection",
"noun,Adjectival noun stem",
"noun,Nai adjective stem"
]
}
Part of speech will be described later in another section.
This is an instance method of the PhonemeExtractor class.
When the definition is extracted, it looks like this.
fn extract(input: RString) -> Array {
let extractor = rtself.get_data(&*PHONEME_EXTRACTOR_WRAPPER);
let input = input.unwrap();
let mut tokenizer = Tokenizer::new(&extractor.mode, "");
let tokens = tokenizer.tokenize(input.to_str());
let mut result = Array::new();
for token in tokens {
let detail = token.detail;
let pos: String = detail.join(",");
if extractor.allowed_poss.iter().any(|s| pos.starts_with(s)) {
result.push(RString::new_utf8(&token.text));
}
}
result
}
Given the input text as an RString (corresponding to a Ruby String), a list of morphemes is returned in the form of an Array of String.
rtself
is given to the second argument of themethods!
Macro, and seems to correspond to an instance of the Ruby class PhonemeExtractor (?).
The variable ʻextractoris an instance of
RustPhonemeExtractor`.
When not adding the user dictionary, generate a tokenizer with Tokenizer :: new
. The first argument is the character string of the above mode, and the second argument gives the directory path of the dictionary to be used. If you give an empty string as the second argument, the default IPADIC is used.
When using a user dictionary, use Tokenizer :: new_with_userdic
and give the path of the user dictionary (CSV format) to the third argument.
If you give text to the tokenizer's tokenize
method, the token sequence will be returned as a vector. One morpheme corresponds to one token.
Token is
#[derive(Serialize, Clone)]
pub struct Token<'a> {
pub text: &'a str,
pub detail: Vec<String>,
}
It is defined as.
text
is the decomposed morpheme itself. In the case of "let's write a code", the four "code", "o", "write", and "u" are applicable.
detail
is a String vector that collectively stores information about one extracted morpheme. What information is in what order depends on the dictionary used.
In the case of the default IPADIC, indexes 0 to 3 are part-of-speech information, and in addition, information such as inflected / inflected prototypes and readings is included.
The essence of this function is to check whether the extracted morpheme corresponds to any of the specified part of speech, but since it is necessary to explain the part of speech system first, it is shelved once.
Anyway, it throws the corresponding morpheme text
into the Ruby array result
and returns the last result
.
The rest is
#[allow(non_snake_case)]
#[no_mangle]
pub extern "C" fn Init_phoneme_extractor() {
Class::new("PhonemeExtractor", None).define(|klass| {
klass.def_self("new", phoneme_extractor_new);
klass.def("extract", extract);
});
}
only.
The Ruby PhonemeExtractor class and its singular method new
and instance method ʻextract are assigned to the methods defined by the
methods! `Macro.
See previous article.
In the case of IPADIC, the part of speech seems to follow the "IPA part of speech system" consisting of four layers. I had no idea where the primary information for this system was, but it is written on the following page for the time being. Part of speech system of morphological analysis tool
According to this, for example, it seems to be as follows.
" Hanako "
→ [" noun "," proper noun "," person's name "," first name "]
" Onion "
→ [" Noun "," General "," "," "]
(Image of the first 4 elements of the token detail
extracted)
It should be noted that the length (number of elements) of detail
is basically 9 in IPADIC, but detail
is [" UNK "]
only for morphemes that are judged to be "unknown words". It will be a vector of length 1.
Now, depending on the application, you may want to pick up all the 0th elements of part of speech information that are nouns
, and the 0th and 1st elements are that of nouns
and proper nouns
, respectively (3rd and 4th elements). It doesn't matter).
In other words, it depends on how detailed you want to specify.
How should this be specified and how should it be judged? I want to do it as simply as possible, so I decided to do as follows.
The designation is a comma-separated string of part-speech information up to the required depth, such as " noun "
or " noun, proper noun "
.
For the found morphemes, use a comma-separated string of detail
(that is,join (",")
).
Then, whether the former exists at the beginning of the latter is determined by the starts_with method of String
. To do.
However, multiple part-speech designations should be given, and any of them should be applicable. That is this part:
for token in tokens {
let detail = token.detail;
let pos: String = detail.join(",");
if extractor.allowed_poss.iter().any(|s| pos.starts_with(s)) {
result.push(RString::new_utf8(&token.text));
}
}
ʻAny` is just like Ruby's Enumerable # any?.
Note that RString :: new_utf8
creates a Ruby String from a Rust string.
As usual
cargo build --release
And.
The artifact can be in the path target / release / libmy_rutie_math.dylib
(extension depends on the target).
This is the only Ruby script. As usual, this script describes the path to the Rust library, assuming it exists in the root directory of the Rust project.
# encoding: utf-8
require "rutie"
Rutie.new(:phoneme_extractor, lib_path: "target/release").init "Init_phoneme_extractor", __dir__
pe = PhonemeExtractor.new <<JSON
{
"mode": "normal",
"allowed_poss": [
"noun,General",
"noun,固有noun",
"noun,Adverbs possible",
"noun,Change connection",
"noun,Adjectival noun stem",
"noun,Nai adjective stem"
]
}
JSON
text = <<EOT
"Road" Kotaro Takamura
There is no way in front of me
There is a road behind me
Oh, it's natural
Father
The vast father who made me stand alone
Keep an eye on me and protect
Always fill me with my father's spirit
Because of this distant journey
Because of this distant journey
EOT
pe.extract(text).tally
.sort_by{ |word, freq| -freq }
.each{ |word, freq| puts "%4d %s" % [freq, word] }
result:
3 father
3 journey
2 way
1 before
1 behind
1 Nature
1 Standing
1 vast
1st
1
1 Takamura
1 Kotaro
Hmm, I'm tired.
If you try to add an explanation, it will get longer and longer, and if you repeat the elaboration, you will never finish writing. I'm sorry, but the quality of the article may not be good. Questions are welcome, so please ask anything. I will answer if I understand.
Recommended Posts