RcppKagome --R package that calls the Go library via Rcpp for morphological analysis

paithiov909/RcppKagome - GitHub

What is this?

This is a package for morphological analysis with R. It wraps ikawaha/kagome known as Pure Go morphological analyzer with dictionary.

How to use

Installation

Build from source. You need make, GCC and Go.

remotes::install_github("paithiov909/RcppKagome")

Morphological analysis

You can pass a character vector. The return value is a list.

res <- RcppKagome::kagome("There is a chicken in the haniwa")
str(res)
#> List of 1
#>  $ :List of 6
#>   ..$ 0:List of 5
#>   .. ..$ Id     : int 53040
#>   .. ..$ Start  : int 0
#>   .. ..$ End    : int 1
#>   .. ..$ Surface: chr "To"
#>   .. ..$ Feature: chr [1:9] "Particle" "格Particle" "General" "*" ...
#>   ..$ 1:List of 5
#>   .. ..$ Id     : int 80172
#>   .. ..$ Start  : int 1
#>   .. ..$ End    : int 3
#>   .. ..$ Surface: chr "Crocodile"
#>   .. ..$ Feature: chr [1:9] "noun" "General" "*" "*" ...
#>   ..$ 2:List of 5
#>   .. ..$ Id     : int 58916
#>   .. ..$ Start  : int 3
#>   .. ..$ End    : int 6
#>   .. ..$ Surface: chr "Haniwa"
#>   .. ..$ Feature: chr [1:9] "noun" "General" "*" "*" ...
#>   ..$ 3:List of 5
#>   .. ..$ Id     : int 53999
#>   .. ..$ Start  : int 6
#>   .. ..$ End    : int 10
#>   .. ..$ Surface: chr "Chicken"
#>   .. ..$ Feature: chr [1:9] "noun" "General" "*" "*" ...
#>   ..$ 4:List of 5
#>   .. ..$ Id     : int 19676
#>   .. ..$ Start  : int 10
#>   .. ..$ End    : int 11
#>   .. ..$ Surface: chr "But"
#>   .. ..$ Feature: chr [1:9] "Particle" "格Particle" "General" "*" ...
#>   ..$ 5:List of 5
#>   .. ..$ Id     : int 6652
#>   .. ..$ Start  : int 11
#>   .. ..$ End    : int 13
#>   .. ..$ Surface: chr "Is"
#>   .. ..$ Feature: chr [1:9] "verb" "Independence" "*" "*" ...

You can format the result into a data frame.

res <- RcppKagome::kagome(c("There is a ring and Lee in the haniwa in the garden", "There are two chickens in the yard"))
res <- RcppKagome::prettify(res)
print(res)
#>    Sid Surface POS1     POS2   POS3 POS4 X5StageUse1 X5StageUse2 Original    Yomi1
#>1 1 Garden nouns in general<NA> <NA>        <NA>        <NA>Garden Niwa
#>2 1 particle case particle general<NA>        <NA>        <NA>Ni ni
#>3 1 Haniwa noun general<NA> <NA>        <NA>        <NA>Haniwa Haniwa
#>4 1 particle case particle general<NA>        <NA>        <NA>Ni ni
#>5 1 wheel noun general<NA> <NA>        <NA>        <NA>Wawa
#>6 1 and particles Parallel particles<NA> <NA>        <NA>        <NA>And
#>7 1 Lee noun proper noun person name surname<NA>        <NA>Lee Li
#>8 1 is a particle Case particle general<NA>        <NA>        <NA>Ga ga
#>9 1 verb independence<NA> <NA>Uninflected word
#>10 2 Garden nouns in general<NA> <NA>        <NA>        <NA>Garden Niwa
#>11 2 Particles Case particles in general<NA>        <NA>        <NA>Ni ni
#>12 2 is a particle particle<NA> <NA>        <NA>        <NA>Ha ha
#>13 2 Two nouns<NA> <NA>        <NA>        <NA>Nini
#>14 2 feather noun suffix classifier<NA>        <NA>        <NA>Wow
#>15 2 Chicken nouns in general<NA> <NA>        <NA>        <NA>Chicken chicken
#>16 2 is a particle Case particle general<NA>        <NA>        <NA>Ga ga
#>17 2 verb independence<NA> <NA>Uninflected word
#>       Yomi2
#>1 Niwa
#>2 D
#>3 honey
#>4 D
#>5 wa
#>6
#>7
#>8 moth
#>9 Il
#>10 Niwa
#>11 D
#>12 wa
#>13 D
#>14 wa
#>15 chicken
#>16 moth
#>17 Il

The formatted data frame consists of the following columns:

--Sid: Sentence index --Surface: Surface type --POS1 ~ POS4: Part of speech, Part of speech subclassification 1, Part of speech subclassification 2, Part of speech subclassification 3 --X5StageUse1: Utilization type (ex. 5th dan, lower 2nd dan ...) --X5StageUse2: Inflected form (ex. Continuous form, terminal form ...) --Original: lemmatised form --Yomi1: readings --Yomi2: Pronunciation

Of these, only the Surface column can be collapsed and returned (separated) with a half-width space.

RcppKagome::pack(res)
#>   Sid                           Text
#>1 1 There is a ring and Lee in the haniwa in the garden
#>2 2 There are two chickens in the yard

I actually use it in the following articles. In this article, I have a divided document as a corpus of quanteda.

-Text analysis by R (quanteda) --Qiita

benchmark

It shouldn't be particularly slow compared to RMeCab.

str <- "When the capicapi sound gets louder, it's a signal to really give it out! Please communicate firmly here"
tm <- microbenchmark::microbenchmark(
  RMeCab = RMeCab::RMeCabC(str),
  RcppKagome = RcppKagome::kagome(str),
  times = 500L
)
summary(tm)
#>         expr    min      lq     mean  median     uq     max neval
#> 1     RMeCab 1.9835 2.50085 3.339689 2.77485 3.1911 83.4071   500
#> 2 RcppKagome 2.1079 2.53700 3.246598 2.76890 3.2572 18.3404   500
ggplot2::autoplot(tm)
#> Coordinate system already present. Adding new coordinate system, which will replace the existing one.

benchmark_2-1.png

About cooperation between R and Go

The following articles are available as Japanese information.

-Use Go function from R → Fast- ★ Data analysis memorandum ★

This article is based on a blog post originally written by Romain Francois (who seems to be a great person involved in the development of Rcpp and rJava).

Go has a command called cgo, which can be used as a library for using code written in Go from C language. By calling the library for C language generated by using this function from the R package, Ichiou can use Go's assets from R.

Originally, in order to call a function written in C etc. directly in the R package, an operation called registration of the function is required to make it callable using .Call. To save this effort, Rcpp Kagome writes a wrapper that uses the library generated by cgo from C ++ and exports it by Rcpp.

Also, in order to handle it more conveniently, it would be desirable to define a type mapping with Go, but RcppKagome does not go into that point, and only passes the string. For reference, here are some other examples of defining type mappings with Go.

Session information

sessioninfo::session_info()
#> - Session info ------------------------------------------------------------------------
#>  setting  value                       
#>  version  R version 4.0.3 (2020-10-10)
#>  os       Windows 10 x64              
#>  system   x86_64, mingw32             
#>  ui       RStudio                     
#>  language (EN)                        
#>  collate  Japanese_Japan.932          
#>  ctype    Japanese_Japan.932          
#>  tz       Asia/Tokyo                  
#>  date     2021-01-20                  
#> 
#> - Packages ----------------------------------------------------------------------------
#>  package        * version   date       lib source        
#>  assertthat       0.2.1     2019-03-21 [1] CRAN (R 4.0.2)
#>  backports        1.2.1     2020-12-09 [1] CRAN (R 4.0.3)
#>  cli              2.2.0     2020-11-20 [1] CRAN (R 4.0.3)
#>  codetools        0.2-16    2018-12-24 [2] CRAN (R 4.0.3)
#>  colorspace       2.0-0     2020-11-11 [1] CRAN (R 4.0.3)
#>  crayon           1.3.4     2017-09-16 [1] CRAN (R 4.0.2)
#>  DBI              1.1.1     2021-01-15 [1] CRAN (R 4.0.3)
#>  digest           0.6.27    2020-10-24 [1] CRAN (R 4.0.3)
#>  dplyr            1.0.3     2021-01-15 [1] CRAN (R 4.0.3)
#>  ellipsis         0.3.1     2020-05-15 [1] CRAN (R 4.0.2)
#>  evaluate         0.14      2019-05-28 [1] CRAN (R 4.0.2)
#>  fansi            0.4.2     2021-01-15 [1] CRAN (R 4.0.3)
#>  farver           2.0.3     2020-01-16 [1] CRAN (R 4.0.2)
#>  furrr            0.2.1     2020-10-21 [1] CRAN (R 4.0.2)
#>  future           1.21.0    2020-12-10 [1] CRAN (R 4.0.3)
#>  generics         0.1.0     2020-10-31 [1] CRAN (R 4.0.3)
#>  ggplot2          3.3.3     2020-12-30 [1] CRAN (R 4.0.3)
#>  globals          0.14.0    2020-11-22 [1] CRAN (R 4.0.3)
#>  glue             1.4.2     2020-08-27 [1] CRAN (R 4.0.2)
#>  gtable           0.3.0     2019-03-25 [1] CRAN (R 4.0.2)
#>  htmltools        0.5.1     2021-01-12 [1] CRAN (R 4.0.3)
#>  jsonlite         1.7.2     2020-12-09 [1] CRAN (R 4.0.3)
#>  knitr            1.30      2020-09-22 [1] CRAN (R 4.0.2)
#>  lifecycle        0.2.0     2020-03-06 [1] CRAN (R 4.0.2)
#>  listenv          0.8.0     2019-12-05 [1] CRAN (R 4.0.2)
#>  magrittr         2.0.1     2020-11-17 [1] CRAN (R 4.0.3)
#>  microbenchmark   1.4-7     2019-09-24 [1] CRAN (R 4.0.2)
#>  munsell          0.5.0     2018-06-12 [1] CRAN (R 4.0.2)
#>  parallelly       1.23.0    2021-01-04 [1] CRAN (R 4.0.3)
#>  pillar           1.4.7     2020-11-20 [1] CRAN (R 4.0.3)
#>  pkgconfig        2.0.3     2019-09-22 [1] CRAN (R 4.0.2)
#>  purrr          * 0.3.4     2020-04-17 [1] CRAN (R 4.0.2)
#>  R.cache          0.14.0    2019-12-06 [1] CRAN (R 4.0.2)
#>  R.methodsS3      1.8.1     2020-08-26 [1] CRAN (R 4.0.2)
#>  R.oo             1.24.0    2020-08-26 [1] CRAN (R 4.0.2)
#>  R.utils          2.10.1    2020-08-26 [1] CRAN (R 4.0.2)
#>  R6               2.5.0     2020-10-28 [1] CRAN (R 4.0.3)
#>  Rcpp             1.0.6     2021-01-15 [1] CRAN (R 4.0.3)
#>  RcppKagome     * 0.0.0.500 2021-01-20 [1] local         
#>  rlang            0.4.10    2020-12-30 [1] CRAN (R 4.0.3)
#>  rmarkdown        2.6       2020-12-14 [1] CRAN (R 4.0.3)
#>  RMeCab         * 1.05      2020-04-28 [1] local         
#>  scales           1.1.1     2020-05-11 [1] CRAN (R 4.0.2)
#>  sessioninfo      1.1.1     2018-11-05 [1] CRAN (R 4.0.2)
#>  stringi          1.5.3     2020-09-09 [1] CRAN (R 4.0.2)
#>  stringr          1.4.0     2019-02-10 [1] CRAN (R 4.0.2)
#>  styler           1.3.2     2020-02-23 [1] CRAN (R 4.0.2)
#>  tibble           3.0.5     2021-01-15 [1] CRAN (R 4.0.3)
#>  tidyselect       1.1.0     2020-05-11 [1] CRAN (R 4.0.2)
#>  vctrs            0.3.6     2020-12-17 [1] CRAN (R 4.0.3)
#>  withr            2.4.0     2021-01-16 [1] CRAN (R 4.0.3)
#>  xfun             0.20      2021-01-06 [1] CRAN (R 4.0.3)
#>  yaml             2.2.1     2020-02-01 [1] CRAN (R 4.0.0)
#> 
#> [1] C:/Users/user/Documents/R/win-library/4.0
#> [2] C:/Program Files/R/R-4.0.3/library

Recommended Posts

RcppKagome --R package that calls the Go library via Rcpp for morphological analysis