This is a package for morphological analysis with R. It wraps ikawaha/kagome known as Pure Go morphological analyzer with dictionary.
Build from source. You need make, GCC and Go.
You can pass a character vector. The return value is a list.
res <- RcppKagome::kagome("There is a chicken in the haniwa")
#> List of 1
#> $ :List of 6
#> ..$ 0:List of 5
#> .. ..$ Id : int 53040
#> .. ..$ Start : int 0
#> .. ..$ End : int 1
#> .. ..$ Surface: chr "To"
#> .. ..$ Feature: chr [1:9] "Particle" "格Particle" "General" "*" ...
#> ..$ 1:List of 5
#> .. ..$ Id : int 80172
#> .. ..$ Start : int 1
#> .. ..$ End : int 3
#> .. ..$ Surface: chr "Crocodile"
#> .. ..$ Feature: chr [1:9] "noun" "General" "*" "*" ...
#> ..$ 2:List of 5
#> .. ..$ Id : int 58916
#> .. ..$ Start : int 3
#> .. ..$ End : int 6
#> .. ..$ Surface: chr "Haniwa"
#> .. ..$ Feature: chr [1:9] "noun" "General" "*" "*" ...
#> ..$ 3:List of 5
#> .. ..$ Id : int 53999
#> .. ..$ Start : int 6
#> .. ..$ End : int 10
#> .. ..$ Surface: chr "Chicken"
#> .. ..$ Feature: chr [1:9] "noun" "General" "*" "*" ...
#> ..$ 4:List of 5
#> .. ..$ Id : int 19676
#> .. ..$ Start : int 10
#> .. ..$ End : int 11
#> .. ..$ Surface: chr "But"
#> .. ..$ Feature: chr [1:9] "Particle" "格Particle" "General" "*" ...
#> ..$ 5:List of 5
#> .. ..$ Id : int 6652
#> .. ..$ Start : int 11
#> .. ..$ End : int 13
#> .. ..$ Surface: chr "Is"
#> .. ..$ Feature: chr [1:9] "verb" "Independence" "*" "*" ...
You can format the result into a data frame.
res <- RcppKagome::kagome(c("There is a ring and Lee in the haniwa in the garden", "There are two chickens in the yard"))
res <- RcppKagome::prettify(res)
#> Sid Surface POS1 POS2 POS3 POS4 X5StageUse1 X5StageUse2 Original Yomi1
#>1 1 Garden nouns in general<NA> <NA> <NA> <NA>Garden Niwa
#>2 1 particle case particle general<NA> <NA> <NA>Ni ni
#>3 1 Haniwa noun general<NA> <NA> <NA> <NA>Haniwa Haniwa
#>4 1 particle case particle general<NA> <NA> <NA>Ni ni
#>5 1 wheel noun general<NA> <NA> <NA> <NA>Wawa
#>6 1 and particles Parallel particles<NA> <NA> <NA> <NA>And
#>7 1 Lee noun proper noun person name surname<NA> <NA>Lee Li
#>8 1 is a particle Case particle general<NA> <NA> <NA>Ga ga
#>9 1 verb independence<NA> <NA>Uninflected word
#>10 2 Garden nouns in general<NA> <NA> <NA> <NA>Garden Niwa
#>11 2 Particles Case particles in general<NA> <NA> <NA>Ni ni
#>12 2 is a particle particle<NA> <NA> <NA> <NA>Ha ha
#>13 2 Two nouns<NA> <NA> <NA> <NA>Nini
#>14 2 feather noun suffix classifier<NA> <NA> <NA>Wow
#>15 2 Chicken nouns in general<NA> <NA> <NA> <NA>Chicken chicken
#>16 2 is a particle Case particle general<NA> <NA> <NA>Ga ga
#>17 2 verb independence<NA> <NA>Uninflected word
#> Yomi2
#>1 Niwa
#>2 D
#>3 honey
#>4 D
#>5 wa
#>8 moth
#>9 Il
#>10 Niwa
#>11 D
#>12 wa
#>13 D
#>14 wa
#>15 chicken
#>16 moth
#>17 Il
The formatted data frame consists of the following columns:
--Sid: Sentence index --Surface: Surface type --POS1 ~ POS4: Part of speech, Part of speech subclassification 1, Part of speech subclassification 2, Part of speech subclassification 3 --X5StageUse1: Utilization type (ex. 5th dan, lower 2nd dan ...) --X5StageUse2: Inflected form (ex. Continuous form, terminal form ...) --Original: lemmatised form --Yomi1: readings --Yomi2: Pronunciation
Of these, only the Surface column can be collapsed and returned (separated) with a half-width space.
#> Sid Text
#>1 1 There is a ring and Lee in the haniwa in the garden
#>2 2 There are two chickens in the yard
I actually use it in the following articles. In this article, I have a divided document as a corpus of quanteda.
-Text analysis by R (quanteda) --Qiita
It shouldn't be particularly slow compared to RMeCab.
str <- "When the capicapi sound gets louder, it's a signal to really give it out! Please communicate firmly here"
tm <- microbenchmark::microbenchmark(
RMeCab = RMeCab::RMeCabC(str),
RcppKagome = RcppKagome::kagome(str),
times = 500L
#> expr min lq mean median uq max neval
#> 1 RMeCab 1.9835 2.50085 3.339689 2.77485 3.1911 83.4071 500
#> 2 RcppKagome 2.1079 2.53700 3.246598 2.76890 3.2572 18.3404 500
#> Coordinate system already present. Adding new coordinate system, which will replace the existing one.
The following articles are available as Japanese information.
-Use Go function from R → Fast- ★ Data analysis memorandum ★
This article is based on a blog post originally written by Romain Francois (who seems to be a great person involved in the development of Rcpp and rJava).
Go has a command called cgo
, which can be used as a library for using code written in Go from C language. By calling the library for C language generated by using this function from the R package, Ichiou can use Go's assets from R.
Originally, in order to call a function written in C etc. directly in the R package, an operation called registration
of the function is required to make it callable using .Call
. To save this effort, Rcpp Kagome writes a wrapper that uses the library generated by cgo from C ++ and exports it by Rcpp.
Also, in order to handle it more conveniently, it would be desirable to define a type mapping with Go, but RcppKagome does not go into that point, and only passes the string. For reference, here are some other examples of defining type mappings with Go.
#> - Session info ------------------------------------------------------------------------
#> setting value
#> version R version 4.0.3 (2020-10-10)
#> os Windows 10 x64
#> system x86_64, mingw32
#> ui RStudio
#> language (EN)
#> collate Japanese_Japan.932
#> ctype Japanese_Japan.932
#> tz Asia/Tokyo
#> date 2021-01-20
#> - Packages ----------------------------------------------------------------------------
#> package * version date lib source
#> assertthat 0.2.1 2019-03-21 [1] CRAN (R 4.0.2)
#> backports 1.2.1 2020-12-09 [1] CRAN (R 4.0.3)
#> cli 2.2.0 2020-11-20 [1] CRAN (R 4.0.3)
#> codetools 0.2-16 2018-12-24 [2] CRAN (R 4.0.3)
#> colorspace 2.0-0 2020-11-11 [1] CRAN (R 4.0.3)
#> crayon 1.3.4 2017-09-16 [1] CRAN (R 4.0.2)
#> DBI 1.1.1 2021-01-15 [1] CRAN (R 4.0.3)
#> digest 0.6.27 2020-10-24 [1] CRAN (R 4.0.3)
#> dplyr 1.0.3 2021-01-15 [1] CRAN (R 4.0.3)
#> ellipsis 0.3.1 2020-05-15 [1] CRAN (R 4.0.2)
#> evaluate 0.14 2019-05-28 [1] CRAN (R 4.0.2)
#> fansi 0.4.2 2021-01-15 [1] CRAN (R 4.0.3)
#> farver 2.0.3 2020-01-16 [1] CRAN (R 4.0.2)
#> furrr 0.2.1 2020-10-21 [1] CRAN (R 4.0.2)
#> future 1.21.0 2020-12-10 [1] CRAN (R 4.0.3)
#> generics 0.1.0 2020-10-31 [1] CRAN (R 4.0.3)
#> ggplot2 3.3.3 2020-12-30 [1] CRAN (R 4.0.3)
#> globals 0.14.0 2020-11-22 [1] CRAN (R 4.0.3)
#> glue 1.4.2 2020-08-27 [1] CRAN (R 4.0.2)
#> gtable 0.3.0 2019-03-25 [1] CRAN (R 4.0.2)
#> htmltools 0.5.1 2021-01-12 [1] CRAN (R 4.0.3)
#> jsonlite 1.7.2 2020-12-09 [1] CRAN (R 4.0.3)
#> knitr 1.30 2020-09-22 [1] CRAN (R 4.0.2)
#> lifecycle 0.2.0 2020-03-06 [1] CRAN (R 4.0.2)
#> listenv 0.8.0 2019-12-05 [1] CRAN (R 4.0.2)
#> magrittr 2.0.1 2020-11-17 [1] CRAN (R 4.0.3)
#> microbenchmark 1.4-7 2019-09-24 [1] CRAN (R 4.0.2)
#> munsell 0.5.0 2018-06-12 [1] CRAN (R 4.0.2)
#> parallelly 1.23.0 2021-01-04 [1] CRAN (R 4.0.3)
#> pillar 1.4.7 2020-11-20 [1] CRAN (R 4.0.3)
#> pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 4.0.2)
#> purrr * 0.3.4 2020-04-17 [1] CRAN (R 4.0.2)
#> R.cache 0.14.0 2019-12-06 [1] CRAN (R 4.0.2)
#> R.methodsS3 1.8.1 2020-08-26 [1] CRAN (R 4.0.2)
#> R.oo 1.24.0 2020-08-26 [1] CRAN (R 4.0.2)
#> R.utils 2.10.1 2020-08-26 [1] CRAN (R 4.0.2)
#> R6 2.5.0 2020-10-28 [1] CRAN (R 4.0.3)
#> Rcpp 1.0.6 2021-01-15 [1] CRAN (R 4.0.3)
#> RcppKagome * 2021-01-20 [1] local
#> rlang 0.4.10 2020-12-30 [1] CRAN (R 4.0.3)
#> rmarkdown 2.6 2020-12-14 [1] CRAN (R 4.0.3)
#> RMeCab * 1.05 2020-04-28 [1] local
#> scales 1.1.1 2020-05-11 [1] CRAN (R 4.0.2)
#> sessioninfo 1.1.1 2018-11-05 [1] CRAN (R 4.0.2)
#> stringi 1.5.3 2020-09-09 [1] CRAN (R 4.0.2)
#> stringr 1.4.0 2019-02-10 [1] CRAN (R 4.0.2)
#> styler 1.3.2 2020-02-23 [1] CRAN (R 4.0.2)
#> tibble 3.0.5 2021-01-15 [1] CRAN (R 4.0.3)
#> tidyselect 1.1.0 2020-05-11 [1] CRAN (R 4.0.2)
#> vctrs 0.3.6 2020-12-17 [1] CRAN (R 4.0.3)
#> withr 2.4.0 2021-01-16 [1] CRAN (R 4.0.3)
#> xfun 0.20 2021-01-06 [1] CRAN (R 4.0.3)
#> yaml 2.2.1 2020-02-01 [1] CRAN (R 4.0.0)
#> [1] C:/Users/user/Documents/R/win-library/4.0
#> [2] C:/Program Files/R/R-4.0.3/library