This is a package for morphological analysis with R. It wraps ikawaha/kagome known as Pure Go morphological analyzer with dictionary.
Build from source. You need make, GCC and Go.
remotes::install_github("paithiov909/RcppKagome")
You can pass a character vector. The return value is a list.
res <- RcppKagome::kagome("There is a chicken in the haniwa")
str(res)
#> List of 1
#> $ :List of 6
#> ..$ 0:List of 5
#> .. ..$ Id : int 53040
#> .. ..$ Start : int 0
#> .. ..$ End : int 1
#> .. ..$ Surface: chr "To"
#> .. ..$ Feature: chr [1:9] "Particle" "格Particle" "General" "*" ...
#> ..$ 1:List of 5
#> .. ..$ Id : int 80172
#> .. ..$ Start : int 1
#> .. ..$ End : int 3
#> .. ..$ Surface: chr "Crocodile"
#> .. ..$ Feature: chr [1:9] "noun" "General" "*" "*" ...
#> ..$ 2:List of 5
#> .. ..$ Id : int 58916
#> .. ..$ Start : int 3
#> .. ..$ End : int 6
#> .. ..$ Surface: chr "Haniwa"
#> .. ..$ Feature: chr [1:9] "noun" "General" "*" "*" ...
#> ..$ 3:List of 5
#> .. ..$ Id : int 53999
#> .. ..$ Start : int 6
#> .. ..$ End : int 10
#> .. ..$ Surface: chr "Chicken"
#> .. ..$ Feature: chr [1:9] "noun" "General" "*" "*" ...
#> ..$ 4:List of 5
#> .. ..$ Id : int 19676
#> .. ..$ Start : int 10
#> .. ..$ End : int 11
#> .. ..$ Surface: chr "But"
#> .. ..$ Feature: chr [1:9] "Particle" "格Particle" "General" "*" ...
#> ..$ 5:List of 5
#> .. ..$ Id : int 6652
#> .. ..$ Start : int 11
#> .. ..$ End : int 13
#> .. ..$ Surface: chr "Is"
#> .. ..$ Feature: chr [1:9] "verb" "Independence" "*" "*" ...
You can format the result into a data frame.
res <- RcppKagome::kagome(c("There is a ring and Lee in the haniwa in the garden", "There are two chickens in the yard"))
res <- RcppKagome::prettify(res)
print(res)
#> Sid Surface POS1 POS2 POS3 POS4 X5StageUse1 X5StageUse2 Original Yomi1
#>1 1 Garden nouns in general<NA> <NA> <NA> <NA>Garden Niwa
#>2 1 particle case particle general<NA> <NA> <NA>Ni ni
#>3 1 Haniwa noun general<NA> <NA> <NA> <NA>Haniwa Haniwa
#>4 1 particle case particle general<NA> <NA> <NA>Ni ni
#>5 1 wheel noun general<NA> <NA> <NA> <NA>Wawa
#>6 1 and particles Parallel particles<NA> <NA> <NA> <NA>And
#>7 1 Lee noun proper noun person name surname<NA> <NA>Lee Li
#>8 1 is a particle Case particle general<NA> <NA> <NA>Ga ga
#>9 1 verb independence<NA> <NA>Uninflected word
#>10 2 Garden nouns in general<NA> <NA> <NA> <NA>Garden Niwa
#>11 2 Particles Case particles in general<NA> <NA> <NA>Ni ni
#>12 2 is a particle particle<NA> <NA> <NA> <NA>Ha ha
#>13 2 Two nouns<NA> <NA> <NA> <NA>Nini
#>14 2 feather noun suffix classifier<NA> <NA> <NA>Wow
#>15 2 Chicken nouns in general<NA> <NA> <NA> <NA>Chicken chicken
#>16 2 is a particle Case particle general<NA> <NA> <NA>Ga ga
#>17 2 verb independence<NA> <NA>Uninflected word
#> Yomi2
#>1 Niwa
#>2 D
#>3 honey
#>4 D
#>5 wa
#>6
#>7
#>8 moth
#>9 Il
#>10 Niwa
#>11 D
#>12 wa
#>13 D
#>14 wa
#>15 chicken
#>16 moth
#>17 Il
The formatted data frame consists of the following columns:
--Sid: Sentence index --Surface: Surface type --POS1 ~ POS4: Part of speech, Part of speech subclassification 1, Part of speech subclassification 2, Part of speech subclassification 3 --X5StageUse1: Utilization type (ex. 5th dan, lower 2nd dan ...) --X5StageUse2: Inflected form (ex. Continuous form, terminal form ...) --Original: lemmatised form --Yomi1: readings --Yomi2: Pronunciation
Of these, only the Surface column can be collapsed and returned (separated) with a half-width space.
RcppKagome::pack(res)
#> Sid Text
#>1 1 There is a ring and Lee in the haniwa in the garden
#>2 2 There are two chickens in the yard
I actually use it in the following articles. In this article, I have a divided document as a corpus of quanteda.
-Text analysis by R (quanteda) --Qiita
It shouldn't be particularly slow compared to RMeCab.
str <- "When the capicapi sound gets louder, it's a signal to really give it out! Please communicate firmly here"
tm <- microbenchmark::microbenchmark(
RMeCab = RMeCab::RMeCabC(str),
RcppKagome = RcppKagome::kagome(str),
times = 500L
)
summary(tm)
#> expr min lq mean median uq max neval
#> 1 RMeCab 1.9835 2.50085 3.339689 2.77485 3.1911 83.4071 500
#> 2 RcppKagome 2.1079 2.53700 3.246598 2.76890 3.2572 18.3404 500
ggplot2::autoplot(tm)
#> Coordinate system already present. Adding new coordinate system, which will replace the existing one.
The following articles are available as Japanese information.
-Use Go function from R → Fast- ★ Data analysis memorandum ★
This article is based on a blog post originally written by Romain Francois (who seems to be a great person involved in the development of Rcpp and rJava).
Go has a command called cgo
, which can be used as a library for using code written in Go from C language. By calling the library for C language generated by using this function from the R package, Ichiou can use Go's assets from R.
Originally, in order to call a function written in C etc. directly in the R package, an operation called registration
of the function is required to make it callable using .Call
. To save this effort, Rcpp Kagome writes a wrapper that uses the library generated by cgo from C ++ and exports it by Rcpp.
Also, in order to handle it more conveniently, it would be desirable to define a type mapping with Go, but RcppKagome does not go into that point, and only passes the string. For reference, here are some other examples of defining type mappings with Go.
sessioninfo::session_info()
#> - Session info ------------------------------------------------------------------------
#> setting value
#> version R version 4.0.3 (2020-10-10)
#> os Windows 10 x64
#> system x86_64, mingw32
#> ui RStudio
#> language (EN)
#> collate Japanese_Japan.932
#> ctype Japanese_Japan.932
#> tz Asia/Tokyo
#> date 2021-01-20
#>
#> - Packages ----------------------------------------------------------------------------
#> package * version date lib source
#> assertthat 0.2.1 2019-03-21 [1] CRAN (R 4.0.2)
#> backports 1.2.1 2020-12-09 [1] CRAN (R 4.0.3)
#> cli 2.2.0 2020-11-20 [1] CRAN (R 4.0.3)
#> codetools 0.2-16 2018-12-24 [2] CRAN (R 4.0.3)
#> colorspace 2.0-0 2020-11-11 [1] CRAN (R 4.0.3)
#> crayon 1.3.4 2017-09-16 [1] CRAN (R 4.0.2)
#> DBI 1.1.1 2021-01-15 [1] CRAN (R 4.0.3)
#> digest 0.6.27 2020-10-24 [1] CRAN (R 4.0.3)
#> dplyr 1.0.3 2021-01-15 [1] CRAN (R 4.0.3)
#> ellipsis 0.3.1 2020-05-15 [1] CRAN (R 4.0.2)
#> evaluate 0.14 2019-05-28 [1] CRAN (R 4.0.2)
#> fansi 0.4.2 2021-01-15 [1] CRAN (R 4.0.3)
#> farver 2.0.3 2020-01-16 [1] CRAN (R 4.0.2)
#> furrr 0.2.1 2020-10-21 [1] CRAN (R 4.0.2)
#> future 1.21.0 2020-12-10 [1] CRAN (R 4.0.3)
#> generics 0.1.0 2020-10-31 [1] CRAN (R 4.0.3)
#> ggplot2 3.3.3 2020-12-30 [1] CRAN (R 4.0.3)
#> globals 0.14.0 2020-11-22 [1] CRAN (R 4.0.3)
#> glue 1.4.2 2020-08-27 [1] CRAN (R 4.0.2)
#> gtable 0.3.0 2019-03-25 [1] CRAN (R 4.0.2)
#> htmltools 0.5.1 2021-01-12 [1] CRAN (R 4.0.3)
#> jsonlite 1.7.2 2020-12-09 [1] CRAN (R 4.0.3)
#> knitr 1.30 2020-09-22 [1] CRAN (R 4.0.2)
#> lifecycle 0.2.0 2020-03-06 [1] CRAN (R 4.0.2)
#> listenv 0.8.0 2019-12-05 [1] CRAN (R 4.0.2)
#> magrittr 2.0.1 2020-11-17 [1] CRAN (R 4.0.3)
#> microbenchmark 1.4-7 2019-09-24 [1] CRAN (R 4.0.2)
#> munsell 0.5.0 2018-06-12 [1] CRAN (R 4.0.2)
#> parallelly 1.23.0 2021-01-04 [1] CRAN (R 4.0.3)
#> pillar 1.4.7 2020-11-20 [1] CRAN (R 4.0.3)
#> pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 4.0.2)
#> purrr * 0.3.4 2020-04-17 [1] CRAN (R 4.0.2)
#> R.cache 0.14.0 2019-12-06 [1] CRAN (R 4.0.2)
#> R.methodsS3 1.8.1 2020-08-26 [1] CRAN (R 4.0.2)
#> R.oo 1.24.0 2020-08-26 [1] CRAN (R 4.0.2)
#> R.utils 2.10.1 2020-08-26 [1] CRAN (R 4.0.2)
#> R6 2.5.0 2020-10-28 [1] CRAN (R 4.0.3)
#> Rcpp 1.0.6 2021-01-15 [1] CRAN (R 4.0.3)
#> RcppKagome * 0.0.0.500 2021-01-20 [1] local
#> rlang 0.4.10 2020-12-30 [1] CRAN (R 4.0.3)
#> rmarkdown 2.6 2020-12-14 [1] CRAN (R 4.0.3)
#> RMeCab * 1.05 2020-04-28 [1] local
#> scales 1.1.1 2020-05-11 [1] CRAN (R 4.0.2)
#> sessioninfo 1.1.1 2018-11-05 [1] CRAN (R 4.0.2)
#> stringi 1.5.3 2020-09-09 [1] CRAN (R 4.0.2)
#> stringr 1.4.0 2019-02-10 [1] CRAN (R 4.0.2)
#> styler 1.3.2 2020-02-23 [1] CRAN (R 4.0.2)
#> tibble 3.0.5 2021-01-15 [1] CRAN (R 4.0.3)
#> tidyselect 1.1.0 2020-05-11 [1] CRAN (R 4.0.2)
#> vctrs 0.3.6 2020-12-17 [1] CRAN (R 4.0.3)
#> withr 2.4.0 2021-01-16 [1] CRAN (R 4.0.3)
#> xfun 0.20 2021-01-06 [1] CRAN (R 4.0.3)
#> yaml 2.2.1 2020-02-01 [1] CRAN (R 4.0.0)
#>
#> [1] C:/Users/user/Documents/R/win-library/4.0
#> [2] C:/Program Files/R/R-4.0.3/library