While writing the last article I tried using {tensorflow}, I found a function that loads a Python module called ʻimport` and tried it. When I saw it, I could call various modules, so I will write it so as not to forget it. This time, I will try HDP (Hierarchical Dirichlet Process) by gensim. I will leave the Python script that corresponds to the execution in R for the time being.
To briefly explain what you can do with ʻimport`, you can call and execute a Python module from R as shown below.
import
numpy <- tensorflow::import(module = "numpy")
#You don't have to use the module name (similar to giving an alias with import XXX as YYYY in Python?)
pd <- tensorflow::import(module = "pandas")
#Random number matrix generation of 5 rows and 10 columns by Numpy
> numpy$random$randn(5, 10)
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
[1,] 0.2047579 1.4396232 1.39649920 -0.1621329 -0.6799612 -1.4375955 0.2590723 -1.4437292 1.1271664 -0.4434576
[2,] -0.3361488 0.5803642 0.95495572 0.6387092 0.6780120 -1.4093077 -1.3109703 -0.3436415 -0.3865685 1.7751260
[3,] -1.5810281 -1.2563752 0.89171330 0.6231667 -0.2083092 -0.3420242 0.3497256 -0.1856578 0.7441822 0.3541081
[4,] 0.8735340 1.3249083 -0.85038554 1.7963186 -0.4585773 0.8976184 -1.3661424 1.1888050 -1.0010572 -1.0871089
[5,] -0.5713605 0.6925751 -0.04234618 1.6234901 0.2355078 -0.2883208 1.6499782 1.6002583 -0.4482790 -0.0950429
#Data frame with Pandas
> pd$DataFrame(data = numpy$array(c(1, 2)))
0
0 1.0
1 2.0
In the article I wrote earlier, I mentioned that calling {tensorflow} in the 1-series Preview version of RStudio suggests TensorFlow objects, but the module called by ʻimportalso supports it. The image below takes the Pandas called above as an example. If you add
$` to the pd object that loaded Pandas and press the tab, the function or object will be suggested.
This makes it easy for RStudio users who are unfamiliar with Python to use useful modules. In Python, the method is called with .
, but when calling it on R, $
is used, so be careful when rewriting a Python script such as GitHub with R.
The topic goes awry, but since R doesn't have a Python dictionary type and I'm confused about how to handle dictionary objects created with tensorflow :: dict
, I'll leave it as a memo.
{tensorflow}Dict
#Created by tensorflow::dict()
> (word_dict <- tensorflow::dict(a = "at"))
{'a': 'at'}
> class(x = word_dict)
[1] "tensorflow.builtin.dict" "tensorflow.builtin.object" "externalptr"
#Value reference get()Use method
> word_dict$get("a")
[1] "at"
#If you specify a key that does not exist, nothing is returned, so the judgment is is.null()use
> word_dict$get("b")
> is.null(x = word_dict$get("b"))
[1] TRUE
For the correspondence of objects between R and Python, refer to the API reference of {tensorflow}.
Python | R | Examples |
---|---|---|
Scalar | Single-element vector | 1 , 1L , TRUE , "foo" |
List | Multi-element vector | c(1.0, 2.0, 3.0) , c(1L, 2L, 3L) |
Tuple | List of multiple types | list(1L, TRUE, "foo") |
Dict | Named list or dict | list(a = 1L, b = 2.0) , dict(x = x_data) |
NumPy ndarray | Matrix/Array | matrix(c(1,2,3,4), nrow = 2, ncol = 2) |
None, True, False | NULL, TRUE, FALSE | NULL , TRUE , FALSE |
Function definition for executing HDP using gensim of the main subject. Of course, you need to install gensim in advance.
Constant definition part
library(pacman)
#Package to load
SET_LOAD_PACKAGE <- c("tensorflow", "dplyr", "tidyr", "tidytext", "tibble", "stringi", "ggplot2")
Function definition part
# hdpmodel.of py__getitem__
# https://github.com/RaRe-Technologies/gensim/blob/master/gensim/models/hdpmodel.py
assignTopic <- function (hdp, corpus, eps = 0.01) {
gamma <- hdp$inference(chunk = corpus)
topic_dist <- gamma / matrix(data = rep(x = rowSums(x = gamma), hdp$m_T), ncol = hdp$m_T, byrow = FALSE)
tpc_idx <- dplyr::as_data_frame(which(x = topic_dist >= eps, arr.ind = TRUE)) %>%
dplyr::arrange(row, col)
return(
dplyr::data_frame(
doc_id = tpc_idx$row,
topic_number = tpc_idx$col - 1,
prob = topic_dist[as.matrix(tpc_idx)]
)
)
}
Performs data formatting in R required for HDP execution of gensim. It will show the progress, so please compare it with Python execution.
Preparation
#Load the package used this time
pacman::p_load(char = SET_LOAD_PACKAGE, install = FALSE, character.only = TRUE)
#Loading gensim for python module
gensim <- tensorflow::import(module = "gensim")
#Text for HDP
documents <- c(
"Human machine interface for lab abc computer applications",
"A survey of user opinion of computer system response time",
"The EPS user interface management system",
"System and human system engineering testing of EPS",
"Relation of user perceived response time to error measurement",
"The generation of random binary unordered trees",
"The intersection graph of paths in trees",
"Graph minors IV Widths of trees and well quasi ordering",
"Graph minors A survey"
)
> texts <- dplyr::data_frame(documents) %>%
tibble::rownames_to_column(var = "sid") %>%
tidytext::unnest_tokens(output = word, input = documents, token = "words") %>%
dplyr::anti_join(
y = dplyr::data_frame(word = c("for", "a", "of", "the", "and", "to", "in")),
by = "word"
) %>%
print
# A tibble: 52 × 2
sid word
<chr> <chr>
1 8 ordering
2 8 quasi
3 8 well
4 8 widths
5 8 iv
6 8 minors
7 9 minors
8 7 paths
9 7 graph
10 8 graph
# ... with 42 more rows
# gensim.corpora.Create an object that associates a word with an ID in Dictionary
dictionary <- gensim$corpora$Dictionary(list(stringi::stri_unique(str = texts$word)))
#ID below=0 words"minors"
> head(x = dictionary$token2id, n = 5)
$minors
[1] 0
$generation
[1] 1
$random
[1] 17
$iv
[1] 3
$engineering
[1] 4
> class(dictionary)
[1] "gensim.corpora.dictionary.Dictionary" "gensim.utils.SaveLoad" "_abcoll.Mapping"
[4] "tensorflow.builtin.object" "externalptr"
#Store word by word in list so that doc2bow method can be applied
#There is a list for each sentence and a list for each word
sentence <- texts %>%
dplyr::group_by(sid) %>%
dplyr::summarize(sentence = list(as.list(x = word))) %>%
print
# A tibble: 9 × 2
sid sentence
<chr> <list>
1 1 <list [7]>
2 2 <list [7]>
3 3 <list [5]>
4 4 <list [6]>
5 5 <list [7]>
6 6 <list [5]>
7 7 <list [4]>
8 8 <list [8]>
9 9 <list [3]>
> sentence$sentence[[1]]
[[1]]
[1] "applications"
[[2]]
[1] "computer"
[[3]]
[1] "abc"
[[4]]
[1] "lab"
[[5]]
[1] "interface"
[[6]]
[1] "machine"
[[7]]
[1] "human"
#Create a corpus by applying the doc2bow method for each sentence
corpus <- lapply(X = sentence$sentence, FUN = dictionary$doc2bow)
Hierarchical Dirichlet Process Call gensim's HDP using the corpus processed above. As will be described later, here the Topic number is up to 150.
HDP execution and result display
hdp <- gensim$models$hdpmodel$HdpModel(corpus = corpus, id2word = dictionary, alpha = 1, T = 150L)
# show_Get the top 100 topics created by the topics method
topic_most_words <- dplyr::bind_rows(
lapply(
X = hdp$show_topics(topics = -1L, topn = 100L, formatted = FALSE),
FUN = function (topic_result) {
return(
dplyr::data_frame(
topic_number = topic_result[[1]],
word = sapply(X = topic_result[[2]], FUN = "[[", 1),
weight = sapply(X = topic_result[[2]], FUN = "[[", 2)
)
)
}
)
)
#Output TOP1 weight words in each topic
> topic_most_words %>%
dplyr::group_by(topic_number) %>%
dplyr::top_n(n = 1, wt = weight) %>%
dplyr::glimpse()
Observations: 150
Variables: 3
$ topic_number <int> 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 3...
$ word <chr> "iv", "abc", "machine", "opinion", "iv", "binary", "user", "time", "random", "unordered", "graph", "computer", "ordering"...
$ weight <dbl> 0.12797694, 0.12662267, 0.14265295, 0.22422048, 0.07968149, 0.15725564, 0.12947385, 0.12292357, 0.12069854, 0.08766465,...
#Shows the top 10 words with a topic number 0 weight
> topic_most_words %>%
dplyr::filter(topic_number == 0) %>%
dplyr::top_n(n = 10, wt = weight)
# A tibble: 10 × 3
topic_number word weight
<int> <chr> <dbl>
1 0 iv 0.12797694
2 0 error 0.09925327
3 0 unordered 0.08631031
4 0 widths 0.08567026
5 0 user 0.05359417
6 0 perceived 0.04646710
7 0 quasi 0.03670312
8 0 minors 0.03300030
9 0 survey 0.03216778
10 0 relation 0.03062379
#The topic weight is 1 in total for each topic.Set to 0
> topic_most_words %>%
dplyr::group_by(topic_number) %>%
dplyr::summarize(sum_weight = sum(weight))
# A tibble: 150 × 2
topic_number sum_weight
<int> <dbl>
1 0 1
2 1 1
3 2 1
4 3 1
5 4 1
6 5 1
7 6 1
8 7 1
9 8 1
10 9 1
# ... with 140 more rows
#The shapes that are not processed with R are as follows
> hdp$show_topics(topics = 1L, topn = 10L)
[1] "topic 0: 0.128*iv + 0.099*error + 0.086*unordered + 0.086*widths + 0.054*user + 0.046*perceived + 0.037*quasi + 0.033*minors + 0.032*survey + 0.031*relation"
HDP-LDA does not have an explicit number of topics, but has infinitely variable length topics. Theoretically, it is an infinite topic, but for the convenience of program execution, it will be set to a certain number of topics (up to 150 is set for the above execution). I'm aware that infinitely variable length topics will be cut off by parameters, resulting in a suitable number of topics, but I'm sorry if I made a mistake (sorry for lack of study).
Number of topics
#When comparing topic assignments by varying the parameters, some topics have a probability of almost 0 at a certain number (Is it considered that the appropriate number of topics is evenly distributed?)
> assignTopic(hdp = hdp, corpus = corpus, eps = 0.5) %>%
tidyr::spread(key = topic_number, value = prob, fill = 0)
# A tibble: 7 × 4
doc_id `0` `1` `2`
* <int> <dbl> <dbl> <dbl>
1 2 0.0000000 0.000000 0.5794248
2 3 0.0000000 0.636553 0.0000000
3 4 0.0000000 0.549115 0.0000000
4 5 0.0000000 0.000000 0.6637392
5 7 0.8510308 0.000000 0.0000000
6 8 0.9171867 0.000000 0.0000000
7 9 0.8142352 0.000000 0.0000000
> assignTopic(hdp = hdp, corpus = corpus, eps = 0.05) %>%
tidyr::spread(key = topic_number, value = prob, fill = 0)
# A tibble: 9 × 5
doc_id `0` `1` `2` `3`
* <int> <dbl> <dbl> <dbl> <dbl>
1 1 0.0000000 0.4434944 0.4689728 0.0000000
2 2 0.0000000 0.0000000 0.5794248 0.3236852
3 3 0.2711764 0.6365530 0.0000000 0.0000000
4 4 0.3718040 0.5491150 0.0000000 0.0000000
5 5 0.2609743 0.0000000 0.6637392 0.0000000
6 6 0.4543387 0.4534745 0.0000000 0.0000000
7 7 0.8510308 0.0000000 0.0000000 0.0000000
8 8 0.9171867 0.0000000 0.0000000 0.0000000
9 9 0.8142352 0.0000000 0.0000000 0.0000000
> assignTopic(hdp = hdp, corpus = corpus, eps = 0.005) %>%
tidyr::spread(key = topic_number, value = prob, fill = 0)
# A tibble: 9 × 10
doc_id `0` `1` `2` `3` `4` `5` `6` `7` `8`
* <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 0.03606753 0.44349436 0.46897279 0.01310405 0.010056735 0.007288994 0.005646310 0.000000000 0.000000000
2 2 0.03447054 0.02405832 0.57942479 0.32368520 0.010056624 0.007288991 0.005646310 0.000000000 0.000000000
3 3 0.27117635 0.63655299 0.02365162 0.01747056 0.013409101 0.009718661 0.007528413 0.005514206 0.000000000
4 4 0.37180402 0.54911504 0.02026298 0.01497529 0.011494638 0.008330283 0.006452925 0.000000000 0.000000000
5 5 0.26097434 0.02382923 0.66373916 0.01309609 0.010056665 0.007288992 0.005646310 0.000000000 0.000000000
6 6 0.45433870 0.45347446 0.02357873 0.01746019 0.013408535 0.009718667 0.007528413 0.005514206 0.000000000
7 7 0.85103077 0.03833858 0.02829496 0.02095766 0.016090783 0.011662393 0.009034095 0.006617047 0.000000000
8 8 0.91718670 0.02139333 0.01567863 0.01164270 0.008939062 0.006479106 0.005018942 0.000000000 0.000000000
9 9 0.81423517 0.04750129 0.03534385 0.02619753 0.020113102 0.014577990 0.011292619 0.008271309 0.005930772
#The following is eps= 0.Verify with the result of 01
topic_assign <- assignTopic(hdp = hdp, corpus = corpus)
topic_assign %>%
tidyr::spread(key = topic_number, value = prob, fill = 0)
# A tibble: 9 × 8
doc_id `0` `1` `2` `3` `4` `5` `6`
* <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 0.03606753 0.44349436 0.46897279 0.01310405 0.01005674 0.00000000 0.00000000
2 2 0.03447054 0.02405832 0.57942479 0.32368520 0.01005662 0.00000000 0.00000000
3 3 0.27117635 0.63655299 0.02365162 0.01747056 0.01340910 0.00000000 0.00000000
4 4 0.37180402 0.54911504 0.02026298 0.01497529 0.01149464 0.00000000 0.00000000
5 5 0.26097434 0.02382923 0.66373916 0.01309609 0.01005666 0.00000000 0.00000000
6 6 0.45433870 0.45347446 0.02357873 0.01746019 0.01340854 0.00000000 0.00000000
7 7 0.85103077 0.03833858 0.02829496 0.02095766 0.01609078 0.01166239 0.00000000
8 8 0.91718670 0.02139333 0.01567863 0.01164270 0.00000000 0.00000000 0.00000000
9 9 0.81423517 0.04750129 0.03534385 0.02619753 0.02011310 0.01457799 0.01129262
#If you assign the topic number with the highest probability as a topic, it will be concentrated from 0 to 2.
> topic_assign %>%
dplyr::group_by(doc_id) %>%
dplyr::filter(prob == max(prob))
Source: local data frame [9 x 3]
Groups: doc_id [9]
doc_id topic_number prob
<int> <dbl> <dbl>
1 1 2 0.4689728
2 2 2 0.5794248
3 3 1 0.6365530
4 4 1 0.5491150
5 5 2 0.6637392
6 6 0 0.4543387
7 7 0 0.8510308
8 8 0 0.9171867
9 9 0 0.8142352
#Calculate the ratio for each topic number
> dplyr::left_join(
x = topic_assign %>%
dplyr::group_by(topic_number) %>%
dplyr::summarize(topic_sum_prob = sum(prob)),
y = topic_assign %>%
dplyr::mutate(sum_prob = sum(prob)) %>%
dplyr::distinct(topic_number, sum_prob),
by = "topic_number"
) %>%
dplyr::mutate(
ratio = topic_sum_prob / sum_prob,
cumsum_ratio = cumsum(x = ratio)
) %>%
dplyr::select(topic_number, ratio, cumsum_ratio)
# A tibble: 7 × 3
topic_number ratio cumsum_ratio
<dbl> <dbl> <dbl>
1 0 0.460601369 0.4606014
2 1 0.256953677 0.7175550
3 2 0.213456275 0.9310113
4 3 0.052658161 0.9836695
5 4 0.012020739 0.9956902
6 5 0.003013089 0.9987033
7 6 0.001296691 1.0000000
When eps = 0.01, the number of topics can be covered by about 3 to 4, and the result is at most 7.
Limit to 7 topics from topic numbers 0 to 6 and try to visualize from various perspectives.
Cumulative sum of word weights
tpc_words <- dplyr::inner_join(
x = topic_most_words,
y = topic_assign %>%
dplyr::distinct(topic_number),
by = "topic_number"
)
tpc_words %>%
dplyr::mutate(topic_number = as.factor(x = topic_number)) %>%
dplyr::group_by(topic_number) %>%
dplyr::mutate(
counter = dplyr::row_number(x = topic_number),
cumsum_weight = cumsum(x = weight)
) %>%
ggplot2::ggplot(
data = ., mapping = ggplot2::aes(x = counter, y = cumsum_weight)
) + ggplot2::geom_line(mapping = ggplot2::aes(group = topic_number, color = topic_number))
In the case of the top 5 words (line with 5 on the X axis), topic number 3 is about 0.50 and topic number 4 is about 0.28, and the word with less topic number 3 has a higher probability. (Some words are thought to strongly represent topic number 3).
Top 10 words weighted by topic
tpc_words %>%
dplyr::group_by(topic_number) %>%
dplyr::arrange(topic_number, dplyr::desc(x = weight)) %>%
dplyr::top_n(n = 10, wt = weight) %>%
dplyr::mutate(word = reorder(x = word, X = weight)) %>%
dplyr::ungroup() %>%
ggplot2::ggplot(data = ., mapping = ggplot2::aes(x = word, y = weight, fill = factor(topic_number))) +
ggplot2::geom_bar(stat = "identity", show.legend = FALSE) +
ggplot2::facet_wrap(facets = ~ topic_number, scales = "free") +
ggplot2::theme(axis.text.x = ggplot2::element_text(angle = 90))
Topic number 3 has a large proportion of "opinion" and "user", and topic number 4 is evenly weighted for each word. This also confirms the previous visualization at the word level.
Stacked bar graph for each word
tpc_words %>%
dplyr::group_by(topic_number) %>%
dplyr::arrange(topic_number, dplyr::desc(x = weight)) %>%
dplyr::top_n(n = 30, wt = weight) %>%
dplyr::ungroup() %>%
dplyr::mutate(
topic_number = factor(topic_number),
word = reorder(x = word, X = weight)
) %>%
ggplot2::ggplot(ggplot2::aes(x = word, y = weight, fill = topic_number)) +
ggplot2::geom_bar(stat = "identity") + ggplot2::coord_flip()
It is thought that the ratio of topics can be visualized for each word. Specifically, "error" occupies a large proportion of topic number 0, and "opinion" has a large proportion of topic number 3 as a result of the previous visualization. In contrast, "iv" and "user" are not tied to one topic.
I ran gensim HDP using a function that loads a Python module called ʻimport in {tensorflow}. I asked for the number of topics that seemed more suitable, and tried to visualize the results from various perspectives. I'm worried that I don't have enough knowledge of HDP and it's correct recognition, but I tried it for the time being (I would appreciate it if you could point out something wrong here). In the same way, I also verified the execution of gensim's Doc2Vec, so I will summarize it soon. I thought that I wouldn't use {PythonInR}, which I used to use for ʻimport
of {tensorflow}, but I can't read the functions that can be called in Python, so I'd like to investigate a little more.
Call gensim DTM from R
> gensim$models$ldaseqmodel$LdaSeqModel
eval(substitute(expr), envir, enclos)Error in:
AttributeError: 'module' object has no attribute 'ldaseqmodel'
Call gensim DTM from Python
import gensim
>>> gensim.models.ldaseqmodel.LdaSeqModel
<class 'gensim.models.ldaseqmodel.LdaSeqModel'>
Regarding the above gensim that DTM cannot be called, it was available when I upgraded the version of tensorflow (the environment where execution could be confirmed is 0.12.1). However, pip install tensorflow
didn't work, and I was able to specify .whl in TF_BINARY_URL by following the official procedure. I don't really understand the difference.
Call gensim DTM from R (additional version)
dtm <- gensim$models$ldaseqmodel$LdaSeqModel(corpus = corpus, time_slice = c(5L, 10L, 15L), num_topics = 3L)
> print(dtm)
LdaSeqModel
#Topic distribution of the specified document
> dtm$doc_topics(doc_number = 1)
[1] 0.001422475 0.001422475 0.997155050
#Topic distribution for each document, topic distribution for each word, number of words for each sentence, frequency of occurrence for each word
> dtm$dtm_vis(time = 1L, corpus = corpus)
[[1]]
[,1] [,2] [,3]
[1,] 0.001422475 0.001422475 0.997155050
[2,] 0.001422475 0.001422475 0.997155050
[3,] 0.001988072 0.001988072 0.996023857
[4,] 0.001658375 0.001658375 0.996683250
[5,] 0.001422475 0.001422475 0.997155050
[6,] 0.996023857 0.001988072 0.001988072
[7,] 0.995037221 0.002481390 0.002481390
[8,] 0.001245330 0.997509340 0.001245330
[9,] 0.003300330 0.993399340 0.003300330
[[2]]
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
[1,] 0.05957456 0.01419498 0.01419498 0.014194979 0.01419498 0.01419498 0.01419498 0.01419498
[2,] 0.20229085 0.04711086 0.01035351 0.041352822 0.01035351 0.01035351 0.01035351 0.01035351
[3,] 0.01935312 0.01369708 0.01369708 0.007190831 0.01369708 0.07356165 0.01369708 0.07356165
[,9] [,10] [,11] [,12] [,13] [,14] [,15] [,16]
[1,] 0.01419498 0.01419498 0.128644022 0.01419498 0.014194979 0.128644022 0.01419498 0.01419498
[2,] 0.01035351 0.01035351 0.010353508 0.01035351 0.041352822 0.202290845 0.01035351 0.01035351
[3,] 0.01369708 0.01369708 0.007190831 0.01369708 0.007190831 0.007190831 0.13245925 0.01369708
[,17] [,18] [,19] [,20] [,21] [,22] [,23] [,24]
[1,] 0.014194979 0.01419498 0.01419498 0.01419498 0.12864402 0.01419498 0.01419498 0.01419498
[2,] 0.041352822 0.01035351 0.01035351 0.01035351 0.04135282 0.01035351 0.01035351 0.01035351
[3,] 0.007190831 0.01369708 0.01369708 0.01369708 0.01369708 0.01369708 0.01369708 0.01369708
[,25] [,26] [,27] [,28] [,29] [,30] [,31] [,32]
[1,] 0.01419498 0.128644022 0.01419498 0.01419498 0.014194979 0.014194979 0.01419498 0.01419498
[2,] 0.01035351 0.010353508 0.01035351 0.01035351 0.041352822 0.041352822 0.01035351 0.04135282
[3,] 0.07356165 0.007190831 0.07356165 0.01369708 0.007190831 0.007190831 0.07356165 0.01369708
[,33] [,34] [,35]
[1,] 0.01419498 0.01419498 0.01419498
[2,] 0.01035351 0.01035351 0.01035351
[3,] 0.07356165 0.01369708 0.10274367
[[3]]
[1] 7 7 5 5 7 5 4 8 3
[[4]]
[1] 2 1 1 1 1 2 1 2 1 1 1 1 1 3 4 1 1 1 1 1 3 1 1 1 2 1 2 1 1 1 2 2 2 1 3
[[5]]
[1] "0" "1" "2" "3" "4" "5" "6" "7" "8" "9" "10" "11" "12" "13" "14" "15" "16" "17"
[19] "18" "19" "20" "21" "22" "23" "24" "25" "26" "27" "28" "29" "30" "31" "32" "33" "34"
-Implementing the hierarchical Dirichlet process (1) Compare HDP-LDA and LDA models -Try Gensim's HDP (Hierarchical Dirichlet Process) for classical music information
A Python script that uses HDP to call.
Python runtime environment
Python 2.7.12 (default, Jul 10 2016, 20:09:20)
[GCC 4.2.1 Compatible Apple LLVM 7.3.0 (clang-703.0.31)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
Python-HDP
import gensim
import pandas
#Text for HDP
documents = ["Human machine interface for lab abc computer applications",
"A survey of user opinion of computer system response time",
"The EPS user interface management system",
"System and human system engineering testing of EPS",
"Relation of user perceived response time to error measurement",
"The generation of random binary unordered trees",
"The intersection graph of paths in trees",
"Graph minors IV Widths of trees and well quasi ordering",
"Graph minors A survey"]
#Stop word definition
stoplist = set('for a of the and to in'.split())
#Text preprocessing
texts = [[word for word in document.lower().split() if word not in stoplist] for document in documents]
>>> pprint(texts)
[['human', 'machine', 'interface', 'lab', 'abc', 'computer', 'applications'],
['survey', 'user', 'opinion', 'computer', 'system', 'response', 'time'],
['eps', 'user', 'interface', 'management', 'system'],
['system', 'human', 'system', 'engineering', 'testing', 'eps'],
['relation', 'user', 'perceived', 'response', 'time', 'error', 'measurement'],
['generation', 'random', 'binary', 'unordered', 'trees'],
['intersection', 'graph', 'paths', 'trees'],
['graph', 'minors', 'iv', 'widths', 'trees', 'well', 'quasi', 'ordering'],
['graph', 'minors', 'survey']]
dictionary = gensim.corpora.Dictionary(texts)
>>> pprint(dictionary.token2id)
{u'abc': 0,
u'applications': 3,
u'binary': 22,
u'computer': 4,
u'engineering': 15,
u'eps': 14,
u'error': 18,
u'generation': 21,
u'graph': 26,
u'human': 5,
u'interface': 6,
u'intersection': 27,
u'iv': 33,
u'lab': 1,
u'machine': 2,
u'management': 13,
u'measurement': 20,
u'minors': 29,
u'opinion': 11,
u'ordering': 30,
u'paths': 28,
u'perceived': 17,
u'quasi': 34,
u'random': 23,
u'relation': 19,
u'response': 12,
u'survey': 8,
u'system': 7,
u'testing': 16,
u'time': 10,
u'trees': 25,
u'unordered': 24,
u'user': 9,
u'well': 32,
u'widths': 31}
corpus = [dictionary.doc2bow(text) for text in texts]
>>> pprint(corpus)
[[(0, 1), (1, 1), (2, 1), (3, 1), (4, 1), (5, 1), (6, 1)],
[(4, 1), (7, 1), (8, 1), (9, 1), (10, 1), (11, 1), (12, 1)],
[(6, 1), (7, 1), (9, 1), (13, 1), (14, 1)],
[(5, 1), (7, 2), (14, 1), (15, 1), (16, 1)],
[(9, 1), (10, 1), (12, 1), (17, 1), (18, 1), (19, 1), (20, 1)],
[(21, 1), (22, 1), (23, 1), (24, 1), (25, 1)],
[(25, 1), (26, 1), (27, 1), (28, 1)],
[(25, 1), (26, 1), (29, 1), (30, 1), (31, 1), (32, 1), (33, 1), (34, 1)],
[(8, 1), (26, 1), (29, 1)]]
#HDP execution
hdp = gensim.models.hdpmodel.HdpModel(corpus = corpus, id2word = dictionary, alpha = 1.0, T = 150)
pandas.DataFrame([dict(hdp[x]) for x in corpus])
0 1 2 3 4 5 6
0 0.153961 0.024102 0.017490 0.013135 0.763441 NaN NaN
1 0.424649 0.506962 0.017490 0.013132 NaN NaN NaN
2 0.234524 0.033107 0.664496 0.017517 0.013194 NaN NaN
3 0.894362 0.027382 0.020088 0.015006 0.011309 NaN NaN
4 0.907610 0.023981 0.017512 0.013130 NaN NaN NaN
5 0.052698 0.856097 0.023335 0.017514 0.013194 NaN NaN
6 0.054569 0.835891 0.028101 0.021013 0.015832 0.011575 NaN
7 0.029701 0.478424 0.015533 0.011674 NaN 0.437527 NaN
8 0.066220 0.797049 0.034939 0.026260 0.019790 0.014469 0.01079
Execution environment
Session info ----------------------------------------------------------------------------------------------------------------------
setting value
version R version 3.3.1 (2016-06-21)
system x86_64, darwin13.4.0
ui RStudio (1.0.44)
language (EN)
collate ja_JP.UTF-8
tz Asia/Tokyo
date 2016-10-30
Packages --------------------------------------------------------------------------------------------------------------------------
package * version date source
assertthat 0.1 2013-12-06 CRAN (R 3.3.1)
broom 0.4.1 2016-06-24 CRAN (R 3.3.0)
colorspace 1.2-6 2015-03-11 CRAN (R 3.3.1)
DBI 0.5 2016-08-11 cran (@0.5)
devtools 1.12.0 2016-06-24 CRAN (R 3.3.0)
digest 0.6.9 2016-01-08 CRAN (R 3.3.0)
dplyr * 0.5.0 2016-06-24 CRAN (R 3.3.1)
ggplot2 * 2.1.0 2016-03-01 CRAN (R 3.3.1)
gtable 0.2.0 2016-02-26 CRAN (R 3.3.1)
janeaustenr 0.1.1 2016-06-20 CRAN (R 3.3.0)
lattice 0.20-33 2015-07-14 CRAN (R 3.3.1)
magrittr 1.5 2014-11-22 CRAN (R 3.3.1)
Matrix 1.2-6 2016-05-02 CRAN (R 3.3.1)
memoise 1.0.0 2016-01-29 CRAN (R 3.3.0)
mnormt 1.5-4 2016-03-09 CRAN (R 3.3.0)
munsell 0.4.3 2016-02-13 CRAN (R 3.3.1)
nlme 3.1-128 2016-05-10 CRAN (R 3.3.1)
pacman * 0.4.1 2016-03-30 CRAN (R 3.3.0)
plyr 1.8.4 2016-06-08 CRAN (R 3.3.1)
psych 1.6.6 2016-06-28 CRAN (R 3.3.0)
R6 2.1.3 2016-08-19 cran (@2.1.3)
Rcpp 0.12.7 2016-09-05 cran (@0.12.7)
reshape2 1.4.1 2014-12-06 CRAN (R 3.3.1)
scales 0.4.0 2016-02-26 CRAN (R 3.3.1)
SnowballC 0.5.1 2014-08-09 CRAN (R 3.3.1)
stringi * 1.1.1 2016-05-27 CRAN (R 3.3.1)
stringr 1.1.0 2016-08-19 cran (@1.1.0)
tensorflow * 0.3.0 2016-10-25 Github (rstudio/tensorflow@dfe2f1a)
tibble * 1.2 2016-08-26 cran (@1.2)
tidyr * 0.6.0 2016-08-12 cran (@0.6.0)
tidytext * 0.1.1 2016-06-25 CRAN (R 3.3.0)
tokenizers 0.1.4 2016-08-29 CRAN (R 3.3.0)
withr 1.0.2 2016-06-20 CRAN (R 3.3.0)
Recommended Posts