The tidytext
package is one of the more popular natural language processing packages
in R’s ecosystem. It follows conventions and syntax of the
“tidyverse.”
You may prefer to use tidytext
for a couple of reasons.
First, tidytext
has its own philosophy and syntax for
handling text, particularly at early stages. You may be more familiar or
comfortable with this approach. Second, tidytext
does,
theoretically, offer some more flexibility in options creating DTMs or
TCMs. This early stage is critical to successful topic modeling.
See Text Mining with R: A Tidy Approach for more details about tidytext.
What follows is a short script combining tidytext
with
textmineR
. Initial data curation and DTM creation is done
with tidytext
. Topic modeling is done with
textmineR
and the outputs are re-formatted in the flavor of
tidytext
’s “tidiers” for other topic models.
################################################################################
# Example: Using tidytext with textmineR
################################################################################
library(tidytext)
library(textmineR)
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:igraph':
#>
#> as_data_frame, groups, union
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
library(tidyr)
#>
#> Attaching package: 'tidyr'
#> The following object is masked from 'package:igraph':
#>
#> crossing
#> The following objects are masked from 'package:Matrix':
#>
#> expand, pack, unpack
# load documents in a data frame
docs <- textmineR::nih_sample
# tokenize using tidytext's unnest_tokens
tidy_docs <- docs %>%
select(APPLICATION_ID, ABSTRACT_TEXT) %>%
unnest_tokens(output = word,
input = ABSTRACT_TEXT,
stopwords = c(stopwords::stopwords("en"),
stopwords::stopwords(source = "smart")),
token = "ngrams",
n_min = 1, n = 2) %>%
count(APPLICATION_ID, word) %>%
filter(n>1) #Filtering for words/bigrams per document, rather than per corpus
tidy_docs <- tidy_docs %>% # filter words that are just numbers
filter(! stringr::str_detect(tidy_docs$word, "^[0-9]+$"))
# turn a tidy tbl into a sparse dgCMatrix for use in textmineR
d <- tidy_docs %>%
cast_sparse(APPLICATION_ID, word, n)
# create a topic model
m <- FitLdaModel(dtm = d,
k = 20,
iterations = 200,
burnin = 175)
# below is equivalent to tidy_beta <- tidy(x = m, matrix = "beta")
tidy_beta <- data.frame(topic = as.integer(stringr::str_replace_all(rownames(m$phi), "t_", "")),
m$phi,
stringsAsFactors = FALSE) %>%
gather(term, beta, -topic) %>%
tibble::as_tibble()
# below is equivalent to tidy_gamma <- tidy(x = m, matrix = "gamma")
tidy_gamma <- data.frame(document = rownames(m$theta),
m$theta,
stringsAsFactors = FALSE) %>%
gather(topic, gamma, -document) %>%
tibble::as_tibble()