--- title: "6. Using tidytext with textmineR" author: "Thomas W. Jones" date: "`r Sys.Date()`" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{6. Using tidytext with textmineR} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include=FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", warning = FALSE ) ``` # Using tidytext with textmineR The [`tidytext`](https://CRAN.R-project.org/package=tidytext) package is one of the more popular natural language processing packages in R's ecosystem. It follows conventions and syntax of the "tidyverse." You may prefer to use `tidytext` for a couple of reasons. First, `tidytext` has its own philosophy and syntax for handling text, particularly at early stages. You may be more familiar or comfortable with this approach. Second, `tidytext` does, theoretically, offer some more flexibility in options creating DTMs or TCMs. This early stage is critical to successful topic modeling. See _[Text Mining with R: A Tidy Approach](https://www.tidytextmining.com/)_ for more details about tidytext. What follows is a short script combining `tidytext` with `textmineR`. Initial data curation and DTM creation is done with `tidytext`. Topic modeling is done with `textmineR` and the outputs are re-formatted in the flavor of `tidytext`'s "tidiers" for other topic models. ```{r} ################################################################################ # Example: Using tidytext with textmineR ################################################################################ library(tidytext) library(textmineR) library(dplyr) library(tidyr) # load documents in a data frame docs <- textmineR::nih_sample # tokenize using tidytext's unnest_tokens tidy_docs <- docs %>% select(APPLICATION_ID, ABSTRACT_TEXT) %>% unnest_tokens(output = word, input = ABSTRACT_TEXT, stopwords = c(stopwords::stopwords("en"), stopwords::stopwords(source = "smart")), token = "ngrams", n_min = 1, n = 2) %>% count(APPLICATION_ID, word) %>% filter(n>1) #Filtering for words/bigrams per document, rather than per corpus tidy_docs <- tidy_docs %>% # filter words that are just numbers filter(! stringr::str_detect(tidy_docs$word, "^[0-9]+$")) # turn a tidy tbl into a sparse dgCMatrix for use in textmineR d <- tidy_docs %>% cast_sparse(APPLICATION_ID, word, n) # create a topic model m <- FitLdaModel(dtm = d, k = 20, iterations = 200, burnin = 175) # below is equivalent to tidy_beta <- tidy(x = m, matrix = "beta") tidy_beta <- data.frame(topic = as.integer(stringr::str_replace_all(rownames(m$phi), "t_", "")), m$phi, stringsAsFactors = FALSE) %>% gather(term, beta, -topic) %>% tibble::as_tibble() # below is equivalent to tidy_gamma <- tidy(x = m, matrix = "gamma") tidy_gamma <- data.frame(document = rownames(m$theta), m$theta, stringsAsFactors = FALSE) %>% gather(topic, gamma, -document) %>% tibble::as_tibble() ```