Package 'textmineR' reference manual

Title:	Functions for Text Mining and Topic Modeling
Description:	An aid for text mining in R, with a syntax that should be familiar to experienced R users. Provides a wrapper for several topic models that take similarly-formatted input and give similarly-formatted output. Has additional functionality for analyzing and diagnostics for topic models.
Authors:	Tommy Jones [aut, cre], William Doane [ctb], Mattias Attbom [ctb]
Maintainer:	Tommy Jones <[email protected]>
License:	MIT + file LICENSE
Version:	3.0.5.999
Built:	2025-03-05 04:21:04 UTC
Source:	https://github.com/tommyjones/textminer

Calculate a matrix whose rows represent P(topic_i|tokens)

Description

This function takes a phi matrix (P(token|topic)) and a theta matrix (P(topic|document)) and returns the phi prime matrix (P(topic|token)). Phi prime can be used for classifying new documents and for alternative topic labels.

Usage

CalcGamma(phi, theta, p_docs = NULL, correct = TRUE)
CalcGamma(phi, theta, p_docs = NULL, correct = TRUE)

Arguments

`phi`	The phi matrix whose rows index topics and columns index words. The i, j entries are P(word_i \| topic_j)
`theta`	The theta matrix whose rows index documents and columns index topics. The i, j entries are P(topic_i \| document_j)
`p_docs`	A numeric vector of length `nrow(theta)` that is proportional to the number of terms in each document. This is an optional argument. It defaults to NULL
`correct`	Logical. Do you want to set NAs or NaNs in the final result to zero? Useful when hitting computational underflow. Defaults to `TRUE`. Set to `FALSE` for troubleshooting or diagnostics.

Value

Returns a matrix whose rows correspond to topics and whose columns correspond to tokens. The i,j entry corresponds to P(topic_i|token_j)

Examples

# Load a pre-formatted dtm and topic model
data(nih_sample_topic_model) 

# Make a gamma matrix, P(topic|words)
gamma <- CalcGamma(phi = nih_sample_topic_model$phi, 
                   theta = nih_sample_topic_model$theta)

# Load a pre-formatted dtm and topic model
data(nih_sample_topic_model) 

# Make a gamma matrix, P(topic|words)
gamma <- CalcGamma(phi = nih_sample_topic_model$phi, 
                   theta = nih_sample_topic_model$theta)

Calculate Hellinger Distance

Description

Calculates the Hellinger distances or the rows or columns of a numeric matrix or for two numeric vectors.

Usage

CalcHellingerDist(x, y = NULL, by_rows = TRUE)
CalcHellingerDist(x, y = NULL, by_rows = TRUE)

Arguments

`x`	A numeric matrix or numeric vector
`y`	A numeric vector. `y` must be specified if `x` is a numeric vector.
`by_rows`	Logical. If `x` is a matrix, should distances be calculated by rows?

Value

If x is a matrix, this returns an square and symmetric matrix. The i,j entries correspond to the Hellinger Distance between the rows of x (or the columns of x if by_rows = FALSE). If x and y are vectors, this returns a numeric scalar whose value is the Hellinger Distance between x and y.

Examples

x <- rchisq(n = 100, df = 8)
y <- x^2
CalcHellingerDist(x = x, y = y)

mymat <- rbind(x, y)
CalcHellingerDist(x = mymat)
x <- rchisq(n = 100, df = 8)
y <- x^2
CalcHellingerDist(x = x, y = y)

mymat <- rbind(x, y)
CalcHellingerDist(x = mymat)

Calculate Jensen-Shannon Divergence

Description

This function calculates the Jensen Shannon Divergence for the rows or columns of a numeric matrix or for two numeric vectors.

Usage

CalcJSDivergence(x, y = NULL, by_rows = TRUE)
CalcJSDivergence(x, y = NULL, by_rows = TRUE)

Arguments

`x`	A numeric matrix or numeric vector
`y`	A numeric vector. `y` must be specified if `x` is a numeric vector.
`by_rows`	Logical. If `x` is a matrix, should distances be calculated by rows?

Value

Examples

x <- rchisq(n = 100, df = 8)
y <- x^2
CalcJSDivergence(x = x, y = y)

mymat <- rbind(x, y)
CalcJSDivergence(x = mymat)
x <- rchisq(n = 100, df = 8)
y <- x^2
CalcJSDivergence(x = x, y = y)

mymat <- rbind(x, y)
CalcJSDivergence(x = mymat)

Calculate the log likelihood of a document term matrix given a topic model

Description

This function takes a DTM, phi matrix (P(word|topic)), and a theta matrix (P(topic|document)) and returns a single value for the likelihood of the data given the model.

Usage

CalcLikelihood(dtm, phi, theta, ...)
CalcLikelihood(dtm, phi, theta, ...)

Arguments

`dtm`	The document term matrix of class `dgCMatrix`.
`phi`	The phi matrix whose rows index topics and columns index words. The i, j entries are P(word_i \| topic_j)
`theta`	The theta matrix whose rows index documents and columns index topics. The i, j entries are P(topic_i \| document_j)
`...`	Other arguments to pass to `TmParallelApply`. See note, below.

Value

Returns an object of class numeric corresponding to the log likelihood.

Note

This function performs parallel computation if dtm has more than 3,000 rows. The default is to use all available cores according to detectCores. However, this can be modified by passing the cpus argument when calling this function.

Examples

# Load a pre-formatted dtm and topic model
data(nih_sample_dtm) 
data(nih_sample_topic_model)

# Get the likelihood of the data given the fitted model parameters
ll <- CalcLikelihood(dtm = nih_sample_dtm, 
                     phi = nih_sample_topic_model$phi, 
                     theta = nih_sample_topic_model$theta)

ll
# Load a pre-formatted dtm and topic model
data(nih_sample_dtm) 
data(nih_sample_topic_model)

# Get the likelihood of the data given the fitted model parameters
ll <- CalcLikelihood(dtm = nih_sample_dtm, 
                     phi = nih_sample_topic_model$phi, 
                     theta = nih_sample_topic_model$theta)

ll

Probabilistic coherence of topics

Description

Calculates the probabilistic coherence of a topic or topics. This approximates semantic coherence or human understandability of a topic.

Usage

CalcProbCoherence(phi, dtm, M = 5)
CalcProbCoherence(phi, dtm, M = 5)

Arguments

`phi`	A numeric matrix or a numeric vector. The vector, or rows of the matrix represent the numeric relationship between topic(s) and terms. For example, this relationship may be p(word\|topic) or p(topic\|word).
`dtm`	A document term matrix or co-occurrence matrix of class `matrix` or whose class inherits from the `Matrix` package. Columns must index terms.
`M`	An integer for the number of words to be used in the calculation. Defaults to 5

Value

Returns an object of class numeric corresponding to the probabilistic coherence of the input topic(s).

Examples

# Load a pre-formatted dtm and topic model
data(nih_sample_topic_model)
data(nih_sample_dtm) 

CalcProbCoherence(phi = nih_sample_topic_model$phi, dtm = nih_sample_dtm, M = 5)
# Load a pre-formatted dtm and topic model
data(nih_sample_topic_model)
data(nih_sample_dtm) 

CalcProbCoherence(phi = nih_sample_topic_model$phi, dtm = nih_sample_dtm, M = 5)

Calculate the R-squared of a topic model.

Description

Function to calculate R-squared for a topic model. This uses a geometric interpretation of R-squared as the proportion of total distance each document is from the center of all the documents that is explained by the model.

Usage

CalcTopicModelR2(dtm, phi, theta, ...)
CalcTopicModelR2(dtm, phi, theta, ...)

Arguments

`dtm`	A documents by terms dimensional document term matrix of class `dgCMatrix` or of class `matrix`.
`phi`	A topics by terms dimensional matrix where each entry is p(term_i \|topic_j)
`theta`	A documents by topics dimensional matrix where each entry is p(topic_j\|document_d)
`...`	Other arguments to be passed to `TmParallelApply`. See note, below.

Value

Returns an object of class numeric representing the proportion of variability in the data that is explained by the topic model.

Note

Examples

# Load a pre-formatted dtm and topic model
data(nih_sample_dtm) 
data(nih_sample_topic_model)

# Get the R-squared of the model
r2 <- CalcTopicModelR2(dtm = nih_sample_dtm, 
                     phi = nih_sample_topic_model$phi, 
                     theta = nih_sample_topic_model$theta)


r2
# Load a pre-formatted dtm and topic model
data(nih_sample_dtm) 
data(nih_sample_topic_model)

# Get the R-squared of the model
r2 <- CalcTopicModelR2(dtm = nih_sample_dtm, 
                     phi = nih_sample_topic_model$phi, 
                     theta = nih_sample_topic_model$theta)


r2

Represent a document clustering as a topic model

Description

Represents a document clustering as a topic model of two matrices. phi: P(term | cluster) theta: P(cluster | document)

Usage

Cluster2TopicModel(dtm, clustering, ...)
Cluster2TopicModel(dtm, clustering, ...)

Arguments

`dtm`	A document term matrix of class `dgCMatrix` or whose class inherits from the `Matrix` package. Columns must index terms, rows must index documents.
`clustering`	A vector of length `nrow(dtm)` whose entries form a partitional clustering of the documents.
`...`	Other arguments to be passed to `TmParallelApply`.

Value

Returns a list with two elements, phi and theta. 'phi' is a matrix whose j-th row represents P(terms | cluster_j). 'theta' is a matrix whose j-th row represents P(clusters | document_j). Each row of theta should only have one non-zero element.

Examples

## Not run: 
# Load pre-formatted data for use
data(nih_sample_dtm)
data(nih_sample) 

result <- Cluster2TopicModel(dtm = nih_sample_dtm, 
                             clustering = nih_sample$IC_NAME)

## End(Not run)
## Not run: 
# Load pre-formatted data for use
data(nih_sample_dtm)
data(nih_sample) 

result <- Cluster2TopicModel(dtm = nih_sample_dtm, 
                             clustering = nih_sample$IC_NAME)

## End(Not run)

Convert a character vector to a document term matrix.

Description

This is the main document term matrix creating function for textmineR. In most cases, all you need to do is import documents as a character vector in R and then run this function to get a document term matrix that is compatible with the rest of textmineR's functionality and many other libraries. CreateDtm is built on top of the excellent text2vec library.

Usage

CreateDtm(
  doc_vec,
  doc_names = names(doc_vec),
  ngram_window = c(1, 1),
  stopword_vec = c(stopwords::stopwords("en"), stopwords::stopwords(source = "smart")),
  lower = TRUE,
  remove_punctuation = TRUE,
  remove_numbers = TRUE,
  stem_lemma_function = NULL,
  verbose = FALSE,
  ...
)
CreateDtm(
  doc_vec,
  doc_names = names(doc_vec),
  ngram_window = c(1, 1),
  stopword_vec = c(stopwords::stopwords("en"), stopwords::stopwords(source = "smart")),
  lower = TRUE,
  remove_punctuation = TRUE,
  remove_numbers = TRUE,
  stem_lemma_function = NULL,
  verbose = FALSE,
  ...
)

Arguments

`doc_vec`	A character vector of documents.
`doc_names`	A vector of names for your documents. Defaults to `names(doc_vec)`. If NULL, then doc_names is set to be `1:length(doc_vec)`.
`ngram_window`	A numeric vector of length 2. The first entry is the minimum n-gram size; the second entry is the maximum n-gram size. Defaults to `c(1, 1)`.
`stopword_vec`	A character vector of stopwords you would like to remove. Defaults to `c(stopwords::stopwords("en"), stopwords::stopwords(source = "smart"))`. If you do not want stopwords removed, specify `stopword_vec = c()`.
`lower`	Do you want all words coerced to lower case? Defaults to `TRUE`
`remove_punctuation`	Do you want to convert all non-alpha numeric characters to spaces? Defaults to `TRUE`
`remove_numbers`	Do you want to convert all numbers to spaces? Defaults to `TRUE`
`stem_lemma_function`	A function that you would like to apply to the documents for stemming, lemmatization, or similar. See examples for usage.
`verbose`	Defaults to `TRUE`. Do you want to see status during vectorization?
`...`	Other arguments to be passed to `TmParallelApply`.

Value

A document term matrix of class dgCMatrix. The rows index documents. The columns index terms. The i, j entries represent the count of term j appearing in document i.

Note

The following transformations are applied to stopword_vec as well as doc_vec: lower, remove_punctuation, remove_numbers

See stopwords for details on the default to the stopword_vec argument.

Examples

## Not run: 
data(nih_sample)

# DTM of unigrams and bigrams
dtm <- CreateDtm(doc_vec = nih_sample$ABSTRACT_TEXT,
                 doc_names = nih_sample$APPLICATION_ID, 
                 ngram_window = c(1, 2))

# DTM of unigrams with Porter's stemmer applied
dtm <- CreateDtm(doc_vec = nih_sample$ABSTRACT_TEXT,
                 doc_names = nih_sample$APPLICATION_ID,
                 stem_lemma_function = function(x) SnowballC::wordStem(x, "porter"))

## End(Not run)
## Not run: 
data(nih_sample)

# DTM of unigrams and bigrams
dtm <- CreateDtm(doc_vec = nih_sample$ABSTRACT_TEXT,
                 doc_names = nih_sample$APPLICATION_ID, 
                 ngram_window = c(1, 2))

# DTM of unigrams with Porter's stemmer applied
dtm <- CreateDtm(doc_vec = nih_sample$ABSTRACT_TEXT,
                 doc_names = nih_sample$APPLICATION_ID,
                 stem_lemma_function = function(x) SnowballC::wordStem(x, "porter"))

## End(Not run)

Convert a character vector to a term co-occurrence matrix.

Description

This is the main term co-occurrence matrix creating function for textmineR. In most cases, all you need to do is import documents as a character vector in R and then run this function to get a term co-occurrence matrix that is compatible with the rest of textmineR's functionality and many other libraries. CreateTcm is built on top of the excellent text2vec library.

Usage

CreateTcm(
  doc_vec,
  skipgram_window = Inf,
  ngram_window = c(1, 1),
  stopword_vec = c(stopwords::stopwords("en"), stopwords::stopwords(source = "smart")),
  lower = TRUE,
  remove_punctuation = TRUE,
  remove_numbers = TRUE,
  stem_lemma_function = NULL,
  verbose = FALSE,
  ...
)
CreateTcm(
  doc_vec,
  skipgram_window = Inf,
  ngram_window = c(1, 1),
  stopword_vec = c(stopwords::stopwords("en"), stopwords::stopwords(source = "smart")),
  lower = TRUE,
  remove_punctuation = TRUE,
  remove_numbers = TRUE,
  stem_lemma_function = NULL,
  verbose = FALSE,
  ...
)

Arguments

`doc_vec`	A character vector of documents.
`skipgram_window`	An integer window, from `0` to `Inf` for skip-grams. Defaults to `Inf`. See 'Details', below.
`ngram_window`	A numeric vector of length 2. The first entry is the minimum n-gram size; the second entry is the maximum n-gram size. Defaults to `c(1, 1)`. Must be `c(1, 1)` if `skipgram_window` is not `0` or `Inf`.
`stopword_vec`	A character vector of stopwords you would like to remove. Defaults to `c(stopwords::stopwords("en"), stopwords::stopwords(source = "smart"))`. If you do not want stopwords removed, specify `stopword_vec = c()`.
`lower`	Do you want all words coerced to lower case? Defaults to `TRUE`
`remove_punctuation`	Do you want to convert all non-alpha numeric characters to spaces? Defaults to `TRUE`
`remove_numbers`	Do you want to convert all numbers to spaces? Defaults to `TRUE`
`stem_lemma_function`	A function that you would like to apply to the documents for stemming, lemmatization, or similar. See examples for usage.
`verbose`	Defaults to `TRUE`. Do you want to see status during vectorization?
`...`	Other arguments to be passed to `TmParallelApply`.

Details

Setting skipgram_window counts the number of times that term j appears within skipgram_window places of term i. Inf and 0 create somewhat special TCMs. Setting skipgram_window to Inf counts the number of documents in which term j and term i occur together. Setting skipgram_window to 0 counts the number of terms shared by document j and document i. A TCM where skipgram_window is 0 is the only TCM that will be symmetric.

Value

A document term matrix of class dgCMatrix. The rows index documents. The columns index terms. The i, j entries represent the count of term j appearing in document i.

Note

The following transformations are applied to stopword_vec as well as doc_vec: lower, remove_punctuation, remove_numbers

See stopwords for details on the default to the stopword_vec argument.

Examples

## Not run: 
data(nih_sample)

# TCM of unigrams and bigrams
tcm <- CreateTcm(doc_vec = nih_sample$ABSTRACT_TEXT,
                 skipgram_window = Inf, 
                 ngram_window = c(1, 2))

# TCM of unigrams and a skip=gram window of 3, applying Porter's word stemmer
tcm <- CreateTcm(doc_vec = nih_sample$ABSTRACT_TEXT,
                 skipgram_window = 3,
                 stem_lemma_function = function(x) SnowballC::wordStem(x, "porter"))

## End(Not run)
## Not run: 
data(nih_sample)

# TCM of unigrams and bigrams
tcm <- CreateTcm(doc_vec = nih_sample$ABSTRACT_TEXT,
                 skipgram_window = Inf, 
                 ngram_window = c(1, 2))

# TCM of unigrams and a skip=gram window of 3, applying Porter's word stemmer
tcm <- CreateTcm(doc_vec = nih_sample$ABSTRACT_TEXT,
                 skipgram_window = 3,
                 stem_lemma_function = function(x) SnowballC::wordStem(x, "porter"))

## End(Not run)

Convert a DTM to a Character Vector of documents

Description

This function takes a sparse matrix (DTM) as input and returns a character vector whose length is equal to the number of rows of the input DTM.

Usage

Dtm2Docs(dtm, ...)
Dtm2Docs(dtm, ...)

Arguments

`dtm`	A sparse Matrix from the matrix package whose rownames correspond to documents and colnames correspond to words
`...`	Other arguments to be passed to `TmParallelApply`. See note, below.

Value

Returns a character vector. Each entry of this vector corresponds to the rows of dtm.

Note

Examples

# Load a pre-formatted dtm and topic model
data(nih_sample)
data(nih_sample_dtm) 

# see the original documents
nih_sample$ABSTRACT_TEXT[ 1:3 ]

# see the new documents re-structured from the DTM
new_docs <- Dtm2Docs(dtm = nih_sample_dtm)

new_docs[ 1:3 ]

# Load a pre-formatted dtm and topic model
data(nih_sample)
data(nih_sample_dtm) 

# see the original documents
nih_sample$ABSTRACT_TEXT[ 1:3 ]

# see the new documents re-structured from the DTM
new_docs <- Dtm2Docs(dtm = nih_sample_dtm)

new_docs[ 1:3 ]

Turn a document term matrix into a list for LDA Gibbs sampling

Description

Represents a document term matrix as a list.

Usage

Dtm2Lexicon(dtm, ...)
Dtm2Lexicon(dtm, ...)

Arguments

`dtm`	A document term matrix (or term co-occurrence matrix) of class `dgCMatrix`.
`...`	Other arguments to be passed to `TmParallelApply`.

Value

Returns a list. Each element of the list represents a row of the input matrix. Each list element contains a numeric vector with as many entries as tokens in the original document. The entries are the column index for that token, minus 1.

Examples

## Not run: 
# Load pre-formatted data for use
data(nih_sample_dtm)

result <- Dtm2Lexicon(dtm = nih_sample_dtm, 
                      cpus = 2)

## End(Not run)
## Not run: 
# Load pre-formatted data for use
data(nih_sample_dtm)

result <- Dtm2Lexicon(dtm = nih_sample_dtm, 
                      cpus = 2)

## End(Not run)

Turn a document term matrix into a term co-occurrence matrix

Description

Turn a document term matrix, whose rows index documents and whose columns index terms, into a term co-occurrence matrix. A term co-occurrence matrix's rows and columns both index terms. See details, below.

Usage

Dtm2Tcm(dtm)
Dtm2Tcm(dtm)

Arguments

dtm

A document term matrix, generally of class dgCMatrix, though other classes, such as dgTMatrix, may also work without issue.

Value

Returns a square dgCMatrix whose rows and columns both index terms. The i, j entries of this matrix represent the count of term j across documents containing term i. Note that, while square, this matrix is not symmetric.

Examples

data(nih_sample_dtm)

tcm <- Dtm2Tcm(nih_sample_dtm)
data(nih_sample_dtm)

tcm <- Dtm2Tcm(nih_sample_dtm)

Fit a Correlated Topic Model

Description

A wrapper for the CTM function based on Blei's original code that returns a nicely-formatted topic model.

Usage

FitCtmModel(
  dtm,
  k,
  calc_coherence = TRUE,
  calc_r2 = FALSE,
  return_all = TRUE,
  ...
)
FitCtmModel(
  dtm,
  k,
  calc_coherence = TRUE,
  calc_r2 = FALSE,
  return_all = TRUE,
  ...
)

Arguments

`dtm`	A document term matrix of class `dgCMatrix`
`k`	Number of topics
`calc_coherence`	Do you want to calculate probabilistic coherence of topics after the model is trained? Defaults to `TRUE`.
`calc_r2`	Do you want to calculate R-squared after the model is trained? Defaults to `FALSE`.
`return_all`	Logical. Do you want the raw results of the underlying function returned along with the formatted results? Defaults to `TRUE`.
`...`	Other arguments to pass to CTM or TmParallelApply. See note below.

Value

Returns a list with a minimum of two objects, phi and theta. The rows of phi index topics and the columns index tokens. The rows of theta index documents and the columns index topics.

Note

When passing additional arguments to CTM, you must unlist the elements in the control argument and pass them one by one. See examples for how to dot this correctly.

Examples

# Load a pre-formatted dtm 
data(nih_sample_dtm) 

# Fit a CTM model on a sample of documents
model <- FitCtmModel(dtm = nih_sample_dtm[ sample(1:nrow(nih_sample_dtm) , 10) , ], 
                     k = 3, return_all = FALSE)
                     
# the correct way to pass control arguments to CTM
## Not run: 
topics_CTM <- FitCtmModel(
    dtm = nih_sample_dtm[ sample(1:nrow(nih_sample_dtm) , 10) , ],
    k = 10,
    calc_coherence = TRUE,
    calc_r2 = TRUE,
    return_all = TRUE,
    estimate.beta = TRUE,
    verbose = 0,
    prefix = tempfile(),
    save = 0,
    keep = 0,
    seed = as.integer(Sys.time()),
    nstart = 1L,
    best = TRUE,
    var = list(iter.max = 500, tol = 10^-6),
    em = list(iter.max = 1000, tol = 10^-4),
    initialize = "random",
    cg = list(iter.max = 500, tol = 10^-5)
)

## End(Not run)
# Load a pre-formatted dtm 
data(nih_sample_dtm) 

# Fit a CTM model on a sample of documents
model <- FitCtmModel(dtm = nih_sample_dtm[ sample(1:nrow(nih_sample_dtm) , 10) , ], 
                     k = 3, return_all = FALSE)
                     
# the correct way to pass control arguments to CTM
## Not run: 
topics_CTM <- FitCtmModel(
    dtm = nih_sample_dtm[ sample(1:nrow(nih_sample_dtm) , 10) , ],
    k = 10,
    calc_coherence = TRUE,
    calc_r2 = TRUE,
    return_all = TRUE,
    estimate.beta = TRUE,
    verbose = 0,
    prefix = tempfile(),
    save = 0,
    keep = 0,
    seed = as.integer(Sys.time()),
    nstart = 1L,
    best = TRUE,
    var = list(iter.max = 500, tol = 10^-6),
    em = list(iter.max = 1000, tol = 10^-4),
    initialize = "random",
    cg = list(iter.max = 500, tol = 10^-5)
)

## End(Not run)

Fit a Latent Dirichlet Allocation topic model

Description

Fit a Latent Dirichlet Allocation topic model using collapsed Gibbs sampling.

Usage

FitLdaModel(
  dtm,
  k,
  iterations = NULL,
  burnin = -1,
  alpha = 0.1,
  beta = 0.05,
  optimize_alpha = FALSE,
  calc_likelihood = FALSE,
  calc_coherence = TRUE,
  calc_r2 = FALSE,
  ...
)
FitLdaModel(
  dtm,
  k,
  iterations = NULL,
  burnin = -1,
  alpha = 0.1,
  beta = 0.05,
  optimize_alpha = FALSE,
  calc_likelihood = FALSE,
  calc_coherence = TRUE,
  calc_r2 = FALSE,
  ...
)

Arguments

`dtm`	A document term matrix or term co-occurrence matrix of class dgCMatrix
`k`	Integer number of topics
`iterations`	Integer number of iterations for the Gibbs sampler to run. A future version may include automatic stopping criteria.
`burnin`	Integer number of burnin iterations. If `burnin` is greater than -1, the resulting "phi" and "theta" matrices are an average over all iterations greater than `burnin`.
`alpha`	Vector of length `k` for asymmetric or a number for symmetric. This is the prior for topics over documents
`beta`	Vector of length `ncol(dtm)` for asymmetric or a number for symmetric. This is the prior for words over topics.
`optimize_alpha`	Logical. Do you want to optimize alpha every 10 Gibbs iterations? Defaults to `FALSE`.
`calc_likelihood`	Do you want to calculate the likelihood every 10 Gibbs iterations? Useful for assessing convergence. Defaults to `FALSE`.
`calc_coherence`	Do you want to calculate probabilistic coherence of topics after the model is trained? Defaults to `TRUE`.
`calc_r2`	Do you want to calculate R-squared after the model is trained? Defaults to `FALSE`.
`...`	Other arguments to be passed to `TmParallelApply`

Details

EXPLAIN IMPLEMENTATION DETAILS

Value

Returns an S3 object of class c("LDA", "TopicModel"). DESCRIBE MORE

Examples

# load some data
data(nih_sample_dtm)

# fit a model 
set.seed(12345)
m <- FitLdaModel(dtm = nih_sample_dtm[1:20,], k = 5,
                 iterations = 200, burnin = 175)

str(m)

# predict on held-out documents using gibbs sampling "fold in"
p1 <- predict(m, nih_sample_dtm[21:100,], method = "gibbs",
              iterations = 200, burnin = 175)

# predict on held-out documents using the dot product method
p2 <- predict(m, nih_sample_dtm[21:100,], method = "dot")

# compare the methods
barplot(rbind(p1[1,],p2[1,]), beside = TRUE, col = c("red", "blue")) 
# load some data
data(nih_sample_dtm)

# fit a model 
set.seed(12345)
m <- FitLdaModel(dtm = nih_sample_dtm[1:20,], k = 5,
                 iterations = 200, burnin = 175)

str(m)

# predict on held-out documents using gibbs sampling "fold in"
p1 <- predict(m, nih_sample_dtm[21:100,], method = "gibbs",
              iterations = 200, burnin = 175)

# predict on held-out documents using the dot product method
p2 <- predict(m, nih_sample_dtm[21:100,], method = "dot")

# compare the methods
barplot(rbind(p1[1,],p2[1,]), beside = TRUE, col = c("red", "blue"))

Fit a topic model using Latent Semantic Analysis

Description

A wrapper for RSpectra::svds that returns a nicely-formatted latent semantic analysis topic model.

Usage

FitLsaModel(dtm, k, calc_coherence = TRUE, return_all = FALSE, ...)
FitLsaModel(dtm, k, calc_coherence = TRUE, return_all = FALSE, ...)

Arguments

`dtm`	A document term matrix of class `Matrix::dgCMatrix`
`k`	Number of topics
`calc_coherence`	Do you want to calculate probabilistic coherence of topics after the model is trained? Defaults to `TRUE`.
`return_all`	Should all objects returned from `RSpectra::svds` be returned here? Defaults to `FALSE`
`...`	Other arguments to pass to `svds` through its `opts` parameter.

Details

Latent semantic analysis, LSA, uses single value decomposition to factor the document term matrix. In many LSA applications, TF-IDF weights are applied to the DTM before model fitting. However, this is not strictly necessary.

Value

Returns a list with a minimum of three objects: phi, theta, and sv. The rows of phi index topics and the columns index tokens. The rows of theta index documents and the columns index topics. sv is a vector of singular values.

Examples

# Load a pre-formatted dtm 
data(nih_sample_dtm) 

# Convert raw word counts to TF-IDF frequency weights
idf <- log(nrow(nih_sample_dtm) / Matrix::colSums(nih_sample_dtm > 0))

dtm_tfidf <- Matrix::t(nih_sample_dtm) * idf

dtm_tfidf <- Matrix::t(dtm_tfidf)

# Fit an LSA model
model <- FitLsaModel(dtm = dtm_tfidf, k = 5)

str(model)

# Load a pre-formatted dtm 
data(nih_sample_dtm) 

# Convert raw word counts to TF-IDF frequency weights
idf <- log(nrow(nih_sample_dtm) / Matrix::colSums(nih_sample_dtm > 0))

dtm_tfidf <- Matrix::t(nih_sample_dtm) * idf

dtm_tfidf <- Matrix::t(dtm_tfidf)

# Fit an LSA model
model <- FitLsaModel(dtm = dtm_tfidf, k = 5)

str(model)

Get cluster labels using a "more probable" method of terms

Description

Function extracts probable terms from a set of documents. Probable here implies more probable than in a corpus overall.

Usage

GetProbableTerms(docnames, dtm, p_terms = NULL)
GetProbableTerms(docnames, dtm, p_terms = NULL)

Arguments

`docnames`	A character vector of rownames of dtm for set of documents
`dtm`	A document term matrix of class `matrix` or `dgCMatrix`.
`p_terms`	If not NULL (the default), a numeric vector representing the probability of each term in the corpus whose names correspond to colnames(dtm).

Value

Returns a numeric vector of the format p_terms. The entries of the vectors correspond to the difference in the probability of drawing a term from the set of documents given by docnames and the probability of drawing that term from the corpus overall (p_terms).

Examples

# Load a pre-formatted dtm and topic model
data(nih_sample_topic_model)
data(nih_sample_dtm) 

# documents with a topic proportion of .25 or higher for topic 2
mydocs <- rownames(nih_sample_topic_model$theta)[ nih_sample_topic_model$theta[ , 2 ] >= 0.25 ] 

term_probs <- Matrix::colSums(nih_sample_dtm) / sum(Matrix::colSums(nih_sample_dtm))

GetProbableTerms(docnames = mydocs, dtm = nih_sample_dtm, p_terms = term_probs)

# Load a pre-formatted dtm and topic model
data(nih_sample_topic_model)
data(nih_sample_dtm) 

# documents with a topic proportion of .25 or higher for topic 2
mydocs <- rownames(nih_sample_topic_model$theta)[ nih_sample_topic_model$theta[ , 2 ] >= 0.25 ] 

term_probs <- Matrix::colSums(nih_sample_dtm) / sum(Matrix::colSums(nih_sample_dtm))

GetProbableTerms(docnames = mydocs, dtm = nih_sample_dtm, p_terms = term_probs)

Get Top Terms for each topic from a topic model

Description

Takes topics by terms matrix and returns top M terms for each topic

Usage

GetTopTerms(phi, M, return_matrix = TRUE)
GetTopTerms(phi, M, return_matrix = TRUE)

Arguments

`phi`	A matrix whose rows index topics and columns index words
`M`	An integer for the number of terms to return
`return_matrix`	Do you want a `matrix` or `data.frame`/`tibble` returned? Defaults to `TRUE`.

Value

If return_matrix = TRUE (the default) then a matrix. Otherwise, returns a data.frame or tibble whose columns correspond to a topic and whose m-th row correspond to the m-th top term from the input phi.

Examples

# Load a pre-formatted dtm and topic model
data(nih_sample_topic_model) 

top_terms <- GetTopTerms(phi = nih_sample_topic_model$phi, M = 5)

str(top_terms)
# Load a pre-formatted dtm and topic model
data(nih_sample_topic_model) 

top_terms <- GetTopTerms(phi = nih_sample_topic_model$phi, M = 5)

str(top_terms)

Internal helper functions for `textmineR`

Description

These functions are internal helper functions for textmineR. They are not designed to be called by users. Each of the functions here are C++ functions. There are corresponding R functions that call these that add additional functionality.

Get some topic labels using a "more probable" method of terms

Description

Function calls GetProbableTerms with some rules to get topic labels. This function is in "super-ultra-mega alpha"; use at your own risk/discretion.

Usage

LabelTopics(assignments, dtm, M = 2)
LabelTopics(assignments, dtm, M = 2)

Arguments

`assignments`	A documents by topics matrix similar to `theta`. This will work best if this matrix is sparse, with only a few non-zero topics per document.
`dtm`	A document term matrix of class `matrix` or `dgCMatrix`. The columns of `dtm` should be n-grams whose colnames have a "_" where spaces would be between the words.
`M`	The number of n-gram labels you want to return. Defaults to 2

Value

Returns a matrix whose rows correspond to topics and whose j-th column corresponds to the j-th "best" label assignment.

Examples

# make a dtm with unigrams and bigrams
data(nih_sample_topic_model)

m <- nih_sample_topic_model

assignments <- t(apply(m$theta, 1, function(x){
  x[ x < 0.05 ] <- 0
  x / sum(x)
}))

assignments[is.na(assignments)] <- 0

labels <- LabelTopics(assignments = assignments, dtm = m$data, M = 2)

# make a dtm with unigrams and bigrams
data(nih_sample_topic_model)

m <- nih_sample_topic_model

assignments <- t(apply(m$theta, 1, function(x){
  x[ x < 0.05 ] <- 0
  x / sum(x)
}))

assignments[is.na(assignments)] <- 0

labels <- LabelTopics(assignments = assignments, dtm = m$data, M = 2)

Abstracts and metadata from NIH research grants awarded in 2014

Description

This dataset holds information on research grants awarded by the National Institutes of Health (NIH) in 2014. The data set was downloaded in approximately January of 2015 from https://exporter.nih.gov/ExPORTER_Catalog.aspx. It includes both 'projects' and 'abstracts' files.

Usage

  data("nih_sample")
  data("nih_sample_dtm")
  data("nih_sample_topic_model")
  data("nih_sample")
  data("nih_sample_dtm")
  data("nih_sample_topic_model")

Format

A data.frame of 100 randomly-sampled grants' abstracts and metadata. A dgCMatrix representing the document term matrix of abstracts from 100 randomly-sampled grants. A list containing a topic model of these 100 sampled grants.

Source

National Institutes of Health ExPORTER https://exporter.nih.gov/ExPORTER_Catalog.aspx

Posterior methods for topic models

Description

posterior will draw from the posterior distribution of a topic model

Usage

posterior(object, ...)
posterior(object, ...)

Arguments

`object`	An existing trained topic model
`...`	Additional arguments to the call

Draw from the posterior of an LDA topic model

Description

This function takes an object of class lda_topic_model and draws samples from the posterior of either phi or theta. This is useful for quantifying uncertainty around parametersof the final model.

Usage

## S3 method for class 'lda_topic_model'
posterior(object, which = "theta", num_samples = 100, ...)
## S3 method for class 'lda_topic_model'
posterior(object, which = "theta", num_samples = 100, ...)

Arguments

`object`	An object of class `lda_topic_model`
`which`	A character of either 'theta' or 'phi', indicating from which matrix to draw posterior samples
`num_samples`	Integer number of samples to draw
`...`	Other arguments to be passed to `TmParallelApply`.

Value

Returns a data frame where each row is a single sample from the posterior. Each column is the distribution over a single parameter. The variable var is a facet for subsetting by document (for theta) or topic (for phi).

References

Heinrich, G. (2005) Parameter estimation for text analysis. Technical report. http://www.arbylon.net/publications/text-est.pdf

Examples

## Not run: 
a <- posterior(object = nih_sample_topic_model, which = "theta", num_samples = 20)

plot(density(a$t1[a$var == "8693991"]))

b <- posterior(object = nih_sample_topic_model, which = "phi", num_samples = 20)

plot(denisty(b$research[b$var == "t_5"]))

## End(Not run)
## Not run: 
a <- posterior(object = nih_sample_topic_model, which = "theta", num_samples = 20)

plot(density(a$t1[a$var == "8693991"]))

b <- posterior(object = nih_sample_topic_model, which = "phi", num_samples = 20)

plot(denisty(b$research[b$var == "t_5"]))

## End(Not run)

Predict method for Correlated topic models (CTM)

Description

Obtains predictions of topics for new documents from a fitted CTM model

Usage

## S3 method for class 'ctm_topic_model'
predict(object, newdata, ...)
## S3 method for class 'ctm_topic_model'
predict(object, newdata, ...)

Arguments

`object`	a fitted object of class "ctm_topic_model"
`newdata`	a DTM or TCM of class dgCMatrix or a numeric vector
`...`	further arguments passed to or from other methods.

Value

a "theta" matrix with one row per document and one column per topic

Note

Predictions for this method are performed using the "dot" method as described in the textmineR vignette "c_topic_modeling".

Examples

# Load a pre-formatted dtm 
## Not run: 
data(nih_sample_dtm) 

model <- FitCtmModel(dtm = nih_sample_dtm[1:20,], k = 3,
                     calc_coherence = FALSE, calc_r2 = FALSE)

# Get predictions on the next 50 documents
pred <- predict(model, nih_sample_dtm[21:100,])

## End(Not run)
# Load a pre-formatted dtm 
## Not run: 
data(nih_sample_dtm) 

model <- FitCtmModel(dtm = nih_sample_dtm[1:20,], k = 3,
                     calc_coherence = FALSE, calc_r2 = FALSE)

# Get predictions on the next 50 documents
pred <- predict(model, nih_sample_dtm[21:100,])

## End(Not run)

Get predictions from a Latent Dirichlet Allocation model

Description

Obtains predictions of topics for new documents from a fitted LDA model

Usage

## S3 method for class 'lda_topic_model'
predict(
  object,
  newdata,
  method = c("gibbs", "dot"),
  iterations = NULL,
  burnin = -1,
  ...
)
## S3 method for class 'lda_topic_model'
predict(
  object,
  newdata,
  method = c("gibbs", "dot"),
  iterations = NULL,
  burnin = -1,
  ...
)

Arguments

`object`	a fitted object of class `lda_topic_model`
`newdata`	a DTM or TCM of class `dgCMatrix` or a numeric vector
`method`	one of either "gibbs" or "dot". If "gibbs" Gibbs sampling is used and `iterations` must be specified.
`iterations`	If `method = "gibbs"`, an integer number of iterations for the Gibbs sampler to run. A future version may include automatic stopping criteria.
`burnin`	If `method = "gibbs"`, an integer number of burnin iterations. If `burnin` is greater than -1, the entries of the resulting "theta" matrix are an average over all iterations greater than `burnin`.
`...`	Other arguments to be passed to `TmParallelApply`

Value

a "theta" matrix with one row per document and one column per topic

Examples

## Not run: 
# load some data
data(nih_sample_dtm)

# fit a model 
set.seed(12345)

m <- FitLdaModel(dtm = nih_sample_dtm[1:20,], k = 5,
                 iterations = 200, burnin = 175)

str(m)

# predict on held-out documents using gibbs sampling "fold in"
p1 <- predict(m, nih_sample_dtm[21:100,], method = "gibbs",
              iterations = 200, burnin = 175)

# predict on held-out documents using the dot product method
p2 <- predict(m, nih_sample_dtm[21:100,], method = "dot")

# compare the methods
barplot(rbind(p1[1,],p2[1,]), beside = TRUE, col = c("red", "blue")) 

## End(Not run)
## Not run: 
# load some data
data(nih_sample_dtm)

# fit a model 
set.seed(12345)

m <- FitLdaModel(dtm = nih_sample_dtm[1:20,], k = 5,
                 iterations = 200, burnin = 175)

str(m)

# predict on held-out documents using gibbs sampling "fold in"
p1 <- predict(m, nih_sample_dtm[21:100,], method = "gibbs",
              iterations = 200, burnin = 175)

# predict on held-out documents using the dot product method
p2 <- predict(m, nih_sample_dtm[21:100,], method = "dot")

# compare the methods
barplot(rbind(p1[1,],p2[1,]), beside = TRUE, col = c("red", "blue")) 

## End(Not run)

Predict method for LSA topic models

Description

Obtains predictions of topics for new documents from a fitted LSA model

Usage

## S3 method for class 'lsa_topic_model'
predict(object, newdata, ...)
## S3 method for class 'lsa_topic_model'
predict(object, newdata, ...)

Arguments

`object`	a fitted object of class "lsa_topic_model"
`newdata`	a DTM or TCM of class dgCMatrix or a numeric vector
`...`	further arguments passed to or from other methods.

Value

a "theta" matrix with one row per document and one column per topic

Examples

# Load a pre-formatted dtm 
data(nih_sample_dtm) 

# Convert raw word counts to TF-IDF frequency weights
idf <- log(nrow(nih_sample_dtm) / Matrix::colSums(nih_sample_dtm > 0))

dtm_tfidf <- Matrix::t(nih_sample_dtm) * idf

dtm_tfidf <- Matrix::t(dtm_tfidf)

# Fit an LSA model on the first 50 documents
model <- FitLsaModel(dtm = dtm_tfidf[1:50,], k = 5)

# Get predictions on the next 50 documents
pred <- predict(model, dtm_tfidf[51:100,])
# Load a pre-formatted dtm 
data(nih_sample_dtm) 

# Convert raw word counts to TF-IDF frequency weights
idf <- log(nrow(nih_sample_dtm) / Matrix::colSums(nih_sample_dtm > 0))

dtm_tfidf <- Matrix::t(nih_sample_dtm) * idf

dtm_tfidf <- Matrix::t(dtm_tfidf)

# Fit an LSA model on the first 50 documents
model <- FitLsaModel(dtm = dtm_tfidf[1:50,], k = 5)

# Get predictions on the next 50 documents
pred <- predict(model, dtm_tfidf[51:100,])

Summarize topics in a topic model

Description

Create a data frame summarizing the contents of each topic in a model

Usage

SummarizeTopics(model)
SummarizeTopics(model)

Arguments

model

A list (or S3 object) with three named matrices: phi, theta, and gamma. These conform to outputs of many of textmineR's native topic modeling functions such as FitLdaModel.

Details

'prevalence' is normalized to sum to 100. If your 'theta' matrix has negative values (as may be the case with an LSA model), a constant is added so that the least prevalent topic has a prevalence of 0.

'coherence' is calculated using CalcProbCoherence.

'label' is assigned using the top label from LabelTopics. This requires an "assignment" matrix. This matrix is like a "theta" matrix except that it is binary. A topic is "in" a document or it is not. The assignment is made by comparing each value of theta to the minimum of the largest value for each row of theta (each document). This ensures that each document has at least one topic assigned to it.

Value

An object of class data.frame or tibble with 6 columns: 'topic' is the name of the topic, 'prevalence' is the rough prevalence of the topic in all documents across the corpus, 'coherence' is the probabilistic coherence of the topic, 'top_terms_phi' are the top 5 terms for each topic according to P(word|topic), 'top_terms_gamma' are the top 5 terms for each topic according to P(topic|word).

Examples

## Not run: 
SummarizeTopics(nih_sample_topic_model)

## End(Not run)
## Not run: 
SummarizeTopics(nih_sample_topic_model)

## End(Not run)

Get term frequencies and document frequencies from a document term matrix.

Description

This function takes a document term matrix as input and returns a data frame with columns for term frequency, document frequency, and inverse-document frequency

Usage

TermDocFreq(dtm)
TermDocFreq(dtm)

Arguments

dtm

A document term matrix of class dgCMatrix.

Value

Returns a data.frame or tibble with 4 columns. The first column, term is a vector of token labels. The second column, term_freq is the count of times term appears in the entire corpus. The third column doc_freq is the count of the number of documents in which term appears. The fourth column, idf is the log-weighted inverse document frequency of term.

Examples

# Load a pre-formatted dtm and topic model
data(nih_sample_dtm)
data(nih_sample_topic_model) 

# Get the term frequencies 
term_freq_mat <- TermDocFreq(nih_sample_dtm)

str(term_freq_mat)
# Load a pre-formatted dtm and topic model
data(nih_sample_dtm)
data(nih_sample_topic_model) 

# Get the term frequencies 
term_freq_mat <- TermDocFreq(nih_sample_dtm)

str(term_freq_mat)

textmineR

Description

Functions for Text Mining and Topic Modeling

Details

An aid for text mining in R, with a syntax that should be familiar to experienced R users. Provides a wrapper for several topic models that take similarly-formatted input and give similarly-formatted output. Has additional functionality for analyzing and diagnostics for topic models.

An OS-independent parallel version of `lapply`

Description

This function takes a vector or list and a function and applies in parallel.

Usage

TmParallelApply(
  X,
  FUN,
  cpus = parallel::detectCores(),
  export = NULL,
  libraries = NULL,
  envir = parent.frame()
)
TmParallelApply(
  X,
  FUN,
  cpus = parallel::detectCores(),
  export = NULL,
  libraries = NULL,
  envir = parent.frame()
)

Arguments

`X`	A vector or list over which to apply `FUN`
`FUN`	A function to apply over `X`
`cpus`	Number of CPU cores to use, defaults to the value returned by `detectCores`.
`export`	A character vector of objects in the workspace to export when using a Windows machine. Defaults to `NULL`
`libraries`	A character vector of library/package names to load on to each cluster if using a Windows machine. Defaults to `NULL`
`envir`	Environment from which to export variables in varlist

Details

This function is used to parallelize executions in textmineR. It is necessary because of differing capabilities between Windows and Unix. Unix systems use mclapply. Windows systems use parLapply.

Value

This function returns a list of length length(X).

Examples

## Not run: 
x <- 1:10000
f <- function(y) y * y + 12
result <- TmParallelApply(x, f)

## End(Not run)
## Not run: 
x <- 1:10000
f <- function(y) y * y + 12
result <- TmParallelApply(x, f)

## End(Not run)

Update methods for topic models

Description

update will update a previously-trained topic model based on new data. Useful for updates or transfer learning.

Usage

update(object, ...)
update(object, ...)

Arguments

`object`	An existing trained topic model
`...`	Additional arguments to the call

Update a Latent Dirichlet Allocation topic model with new data

Description

Update an LDA model with new data using collapsed Gibbs sampling.

Usage

## S3 method for class 'lda_topic_model'
update(
  object,
  dtm,
  additional_k = 0,
  iterations = NULL,
  burnin = -1,
  new_alpha = NULL,
  new_beta = NULL,
  optimize_alpha = FALSE,
  calc_likelihood = FALSE,
  calc_coherence = TRUE,
  calc_r2 = FALSE,
  ...
)
## S3 method for class 'lda_topic_model'
update(
  object,
  dtm,
  additional_k = 0,
  iterations = NULL,
  burnin = -1,
  new_alpha = NULL,
  new_beta = NULL,
  optimize_alpha = FALSE,
  calc_likelihood = FALSE,
  calc_coherence = TRUE,
  calc_r2 = FALSE,
  ...
)

Arguments

`object`	a fitted object of class `lda_topic_model`
`dtm`	A document term matrix or term co-occurrence matrix of class dgCMatrix.
`additional_k`	Integer number of topics to add, defaults to 0.
`iterations`	Integer number of iterations for the Gibbs sampler to run. A future version may include automatic stopping criteria.
`burnin`	Integer number of burnin iterations. If `burnin` is greater than -1, the resulting "phi" and "theta" matrices are an average over all iterations greater than `burnin`.
`new_alpha`	For now not used. This is the prior for topics over documents used when updating the model
`new_beta`	For now not used. This is the prior for words over topics used when updating the model.
`optimize_alpha`	Logical. Do you want to optimize alpha every 10 Gibbs iterations? Defaults to `FALSE`.
`calc_likelihood`	Do you want to calculate the likelihood every 10 Gibbs iterations? Useful for assessing convergence. Defaults to `FALSE`.
`calc_coherence`	Do you want to calculate probabilistic coherence of topics after the model is trained? Defaults to `TRUE`.
`calc_r2`	Do you want to calculate R-squared after the model is trained? Defaults to `FALSE`.
`...`	Other arguments to be passed to `TmParallelApply`

Value

Returns an S3 object of class c("LDA", "TopicModel").

Examples

## Not run: 
# load a document term matrix
d1 <- nih_sample_dtm[1:50,]

d2 <- nih_sample_dtm[51:100,]

# fit a model
m <- FitLdaModel(d1, k = 10, 
                 iterations = 200, burnin = 175,
                 optimize_alpha = TRUE, 
                 calc_likelihood = FALSE,
                 calc_coherence = TRUE,
                 calc_r2 = FALSE)

# update an existing model by adding documents
m2 <- update(object = m,
             dtm = rbind(d1, d2),
             iterations = 200,
             burnin = 175)
             
# use an old model as a prior for a new model
m3 <- update(object = m,
             dtm = d2, # new documents only
             iterations = 200,
             burnin = 175)
             
# add topics while updating a model by adding documents
m4 <- update(object = m,
             dtm = rbind(d1, d2),
             additional_k = 3,
             iterations = 200,
             burnin = 175)
             
# add topics to an existing model
m5 <- update(object = m,
             dtm = d1, # this is the old data
             additional_k = 3,
             iterations = 200,
             burnin = 175)


## End(Not run)
## Not run: 
# load a document term matrix
d1 <- nih_sample_dtm[1:50,]

d2 <- nih_sample_dtm[51:100,]

# fit a model
m <- FitLdaModel(d1, k = 10, 
                 iterations = 200, burnin = 175,
                 optimize_alpha = TRUE, 
                 calc_likelihood = FALSE,
                 calc_coherence = TRUE,
                 calc_r2 = FALSE)

# update an existing model by adding documents
m2 <- update(object = m,
             dtm = rbind(d1, d2),
             iterations = 200,
             burnin = 175)
             
# use an old model as a prior for a new model
m3 <- update(object = m,
             dtm = d2, # new documents only
             iterations = 200,
             burnin = 175)
             
# add topics while updating a model by adding documents
m4 <- update(object = m,
             dtm = rbind(d1, d2),
             additional_k = 3,
             iterations = 200,
             burnin = 175)
             
# add topics to an existing model
m5 <- update(object = m,
             dtm = d1, # this is the old data
             additional_k = 3,
             iterations = 200,
             burnin = 175)


## End(Not run)

Package 'textmineR'

Help Index

Calculate a matrix whose rows represent P(topic_i|tokens)

Description

Usage

Arguments

Value

Examples

Calculate Hellinger Distance

Description

Usage

Arguments

Value

Examples

Calculate Jensen-Shannon Divergence

Description

Usage

Arguments

Value

Examples

Calculate the log likelihood of a document term matrix given a topic model

Description

Usage

Arguments

Value

Note

Examples

Probabilistic coherence of topics

Description

Usage

Arguments

Value

Examples

Calculate the R-squared of a topic model.

Description

Usage

Arguments

Value

Note

Examples

Represent a document clustering as a topic model

Description

Usage

Arguments

Value

Examples

Convert a character vector to a document term matrix.

Description

Usage

Arguments

Value

Note

Examples

Convert a character vector to a term co-occurrence matrix.

Description

Usage

Arguments

Details

Value

Note

Examples

Convert a DTM to a Character Vector of documents

Description

Usage

Arguments

Value

Note

Examples

Turn a document term matrix into a list for LDA Gibbs sampling

Description

Usage

Arguments

Value

Examples

Turn a document term matrix into a term co-occurrence matrix

Description

Usage

Arguments

Value

Examples

Internal helper functions for `textmineR`