--- title: "2. document clustering" author: "Thomas W. Jones" date: "`r Sys.Date()`" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{2. document clustering} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", warning = FALSE ) ``` # Document clustering A common task in text mining is document clustering. There are other ways to cluster documents. However, for this vignette, we will stick with the basics. The example below shows the most common method, using [TF-IDF](https://en.wikipedia.org/wiki/Tf–idf) and cosine distance. Let's read in some data and make a document term matrix (DTM) and get started. ```{r } library(textmineR) # load nih_sample data set from textmineR data(nih_sample) # create a document term matrix dtm <- CreateDtm(doc_vec = nih_sample$ABSTRACT_TEXT, # character vector of documents doc_names = nih_sample$APPLICATION_ID, # document names ngram_window = c(1, 2), # minimum and maximum n-gram length stopword_vec = c(stopwords::stopwords("en"), # stopwords from tm stopwords::stopwords(source = "smart")), # this is the default value lower = TRUE, # lowercase - this is the default value remove_punctuation = TRUE, # punctuation - this is the default remove_numbers = TRUE, # numbers - this is the default verbose = FALSE, # Turn off status bar for this demo cpus = 2) # default is all available cpus on the system # construct the matrix of term counts to get the IDF vector tf_mat <- TermDocFreq(dtm) ``` First, we must re-weight the word counts in the document term matrix. We do this by multiplying the term frequency (in this case, count of words in documents) by an inverse document frequency (IDF) vector. textmineR calculates IDF for the $i$-th word as \begin{align} IDF_i = ln\big(\frac{N}{\sum_{j = 1}^N C(word_i, doc_j)}\big) \end{align} where $N$ is the number of documents in the corpus. By default, when you multiply a matrix with a vector, R multiplies the vector to each column. For this reason, we need to transpose the DTM before multiplying the IDF vector. Then we transpose it back to the original orientation. ```{r } # TF-IDF and cosine similarity tfidf <- t(dtm[ , tf_mat$term ]) * tf_mat$idf tfidf <- t(tfidf) ``` The next step is to calculate cosine similarity and change it to a distance. We're going to use some linear algebra to do this. The dot product of two positive-valued, unit-length vectors is the cosine similarity between the two vectors. For a deeper explanation of the math and logic, read [this article](https://anythingbutrbitrary.blogspot.com/2013/03/build-search-engine-in-20-minutes-or.html). ```{r } csim <- tfidf / sqrt(rowSums(tfidf * tfidf)) csim <- csim %*% t(csim) ``` R's various clustering functions work with distances, not similarities. We convert cosine similarity to cosine distance by subtracting it from $1$. This works because cosine similarity is bound between $0$ and $1$. While we are at it, we'll convert the matrix to a `dist` object. ```{r } cdist <- as.dist(1 - csim) ``` The last step is clustering. There are many clustering algorithms out there. My preference is [agglomerative hierarchical clustering](https://en.wikipedia.org/wiki/Hierarchical_clustering) using [Ward's method](https://en.wikipedia.org/wiki/Ward%27s_method) as the merge rule. Compared to other methods, such as k-means, hierarchical clustering is computationally inexpensive. In the example below, I choose to cut the tree at $10$ clusters. This is a somewhat arbitrary choice. I often prefer to use the silhouette coefficient. You can read about this method [here](https://www.datanovia.com/en/lessons/determining-the-optimal-number-of-clusters-3-must-know-methods/). Performing this is an exercise I'll leave to the reader. ```{r fig.width = 7.5, fig.height = 4} hc <- hclust(cdist, "ward.D") clustering <- cutree(hc, 10) plot(hc, main = "Hierarchical clustering of 100 NIH grant abstracts", ylab = "", xlab = "", yaxt = "n") rect.hclust(hc, 10, border = "red") ``` It might be nice to get an idea of what's in each of these clusters. We can use the probability difference method from above. ```{r documnet_clustering_5} p_words <- colSums(dtm) / sum(dtm) cluster_words <- lapply(unique(clustering), function(x){ rows <- dtm[ clustering == x , ] # for memory's sake, drop all words that don't appear in the cluster rows <- rows[ , colSums(rows) > 0 ] colSums(rows) / sum(rows) - p_words[ colnames(rows) ] }) ``` The code chunk below creates a summary table of clusters. Each cluster's size and the top 5 words are represented. ```{r } # create a summary table of the top 5 words defining each cluster cluster_summary <- data.frame(cluster = unique(clustering), size = as.numeric(table(clustering)), top_words = sapply(cluster_words, function(d){ paste( names(d)[ order(d, decreasing = TRUE) ][ 1:5 ], collapse = ", ") }), stringsAsFactors = FALSE) ``` ```{r eval = FALSE} cluster_summary ``` ```{r echo = FALSE} knitr::kable(cluster_summary, caption = "Cluster summary table") ``` You may want a word cloud to visualize each cluster. Using the `wordcloud` package, we plot cluster 100 below. ```{r eval = FALSE} # plot a word cloud of one cluster as an example wordcloud::wordcloud(words = names(cluster_words[[ 5 ]]), freq = cluster_words[[ 5 ]], max.words = 50, random.order = FALSE, colors = c("red", "yellow", "blue"), main = "Top words in cluster 100") ``` ```{r echo = FALSE, warning = FALSE, fit.height = 7.5, fig.width = 7.5} # plot a word cloud of one cluster as an example suppressWarnings({ wordcloud::wordcloud(words = names(cluster_words[[ 5 ]]), freq = cluster_words[[ 5 ]], max.words = 50, random.order = FALSE, colors = c("red", "yellow", "blue"), main = "Top words in cluster 100") }) ```