Clustering documents and words

Part II. Clustering documents and words

In this part we are going to cluster documents using each word as a dimension, and then we are going to cluster words using each document where the word occurs as a dimension. This is just a simple primer to demonstrate how both dimensions can be used for clustering, and how to prepare your documents for clustering algorithms.

First, we need to prepare data for clustering.

The input consists of a set of 9 paper titles (See input). Clearly, 6 first documents belong to the category Human Computer Interaction, and the last 3 documents belong to theoretical computer science. We want to check how k-means algorithm can automatically discover two classes.

First, we prepare the data by converting each document into a list of words.

Figure 1

Second, we remove stop words and the words which occur only once in the entire set (these words cannot help in computing similarity).

Figure 2

Finally, we create word-document and document-word matrices, and store them in files Documents.csv and Words.csv respectively.

Cluster documents by common words using SimpleKMeans. Did the program identify two groups of documents?

Now, perform word clustering. Does the grouping of words make sense?