Part
II. Clustering documents and words
In this part we are
going to cluster documents using each word as a dimension, and then we are
going to cluster words using each document where the word occurs as a dimension.
This is just a simple primer to demonstrate how both dimensions can be used for
clustering, and how to prepare your documents for clustering algorithms.
First, we need to
prepare data for clustering.
The input consists of a
set of 9 paper titles (See input). Clearly, 6
first documents belong to the category Human Computer Interaction, and the last
3 documents belong to theoretical computer science. We want to check how
k-means algorithm can automatically discover two classes.
First, we prepare the
data by converting each document into a list of words.
Second, we remove stop
words and the words which occur only once in the entire set (these words cannot
help in computing similarity).
Finally, we create
word-document and document-word matrices, and store them in files Documents.csv and Words.csv
respectively.
Cluster documents by
common words using SimpleKMeans. Did the program identify
two groups of documents?
Now, perform word
clustering. Does the grouping of words make sense?