In the previous post I wrote how to gather the content from Sitecore and find the similarities among the documents. In this part, clustering of the documents will be covered. Where can this be used: content structuring, defining better content, SEO, meta tags.
Note: Data mining is valid only if you have big set of data, this example is written to give you an overall idea how to use it.

Little bit of theory

What is clustering?

Clustering is a task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar to each other than to those in other groups (clusters). It is a type of unsupervised machine learning which makes it more challenging (it is not labelled). Each of the algorithms belongs to one of the four groups:

  • Exclusive Clustering
  • Overlapping Clustering
  • Hierarchical Clustering
  • Probabilistic Clustering

Rapidminer by default does not come with many algorithms, but there are plenty of modules on its marketplace

For this example, two algorithms have been used: K-means and Fuzzy C-means

Warning, the next section contains formulas


The objective function

where is a chosen distance measure between a data point and the cluster center , is an indicator of the distance of the n data points from their respective cluster centers.

Translating the above formula into the algorithm steps:

  • Initialize K points into the space represented by the objects that are being clustered. * These points represent initial group centroids.
  • Assign each object to the group that has the closest centroid.
  • When all objects have been assigned, recalculate the positions of the K centroids.
  • Repeat Steps 2 and 3 until the centroids no longer move. This produces a separation of the objects into groups from which the metric to be minimized can be calculated.

There is a great video which shows how exactly K-means works

Fuzzy C-means (the “pimped” K-means)

Fuzzy c-means (FCM) is a method of clustering which allows one piece of data to belong to two or more clusters (with K-means it’s either 0 or 1, belong/not belong, FCM is from 0 to 1. “I like to belong to certain group from this to this percentage”).

It is based on minimization of the following objective function:

where m is any real number greater than 1, uij is the degree of membership of xi in the cluster j, xi is the ith of d-dimensional measured data, cj is the d-dimension center of the cluster, and ||*|| is any norm expressing the similarity between any measured data and the center.
Fuzzy partitioning is carried out through an iterative optimization of the objective function shown above, with the update of membership uij and the cluster centers cj by:

Translating the above formula into the algorithm steps:

If you want to skip the formula parts, here is the video how it works

Time to cluster

I am taking the processed data from the previous post and on it K-means is being applied. This is how the process looks:

  • Process Documents from Files – Filter out the stop words, use English dictionary, apply tokenizing, n-grams and stemming.
  • Store – This is used to store filtered out data into repository so that it can be used for the next time. Main reason for it is time processing, the more data you have the more it takes to process it. You can name the repository how ever you want, here it is called sitecore_wordlist
  • Clustering (K-means) – the actual clustering process

Clustering (K-means) parameters

We need to specify how many centroids we want, how many iterations it takes, measuring type and divergence. Keep in mind that if you would write a code in C#, normally the number of clusters is chosen randomly. In Rapidminer’s case, it needs to be specified upfront. It involves some guessing and tweaking. For this case, 5 clusters are specified. There are many algorithms which you can use for calculating the distance, I am using the most commonly used one, squared Euclidian distance. I will not go into details how that one exactly works, you can look it up.

The results

Unfortunately, Rapidminer doesn’t have a nice way of presenting the data, so it involves some manual analyzing of the results.

Interpreting the results

Cluster 0 African elephant, Asian elephant
Cluster 1 Turaco, Common nighthawk, Panda, Red panda
Cluster 2 Southwest African lion
Cluster 3 Bengal tiger, Siberian tiger
Cluster 4 Koala

All clusters make sense to define as categories except cluster 1. Cluster 0 can be grouped as elephants, cluster 2 as lions etc. We see that Turaco and Common nighthawk are birds which should be grouped together and pandas should become a separate category. There might be several reasons for this result:

  • Documents are not specified enough
  • The number of clusters should be increased
  • The actual K-means algorithm

K-means drawbacks

Three key features of k-means which make it efficient are often regarded as its biggest drawbacks:

  • Euclidean distance is used as a metric and variance is used as a measure of cluster scatter.
  • The number of clusters k is an input parameter: an inappropriate choice of k may yield poor results. That is why, when performing k-means, it is important to run diagnostic checks for determining the number of clusters in the data set.
  • Convergence to a local minimum may produce counterintuitive (“wrong”) results
    There are several algorithms which came out of K-means and one of them is Fuzzy C-means.

Fuzzy C-means

The process for this clustering algorithm is the same as for the K-means, the only difference is the clustering algorithm. All parameters are the same as in K-means: 5 clusters, square Euclidian distance, the only difference is the additional parameter: fuzzyness which has value 2.0

The results

Interpreting the results

Cluster 0 Southwest African lion
Cluster 1 Bengal tiger, Siberian tiger
Cluster 2 Panda, Red panda
Cluster 3 Common nighthawk, Turaco
Cluster 4 Asian elephant, Africa elephant, Koala

This algorithm already gives far better results than the first one just by having the default parameters. The grouping is done far better the only exception that occurs is Koala. The reason for this is very simple, the Koala document only contains 4 sentences and it will always fall into a random group, second, it has no document to which it can be compared to with its data. Placing a document with an animal from Australia and extending the number of clusters will give a far better results. But then again, this will show you which data sticks out and which content you need to modify.

Printing out the data

I prefer to save the results in Excel file instead of repository so that I can analyze them later, but it is not mandatory. Keep in mind that Rapidminer will provide you with lots of statistical data which you actually need to figure out by yourself.

There is no magic.

Process for saving the data in Excel looks as following:

Retrieve – Getting the previously saved repository

WordList to Data – This operator builds a data set from a word list

Generate Report – Generates a new report in PDF, Excel, HTML or RTF format (Note: this is just preparation for writing the file)

Report – Writing the data into the report. You need to specify which settings you want to be outputted. This example uses the following settings:

The data which is in WordList Data is printed out in Excel. The WordList output looks as follows:

With this, on an overall level, I can see how much each of the words is used and in how many documents it appears (frequency). This is how to get the keywords out, if the same principle gets applied on one document instead of set of documents. Examples where the keywords can be used: SEO, meta tags and search.