Why?

The main goal for this post is to explain how the content is more than just a stack of words and to give some insights how the structure of the website and its content could be more organized and written. It treats the content as an actual language instead of meta language (html tags). This approach gives you insights about the content relevancy, it’s structure, it’s “meaning”. It can be used for a content strategy purposes and data structuring.

This is the first post out of 2.

  • Part 1 – Introduction to Rapidminer, extracting data and finding similarities among the documents
  • Part 2 – Clustering of the documents and finding the keywords in the documents

What is a Rapidminer?

Rapidminer is platform which provides an integrated environment for machine learning, data mining, text mining, predictive and business analytics. It has a free edition but of course without some features. For more info check here.

What is semantics?

Semantics is the study of the meaning of linguistic expressions. The language can be a natural language, such as English or Navajo, or an artificial language, like a computer programming language.

Step 1 – Extract data from Sitecore

Rapidminer has the option to extract data directly from the web pages. The reason why I am extracting data out of Sitecore is to avoid the cluttering. With this approach, the focus is on the main content on the page, and not it’s navigation, sidebars, footer etc.

This code was a quick way for me to get the data as a txt files. It is not bullet proof (obviously, you can crash it immediately) and squeaky clean (do not hardcode stuff like here) since my focus is on data analysis. I am using only the vanilla installation of Sitecore with Sample Item as my main content focus. You can query SQL database directly, export content in Excel, the choice is yours.

namespace SemanticsConsole{
   class Articles {
   public static void ReadContent() {
  Sitecore.Data.Database database = null;
  database = Factory.GetDatabase("master");

  Sitecore.Data.ID itemUri = new Sitecore.Data.ID("{110D559F-DEA5-42EA-9C1C-8A5DF7E70EF9}");
  Item item = database.GetItem(itemUri);
  List itemList = new List();
  foreach (Item childItem in item.Children) {

     Directory.CreateDirectory("D:\\" + childItem.Name);
     foreach (Item childItemFiles in childItem.Children) {

        if (childItemFiles.TemplateName.Equals("Sample Item")) {
           File.WriteAllText("D:\\" + childItemFiles.Parent.Name + "\\" + childItemFiles.Name + ".txt", childItemFiles.Fields["Text"].ToString());
        }
     }
  }
}
public static void DirSearch(string sDir) {
try {
  foreach (string d in Directory.GetDirectories(sDir)) {
    foreach (string f in Directory.GetFiles(d)) {
      Console.WriteLine(f);
    }
  DirSearch(d);
  }
}
catch (System.Exception excpt) {
  Console.WriteLine(excpt.Message);
}
}
}

The content is taken from Wikipedia and contains pages with basic infos about different animals.

content tree

Step 2 – Create a process which will “understand” the data

Rapidminer has a powerful operator called Data to Similarity.

data to similarity

The Data to Similarity operator calculates the similarity among examples of an ExampleSet. Same comparisons are not repeated again e.g. if example x is compared with example y to compute similarity then example y will not be compared again with example x to compute similarity because the result will be the same.
Before we can use it, we need to prepare our data for processing.

Load your data

To load the data in the Rapidminer, I am using the operator called Process Documents from Files.

Generates word vectors from a text collection stored in multiple files.

Parameters:

  • text directories – specify a folder which contains the files that you want to analyze
  • select extract text only
  • encoding – UTF-8
  • select create word vector
  • select TF-IDF ( term frequency–inverse document frequency) as vector creation
  • prune method – absolute, with prune bellow 3, prune above 80 (ignore the words which are smaller than or longer than)

Now we need to enter into the process and set it’s steps.

Process steps

Since I am using the Rich Text Field as the main content part, first we need remove the possible html tags that it contains. For that, Extract Content is used.

extract_content_parameters

This operator extracts textual content from a given HTML document and returns the extracted text blocks as documents.

transform_cases

Transform Cases

We want to make sure that all text has the same type of casing (lower or upper).

Tokenize

Tokenize

This operator splits the text of a document into a sequence of tokens. The chosen parameters are linguistic tokens in English. After that the word vector gets constructed.

filter_tokens

Filter Tokens  (by Length)

This operator filters tokens based on their length (i.e. the number of characters they contain). For this case the minimum is 3 and the maximum is 25. This way we can exclude a lot of words which are not relevant to us (an, or, if, up, at, etc).

filter_stopwords

Filter stopwords (English)

Rapidminer already contains a dictionary for English. In this case we want to remove all stop words from our text. These are common words such as prepositions, conjunctions, articles, adverbs and so on. The dictionary for Dutch language is not included, but you can either create it yourself as an txt file or download the example I have here.

n-grams

Generate n-grams

This operator creates term n-Grams of tokens in a document. A term n-Gram is defined as a series of consecutive tokens of length n. The term n-Grams generated by this operator consist of all series of consecutive tokens of length n. In this example 3 n-grams are used. Basically, in the documents, there may be pairs of words which always go together.

stem

Stemming

Stemming reduces the words to their barest minimum. For example,  words “responsibilities” and “responsible” indicate the same thing.

The end Process filtering should look like this:

process-1024x302

Now the Data to Similarity can be added on the Process Documents from Files

final_process-1024x190

Now we can run the process and see its output.

Step 3 – Analyse your data

Since the data table is quite big (even though there are only 10 documents in Sitecore), the data is presented in Circle graph

results-1024x765

IDs and Document names

01  African elephant
02  Asian elephant
03  Bengal tiger
04  Common nighthawk
05  Koala
06  Panda
07  Red panda
08  Siberian tiger
09  Southwest African lion
10  Turaco

Based on the algorithm, even though all the pages are stored under Animals folder, we can see that for example documents 3 and 8 share high similarity. With this, we can have substructure called tigers and to group content. The same goes for grouping of 1 and 2.

More detailed analyses will be covered in part 2 when on this data output, clustering algorithms get applied.