Blog

How to Classify Evergreen Content with Machine Learning and Eureqa

Posted by Matt Fleming

16.01.2014 09:18 AM

Watch the Tutorial »

As a part of a recent Kaggle competition, StumbleUpon challenged users to build a model using machine learning that would classify whether a webpage should be considered evergreen or ephemeral.  The ability to better classify and understand evergreen content would allow StumbleUpon to greatly improve the performance of its recommendation engine.

Evergreen content, for those not familiar with the term, signifies content that remains relevant, valuable, and authoritative year after year.  Evergreen content is of immense value to marketers, as it continually generates traffic and leads season after season.  Hubspot has a great introductory article on the subject.

In this tutorial, we’ll review how Eureqa can be used to predict whether a webpage is evergreen or non-evergreen, using both structured and unstructured data provided by Kaggle and StumbleUpon.

The original competition and data, hosted by Kaggle, can be found here.

Examining the Data

After downloading the training dataset, we can see that we will be working with 27 variables and 7,395 records, with each record (or row) corresponding to a given webpage.  For the purposes of this tutorial, we are not going to use the ‘raw content’ file.  Some of the variables we will be working with include:

  • URL
  • Alchemy Category
  • Link Word Score
  • Avg Link Size
  • Compression Ratio
  • Page Title
  • Body Description
  • Number of Links
  • Number of Spelling Errors

At first glance, it does look we’ll need to do a little bit of preparation to get our data properly setup for Eureqa.

Preparing the Data

The first thing we’re going to do is parse the ‘boiler plate’ variable so that “title”, “body”, and “url” are all in separate columns.   This will enable us to not only examine what words are potentially indicative of evergreen content, but also the impact of their placement in the ‘title’, ‘body’ or ‘url’ sections.  I also went ahead parsed the ‘URL’ variable, so that the domain was in its own column.   All of this can be accomplished within Excel or any other stats program.

Now that we’ve completed our initial adjustments to the training data, we can go ahead and import it into Eureqa.  To do this, save your .xls file (or equivalent) as a .csv and from within Eureqa, click ‘Import Data’.  After it’s done loading, your worksheet should look similar to the screenshot below:

eureqa stumble upon resized 600

Once you’ve imported the data and confirmed that it looks as expected in the Enter Data tab, we can move on to the Prepare Data tab. This tab has options to further pre-process with your data,including handling missing values and smoothing the data points. For this initial analysis, we will not choose any of those options, but you can return to these later to improve on the performance of your model. For more information on how to take advantage of these options, see our tutorial on preparing data in Eureqa.

Before we move on and begin our model search, you may have noticed that several new columns were appended to your data.  Eureqa uses a basic ‘bag of words’ implementation for handling text data, which takes the most frequently used words and appends them to your data as columns with boolean values.  As an example, if the ‘title’ of a webpage was ‘6 Tips for Evergreen Content’, the column title_Evergreen would have a value of ‘True’ or 1.  For more information, take a look at Wikipedia’s article on Bag of Words.

How to Classify Evergreen Content with Machine Learning and Eureqa 

Let’s go ahead and click the tab labeled ‘Set Target’.  From this tab, we can tell Eureqa the variable we wish to model as well as what mathematical building blocks should be used during the model search.

We want to predict the variable ‘Label’, which signifies whether or not a given webpage is considered evergreen or ephemeral.  Since this variable only contains values of 0 or 1, we’re going to use a special target expression that will provide a similar constraint on the resulting model.  For this tutorial, we used the logistic function, which squashes values to be between 0 and 1.  We choose the logistic function (as opposed to a step function) because it provides a better search gradient.  For more information, see our tutorial on modeling binary values.

We’re also going to make a couple of changes to the building blocks Eureqa will use during the search.  Go ahead and enable the Logistic building block, as well as all of the Logical operators.

Also, since the Kaggle competition uses the AUC (area under curve) error metric, we should select AUC from the Error Metric dropdown list at the bottom of the ‘Set Target’ screen.

At this point, Eureqa should look something like the screenshot below:

set target real resized 600

Now that we have set our target expressio and selected what we believe are good ‘starter’ building blocks, we can begin our search.  At this point, it’s worth noting that if you are interested in speeding up your searches by leveraging the cloud, take a look at out tutorial on using Amazon EC2 with Eureqa.  Enabling the cloud can help you to obtain results up to 100x faster.

The Results 

The “View Results” tab offers a digest view of the top solutions Eureqa has generated over the course of a search.  For the purposes of this tutorial, we ran Eureqa on a 72 Core private cloud for the better part of 3 hours, which generated 697,254 models.

View Eureqa Results

 

At first glance, we can see that the top two models are very close in predictive accuracy and complexity, with AUC ranging between .2239 and.2276 and each using 29 to 27 terms respectively.

Perhaps most importantly, because the output of Eureqa is an analytical model, we can easily identify what characteristics are most indicative of evergreen content. Our most accurate model includes URL_cake, URL_chicken, URL_chocolate, URL_cupcakes, URL_kitchen, URL_make, URL_Recipe, and URL_Recipes.  This also seems to pass the sanity test, as recipes would seemingly be content that stands the test of time.

Given these variables, it would appear that other variables, like “domain”, “embed ratio”, “number of links”, and “link word score”, while possibly important indirectly, do not significantly improve accuracy and as a result are not used in the best models.

Summary

In just over three hours, we were able to go from a training dataset containing the characteristics of a given webpage, to an analytical model that predicts Evergreen content correctly 78% of the time and offers us a much deeper understanding of the characteristics and relationships that are most indicative of evergreen content.

For real world applications, we would likely want to improve on the predictive accuracy of our results by leveraging the ‘raw content by url zip file provided by Stumbleupon, more thoroughly preparing our training data, adding or removing building blocks, letting Eureqa search for a longer time period, and leveraging additional computation resources such as Amazon EC2 or your own Dedicated Eureqa Server.

Ready to try for yourself? Go ahead and download the Eureqa project file to get started.

Matt

Topics: Eureqa, Tutorial

One response to “How to Classify Evergreen Content with Machine Learning and Eureqa”

  1. Cram says:

    Nice tutorial Matt!
    Can you explain your target expression? It seems a bit more detailed than the one used in the “modeling binary output” tutorial, which uses: z = logistic( f(x,y) )
    In your example above in the screenshot there are additional coefficients prior to the listing of the variables and after the logistic operator.

Leave a Reply

Your email address will not be published. Required fields are marked *

Follow Me

Posts by Topic

see all