Blog

How to Classify Evergreen Content with Machine Learning and Eureqa

Posted by Matt Fleming

16.01.2014 09:18 AM

Watch the Tutorial »

As a part of a recent Kaggle competition, StumbleUpon challenged users to build a model using machine learning that would classify whether a webpage should be considered evergreen or ephemeral.  The ability to better classify and understand evergreen content would allow StumbleUpon to greatly improve the performance of its recommendation engine.

Evergreen content, for those not familiar with the term, signifies content that remains relevant, valuable, and authoritative year after year.  Evergreen content is of immense value to marketers, as it continually generates traffic and leads season after season.  Hubspot has a great introductory article on the subject.

In this tutorial, we’ll review how Eureqa can be used to predict whether a webpage is evergreen or non-evergreen, using both structured and unstructured data provided by Kaggle and StumbleUpon.

The original competition and data, hosted by Kaggle, can be found here.

Examining the Data

After downloading the training dataset, we can see that we will be working with 27 variables and 7,395 records, with each record (or row) corresponding to a given webpage.  For the purposes of this tutorial, we are not going to use the ‘raw content’ file.  Some of the variables we will be working with include:

  • URL
  • Alchemy Category
  • Link Word Score
  • Avg Link Size
  • Compression Ratio
  • Page Title
  • Body Description
  • Number of Links
  • Number of Spelling Errors

At first glance, it does look we’ll need to do a little bit of preparation to get our data properly setup for Eureqa.

Preparing the Data

The first thing we’re going to do is parse the ‘boiler plate’ variable so that “title”, “body”, and “url” are all in separate columns.   This will enable us to not only examine what words are potentially indicative of evergreen content, but also the impact of their placement in the ‘title’, ‘body’ or ‘url’ sections.  I also went ahead parsed the ‘URL’ variable, so that the domain was in its own column.   All of this can be accomplished within Excel or any other stats program.

Now that we’ve completed our initial adjustments to the training data, we can go ahead and import it into Eureqa.  To do this, save your .xls file (or equivalent) as a .csv and from within Eureqa, click ‘Import Data’.  After it’s done loading, your worksheet should look similar to the screenshot below:

eureqa stumble upon resized 600

Once you’ve imported the data and confirmed that it looks as expected in the Enter Data tab, we can move on to the Prepare Data tab. This tab has options to further pre-process with your data,including handling missing values and smoothing the data points. For this initial analysis, we will not choose any of those options, but you can return to these later to improve on the performance of your model. For more information on how to take advantage of these options, see our tutorial on preparing data in Eureqa.

Before we move on and begin our model search, you may have noticed that several new columns were appended to your data.  Eureqa uses a basic ‘bag of words’ implementation for handling text data, which takes the most frequently used words and appends them to your data as columns with boolean values.  As an example, if the ‘title’ of a webpage was ‘6 Tips for Evergreen Content’, the column title_Evergreen would have a value of ‘True’ or 1.  For more information, take a look at Wikipedia’s article on Bag of Words.

How to Classify Evergreen Content with Machine Learning and Eureqa 

Let’s go ahead and click the tab labeled ‘Set Target’.  From this tab, we can tell Eureqa the variable we wish to model as well as what mathematical building blocks should be used during the model search.

We want to predict the variable ‘Label’, which signifies whether or not a given webpage is considered evergreen or ephemeral.  Since this variable only contains values of 0 or 1, we’re going to use a special target expression that will provide a similar constraint on the resulting model.  For this tutorial, we used the logistic function, which squashes values to be between 0 and 1.  We choose the logistic function (as opposed to a step function) because it provides a better search gradient.  For more information, see our tutorial on modeling binary values.

We’re also going to make a couple of changes to the building blocks Eureqa will use during the search.  Go ahead and enable the Logistic building block, as well as all of the Logical operators.

Also, since the Kaggle competition uses the AUC (area under curve) error metric, we should select AUC from the Error Metric dropdown list at the bottom of the ‘Set Target’ screen.

At this point, Eureqa should look something like the screenshot below:

set target real resized 600

Now that we have set our target expressio and selected what we believe are good ‘starter’ building blocks, we can begin our search.  At this point, it’s worth noting that if you are interested in speeding up your searches by leveraging the cloud, take a look at out tutorial on using Amazon EC2 with Eureqa.  Enabling the cloud can help you to obtain results up to 100x faster.

The Results 

The “View Results” tab offers a digest view of the top solutions Eureqa has generated over the course of a search.  For the purposes of this tutorial, we ran Eureqa on a 72 Core private cloud for the better part of 3 hours, which generated 697,254 models.

View Eureqa Results

 

At first glance, we can see that the top two models are very close in predictive accuracy and complexity, with AUC ranging between .2239 and.2276 and each using 29 to 27 terms respectively.

Perhaps most importantly, because the output of Eureqa is an analytical model, we can easily identify what characteristics are most indicative of evergreen content. Our most accurate model includes URL_cake, URL_chicken, URL_chocolate, URL_cupcakes, URL_kitchen, URL_make, URL_Recipe, and URL_Recipes.  This also seems to pass the sanity test, as recipes would seemingly be content that stands the test of time.

Given these variables, it would appear that other variables, like “domain”, “embed ratio”, “number of links”, and “link word score”, while possibly important indirectly, do not significantly improve accuracy and as a result are not used in the best models.

Summary

In just over three hours, we were able to go from a training dataset containing the characteristics of a given webpage, to an analytical model that predicts Evergreen content correctly 78% of the time and offers us a much deeper understanding of the characteristics and relationships that are most indicative of evergreen content.

For real world applications, we would likely want to improve on the predictive accuracy of our results by leveraging the ‘raw content by url zip file provided by Stumbleupon, more thoroughly preparing our training data, adding or removing building blocks, letting Eureqa search for a longer time period, and leveraging additional computation resources such as Amazon EC2 or your own Dedicated Eureqa Server.

Ready to try for yourself? Go ahead and download the Eureqa project file to get started.

Matt

Topics: Eureqa, Tutorial

Predicting a Bond’s Next Trade Price with Eureqa: Part 2

Posted by Jess Lin

26.11.2013 10:00 AM

In my previous post, we walked through the process of using Eureqa to predict the next price a bond will trade at. Starting with a massive spreadsheet with >760,000 rows and 61 columns, we were able to generate 11 equations to describe the data in 20 minutes. While I focused on just one of the equations, there is still more we can learn from Eureqa.

So let’s review the last page we looked at – the View Results tab of Eureqa:

View results tab

 

I walked you through my thought process of how I chose a single equation out of the 11 that Eureqa generated. This equation had a size of 14, with only 4 parameters and 3 terms. Of all the equations, it seemed to best balance both accuracy and complexity, being able to predict the next price a bond will trade at within $0.55 based on only 3 variables. However, there’s far more information here in this tab – what else can we learn?

First, let’s talk a little more about the equation we chose:
trade_price = 0.6964*trade_price_last1 + 0.3026*curve_based_price + 0.1059/(trade_type – 2.759)

When you click on that specific solution, you will see details about that solution directly below. Eureqa provides details on 8 different error metrics for each solution, ranging from Mean Absolute Error to Hybrid Correlation/Error. I used MAE to judge accuracy in this case, but different data sets may require different error metrics.

Solution details

While I didn’t touch upon R^2 Goodness of Fit in the previous tutorial, it can provide a meaningful way to evaluate your overall search. What this metric helps you understand is how much of the variance in your target variable is captured by each of your solutions. In this case, the R^2 value is telling us that our solution captures 98.9% of the variance in the predicted trade price. With this equation under our belt, let’s dig a little deeper.

For more details on all 8 error metrics, please see our tutorial on error metrics.

Even though we chose this specific equation as the best for now, what can the other equations tell us about this data? There are three different ways of ranking solutions – by size only, by fit only, or by a combination of size and fit. The third is what Eureqa defaults to, but you can still find valuable data by ranking by the other two methods.

Specifically, let’s look at what happens when you rank by size, looking at the simplest solutions first. By doing this, you can see which single variable Eureqa believes to be the most crucial to understanding the target variable. Then going through each successively more complicated solution, you can see which other variables begin appearing in what order. The simplest solution here is just:
trade_price = trade_price_last1.

Solution plot

 

When you look at the R^2 value for this solution, it actually shows us that this one variable captures 98.4% of the variance of the target variable. What does this mean for us? While we can (and did) find more sophisticated models that get us closer to modeling the future trade price, the last price that the bond traded at is by far the best indicator of the future price.

Finally, let’s focus on this trade_price_last1 variable. As we just discovered, it captures 98% of the variance in our target variable – trade_price. It could be interesting to look at what drives differences between the two variables – and Eureqa lets us do that extremely easily. All we need to do is set up a new search, and modify the target expression to find the difference between trade_price and trade_price_last1, as modeled by the rest of the variables:
trade_price – trade_price_last1 = f(weight, current_coupon, time_to_maturity, …, curve_based_price_last10)

After running this for almost 7 hours on 72 cores, the most accurate solution I could generate was:
trade_price – trade_price_last1 = (trade_type_last3 + 1.342*time_to_maturity)/(2.819*curve_based_price_last1 – curve_based_price*trade_type_last1) + (trade_type_last3 + 1.342*time_to_maturity)/(trade_type*curve_based_price – 2.819*curve_based_price)

Pareto front display

 

As you can see from the Pareto front display, solutions with much more complexity are being introduced. Keeping in mind that the average difference between trade price and the last price is actually 0.607, our most accurate equation here has a 0.52 MAE.  While this solution is the most accurate, you can choose for yourself which solution has a better balance of accuracy and complexity, such as the one with equation size of 13, using only 2 parameters. Additionally, doing more pre-processing on the dataset or choosing different building blocks will lead you to improved searches.

Last week, it was all about showing you how easy and intuitive it is to use Eureqa to quickly come up with incredibly accurate results. Today, I hope I was able to show you some of the hidden power behind Eureqa that allows you to accomplish far more.

Of course, this is still only touching the tip of the iceberg of Eureqa’s abilities. Using the fxp file I posted last time, go ahead and try yourself! If you run into any questions, check out our user guides and tutorials, or come visit our forums and see what questions others have asked!

Happy modeling!

Jess

Topics: Advanced Techniques, Eureqa, Making predictions, Tutorial

Predicting a Bond’s Next Trade Price with Eureqa

Posted by Jess Lin

21.11.2013 01:30 PM

Watch the Tutorial »

Investors use predicted bond trade prices to inform their trading decisions throughout the day. While bond yield curves can be used to help make decisions, there are many other factors that could help predict a bond’s next trading price.

In this tutorial, we’ll review how Eureqa and the power of symbolic regression can be used to predict the next trading price of a US corporate bond. We use bond price data provided through Benchmark Solutions and Kaggle.com, which includes variables such as current coupon, time to maturity, and details of the previous 10 trades, among others.

The original competition and data, hosted by Kaggle, can be found here.

Understanding the Data

After downloading and viewing the data, we see that this dataset comprises 61 columns of parameters. These parameters include the row ID (can be used for time series analysis), the bond ID (there is data for almost 8,000 different bonds), current coupon, previous trade prices, and more. The most important column for us is the trade_price column, as this is the value that we are trying to solve for.

This dataset also includes over 700,000 rows of data. For this tutorial, we’re only going to take the first 200,000 rows of data for learning the model. Later, you can use the rest of this data for more training or validation, but let’s stick with 200,000 for now. To do this, you can extract the rows you want manually from the data, or create a new, smaller file by using the command line to run the following command: head –n 200000 originalFileNameAndLocation.csv > newFileName.csv.

For more details, please see a previous tutorial that covered this step.

import data

Preparing the Data

Once you’ve imported the data and confirmed that it looks as expected in the Enter Data tab, let’s move on to the Prepare Data tab. This tab has options for you to do further pre-processing with your data, such as handling missing values or smoothing the data points. For this initial exploration, we will not choose any of those options, but you can return to these later to improve on the performance of your model. For more information on how to take advantage of these options, see our tutorial on preparing data in Eureqa.

Finally, let’s give our search a target expression and choose its building blocks. Since we want to solve for the trade price of a particular bond given the other variables, the target expression should be set so that the trade_price variable is modeled as a function of all other variables:

trade_price = f(weight, current_coupon, time_to_maturity, …, curve_based_price_last10)

With regards to the building blocks, they are used to define the mathematical equation types that Eureqa will attempt to combine in your final model. We prefer using fewer building blocks initially to speed up the search, then later expanding the number of building blocks to add in more sophistication to subsequent models. In this case, we are only going to leave the basic building blocks checked and uncheck the two trigonometry building blocks.

For this dataset, we will leave all other options set to their defaults. Eureqa includes many other options to further refine your search which we don’t need to use for this example, but can be very useful in more targeted searches. If you want to read more about your options on setting search targets, please see our tutorial on setting search targets in Eureqa.

set target

Interpreting the Results

At this point, we are finally ready to move on to the Start Search tab and begin to run the formula search. You can run your search on your local computer, just using the cores you have on your laptop or desktop, or you can speed up your searches by using either your own dedicated private cloud or leveraging the cloud with Amazon EC2. For this search, we ran it on 72 cloud cores for just 20 minutes.

Eureqa gives us a few different methods for assessing the progress of your search. On the Start Search tab, you can monitor both the Confidence metrics in the Progress and performance view as well as the Progress over time chart. In conjunction with those two methods, the Pareto front display gives another visual indication of the performance of the generated equations.

In our case, Eureqa went through nearly 250,000 generations of equations in just 20 minutes, resulting in 11 equations. The top 4 most accurate equations (as judged by the Pareto front display) differ widely in terms of complexity, ranging from 14 terms used to 20 terms. The remaining, simpler models show a steep decrease in accuracy, but it is up to you to determine the correct trade off of simplicity and accuracy.

The current most accurate solution has a 0.547 mean absolute error, signifying that this model can predict the future trading price of a bond with an average error of only $0.55. Using a less complex model with 20% fewer terms gives us a 0.554 mean absolute error. Given that the average future trading price among the entire training dataset is $105, having an average error of only $0.54 or $0.55 shows that both formulas model that data very closely. In this case, trading 20% fewer terms for only a 1% difference in accuracy seems like the ideal tradeoff in this scenario.

trade_price = 0.6964*trade_price_last1 + 0.3026*curve_based_price + 0.1059/(trade_type – 2.759)

view results

Summary

In just 20 minutes we were able to discover a formula to predict the next price that a corporate bond will trade at, with a mean absolute error of only 0.554. In addition to just making pure predictions, this formula found the relationships within the data, allowing us to understand what factors are truly driving these prices. In this example, we found that the last trading price, the curve based price, and the trade type are the factors that are most important to what price the bond will trade at next. Out of the 61 variables that we began with, Eureqa was able to identify the 3 variables that have the most impact on the future trading price.

Throughout this example, we took a variety of shortcuts to reach an initial assessment quickly. Now that we have a sense for what this data has to offer and what we’re looking for, there are many opportunities to expand this model to reach even greater accuracy by doing additional data preparation, choosing different formula building blocks, or even just letting the search run for a longer amount of time. However, it is important to keep in mind that with this first investigation, we were able to quickly get visually intuitive results at a high level of accuracy without any deep technical knowledge.

Ready to try for yourself? Go ahead and download the Eureqa project file to get started.

Jess

Topics: Eureqa, Making predictions, Tutorial

Predicting Financial Delinquency Using Credit Scoring Data

Posted by Matt Fleming

14.11.2013 08:10 AM

Predicting future financial distress and understanding the factors that cause it are critical to how banks decide who can get financing and on what terms.  Credit scoring algorithms, which predict the probability of default, are the primary method banks use to determine whether or not a loan should be granted to a given applicant.

In this Eureqa tutorial, we’ll examine how Eureqa can be used to predict whether somebody will experience financial distress in the next two years using anonymous credit-scoring data provided by Kaggle.com.

The original competition can be found here.

Let’s get started!

The Training Data

After downloading the training data from Kaggle.com we can go ahead and take a look at the various variables and values within Excel.   At first glance, we can see that we’re going to be working with characteristics that are commonly used in assessing credit worthiness.

Predicting financial delinquency with credit characteristics

The full list of variables includes:

  • Serious delinquency in 2 years (SeriousDlqin2yrs)
  • Revolving Utilization Of Unsecured Lines
  • Number Of Time 30-59 Days Past Due Not Worse
  • Debt Ratio
  • Monthly Income
  • Number Of Open Credit Lines And Loans
  • Number Of Times 90 Days Late
  • Number of Real Estate Loans Or Lines
  • Number Of Time 60-89 Days Past Due Not Worse
  • Number of Dependents

You might suspect that a lot of these variables interact to affect someone’s likelihood of delinquency; but how exactly and what do we predict? We’ll use Eureqa to answer the question.

Preparing the Data for Modeling

Eureqa includes several data preparation tools for your convenience, however, we’ll try working with the raw data, without any special pre-processing or preparation, just to get started.   Once we complete our first attempt at modeling the data, we can go back and consider options like removing outliers and smoothing.

eureqa data preperation

 

Predicting Financial Delinquency using Credit Scoring Characteristics

We want to predict the variable SeriousDlqin2yrs, which is a variable containing 0’s for no delinquency, and 1’s for serious delinquency. Since this variable has only values of 0 and 1, we’re going to use a special target expression that will provide a similar constraint on the resulting models. We recommend using the logistic function (see below) which squashes values to be between 0 and 1. We choose the logistic function (as opposed to a step function) because it provides a better search gradient.  For more information, see our tutorial on modeling binary values.

Your target expression should now look something like the following:

eureqa target expression

eureqa building blocks

We’re also going to make a few changes to the model building blocks based on some assumptions that can be made about the data.  Since the data is not related to engineering and is not cyclical or seasonal in nature, we can go ahead and uncheck the ‘Sine’ and ‘Cosine’ building blocks.  We also recommend enabling the ‘Logistic’ building block, which when used as a building-block provides an easy ability to threshold input variables or values that may be useful in a model.

Next, we’re going to enable row weighting on our target variable by clicking Row Weight and selecting ‘1/occurences(SeriousDlqin2yrs)’.   We do this because positive occurrences appear sparsely in the training data (roughly 10,000 out of 150,000 records), so we want to weight the outcomes proportionally to the more frequent case (e.g. by their frequency of occurrence, 1 to 15).  For more information on row weighting, please visit our row weighting tutorial.

eureqa row weighting

Now that we’ve set our target expression, selected the appropriate model building blocks, and enabled row weighting, we’re ready to start our search.  From within Eureqa, select the ‘Start Search’ tab and click the button marked ‘Run’.

If you are interested in speeding up your searches by leveraging the cloud, please read our tutorial on using Amazon EC2 with Eureqa.  Enabling the cloud can enable you to search up to 100x faster.

The Results

From the ‘View Results’ tab, we can get a digest view of all the solutions generated by Eureqa thus far.  For this tutorial, we ran Eureqa using a 72 core private cloud for about five hours.  If you’d like to learn more about how to setup a private cloud, please take a look at our Eureqa Dedicated Server product page.

As with our predicting insurance claim payments tutorial, we’re going to judge the predictive accuracy of the solutions using Mean Absolute Error (MAE). This metric is the average error (plus or minus) one can expect with the predictions generated by our models.

Looking at the solutions Eureqa has generated, we can see that the top four models offer similar predictive accuracy, with MAE ranging between .2245 and .2249, while differing substantially in complexity.  Choosing the simplest of the four would result in a .2% decrease in predictive accuracy, while decreasing the number of terms by nearly 50%.  This is clearly illustrated via Eureqa’s built in Pareto Front display.

eureqa view results

Since the outputs and the predictions are nearly all 0’s or 1s, we can interpret mean absolute error statistic as the percentage of time the model found in Eureqa will make an incorrect prediction.   In the case of our most accurate solution, Eureqa could correctly predict whether or not someone would have financial distress 77.55% of the time.

Even more interestingly, because the output of Eureqa is an analytical model, we can easily identify what characteristics are indicative of future financial delinquency.   Our most accurate model includes Revolving Utilization Of Unsecured Lines, Number Of Times 90 Days Late, Number Of Time 60 – 89 Days Past Due Not Worse, and Number Of Time 30 – 59 Days Past Due Not Worse as the most important factors in determining future delinquency.  Given these variables, other variables like age, monthly income, number of dependents, real estate loans, and debt ratio while possibly important indirectly, do not significantly improve accuracy and are not used in the best models. The variables related to being overdue appear to drive nearly all delinquency outcomes.

Summary

In a just over a few hours, we were able to go from a raw data set containing credit scoring characteristics, to a precise analytical model of financial delinquency that predicts claims correctly nearly 78% of the time, and we discovered that the variables related to overdue payments dramatically affect and drive these outcomes on the best model

For real world applications, we would likely want to improve on the predictive accuracy of our results by more thoroughly preparing our sample data, adding new building blocks, letting Eureqa search for a longer period of time, and leveraging additional computational resources such as Amazon EC2 or a private cloud using a dedicated Eureqa server. We can also use the models produced by Eureqa almost anywhere, such as software like Excel, R, SAS, or Matlab in order to do additional analysis now that we have a model to work with.

Ready to try for yourself? Go ahead and download the Eureqa project file to get started.

Topics: Eureqa, Tutorial

Predicting Liability Insurance Claim Payments with Vehicle Data

Posted by Matt Fleming

07.11.2013 08:47 AM

Understanding risk is intrinsic to running a successful insurance business.  Developing a deep understanding of the relationships between various risk factors and the likelihood and cost of associated claims enables insurance companies to charge customers appropriately for the risk they represent.

In this tutorial, we’ll examine how you can use Eureqa to predict Bodily Injury Liability insurance claim payments based on the characteristics of the insured’s vehicle, using data provided by Allstate and Kaggle.com.

The original competition and data, hosted by Kaggle, can be found here.

Let’s get started!

The Data

This competition provided millions of rows of claims data, but we are only going to use a few hundred thousand for discovering the model. The rest we’ll set aside for more advanced validation and testing steps should we need them later on. You may get an error attempting load all data on the local machine, so we will first extract just the first 200,000 rows using the command line (or ‘Terminal’ for your OS X folks). This can be done using the following command in Linux or OS X: head –n 200000 originalFileNameAndLocation.csv > newFileName.csv

down sample commandThis will be our training dataset for identifying the model structure – we can go ahead and open it using Eureqa.  To do this, open up Eureqa and click ‘File’ and then select ‘Import Data’.

import insurance claim data into eureqaAt first glance after import, we can surmise that the insurance data provided by Allstate contains the makes/models of various vehicles, their associated characteristics, and respective claim payments (if made).  It does not appear to contain any time series data.

Preparing Data for Modeling

Eureqa provides some essential data preparation options for your convenience.  In the case of Allstate’s data, you may want to consider removing outliers on some of the columns.  As for missing data, Eureqa will automatically fill it in with the statistical mean (this is also changeable via the Data Preparation tab).

For this exercise, we’ll try working with the raw data, without any special pre-processing or preparation.

Predicting Claim Payments with Eureqa

To predict claim payments, we are going to set our Eureqa target expression so that ‘Claim_Amount’ is modeled as a function of all the other variables.  It should look something like the following:

eureqa target expression for predicting insurance claims

eureqa building blocks insurance

We’re also going to make some changes to the building blocks based on a few basic assumptions that can be made about the data.  First, we’re going to uncheck the Sine and Cosine building blocks because they are advanced building blocks used for engineering datasets, or cyclical or seasonal data. Second, we’re going to enable row weighting on the ‘Claim_Amount’ column by selecting ‘Row Weight: 1/occurences(Claim_Amount)’.

Row weighting enables us to adjust for the fact that ‘Claim_Amount’ appears quite sparsely in the sample data.  Eureqa will weight rows where ‘Claim_Amount’ is greater than 0 relative to its occurrences in the data set.  For more information, you can check out our tutorial on row weighting.

Now that our target expression is setup, we can start the search.  If you are interested in speeding up your searches by leveraging the cloud, please read our tutorial on using Amazon EC2 with Eureqa.

Go ahead and select the ‘Start Search’ tab and click ‘Run’.

The Results

From the ‘View Results’ tab, we can get a summary view of the solutions Eureqa has generated thus far, along with all of the associated error metrics.   In the case of this search, we ran it on 72 cores for the better part of two hours.

eureqa view results

 

To judge the accuracy of the models generated by Eureqa we can look at a few different error metrics, but for this tutorial we are going to look at ‘Mean Absolute Error’.   This metric, since it’s based on ‘Claim Amount’, is the average error (plus or minus) one can expect with the predictions we’ve generated.

eureqa pareto front

Looking at the Pareto Front graph, we can see that two of the leading solutions are very similar in predictive accuracy, yet are substantially more complex than other models. While the claim amounts in the dataset range between $0.00 and $10,000.00, our most accurate solution predicts the claim amount to within plus or minus $123.66.  Eureqa also identifies a few potentially simpler solutions; one comparable solution using only 11 variables predicts the claim amount plus or minus $124.72.

This is where domain expertise proves invaluable.  Ultimately, it’s up to the analyst to determine the appropriate balance between solution accuracy and complexity.

You should see similar models to our results given enough computational effort. Early on in the search, when there has been less computational effort, solutions are more approximate and refine over time. Even early results, while approximate, can often be useful and require minimal computing time.

Summary

In just a few hours we were able to go directly from the Allstate claims dataset to a precise analytical model of Bodily Injury Liability Insurance claim payments, which predicts claim amounts within plus or minus $123.66 (on average), using the characteristics of the insured’s vehicles.  Since this is an analytical model, we can also deduce what factors and variables lead to higher insurance claims, in addition to just predicting future claim values.

In the real world, we may want to expand on these results further to improve its performance. This could be achieved by more thoroughly preparing our sample data, adding new variables and features, adding new building blocks, letting Eureqa search for a longer period of time, and using additional computation resources such as Amazon EC2.

Ready to try for yourself? Go ahead and download the Eureqa project file to get started.

Topics: Eureqa, Tutorial

Follow Me

Posts by Topic

see all