﻿

# Blog

A common question we often encounter is how to use Eureqa to predict future sales. Let’s consider the following example, using real world data provided by Dunnhumby, in their Product launch challenge.

This dataset contains week-by-week sales numbers for various products for the first quarter (13 weeks) after the product’s launch. The challenge is then to predict how successful each product will be at the end of the second quarter (26 weeks after its launch), based only on information in the first quarter.

What kind of information was provided? There were 16 week-by-week attributes typically collected by retailers, such as

• The number of stores selling the product
• The total number of units sold that week
• The number of distinct customers who have bought the product (cumulative)
• The number of distinct customers who have bought the product at least twice (cumulative
• Cumulative units sold to a number of different customer groups, such as Family Focused, Shoppers on a Budget, Price Sensitive shoppers, etc.

The dataset also included actual second quarter performance numbers for some of the products. The challenge was to predict the numbers for the rest. These given Q2 performance numbers are called the “Training Set”. Often, training set numbers are obtained from prior years.

To use Eureqa on this problem, we downloaded the data and reformatted it into a spreadsheet that contained 2,768 rows and 208 columns. That’s one row for each one of the 2,768 products. Each of the 208 columns corresponds to the 16 attributes listed above, for each of the 13 weeks of the first quarter (13×16=208). For example, one column contained the number of stores selling in week 1; the next column contained the number of units sold in week 2, etc. Another column contained the number of units sold in week 1, then the number of units sold in week 2, etc. Finally, we appended one last column containing the actual sales on week 26, which is what we want Eureqa to find a predictive formula for. The week-26 sales column was available only for the products included in the training set.

To start the search, we punched a simple query into Eureqa:

#### Units_that_sold26 = f(…)

This query instructs Eureqa to search for a formula f() that models the column Units_that_sold26 based on all the other columns. Of course, you could make other types of queries. For example, we can ask for a formula for the number of units sold in week 26 based only on variables from the first three weeks, or based only on a subset of the variables. The actual Eureqa-formatted file can be found here for a subset of the data pertaining to candy products only.

After punching in the query, we let Eureqa run for about 24 core-hours – something that would take the better part of a workday on your quad-core, but only a few minutes on the cloud. How did the formula pan out? The formula’s prediction was pretty accurate – The mean absolute error on a validation set (not used for training) was about 4.75% (19 items out of 4000 items sold), and the R2 for the validation set was 0.998. The chart to the right shows the actual sales (blue dots) versus predicted sales (red curve) for various candy types.

Although we could run this process for all the item categories in the supermarket, asking Eureqa to create a model for just the candy products gave us a candy-specific sales prediction formula that is particularly well suited for predicting candy sales. A separate model could be trained for predicting, say, just frozen pizza or baked goods. We could also try to predict overall sales. Different formulas (or “models” as analysts call them) are useful for different purposes: For financial purposes we may want to predict total sales, but for the purpose of deciding how much shelf space to allocate to each product type, we might want to use product-specific formulae. For example, if candy X is likely to sell twice as much as candy Y in week 26, we might want to reduce the shelf space allocated to candy X and increase the shelf space for candy Y. We might also want to re-run the analysis every week (or every day) to update the forecast based on up-to-date data.

Unlike other modeling techniques, Eureqa doesn’t just give you a prediction. It gives you a simple formula. You can use this formula to make your own predictions, but you can use it also for gaining insight into how the prediction is reached, what it is based on and what affects it. The transparency provided by Eureqa and Symbolic Regression helps you build confidence in the forecast before basing any critical actions on it. You can even play “what if” scenarios, by tweaking various parameters to see what changes might improve future performance: Perhaps sales in the third week are really critical, so marketing budget is best spent around that week?

So what did the formula say? You’ll have to try it out yourself: Here are the files. And if you want to try it on your own business data, let me know if I can help. Often the biggest challenge is reformatting your data so that it can be modeled, so if you have any questions about that, don’t hesitate to ask me.

Topics: hod lipson, Tutorial

Eureqa automatically splits your data into groups: training and validation data sets. The training data is used to optimize models, whereas validation data is used to test how well models generalize to new data. Eureqa also uses the validation data to filter out the best models to display in the Eureqa user interface. This post describes how to use and control these data sets in Eureqa.

Default Splitting
By default, Eureqa will randomly shuffle your data and then split it into train and validation data sets based on the total size of your data. Eureqa will color these points differently in the user interface, and also provide statistics for each when displaying stats, for example:

All other error metrics shown in Eureqa, like the “Fit” column and “Error” shown in the Accuracy/Complexity plot, use the metric calculated with the validation data set.

### Validation Data Settings

You can modify how Eureqa chooses the training and validation data sets in the Options | Advanced Genetic Program Settings menu, shown below:

Here you can change the portion of the data that is used for the training data, and the portion that goes into the validation data. The two sets are allowed to overlap, but can also be set to be mutually exclusive as shown above.

For very small data sets (under a few hundred points) it is usually best to use almost all of this data for both training and validation. Model selection can be done using the model complexity alone in these cases.

For very large data sets (over 1,000 rows) it is usually best to use a smaller fraction of data for training. It is recommended to choose a fraction such that the size of the training data is approximately 10,000 rows or less. Then, use all the remaining data for validation.

Finally, you can also tell Eureqa to randomly shuffle the data before splitting or not. One reason to disable the shuffling is if you want to choose specific rows at the end of the data set to use for validation.

### Using Validation Data to Test Extrapolating Future Values

If you are using time-series data and are trying to predict future time-series values, you may want to create a validation data split that emphasizes the ability of the models to predict future values that were not used for optimizing the model directly.

To do this, you need to disable the random shuffling in the Options | Advanced Genetic Programming Options menu, and optionally make the training and validation data sets mutually exclusive (as shown in the options above). For example, you could set the first 75% of the data to be used for training, and the last 25% to be used for validation. After starting the search, you will see your data split like below:

Now, the list of best solutions will be filtered by their ability to predict only future values – the last rows in the data set which were not used to optimize the models directly.

Several features in Eureqa assume that your data is one continuous series of points by default, such as the smoothing features and numerical derivative operators. This post shows how to tell Eureqa that there are breaks in the data.

Entering discontinuity with a blank row:

To indicate a break in your data set, simply insert a blank row in the data spreadsheet. You can insert a blank row by right clicking on the first row of the next series of points. For example:

Eureqa will automatically recognize this blank line as a break between continuous series of data for each variable. This allows you to smooth the data correctly for data points between breaks. For example:

Without a break, the smooth would attempt to blur the two distinct series into a single smooth curve.

Eureqa’s Search Relation setting provides quite a bit of flexibility to search for different types of models. This post describes some advanced techniques of using the Search Relation setting to specify custom error metrics for the search to optimize; or more specifically, arbitrary custom loss functions for the fitness calculation.

Custom Fitness Using Minimize Difference

Eureqa has a built-in fitness metric named “Minimize difference”. This fitness metric minimizes the signed difference between the left- and right-hand sides of the search relationship. For example, specifying:

y = f(x)

with the minimize difference fitness metric selected tells Eureqa to find an f(x) to minimize y – f(x). A trivial solution to this relationship would be f(x) = negative infinity. However, you can enter other relations that are more useful. Consider the follow search relation:

(y – f(x))^4 = 0

Here, the minimize difference fitness would minimize the 4th-power error. In Eureqa this setting looks like:

In fact, you can enter any such expression and the f(x) can appear multiple times. For example:

max( abs(y – f(x)), (y – f(x))^2 ) = 0

would minimize the maximum of the absolute error and squared error, at each data point in the data set.

Other Methods

There are many other possible ways to alter the fitness metric using the search relationship setting. For example, you could use a normal fitness metric (e.g. absolute or squared error) but scale both sides of the relation. For example, you could wrap each side of the search relation with a sigmoid function like tanh:

tanh(y) = tanh( f(x) )

Now, both the left and right sides get squashed down to a tanh function (an s-shaped curve that ranges -1 to 1) before being compared. This effectively caps large errors, reducing their impact on the fitness.

Even More Tricks

You can also use the search relationship to forbid certain values by exploiting NaN values (NaN = Not a Number). For example, consider the following search relation, which forbids models with negative values:

y = f(x) + 0*log( f(x) )

Notice the unusual 0*log(f(x)) term. Whenever f(x) is positive, the log is real-valued, and the multiplication with zero reduces the expression to y = f(x). However, whenever f(x) is negative, log(f(x)) is undefined, and produces a NaN value. Whenever a NaN appears in the fitness calculation, Eureqa automatically assigns the solution infinite error. Therefore, this search relationship tells Eureqa to find an f(x) that models y, but f(x) must be positive on each point in the data set.

This behavior can be used in other ways as well. Any operation that would produce an IEEE floating point NaN, undefined, or infinity will trigger Eureqa to assign infinite error. You can also add multiple terms like this to place multiple different constraints on solutions.

A time delay retrieves the value of a variable or expression at a fixed offset in the past, according to the time ordering or index of each data point in the data set. This post describes the time-delay building-blocks available in Eureqa and different modeling techniques with delayed values.

Time Delay Building Blocks:

Eureqa provides the delay(x, c) building block to represent an arbitrary time-delay, where x could be any expression. The expression delay(x, c) returns the value of x at c time units in the past. When used as a building-block, Eureqa can automatically optimize expressions or variables to be delayed and the time-delay amount  c.

The figure above plots an arbitrary variable x and a delayed value delay(x, 1.0), where the values are ordered by some time variable t. The delayed version is equal to x at 1.0 time units into the past.

To use time-delay building-blocks, your data must have some notion of time or ordering. You also need to tell Eureqa which variable in your data represents the time or ordering value:

If you don’t specify a time variable, Eureqa will use the row number in the spreadsheet as the time value of each data point.

If a particular delayed time value falls between two points in the data set, the value is linearly interpolated between the two data points using the time value.

Eureqa also provides the delay_var(x, c) building-block which is identical to delay(x, c), except that it only accepts a variable as input. It’s provided as a special case of the delay(x, c) building-block to allow you to constrain the types delays used in the solutions. But in the end they are effectively identical.

Control the Fraction of Data Used for History

Notice that the delayed output plotted above does not have values on the left side of the graph for the first few time points. This is because these points request previous values of x that lie before the first point in our data set. Eureqa will automatically ignore these data points when calculating errors.

However, there is a way to control how much of the data set Eureqa is allowed to ignore – or effectively, specify a maximum delay offset. You can limit the fraction of data used for time-delay history values in the Advanced Solutions Options menu:

The default maximum fraction is 50% of the data. If you find that Eureqa is identifying solutions with very large time delays, perhaps just to avoid modeling difficult features in the first half of the data set, you may want reduce this fraction

Additionally, you can control the number of delayed values per variable (including a zero delay of an ordinary variable use) in this dialog.

Fixed Time-delays:

Another way to model a value as a function of its previous values is with fixed delays. You can enter in fixed time-delays, or “lags” of the variable, directly into the Search Relationship option. For example:

x = f( delay(x, 2.1), delay(x, 5.6) )

This search relationship tells Eureqa to find an equation to model the value of x as a function of it’s value at 2.1 and 5.6 time units in the past.

Minimum Time-delays:

You may also want to specify a minimum time-delay offset. If you entered a search relationship such as x = f(x), Eureqa would find a trivial answer f(x) = x. More likely, you wanted to find a model of x, but as a function of x at least some amount of time in the past. The way to do this is to again use a fixed delay, such as:

x = f(delay(x, 3.21))

Here, 3.21 is the minimum time-delay. Now, if the time-delay building-blocks are enabled, Eureqa can delay this delayed input further if necessary.

Delay Differential Equations:

Another common use for time-delays in for modeling using Delay Differential Equations. Finding delay differential equations is just like searching for ordinary differential equations. For example, entering a search relationship like:

D(y,t) = f(y)

but also enabling time-delay building blocks. This relationship has a trivial solution however: Eureqa will return the slope formula such as

f(y) = ( y – delay(y, 0.1) )/0.1

Therefore, you most-likely want to limit the total number of delays per variable to one (which includes the zero delay of the normal variable use). You can set this in the Advanced Solution Settings menu. The default is unlimited.

Implementing Delays Outside of Eureqa

In Matlab, you can implement a time delay using the interp1 function. For example, the expression delay(x, 1.23) would be implemented as:

interp1(t, x, t – 1.23, ‘linear’)

Implementing delays in Excel is a littler harder. You need to download an Excel add-on that adds an interpolate function. For example, the package XlXtrFun adds a function “Interpolate” that is just like Matlab’s interp1. There are also other guides for Linear Interpolation with Excel.