﻿

# Blog

Eureqa’s Search Relation setting provides quite a bit of flexibility to search for different types of models. This post describes some advanced techniques of using the Search Relation setting to specify custom error metrics for the search to optimize; or more specifically, arbitrary custom loss functions for the fitness calculation.

Custom Fitness Using Minimize Difference

Eureqa has a built-in fitness metric named “Minimize difference”. This fitness metric minimizes the signed difference between the left- and right-hand sides of the search relationship. For example, specifying:

y = f(x)

with the minimize difference fitness metric selected tells Eureqa to find an f(x) to minimize y – f(x). A trivial solution to this relationship would be f(x) = negative infinity. However, you can enter other relations that are more useful. Consider the follow search relation:

(y – f(x))^4 = 0

Here, the minimize difference fitness would minimize the 4th-power error. In Eureqa this setting looks like:

In fact, you can enter any such expression and the f(x) can appear multiple times. For example:

max( abs(y – f(x)), (y – f(x))^2 ) = 0

would minimize the maximum of the absolute error and squared error, at each data point in the data set.

Other Methods

There are many other possible ways to alter the fitness metric using the search relationship setting. For example, you could use a normal fitness metric (e.g. absolute or squared error) but scale both sides of the relation. For example, you could wrap each side of the search relation with a sigmoid function like tanh:

tanh(y) = tanh( f(x) )

Now, both the left and right sides get squashed down to a tanh function (an s-shaped curve that ranges -1 to 1) before being compared. This effectively caps large errors, reducing their impact on the fitness.

Even More Tricks

You can also use the search relationship to forbid certain values by exploiting NaN values (NaN = Not a Number). For example, consider the following search relation, which forbids models with negative values:

y = f(x) + 0*log( f(x) )

Notice the unusual 0*log(f(x)) term. Whenever f(x) is positive, the log is real-valued, and the multiplication with zero reduces the expression to y = f(x). However, whenever f(x) is negative, log(f(x)) is undefined, and produces a NaN value. Whenever a NaN appears in the fitness calculation, Eureqa automatically assigns the solution infinite error. Therefore, this search relationship tells Eureqa to find an f(x) that models y, but f(x) must be positive on each point in the data set.

This behavior can be used in other ways as well. Any operation that would produce an IEEE floating point NaN, undefined, or infinity will trigger Eureqa to assign infinite error. You can also add multiple terms like this to place multiple different constraints on solutions.

A time delay retrieves the value of a variable or expression at a fixed offset in the past, according to the time ordering or index of each data point in the data set. This post describes the time-delay building-blocks available in Eureqa and different modeling techniques with delayed values.

Time Delay Building Blocks:

Eureqa provides the delay(x, c) building block to represent an arbitrary time-delay, where x could be any expression. The expression delay(x, c) returns the value of x at c time units in the past. When used as a building-block, Eureqa can automatically optimize expressions or variables to be delayed and the time-delay amount  c.

The figure above plots an arbitrary variable x and a delayed value delay(x, 1.0), where the values are ordered by some time variable t. The delayed version is equal to x at 1.0 time units into the past.

To use time-delay building-blocks, your data must have some notion of time or ordering. You also need to tell Eureqa which variable in your data represents the time or ordering value:

If you don’t specify a time variable, Eureqa will use the row number in the spreadsheet as the time value of each data point.

If a particular delayed time value falls between two points in the data set, the value is linearly interpolated between the two data points using the time value.

Eureqa also provides the delay_var(x, c) building-block which is identical to delay(x, c), except that it only accepts a variable as input. It’s provided as a special case of the delay(x, c) building-block to allow you to constrain the types delays used in the solutions. But in the end they are effectively identical.

Control the Fraction of Data Used for History

Notice that the delayed output plotted above does not have values on the left side of the graph for the first few time points. This is because these points request previous values of x that lie before the first point in our data set. Eureqa will automatically ignore these data points when calculating errors.

However, there is a way to control how much of the data set Eureqa is allowed to ignore – or effectively, specify a maximum delay offset. You can limit the fraction of data used for time-delay history values in the Advanced Solutions Options menu:

The default maximum fraction is 50% of the data. If you find that Eureqa is identifying solutions with very large time delays, perhaps just to avoid modeling difficult features in the first half of the data set, you may want reduce this fraction

Additionally, you can control the number of delayed values per variable (including a zero delay of an ordinary variable use) in this dialog.

Fixed Time-delays:

Another way to model a value as a function of its previous values is with fixed delays. You can enter in fixed time-delays, or “lags” of the variable, directly into the Search Relationship option. For example:

x = f( delay(x, 2.1), delay(x, 5.6) )

This search relationship tells Eureqa to find an equation to model the value of x as a function of it’s value at 2.1 and 5.6 time units in the past.

Minimum Time-delays:

You may also want to specify a minimum time-delay offset. If you entered a search relationship such as x = f(x), Eureqa would find a trivial answer f(x) = x. More likely, you wanted to find a model of x, but as a function of x at least some amount of time in the past. The way to do this is to again use a fixed delay, such as:

x = f(delay(x, 3.21))

Here, 3.21 is the minimum time-delay. Now, if the time-delay building-blocks are enabled, Eureqa can delay this delayed input further if necessary.

Delay Differential Equations:

Another common use for time-delays in for modeling using Delay Differential Equations. Finding delay differential equations is just like searching for ordinary differential equations. For example, entering a search relationship like:

D(y,t) = f(y)

but also enabling time-delay building blocks. This relationship has a trivial solution however: Eureqa will return the slope formula such as

f(y) = ( y – delay(y, 0.1) )/0.1

Therefore, you most-likely want to limit the total number of delays per variable to one (which includes the zero delay of the normal variable use). You can set this in the Advanced Solution Settings menu. The default is unlimited.

Implementing Delays Outside of Eureqa

In Matlab, you can implement a time delay using the interp1 function. For example, the expression delay(x, 1.23) would be implemented as:

interp1(t, x, t – 1.23, ‘linear’)

Implementing delays in Excel is a littler harder. You need to download an Excel add-on that adds an interpolate function. For example, the package XlXtrFun adds a function “Interpolate” that is just like Matlab’s interp1. There are also other guides for Linear Interpolation with Excel.

Eureqa can automatically estimate numerical derivatives in order to model the rates of change of variables in your data. Often derivatives are more natural and simpler for modeling certain types of phenomena, particularly in physics. This post discusses the basics of entering derivatives into the Eureqa search relationship.

The Derivative Operator:

Eureqa provides the derivative operator D(x, y, n) where x and y are any arbitrary expressions and n is an integer representing the order of the derivative to take. This operator can be used in the Search Relationship setting. For example, consider the search relationship:

D(x,t,1) = f(x,t)

This relation tells Eureqa to find a function of x and t that models the first derivative (e.g. a velocity or slope) of x with respect to t. Short-hand for the first derivative is D(x,t). The derivative operator can also appear inside the formula as an input variable, for example:

D(x,t,2) = f( x, D(x,t) )

This relation tells Eureqa to find a model of the second derivative (e.g. an acceleration or curvature) of x with respect to t, as a function of x and the first derivative of x. In Eureqa, this relation will appear as:

Eureqa displays the derivatives in mathematical format after the relationship text is entered.

Alternatively, you could estimate the numerical derivatives ahead of time using another program, and enter these values as a new variable in the data set rather than using Eureqa’s derivative operator.
Starting the Search:

Eureqa will calculate the numerical derivatives that appear in your search relation when you start the search. The following screen will appear after you click start:

Eureqa estimates the numerical derivative using a spline fit to the data. This allows more accurate derivative estimates than other methods in case the data contains noise.

Estimating numerical derivatives accurately is a challenging task when the data is sparse or contains noise. Eureqa’s derivative estimation is an improvement over the most basic methods like Newton’s difference quotient. However, it does not work well in all cases.

One particular problem with spline curves is their accuracy at the head and tail of the data – these points are “surrounded” by fewer data points and thus have higher estimation error. If you can, you might want to ignore these points entirely using a weight variable. Simply add a new column to your data, and set the weight to 1 for all data points but near zero for the first and last 5 to 10 data points.

It may also be worth the effort to estimate the numerical derivatives outside of Eureqa using more specialized tools. For example, you may want to compute the derivative values in R or using Matlab’s spline toolbox, and then paste these into Eureqa as a new column variable.

Binary classification attempts to predict a variable that has only two possible outcomes – for example, true or false, or buy or don’t buy. This post describes how Eureqa can be used to model a boolean decision or classification value.

Binary classification is also one of the most widely studied problems in machine learning, and there are many optimized approaches for prediction (e.g. neureal nets, support vector machine, etc). Using Eureqa for classification (or symbolic regression in general) has a few advantages:

• finding models requires less data
• models can extrapolate extremely well
• resulting models are simple to analyze, refit, and reuse
• the structure of the models gives insight into the classification problem

The last point is the most important in my opinion – not only can you predict but you can also learn something about how the classification works, as in the example below. This isn’t possible with most other methods, but comes at a cost of increased time to find an analytical solution if one exists. Here’s how to do it in Eureqa.

Squash Method:

The key to this method is to tell Eureqa to search for equations that tend to be negative when the output is false, and positive when true. We then put solutions inside a step function to obtain outputs of either 1 (true) or 0 (false).

Step 1: Eureqa works with numerical values, so define true outcomes to have value 1, and false outcomes to have value 0. Now, enter in the boolean variable into Eureqa as a column of 0 and 1 values.

Step 2: We want to find formula that predicts 0 and 1 values. One way to do this is to tell Eureqa to search for an equation that goes inside a step function before comparing with the boolean value. For example, we could enter “z = step(f(x,y))” into the search relationship setting, where z is a boolean value we want to model, x and y are other variables in the data set, and f(x,y) is the formula that Eureqa attempts to find. The step function is a built-in function in Eureqa that outputs 1 if the input is positive, and 0 otherwise. In other words, we are telling Eureqa to find equations that tend to be negative when z is 0 (false), and positive when z is 1 (true).

Step 3: Start a Eureqa search as normal. Eureqa reports equations for f(x,y) which is inside a step function. To use these solutions to predict the boolean value outside of Eureqa, we need to substitute the formula back into the search relationship. In other words, remember to place the reported solutions back into a step function to obtain the final model.

Example:

Let’s say we collected the following data, where x and y are two input variables, and z is a boolean outcome that we want to model (red = true, green = false):

We enter in a search relationship as “z = step( f(x,y) )”:

We then start the Eureqa search. After a few minutes, Eureqa identified a very accurate solution:

f(x,y) = 1.98 + 2.02*x*y – 3.05*y*y – x*x

You may recognize this equation as a tilted ellipse. Plotting this solution on the data makes this clear:

Here, we used Eureqa to identify a boolean model of whether a data point would be red or green based on the 2D location of x and y. The resulting solution shows that the data can be separated by an ellipse.

Another type of squashing function is the logistic function which varies smoothly between 0 and 1. It provides a better search gradient than the step function which has almost none. For example, we could enter a search relationship instead as:

z = logistic( f(x,y) )

A side effect is that logistic(f(x,y)) can produce intermediate values, such as 0.77 or 0.001. Therefore, we would need to threshold this value to get final 0 or 1 outputs. A simple way to threshold at 0.5 is to simply replace the logistic with a step function for the final step to make final predictions of the boolean value.

Often you might want to specify that the output of a model should fall within a certain range rather than an exact numerical value. This post shows one way to do this with Eureqa. The goal it so find the simplest equations who’s outputs always lie between some min/max value for each data point.

Enter Min and Max Values for each Data Point:

Step 1: For each data points that you only have a range of output values (the min and max values), you simply need to add two rows for that data point, one with minimum value and one with the maximum value (keeping all other variables in the row the same).

Step 2: Next, set the fitness metric to the “Mean Absolute Error” option.

Step 3: Start the Eureqa search as usual. Solutions that fall between the min and max values will have identical absolute error.

If a model output lies between the min and max values, the absolute error happens to be indifferent (mathematically) to where exactly this value lies. If the value moves closer to the max value, the error on the max value data point decreases linearly, but the error on the min value data point increases linearly also.

Example

In Eureqa, your data view should look similar to:

Where each input x is repeated twice, once with the minimum y value and again with maximum y value.

We can then start the search using the Mean Absolute Error fitness metric, and get various solutions that fall into the min/max ranges:

These solutions may have slightly different fitness values because some min/max data point pairs might get separated between the train and validation data sets. One way to avoid this is to change the train and validation sets to use all data or not shuffle their points in the Advanced Genetic Program Settings menu.

Using Separate Min and Max Values in a Custom Error Metric

Another option is to specify a custom error metric in the Search Relationship, this allows you to enter your min and max range values in different columns. For example, consider the following search relationship:

abs(y_min – f(x)) + abs(y_max – f(x)) = 0

where x is the input, and y_min and y_max are two different variables representing the min and max values of the range of outputs for each input x. The custom error in this relation is equivalent to the previous method.

see all