Blog

Setting and using validation data with Eureqa

Posted by Michael Schmidt

6/28/13 4:02 PM

Eureqa automatically splits your data into groups: training and validation data sets. The training data is used to optimize models, whereas validation data is used to test how well models generalize to new data. Eureqa also uses the validation data to filter out the best models to display in the Eureqa user interface. This post describes how to use and control these data sets in Eureqa.

Default Splitting
By default, Eureqa will randomly shuffle your data and then split it into train and validation data sets based on the total size of your data. Eureqa will color these points differently in the user interface, and also provide statistics for each when displaying stats, for example:


All other error metrics shown in Eureqa, like the "Fit" column and "Error" shown in the Accuracy/Complexity plot, use the metric calculated with the validation data set.

Validation Data Settings

You can modify how Eureqa chooses the training and validation data sets in the Options | Advanced Genetic Program Settings menu, shown below:


Here you can change the portion of the data that is used for the training data, and the portion that goes into the validation data. The two sets are allowed to overlap, but can also be set to be mutually exclusive as shown above.

For very small data sets (under a few hundred points) it is usually best to use almost all of this data for both training and validation. Model selection can be done using the model complexity alone in these cases.

For very large data sets (over 1,000 rows) it is usually best to use a smaller fraction of data for training. It is recommended to choose a fraction such that the size of the training data is approximately 10,000 rows or less. Then, use all the remaining data for validation.

Finally, you can also tell Eureqa to randomly shuffle the data before splitting or not. One reason to disable the shuffling is if you want to choose specific rows at the end of the data set to use for validation.

Using Validation Data to Test Extrapolating Future Values

If you are using time-series data and are trying to predict future time-series values, you may want to create a validation data split that emphasizes the ability of the models to predict future values that were not used for optimizing the model directly.

To do this, you need to disable the random shuffling in the Options | Advanced Genetic Programming Options menu, and optionally make the training and validation data sets mutually exclusive (as shown in the options above). For example, you could set the first 75% of the data to be used for training, and the last 25% to be used for validation. After starting the search, you will see your data split like below:


Now, the list of best solutions will be filtered by their ability to predict only future values - the last rows in the data set which were not used to optimize the models directly. 

Read More

Topics: Preparing Data, Advanced Techniques, Eureqa, Tutorial

Working with discontinuous data in Eureqa

Posted by Michael Schmidt

6/28/13 4:02 PM

Several features in Eureqa assume that your data is one continuous series of points by default, such as the smoothing features and numerical derivative operators. This post shows how to tell Eureqa that there are breaks in the data.

Entering discontinuity with a blank row:

Read More

Topics: Preparing Data, Advanced Techniques, Eureqa, Tutorial

Using date and time variables

Posted by Michael Schmidt

6/28/13 4:01 PM

This post describes the best way to convert date or time values into numeric time values that can be used in Eureqa.

Time Values in Eureqa:
Eureqa can only store date and time values as numeric values (e.g. total seconds or total days). Therefore, you need to pick a reference point to measure a time duration from, and units to measure the time duration.

For example, you could convert a time value "8:31 am" to 8.52 total hours since midnight. Similarly for dates, you could convert a date like "Dec. 6, 1981 8:31 am" to 81.9 total years since 1900.

You need to make date and time conversions to numeric duration values in another program like Excel before entering into Eureqa (see below for example).

Pitfalls:

1) Do not concatenate date and time strings to get a numeric value. For example, do not convert a date like "1981-12-06" to 19811206. This representation of time is extremely nonlinear. It can preserve order, but has lost all meaning. Additionally, the values are very large and numerically unstable.

2) Avoid measuring time durations from a very distant reference point. For example, if you're data uses time values that span a few days, do not convert these time values to total seconds since the beginning of the century. The numeric values would be enormous and numerically unstable.

Instead, the best practice is to measure a time duration since the time point in your data set.

Convert in Excel:

Many programs can convert date and time values to numeric time duration values. In Excel, if you subtract two date cells, the result is the fractional number of days between the two dates. You could then convert days to hours or some other unit to get numeric values with reasonable numeric magnitudes. For example:
    =(A0-A$0)*24

and then repeated for all rows, would subtract the first date in cell A0, and multiply the resulting day values into hours.
Another useful function is the YEARFRAC function which converts the difference between a date and a reference date to the fraction of years difference between them. For example:
    =YEARFRAC(A$0, A0)
and repeated for all rows, returns the fractional value of years from cell A0.

See Also:
Read More

Topics: Preparing Data, Advanced Techniques, Eureqa, Techniques, Time Series

Normalizing data variables in Eureqa

Posted by Michael Schmidt

6/28/13 4:00 PM

While normalizing your data variables (rescaling the numeric values) is completely optional, it can greatly improve the performance of Eureqa, and numerical stability of solutions. This post discusses when and how to normalize variables in your data.

When to Normalize:

Eureqa works best when all variables in your data have small to medium magnitudes, on the order of 1 to 100. For example, if you have any variable that ranges over a million, it would be best to rescale the values to larger units.

Additionally, the magnitudes of the variable should be similar to the mean or offset of the variable. For example, if you have a variable that only varies between 100.0 and 100.5, it would be best to subtract off 100 so that it ranges between 0 and 0.5.

For example, consider the following two variables in some data set:
Notice that both variables look rather flat. You can't see any interesting variation because the variable a has such a large offset. Do variables in your data look like this? Let's try subtracting off an offset of 10,000 from a:
Now, we can see some interesting variation in the variable a, but the variable b still looks flat because the variable still has a large magnitude relative to b. Next, let's try dividing the values of a by 50:
Now we can see the interesting variation in both variables, as they now have the same relative scale and magnitudes. This is ideally how we want our data to look before entering it into Eureqa. When the variables are reasonably scaled, Eureqa is most likely to utilize their variation to build accurate solutions.

How to Normalize a Variable:

First, consider changing the units of the data you enter into Eureqa. Could you measure values in meters instead of centimeters? Could you measure currency in millions-of-dollars instead of dollars? Pick units such that the numeric values have a range of approximately 1 to 100.

Second, consider measuring values from an offset. Could you measure time since the time of your first data point, instead of since the beginning of the year or century?

Third, check over your data; look for outliers. Are there any values that are drastically out of proportion with the rest of the values? If so, consider removing this entire row in your data set or giving it a very low weight.

The general formula for normalizing a variable y is:

    y_normalized = (y - offset)/scale

where offset and scale are the normalization parameters. It's recommended that you pick offset and scale manually, so that the numeric values still have an intuitive meaning. However, if you truly don't care what the numeric values mean, a common approach is to set offset equal to the mean of the variable and scale to the standard deviation of the variable.

It's also recommended that you apply normalization before entering your data into Eureqa. However, you can specify the normalization in the Eureqa Search Relation. For example, consider the search relation:

    y = f( x/1000 )

This tells Eureqa to find a model of y as a function of values of x that are divided by 1000.

Automatic Normalization Checks:

By default, Eureqa will check your data for extreme cases of that variables that need to be normalized. When entering or modifying values in the Eureqa data view, you may encounter a message like this:


Here, Eureqa is telling you that the variable y has a large offset. It has a mean value of about 1000, but it only varies by +/- 1.38. Eureqa suggests subtracting 996 from each y value in your data set, but leaving the scale unchanged.

You can also modify this and specify what values to apply. Pick a scale and offset that makes sense and preserves meaning.

See Also:
Read More

Topics: Preparing Data, Eureqa, Tutorial, Techniques, Normalize Data in Eureqa

Subscribe to Our Blog!

Follow Me

Posts by Topic

see all