Blog

Setting and using validation data with Eureqa

Posted by Michael Schmidt

28.06.2013 04:02 PM

Eureqa automatically splits your data into groups: training and validation data sets. The training data is used to optimize models, whereas validation data is used to test how well models generalize to new data. Eureqa also uses the validation data to filter out the best models to display in the Eureqa user interface. This post describes how to use and control these data sets in Eureqa.

Default Splitting
By default, Eureqa will randomly shuffle your data and then split it into train and validation data sets based on the total size of your data. Eureqa will color these points differently in the user interface, and also provide statistics for each when displaying stats, for example:

Eureqa by Nutonian

All other error metrics shown in Eureqa, like the “Fit” column and “Error” shown in the Accuracy/Complexity plot, use the metric calculated with the validation data set.

Validation Data Settings

You can modify how Eureqa chooses the training and validation data sets in the Options | Advanced Genetic Program Settings menu, shown below:

Eureqa Validation Data Settings

Here you can change the portion of the data that is used for the training data, and the portion that goes into the validation data. The two sets are allowed to overlap, but can also be set to be mutually exclusive as shown above.

For very small data sets (under a few hundred points) it is usually best to use almost all of this data for both training and validation. Model selection can be done using the model complexity alone in these cases.

For very large data sets (over 1,000 rows) it is usually best to use a smaller fraction of data for training. It is recommended to choose a fraction such that the size of the training data is approximately 10,000 rows or less. Then, use all the remaining data for validation.

Finally, you can also tell Eureqa to randomly shuffle the data before splitting or not. One reason to disable the shuffling is if you want to choose specific rows at the end of the data set to use for validation.

Using Validation Data to Test Extrapolating Future Values

If you are using time-series data and are trying to predict future time-series values, you may want to create a validation data split that emphasizes the ability of the models to predict future values that were not used for optimizing the model directly.

To do this, you need to disable the random shuffling in the Options | Advanced Genetic Programming Options menu, and optionally make the training and validation data sets mutually exclusive (as shown in the options above). For example, you could set the first 75% of the data to be used for training, and the last 25% to be used for validation. After starting the search, you will see your data split like below:

Eureqa Model Prediction

Now, the list of best solutions will be filtered by their ability to predict only future values – the last rows in the data set which were not used to optimize the models directly.

Topics: Advanced Techniques, Eureqa, Preparing Data, Tutorial

One response to “Setting and using validation data with Eureqa”

  1. Jamal Zaherpour says:

    Hi Dear Michael,

    As a PhD students, I have been using academic Eureqa for about a year and half. But i have not seen some options like stats for both the taring and validation sets, or Advanced Genetic Programming Options in options tab. I would be thankful to let me know whether they used to be options of an old version or how to activate them.

    Kind Regards,
    Jamal

Leave a Reply

Your email address will not be published. Required fields are marked *

Follow Me

Posts by Topic

see all