By default, Eureqa will randomly shuffle your data and then split it into train and validation data sets based on the total size of your data. Eureqa will color these points differently in the user interface, and also provide statistics for each when displaying stats, for example:
All other error metrics shown in Eureqa, like the “Fit” column and “Error” shown in the Accuracy/Complexity plot, use the metric calculated with the validation data set.
Validation Data Settings
You can modify how Eureqa chooses the training and validation data sets in the Options | Advanced Genetic Program Settings menu, shown below:
Here you can change the portion of the data that is used for the training data, and the portion that goes into the validation data. The two sets are allowed to overlap, but can also be set to be mutually exclusive as shown above.
For very small data sets (under a few hundred points) it is usually best to use almost all of this data for both training and validation. Model selection can be done using the model complexity alone in these cases.
For very large data sets (over 1,000 rows) it is usually best to use a smaller fraction of data for training. It is recommended to choose a fraction such that the size of the training data is approximately 10,000 rows or less. Then, use all the remaining data for validation.
Finally, you can also tell Eureqa to randomly shuffle the data before splitting or not. One reason to disable the shuffling is if you want to choose specific rows at the end of the data set to use for validation.
Using Validation Data to Test Extrapolating Future Values
If you are using time-series data and are trying to predict future time-series values, you may want to create a validation data split that emphasizes the ability of the models to predict future values that were not used for optimizing the model directly.
To do this, you need to disable the random shuffling in the Options | Advanced Genetic Programming Options menu, and optionally make the training and validation data sets mutually exclusive (as shown in the options above). For example, you could set the first 75% of the data to be used for training, and the last 25% to be used for validation. After starting the search, you will see your data split like below: