Binary classification is also one of the most widely studied problems in machine learning, and there are many optimized approaches for prediction (e.g. neureal nets, support vector machine, etc). Using Eureqa for classification (or symbolic regression in general) has a few advantages:
- finding models requires less data
- models can extrapolate extremely well
- resulting models are simple to analyze, refit, and reuse
- the structure of the models gives insight into the classification problem
The key to this method is to tell Eureqa to search for equations that tend to be negative when the output is false, and positive when true. We then put solutions inside a step function to obtain outputs of either 1 (true) or 0 (false).
Step 1: Eureqa works with numerical values, so define true outcomes to have value 1, and false outcomes to have value 0. Now, enter in the boolean variable into Eureqa as a column of 0 and 1 values.
Step 2: We want to find formula that predicts 0 and 1 values. One way to do this is to tell Eureqa to search for an equation that goes inside a step function before comparing with the boolean value. For example, we could enter "z = step(f(x,y))" into the search relationship setting, where z is a boolean value we want to model, x and y are other variables in the data set, and f(x,y) is the formula that Eureqa attempts to find. The step function is a built-in function in Eureqa that outputs 1 if the input is positive, and 0 otherwise. In other words, we are telling Eureqa to find equations that tend to be negative when z is 0 (false), and positive when z is 1 (true).
Step 3: Start a Eureqa search as normal. Eureqa reports equations for f(x,y) which is inside a step function. To use these solutions to predict the boolean value outside of Eureqa, we need to substitute the formula back into the search relationship. In other words, remember to place the reported solutions back into a step function to obtain the final model.
Let's say we collected the following data, where x and y are two input variables, and z is a boolean outcome that we want to model (red = true, green = false):
We enter in a search relationship as "z = step( f(x,y) )":
We then start the Eureqa search. After a few minutes, Eureqa identified a very accurate solution:
f(x,y) = 1.98 + 2.02*x*y - 3.05*y*y - x*x
You may recognize this equation as a tilted ellipse. Plotting this solution on the data makes this clear:
Another type of squashing function is the logistic function which varies smoothly between 0 and 1. It provides a better search gradient than the step function which has almost none. For example, we could enter a search relationship instead as:
z = logistic( f(x,y) )
A side effect is that logistic(f(x,y)) can produce intermediate values, such as 0.77 or 0.001. Therefore, we would need to threshold this value to get final 0 or 1 outputs. A simple way to threshold at 0.5 is to simply replace the logistic with a step function for the final step to make final predictions of the boolean value.