﻿

# Blog

In my previous post, we walked through the process of using Eureqa to predict the next price a bond will trade at. Starting with a massive spreadsheet with >760,000 rows and 61 columns, we were able to generate 11 equations to describe the data in 20 minutes. While I focused on just one of the equations, there is still more we can learn from Eureqa.

So let’s review the last page we looked at – the View Results tab of Eureqa:

I walked you through my thought process of how I chose a single equation out of the 11 that Eureqa generated. This equation had a size of 14, with only 4 parameters and 3 terms. Of all the equations, it seemed to best balance both accuracy and complexity, being able to predict the next price a bond will trade at within \$0.55 based on only 3 variables. However, there’s far more information here in this tab – what else can we learn?

First, let’s talk a little more about the equation we chose:

When you click on that specific solution, you will see details about that solution directly below. Eureqa provides details on 8 different error metrics for each solution, ranging from Mean Absolute Error to Hybrid Correlation/Error. I used MAE to judge accuracy in this case, but different data sets may require different error metrics.

While I didn’t touch upon R^2 Goodness of Fit in the previous tutorial, it can provide a meaningful way to evaluate your overall search. What this metric helps you understand is how much of the variance in your target variable is captured by each of your solutions. In this case, the R^2 value is telling us that our solution captures 98.9% of the variance in the predicted trade price. With this equation under our belt, let’s dig a little deeper.

For more details on all 8 error metrics, please see our tutorial on error metrics.

Even though we chose this specific equation as the best for now, what can the other equations tell us about this data? There are three different ways of ranking solutions – by size only, by fit only, or by a combination of size and fit. The third is what Eureqa defaults to, but you can still find valuable data by ranking by the other two methods.

Specifically, let’s look at what happens when you rank by size, looking at the simplest solutions first. By doing this, you can see which single variable Eureqa believes to be the most crucial to understanding the target variable. Then going through each successively more complicated solution, you can see which other variables begin appearing in what order. The simplest solution here is just:

When you look at the R^2 value for this solution, it actually shows us that this one variable captures 98.4% of the variance of the target variable. What does this mean for us? While we can (and did) find more sophisticated models that get us closer to modeling the future trade price, the last price that the bond traded at is by far the best indicator of the future price.

Finally, let’s focus on this trade_price_last1 variable. As we just discovered, it captures 98% of the variance in our target variable – trade_price. It could be interesting to look at what drives differences between the two variables – and Eureqa lets us do that extremely easily. All we need to do is set up a new search, and modify the target expression to find the difference between trade_price and trade_price_last1, as modeled by the rest of the variables:

After running this for almost 7 hours on 72 cores, the most accurate solution I could generate was:

As you can see from the Pareto front display, solutions with much more complexity are being introduced. Keeping in mind that the average difference between trade price and the last price is actually 0.607, our most accurate equation here has a 0.52 MAE.  While this solution is the most accurate, you can choose for yourself which solution has a better balance of accuracy and complexity, such as the one with equation size of 13, using only 2 parameters. Additionally, doing more pre-processing on the dataset or choosing different building blocks will lead you to improved searches.

Last week, it was all about showing you how easy and intuitive it is to use Eureqa to quickly come up with incredibly accurate results. Today, I hope I was able to show you some of the hidden power behind Eureqa that allows you to accomplish far more.

Of course, this is still only touching the tip of the iceberg of Eureqa’s abilities. Using the fxp file I posted last time, go ahead and try yourself! If you run into any questions, check out our user guides and tutorials, or come visit our forums and see what questions others have asked!

Happy modeling!

Jess

Watch the Tutorial »

Investors use predicted bond trade prices to inform their trading decisions throughout the day. While bond yield curves can be used to help make decisions, there are many other factors that could help predict a bond’s next trading price.

In this tutorial, we’ll review how Eureqa and the power of symbolic regression can be used to predict the next trading price of a US corporate bond. We use bond price data provided through Benchmark Solutions and Kaggle.com, which includes variables such as current coupon, time to maturity, and details of the previous 10 trades, among others.

The original competition and data, hosted by Kaggle, can be found here.

Understanding the Data

After downloading and viewing the data, we see that this dataset comprises 61 columns of parameters. These parameters include the row ID (can be used for time series analysis), the bond ID (there is data for almost 8,000 different bonds), current coupon, previous trade prices, and more. The most important column for us is the trade_price column, as this is the value that we are trying to solve for.

This dataset also includes over 700,000 rows of data. For this tutorial, we’re only going to take the first 200,000 rows of data for learning the model. Later, you can use the rest of this data for more training or validation, but let’s stick with 200,000 for now. To do this, you can extract the rows you want manually from the data, or create a new, smaller file by using the command line to run the following command: head –n 200000 originalFileNameAndLocation.csv > newFileName.csv.

For more details, please see a previous tutorial that covered this step.

Preparing the Data

Once you’ve imported the data and confirmed that it looks as expected in the Enter Data tab, let’s move on to the Prepare Data tab. This tab has options for you to do further pre-processing with your data, such as handling missing values or smoothing the data points. For this initial exploration, we will not choose any of those options, but you can return to these later to improve on the performance of your model. For more information on how to take advantage of these options, see our tutorial on preparing data in Eureqa.

Finally, let’s give our search a target expression and choose its building blocks. Since we want to solve for the trade price of a particular bond given the other variables, the target expression should be set so that the trade_price variable is modeled as a function of all other variables:

trade_price = f(weight, current_coupon, time_to_maturity, …, curve_based_price_last10)

With regards to the building blocks, they are used to define the mathematical equation types that Eureqa will attempt to combine in your final model. We prefer using fewer building blocks initially to speed up the search, then later expanding the number of building blocks to add in more sophistication to subsequent models. In this case, we are only going to leave the basic building blocks checked and uncheck the two trigonometry building blocks.

For this dataset, we will leave all other options set to their defaults. Eureqa includes many other options to further refine your search which we don’t need to use for this example, but can be very useful in more targeted searches. If you want to read more about your options on setting search targets, please see our tutorial on setting search targets in Eureqa.

Interpreting the Results

At this point, we are finally ready to move on to the Start Search tab and begin to run the formula search. You can run your search on your local computer, just using the cores you have on your laptop or desktop, or you can speed up your searches by using either your own dedicated private cloud or leveraging the cloud with Amazon EC2. For this search, we ran it on 72 cloud cores for just 20 minutes.

Eureqa gives us a few different methods for assessing the progress of your search. On the Start Search tab, you can monitor both the Confidence metrics in the Progress and performance view as well as the Progress over time chart. In conjunction with those two methods, the Pareto front display gives another visual indication of the performance of the generated equations.

In our case, Eureqa went through nearly 250,000 generations of equations in just 20 minutes, resulting in 11 equations. The top 4 most accurate equations (as judged by the Pareto front display) differ widely in terms of complexity, ranging from 14 terms used to 20 terms. The remaining, simpler models show a steep decrease in accuracy, but it is up to you to determine the correct trade off of simplicity and accuracy.

The current most accurate solution has a 0.547 mean absolute error, signifying that this model can predict the future trading price of a bond with an average error of only \$0.55. Using a less complex model with 20% fewer terms gives us a 0.554 mean absolute error. Given that the average future trading price among the entire training dataset is \$105, having an average error of only \$0.54 or \$0.55 shows that both formulas model that data very closely. In this case, trading 20% fewer terms for only a 1% difference in accuracy seems like the ideal tradeoff in this scenario.

Summary

In just 20 minutes we were able to discover a formula to predict the next price that a corporate bond will trade at, with a mean absolute error of only 0.554. In addition to just making pure predictions, this formula found the relationships within the data, allowing us to understand what factors are truly driving these prices. In this example, we found that the last trading price, the curve based price, and the trade type are the factors that are most important to what price the bond will trade at next. Out of the 61 variables that we began with, Eureqa was able to identify the 3 variables that have the most impact on the future trading price.

Throughout this example, we took a variety of shortcuts to reach an initial assessment quickly. Now that we have a sense for what this data has to offer and what we’re looking for, there are many opportunities to expand this model to reach even greater accuracy by doing additional data preparation, choosing different formula building blocks, or even just letting the search run for a longer amount of time. However, it is important to keep in mind that with this first investigation, we were able to quickly get visually intuitive results at a high level of accuracy without any deep technical knowledge.

Jess

Topics: Eureqa, Making predictions, Tutorial

We have amazing customers doing even more amazing things with their data. As we hear from our customers with their stories, we will be sharing them with you here on our blog. Hopefully they will help inspire you to think about what more you could be doing with your own data! Contact us if you have a case study using Eureqa you would like to share as well.

Cypress Point Technologies creates tools that allow hedge fund managers to analyze the New York and NASDAQ stock exchanges. Charles Brauer founded Cypress Point Technologies in 2009 after seeing success with his side project, HedgeTools. Two years later, Brauer and his team began to work on expanding HedgeTool’s capabilities with custom models that use early-morning stock data to predict stock price trajectories through the rest of the day. Given the complexity of the task and the number of trends that could turn out to be false positives, it was crucial for the team to develop a sophisticated yet lightning-fast algorithm that was capable of producing accurate predictions.

After seeing a story in Wired Magazine about Eureqa, Brauer decided to download the software and get in contact with Nutonian founder and CEO Michael Schmidt. In his first few trial runs with Eureqa, Brauer took historical, minute-by-minute price data for a batch of stocks, pre-categorize them as “good” or “bad”, and let Eureqa come up with an algorithm that could predict trajectories based on the first 30 minutes of market activity. The results were impressive enough on the initial, minimal dataset, but after adding in additional parameters and refinements, the resulting models have been making increasingly accurate predictions. Most importantly for Brauer, the speed and accuracy of generating and applying models makes them easy to deploy within HedgeTools. “I have a test portfolio that’s managed based on recommendations from HedgeTools,” says Brauer, “and it beats the performance of the S&P 500 by a wide margin.”

For more details, read the full case study here!