Investors use predicted bond trade prices to inform their trading decisions throughout the day. While bond yield curves can be used to help make decisions, there are many other factors that could help predict a bond’s next trading price.
In this tutorial, we’ll review how Eureqa and the power of symbolic regression can be used to predict the next trading price of a US corporate bond. We use bond price data provided through Benchmark Solutions and Kaggle.com, which includes variables such as current coupon, time to maturity, and details of the previous 10 trades, among others.
The original competition and data, hosted by Kaggle, can be found here.
Understanding the Data
After downloading and viewing the data, we see that this dataset comprises 61 columns of parameters. These parameters include the row ID (can be used for time series analysis), the bond ID (there is data for almost 8,000 different bonds), current coupon, previous trade prices, and more. The most important column for us is the trade_price column, as this is the value that we are trying to solve for.
This dataset also includes over 700,000 rows of data. For this tutorial, we’re only going to take the first 200,000 rows of data for learning the model. Later, you can use the rest of this data for more training or validation, but let’s stick with 200,000 for now. To do this, you can extract the rows you want manually from the data, or create a new, smaller file by using the command line to run the following command: head –n 200000 originalFileNameAndLocation.csv > newFileName.csv.
For more details, please see a previous tutorial that covered this step.
Preparing the Data
Once you’ve imported the data and confirmed that it looks as expected in the Enter Data tab, let’s move on to the Prepare Data tab. This tab has options for you to do further pre-processing with your data, such as handling missing values or smoothing the data points. For this initial exploration, we will not choose any of those options, but you can return to these later to improve on the performance of your model. For more information on how to take advantage of these options, see our tutorial on preparing data in Eureqa.
Finally, let’s give our search a target expression and choose its building blocks. Since we want to solve for the trade price of a particular bond given the other variables, the target expression should be set so that the trade_price variable is modeled as a function of all other variables:
trade_price = f(weight, current_coupon, time_to_maturity, …, curve_based_price_last10)
With regards to the building blocks, they are used to define the mathematical equation types that Eureqa will attempt to combine in your final model. We prefer using fewer building blocks initially to speed up the search, then later expanding the number of building blocks to add in more sophistication to subsequent models. In this case, we are only going to leave the basic building blocks checked and uncheck the two trigonometry building blocks.
For this dataset, we will leave all other options set to their defaults. Eureqa includes many other options to further refine your search which we don’t need to use for this example, but can be very useful in more targeted searches. If you want to read more about your options on setting search targets, please see our tutorial on setting search targets in Eureqa.
Interpreting the Results
At this point, we are finally ready to move on to the Start Search tab and begin to run the formula search. You can run your search on your local computer, just using the cores you have on your laptop or desktop, or you can speed up your searches by using either your own dedicated private cloud or leveraging the cloud with Amazon EC2. For this search, we ran it on 72 cloud cores for just 20 minutes.
Eureqa gives us a few different methods for assessing the progress of your search. On the Start Search tab, you can monitor both the Confidence metrics in the Progress and performance view as well as the Progress over time chart. In conjunction with those two methods, the Pareto front display gives another visual indication of the performance of the generated equations.
In our case, Eureqa went through nearly 250,000 generations of equations in just 20 minutes, resulting in 11 equations. The top 4 most accurate equations (as judged by the Pareto front display) differ widely in terms of complexity, ranging from 14 terms used to 20 terms. The remaining, simpler models show a steep decrease in accuracy, but it is up to you to determine the correct trade off of simplicity and accuracy.
The current most accurate solution has a 0.547 mean absolute error, signifying that this model can predict the future trading price of a bond with an average error of only $0.55. Using a less complex model with 20% fewer terms gives us a 0.554 mean absolute error. Given that the average future trading price among the entire training dataset is $105, having an average error of only $0.54 or $0.55 shows that both formulas model that data very closely. In this case, trading 20% fewer terms for only a 1% difference in accuracy seems like the ideal tradeoff in this scenario.
trade_price = 0.6964*trade_price_last1 + 0.3026*curve_based_price + 0.1059/(trade_type – 2.759)
In just 20 minutes we were able to discover a formula to predict the next price that a corporate bond will trade at, with a mean absolute error of only 0.554. In addition to just making pure predictions, this formula found the relationships within the data, allowing us to understand what factors are truly driving these prices. In this example, we found that the last trading price, the curve based price, and the trade type are the factors that are most important to what price the bond will trade at next. Out of the 61 variables that we began with, Eureqa was able to identify the 3 variables that have the most impact on the future trading price.
Throughout this example, we took a variety of shortcuts to reach an initial assessment quickly. Now that we have a sense for what this data has to offer and what we’re looking for, there are many opportunities to expand this model to reach even greater accuracy by doing additional data preparation, choosing different formula building blocks, or even just letting the search run for a longer amount of time. However, it is important to keep in mind that with this first investigation, we were able to quickly get visually intuitive results at a high level of accuracy without any deep technical knowledge.
Ready to try for yourself? Go ahead and download the Eureqa project file to get started.