Blog

Nutonian Piercing the Veil of Distortion Over Mission-Critical Images

Posted by Jay Schuren

12.08.2014 10:00 AM

Screen_Shot_2014-08-08_at_2.36.47_PMImaging and advanced characterization are at the heart of a range of industries across aircraft component inspections, medical imaging, CPU manufacturing and more. As society continues to push the envelope in technological innovation, demand for the best quantitative characterization possible is always present. Distortion, or warping of image data, arises from the imaging equipment itself; common examples are fish eye lenses or fun house mirrors. A little-discussed issue is that changing the instrument settings often changes the distortion – limiting peak characterization performance across fields, reducing accuracy and escalating time and effort for discovery.

The majority of the innovation in imaging systems has been focused on showing ever-smaller features. But those expensive high-magnification machines still have distortion issues that no one has addressed. Inspired by work with the Air Force Research Laboratory, Nutonian has developed tools able to dynamically eliminate image distortion. Applying Nutonian’s data science automation engine, Eureqa, allows users to rapidly identify specific relationships between the instrument settings/environmental conditions and the limiting distortions.

Screen_Shot_2014-08-08_at_2.34.20_PM

Now, instead of pretending that image distortion either doesn’t exist or remains static across different instrument settings, companies can computationally model, predict and calculate behavior at a high degree of accuracy based on the measurements they take from an image. Understanding these causal relationships gives users the ability to “unwarp” image distortions, enabling accurate insight and peak performance when it really matters. The implications could mean saved lives and >10x improvements in quantitative measurements for systems such as Scanning Electronic Microscopes, realizing peak performance even in outdated equipment.

Whether companies are looking for cracks in aircraft turbine blades or tumors in a mammogram, current limits of characterization systems govern the status quo for early identification. Applying Eureqa to a Scanning Electron Microscope and a mammography detector over a range of conditions resulted in >10x improvements in quantitative measurements. Gain competitive advantage with access to improved detection systems that will save lives, reduce costs and accelerate the development of next generation products.

Topics: Advanced Techniques, Big data, Case study, nutonian, U.S. Air Force

Predicting a Bond’s Next Trade Price with Eureqa: Part 2

Posted by Jess Lin

26.11.2013 10:00 AM

In my previous post, we walked through the process of using Eureqa to predict the next price a bond will trade at. Starting with a massive spreadsheet with >760,000 rows and 61 columns, we were able to generate 11 equations to describe the data in 20 minutes. While I focused on just one of the equations, there is still more we can learn from Eureqa.

So let’s review the last page we looked at – the View Results tab of Eureqa:

View results tab

 

I walked you through my thought process of how I chose a single equation out of the 11 that Eureqa generated. This equation had a size of 14, with only 4 parameters and 3 terms. Of all the equations, it seemed to best balance both accuracy and complexity, being able to predict the next price a bond will trade at within $0.55 based on only 3 variables. However, there’s far more information here in this tab – what else can we learn?

First, let’s talk a little more about the equation we chose:
trade_price = 0.6964*trade_price_last1 + 0.3026*curve_based_price + 0.1059/(trade_type – 2.759)

When you click on that specific solution, you will see details about that solution directly below. Eureqa provides details on 8 different error metrics for each solution, ranging from Mean Absolute Error to Hybrid Correlation/Error. I used MAE to judge accuracy in this case, but different data sets may require different error metrics.

Solution details

While I didn’t touch upon R^2 Goodness of Fit in the previous tutorial, it can provide a meaningful way to evaluate your overall search. What this metric helps you understand is how much of the variance in your target variable is captured by each of your solutions. In this case, the R^2 value is telling us that our solution captures 98.9% of the variance in the predicted trade price. With this equation under our belt, let’s dig a little deeper.

For more details on all 8 error metrics, please see our tutorial on error metrics.

Even though we chose this specific equation as the best for now, what can the other equations tell us about this data? There are three different ways of ranking solutions – by size only, by fit only, or by a combination of size and fit. The third is what Eureqa defaults to, but you can still find valuable data by ranking by the other two methods.

Specifically, let’s look at what happens when you rank by size, looking at the simplest solutions first. By doing this, you can see which single variable Eureqa believes to be the most crucial to understanding the target variable. Then going through each successively more complicated solution, you can see which other variables begin appearing in what order. The simplest solution here is just:
trade_price = trade_price_last1.

Solution plot

 

When you look at the R^2 value for this solution, it actually shows us that this one variable captures 98.4% of the variance of the target variable. What does this mean for us? While we can (and did) find more sophisticated models that get us closer to modeling the future trade price, the last price that the bond traded at is by far the best indicator of the future price.

Finally, let’s focus on this trade_price_last1 variable. As we just discovered, it captures 98% of the variance in our target variable – trade_price. It could be interesting to look at what drives differences between the two variables – and Eureqa lets us do that extremely easily. All we need to do is set up a new search, and modify the target expression to find the difference between trade_price and trade_price_last1, as modeled by the rest of the variables:
trade_price – trade_price_last1 = f(weight, current_coupon, time_to_maturity, …, curve_based_price_last10)

After running this for almost 7 hours on 72 cores, the most accurate solution I could generate was:
trade_price – trade_price_last1 = (trade_type_last3 + 1.342*time_to_maturity)/(2.819*curve_based_price_last1 – curve_based_price*trade_type_last1) + (trade_type_last3 + 1.342*time_to_maturity)/(trade_type*curve_based_price – 2.819*curve_based_price)

Pareto front display

 

As you can see from the Pareto front display, solutions with much more complexity are being introduced. Keeping in mind that the average difference between trade price and the last price is actually 0.607, our most accurate equation here has a 0.52 MAE.  While this solution is the most accurate, you can choose for yourself which solution has a better balance of accuracy and complexity, such as the one with equation size of 13, using only 2 parameters. Additionally, doing more pre-processing on the dataset or choosing different building blocks will lead you to improved searches.

Last week, it was all about showing you how easy and intuitive it is to use Eureqa to quickly come up with incredibly accurate results. Today, I hope I was able to show you some of the hidden power behind Eureqa that allows you to accomplish far more.

Of course, this is still only touching the tip of the iceberg of Eureqa’s abilities. Using the fxp file I posted last time, go ahead and try yourself! If you run into any questions, check out our user guides and tutorials, or come visit our forums and see what questions others have asked!

Happy modeling!

Jess

Topics: Advanced Techniques, Eureqa, Making predictions, Tutorial

Setting and using validation data with Eureqa

Posted by Michael Schmidt

28.06.2013 04:02 PM

Eureqa automatically splits your data into groups: training and validation data sets. The training data is used to optimize models, whereas validation data is used to test how well models generalize to new data. Eureqa also uses the validation data to filter out the best models to display in the Eureqa user interface. This post describes how to use and control these data sets in Eureqa.

Default Splitting
By default, Eureqa will randomly shuffle your data and then split it into train and validation data sets based on the total size of your data. Eureqa will color these points differently in the user interface, and also provide statistics for each when displaying stats, for example:

Eureqa by Nutonian

All other error metrics shown in Eureqa, like the “Fit” column and “Error” shown in the Accuracy/Complexity plot, use the metric calculated with the validation data set.

Validation Data Settings

You can modify how Eureqa chooses the training and validation data sets in the Options | Advanced Genetic Program Settings menu, shown below:

Eureqa Validation Data Settings

Here you can change the portion of the data that is used for the training data, and the portion that goes into the validation data. The two sets are allowed to overlap, but can also be set to be mutually exclusive as shown above.

For very small data sets (under a few hundred points) it is usually best to use almost all of this data for both training and validation. Model selection can be done using the model complexity alone in these cases.

For very large data sets (over 1,000 rows) it is usually best to use a smaller fraction of data for training. It is recommended to choose a fraction such that the size of the training data is approximately 10,000 rows or less. Then, use all the remaining data for validation.

Finally, you can also tell Eureqa to randomly shuffle the data before splitting or not. One reason to disable the shuffling is if you want to choose specific rows at the end of the data set to use for validation.

Using Validation Data to Test Extrapolating Future Values

If you are using time-series data and are trying to predict future time-series values, you may want to create a validation data split that emphasizes the ability of the models to predict future values that were not used for optimizing the model directly.

To do this, you need to disable the random shuffling in the Options | Advanced Genetic Programming Options menu, and optionally make the training and validation data sets mutually exclusive (as shown in the options above). For example, you could set the first 75% of the data to be used for training, and the last 25% to be used for validation. After starting the search, you will see your data split like below:

Eureqa Model Prediction

Now, the list of best solutions will be filtered by their ability to predict only future values – the last rows in the data set which were not used to optimize the models directly.

Topics: Advanced Techniques, Eureqa, Preparing Data, Tutorial

Working with discontinuous data in Eureqa

Posted by Michael Schmidt

Several features in Eureqa assume that your data is one continuous series of points by default, such as the smoothing features and numerical derivative operators. This post shows how to tell Eureqa that there are breaks in the data.

Entering discontinuity with a blank row:

To indicate a break in your data set, simply insert a blank row in the data spreadsheet. You can insert a blank row by right clicking on the first row of the next series of points. For example:

Screen Shot 2013 06 28 at 2.50.08 PM resized 600 Eureqa working with discontinuous data

Eureqa will automatically recognize this blank line as a break between continuous series of data for each variable. This allows you to smooth the data correctly for data points between breaks. For example:

Eureqa smooth data points

Without a break, the smooth would attempt to blur the two distinct series into a single smooth curve.

Topics: Advanced Techniques, Eureqa, Preparing Data, Tutorial

Using date and time variables

Posted by Michael Schmidt

This post describes the best way to convert date or time values into numeric time values that can be used in Eureqa.

Time Values in Eureqa:
Eureqa can only store date and time values as numeric values (e.g. total seconds or total days). Therefore, you need to pick a reference point to measure a time duration from, and units to measure the time duration.For example, you could convert a time value “8:31 am” to 8.52 total hours since midnight. Similarly for dates, you could convert a date like “Dec. 6, 1981 8:31 am” to 81.9 total years since 1900.

You need to make date and time conversions to numeric duration values in another program like Excel before entering into Eureqa (see below for example).

Pitfalls:

Pitfalls of using time variables in Eureqa
1) Do not concatenate date and time strings to get a numeric value. For example, do not convert a date like “1981-12-06” to 19811206. This representation of time is extremely nonlinear. It can preserve order, but has lost all meaning. Additionally, the values are very large and numerically unstable.

2) Avoid measuring time durations from a very distant reference point. For example, if you’re data uses time values that span a few days, do not convert these time values to total seconds since the beginning of the century. The numeric values would be enormous and numerically unstable.

Instead, the best practice is to measure a time duration since the time point in your data set.

Convert in Excel:

Many programs can convert date and time values to numeric time duration values. In Excel, if you subtract two date cells, the result is the fractional number of days between the two dates. You could then convert days to hours or some other unit to get numeric values with reasonable numeric magnitudes. For example:

    =(A0-A$0)*24

and then repeated for all rows, would subtract the first date in cell A0, and multiply the resulting day values into hours.
Another useful function is the YEARFRAC function which converts the difference between a date and a reference date to the fraction of years difference between them. For example:
    =YEARFRAC(A$0, A0)

and repeated for all rows, returns the fractional value of years from cell A0.

See Also:

Topics: Advanced Techniques, Eureqa, Preparing Data, Techniques, Time Series

Follow Me

Posts by Topic

see all