Blog

Customer Spotlight: Radnostics

Posted by Jess Lin

19.11.2013 10:00 AM

We have amazing customers doing even more amazing things with their data. As we hear from our customers with their stories, we will be sharing them with you here on our blog. Hopefully they will help inspire you to think about what more you could be doing with your own data! Contact us if you have a case study using Eureqa you would like to share as well.

The diagnosis of acute appendicitis requires immediate surgery in cases where the appendix is still intact. To avoid exposing children to ionizing radiation in CT scans, ultrasound can be used to detect other features of the appendix and the surrounding area, but is unable to directly reveal the presence or absence of a perforation in the appendix. A team led by Dr. Einat Blumfield worked to discover the specific, associated ultrasound findings that would accurately point to perforation. Given the ultrasound images of 161 subjects diagnosed with acute appendicitis, the researchers needed to find a model that would identify correlations in the data.

Anthony Blumfield, the statistical modeler for the team, got a suggestion from his former classmate Hod Lipson (now advisor to Nutonian), to download Eureqa and use the power of symbolic regression to find a model for the data. After setting the software to work, Anthony Blumfield returned a couple hours later to find that the software had already found a model that predicted perforation as accurately as CT scans. More surprisingly, the model had discovered age categories that better defined which ultrasound findings would predict perforation. “We want to look at clinical findings that are associated with perforation such as duration of symptoms, white blood count, and fever,” Dr. Einat Blumfield explains. “We’ll then use Eureqa to search for a formula that combines ultrasound findings and clinical findings, and we hope to achieve an even higher level of accuracy.”

For more details, read the full case study here!

Jess

Topics: Case study, Eureqa, hod lipson, Modeling Outputs

Using Eureqa to Predict Sales Next Quarter

Posted by Hod Lipson

08.10.2013 09:36 AM

Forecasting sales with Eureqa

A common question we often encounter is how to use Eureqa to predict future sales. Let’s consider the following example, using real world data provided by Dunnhumby, in their Product launch challenge.

This dataset contains week-by-week sales numbers for various products for the first quarter (13 weeks) after the product’s launch. The challenge is then to predict how successful each product will be at the end of the second quarter (26 weeks after its launch), based only on information in the first quarter.

What kind of information was provided? There were 16 week-by-week attributes typically collected by retailers, such as

  • The number of stores selling the product
  • The total number of units sold that week
  • The number of distinct customers who have bought the product (cumulative)
  • The number of distinct customers who have bought the product at least twice (cumulative
  • Cumulative units sold to a number of different customer groups, such as Family Focused, Shoppers on a Budget, Price Sensitive shoppers, etc.

The dataset also included actual second quarter performance numbers for some of the products. The challenge was to predict the numbers for the rest. These given Q2 performance numbers are called the “Training Set”. Often, training set numbers are obtained from prior years.

To use Eureqa on this problem, we downloaded the data and reformatted it into a spreadsheet that contained 2,768 rows and 208 columns. That’s one row for each one of the 2,768 products. Each of the 208 columns corresponds to the 16 attributes listed above, for each of the 13 weeks of the first quarter (13×16=208). For example, one column contained the number of stores selling in week 1; the next column contained the number of units sold in week 2, etc. Another column contained the number of units sold in week 1, then the number of units sold in week 2, etc. Finally, we appended one last column containing the actual sales on week 26, which is what we want Eureqa to find a predictive formula for. The week-26 sales column was available only for the products included in the training set.

To start the search, we punched a simple query into Eureqa:

Units_that_sold26 = f(…)

This query instructs Eureqa to search for a formula f() that models the column Units_that_sold26 based on all the other columns. Of course, you could make other types of queries. For example, we can ask for a formula for the number of units sold in week 26 based only on variables from the first three weeks, or based only on a subset of the variables. The actual Eureqa-formatted file can be found here for a subset of the data pertaining to candy products only.

Sales Forecast 2

After punching in the query, we let Eureqa run for about 24 core-hours – something that would take the better part of a workday on your quad-core, but only a few minutes on the cloud. How did the formula pan out? The formula’s prediction was pretty accurate – The mean absolute error on a validation set (not used for training) was about 4.75% (19 items out of 4000 items sold), and the R2 for the validation set was 0.998. The chart to the right shows the actual sales (blue dots) versus predicted sales (red curve) for various candy types.

Although we could run this process for all the item categories in the supermarket, asking Eureqa to create a model for just the candy products gave us a candy-specific sales prediction formula that is particularly well suited for predicting candy sales. A separate model could be trained for predicting, say, just frozen pizza or baked goods. We could also try to predict overall sales. Different formulas (or “models” as analysts call them) are useful for different purposes: For financial purposes we may want to predict total sales, but for the purpose of deciding how much shelf space to allocate to each product type, we might want to use product-specific formulae. For example, if candy X is likely to sell twice as much as candy Y in week 26, we might want to reduce the shelf space allocated to candy X and increase the shelf space for candy Y. We might also want to re-run the analysis every week (or every day) to update the forecast based on up-to-date data.

Unlike other modeling techniques, Eureqa doesn’t just give you a prediction. It gives you a simple formula. You can use this formula to make your own predictions, but you can use it also for gaining insight into how the prediction is reached, what it is based on and what affects it. The transparency provided by Eureqa and Symbolic Regression helps you build confidence in the forecast before basing any critical actions on it. You can even play “what if” scenarios, by tweaking various parameters to see what changes might improve future performance: Perhaps sales in the third week are really critical, so marketing budget is best spent around that week?

So what did the formula say? You’ll have to try it out yourself: Here are the files. And if you want to try it on your own business data, let me know if I can help. Often the biggest challenge is reformatting your data so that it can be modeled, so if you have any questions about that, don’t hesitate to ask me.

Topics: hod lipson, Tutorial

How does Eureqa Compare to Other Machine Learning Methods?

Posted by Hod Lipson

27.08.2013 09:57 AM

Hod Lipson

Hod Lipson

How does Eureqa’s performance, in terms of predictive accuracy and simplicity, compare to other machine learning methods, such as Neural Networks, Support Vector Machines, Decision Trees, and simple Linear Regression?

To answer this question we did a simple comparison. We ran Eureqa on seven test-cases for which data is publically available, and compared performance to four standard machine learning methods. The implementations used were the WEKA codes, with settings optimized for best performance:

  1. Linear Regression: Fit a linear equation of the form y=a1x1+ a2x2+ a3x3…  using least squares method. This approach is the traditional regression method used in many statistical regression software packages
  2. Decision trees (DT): This process tries to find multiple linear regression models, each for a different part of the dataset. The dataset is portioned using conditions on the input variables.
  3. Neural Networks (NN): A classic multi-layer perceptron network attempts to learn to predict the output from the input using back-propagation learning method. Early-stopping using validation set is used, with a single hidden layer whose size is optimized automatically.
  4. Support vector machines (SVM): Model the data as a combination of a few, selected data points (called support vectors).

We ran the tests on five datasets, obtained from the UCI Dataset repository. They included the Auto MPG Benchmark, the Challenger O-Ring Benchmark, the Concrete Compressive Strength Benchmark, the Solar Flare Benchmark, and the Coil 2000 Benchmark.

resultsEach algorithm produced a result in a different format: Linear regression produced a hyperplane, while a neural network produced a connectivity weight matrix and Eureqa produced an analytical expression. One example result can be seen to the right. It is clear that some solutions are more complex than others. The more complex solutions involve more free parameters, or just take more ink to write down. Some solutions were more accurate than others: They produced less error when tested in a separate test dataset. Of course, we’d like to have a machine learning algorithm that produces models that are both accurate and simple, but that isn’t always the case.

We plotted the average performance of all five algorithms at a location corresponding to the average complexity and accuracy of the models they produced. In a complexity versus accuracy chart, we can see several regions. The top left region is where we would see algorithms that produced models that are fairly accurate, but have many free parameters, The bottom right region is where we would see algorithms that produce very simple solutions, even if they are somewhat less accurate. The top right region of the chart is the worst region to be in, where models are both complicated and not so accurate. And the bottom left region is where we find algorithms that produce models that are at the same time both simple and accurate.

comparison resized 600

It appears that Eureqa’s use of symbolic regression produces models that are both more accurate and simpler than other machine learning methods, but what’s the catch?

There is no free lunch. Symbolic regression is substantially more computationally intensive when compared to neural networks, SVMs and Linear regression.  Luckily, however, while accuracy and simplicity are priceless, computational power can be bought on-demand with platforms like Amazon EC2.

Topics: hod lipson, symbolic regression

What is Symbolic Regression (and how does Eureqa use it)?

Posted by Hod Lipson

18.07.2013 04:00 PM

Hod Lipson

Hod Lipson

You may be familiar with the term “regression” – the ability of a computer to fit a mathematical equation to data. There are many types of regression techniques and tools out there. The most common method is called “linear regression”, where a computer fits a straight line (or a flat plane) to data. This works well if your data generally follows a straight trend and you want to know what the slope is. Another method is nonlinear regression, where a computer fits the coefficient of some arbitrary mathematical equation that you provide. This is good when you already know how your data behaves qualitatively, and all you want is just to get quantitative predictions. But what if your data does not seem to be following a linear trend, but you do not know what that trend is – even qualitatively?

Eureqa uses a new technique, called Symbolic Regression. Symbolic Regression does not assume a linear trend, nor does it require you to provide a model. Instead, symbolic regression searches for the best model for your data, including linear and nonlinear models. Since some models might be simple but inaccurate, and other models may be very accurate but complex, symbolic regression does not try to give you just a single answer – it gives you a handful of possible models that you can choose from. You can use the model to make predictions, to gain insight, and to find optimal points.

Symbolic Regression Verus Linear RegressionHow does symbolic regression works? We start with a bunch of simple, linear models. If these models fit perfectly, that’s great. If they don’t, we produce small variations to these models, and try again. These variations can include changing the form of the models adding, removing, and changing mathematical terms. We then keep testing – at a rate of 10 million equations per second – until we gradually converge. In test cases, we watched this simple algorithm find models that have taken human experts decades to discover.

Try it on your own data >>

Topics: hod lipson, linear regression, symbolic regression

Follow Me

Posts by Topic

see all