A common question we often encounter is how to use Eureqa to predict future sales. Let’s consider the following example, using real world data provided by Dunnhumby, in their Product launch challenge.

This dataset contains week-by-week sales numbers for various products for the first quarter (13 weeks) after the product’s launch. The challenge is then to predict how successful each product will be at the end of the second quarter (26 weeks after its launch), based only on information in the first quarter.

What kind of information was provided? There were 16 week-by-week attributes typically collected by retailers, such as

- The number of stores selling the product
- The total number of units sold that week
- The number of distinct customers who have bought the product (cumulative)
- The number of distinct customers who have bought the product at least twice (cumulative
- Cumulative units sold to a number of different customer groups, such as Family Focused, Shoppers on a Budget, Price Sensitive shoppers, etc.

The dataset also included actual second quarter performance numbers for some of the products. The challenge was to predict the numbers for the rest. These given Q2 performance numbers are called the “Training Set”. Often, training set numbers are obtained from prior years.

To use Eureqa on this problem, we downloaded the data and reformatted it into a spreadsheet that contained 2,768 rows and 208 columns. That’s one row for each one of the 2,768 products. Each of the 208 columns corresponds to the 16 attributes listed above, for each of the 13 weeks of the first quarter (13×16=208). For example, one column contained the number of stores selling in week 1; the next column contained the number of units sold in week 2, etc. Another column contained the number of units sold in week 1, then the number of units sold in week 2, etc. Finally, we appended one last column containing the actual sales on week 26, which is what we want Eureqa to find a predictive formula for. The week-26 sales column was available only for the products included in the training set.

To start the search, we punched a simple query into Eureqa:

#### Units_that_sold26 = f(…)

This query instructs Eureqa to search for a formula f() that models the column Units_that_sold26 based on all the other columns. Of course, you could make other types of queries. For example, we can ask for a formula for the number of units sold in week 26 based only on variables from the first three weeks, or based only on a subset of the variables. The actual Eureqa-formatted file can be found here for a subset of the data pertaining to candy products only.

After punching in the query, we let Eureqa run for about 24 core-hours – something that would take the better part of a workday on your quad-core, but only a few minutes on the cloud. How did the formula pan out? The formula’s prediction was pretty accurate – The mean absolute error on a validation set (not used for training) was about 4.75% (19 items out of 4000 items sold), and the R2 for the validation set was 0.998. The chart to the right shows the actual sales (blue dots) versus predicted sales (red curve) for various candy types.

Although we could run this process for all the item categories in the supermarket, asking Eureqa to create a model for just the candy products gave us a candy-specific sales prediction formula that is particularly well suited for predicting candy sales. A separate model could be trained for predicting, say, just frozen pizza or baked goods. We could also try to predict overall sales. Different formulas (or “models” as analysts call them) are useful for different purposes: For financial purposes we may want to predict total sales, but for the purpose of deciding how much shelf space to allocate to each product type, we might want to use product-specific formulae. For example, if candy X is likely to sell twice as much as candy Y in week 26, we might want to reduce the shelf space allocated to candy X and increase the shelf space for candy Y. We might also want to re-run the analysis every week (or every day) to update the forecast based on up-to-date data.

Unlike other modeling techniques, Eureqa doesn’t just give you a prediction. It gives you a simple formula. You can use this formula to make your own predictions, but you can use it also for gaining insight into how the prediction is reached, what it is based on and what affects it. The transparency provided by Eureqa and Symbolic Regression helps you build confidence in the forecast before basing any critical actions on it. You can even play “what if” scenarios, by tweaking various parameters to see what changes might improve future performance: Perhaps sales in the third week are really critical, so marketing budget is best spent around that week?

So what did the formula say? You’ll have to try it out yourself: Here are the files. And if you want to try it on your own business data, let me know if I can help. Often the biggest challenge is reformatting your data so that it can be modeled, so if you have any questions about that, don’t hesitate to ask me.

Topics: hod lipson, Tutorial