“Data Science” is a pretty hot topic these days. But what is it exactly? Before I learned about data science, my reaction to it could be best described with the following cartoon:
Data science is often described as the extraction of knowledge from data using statistical techniques. In this post I’m going to attempt to be a little more explicit about what goes into a data science workflow.
Let’s look at a popular modeling technique – linear regression. Regression is used to build a model to be able to predict some numerical quantity that you’re interested in, e.g., the weekly sales in a retail store. The steps involved in building a model are:
- Prepare data: Real world data is often noisy. Some values may be missing, and there may be outliers. The first step in modeling is to prepare a good sample of clean data, by interpolating missing values, removing outliers, etc.
- Build a model: You might start with building a model that predicts the weekly sales as a function of all other variables in the data. Domain expertise is often useful at this stage. For example, you may know that sales follows a seasonal cycle, so you would include time of year metrics as inputs. You also need to examine each term for statistical significance using the p-value. The p-value of a term basically tells you how likely that term is to occur by chance; the higher the p-value, the more likely the term has appeared by chance.
- Test the model: Once you have a model, you need to cross validate it against data that has been withheld from the modeling process. This helps to ensure the model will generalize and work well on future data.
- Add or remove features: You need to remove statistically insignificant terms to avoid over-fitting the model to the data at hand. After initial modeling, you may also need to add new features. For example, if you think that the prior week’s advertisements may affect this week’s sales, you would create some features (i.e., new variables) for the prior week’s advertisement metrics.
- Repeat until all terms are significant and you’re satisfied with the model structure.
There are several problems with this flow:
- Very manual process: Even though there are tools that can help with individual steps in the flow, testing the model and adding/removing features (steps 3 and 4 above) require a lot of experience in the field, and are very labor intensive.
- Easy to misinterpret results: P-values are sensitive to the amount of data you have, so it’s not enough to use a simple rule of thumb, such as p < 0.05 implies significance. Again, this requires a lot of experience with the data before you understand how much data to use, what p-values make sense for the particular domain, etc.
Despite all the complexity of the above process, the end goal is really quite simple to express: you want a model in which every term contributes to the accuracy, and you want to be sure you haven’t missed a term that would improve the accuracy significantly. Other modeling algorithms, such as decision trees or neural nets, require the use of different techniques to achieve the same goal, and in general suffer from the same drawbacks – the process is bottlenecked by the data scientists ability to add or remove features based on domain expertise.
Eureqa completely automates this process from start to finish. The models generated by the engine vary by complexity (i.e., the number of terms) and each model satisfies the above goal. The user can see how each additional term affects the accuracy. Additionally, Eureqa’s models are easily interpretable by business users, since they can be turned into plain English.
In the next post in this series, I’ll share some examples of how Eureqa helps users get an intuitive understanding of models through innovative UI visualizations.