Blog

Demystifying Data Science, Part 3: Scaling data science

Posted by Lakshmikant Shrinivas

15.05.2015 09:45 AM

In my last post in this series, I spoke about what goes into a data science workflow. The current state of the art in data science is not ideal; the value of data is limited by our understanding of it, and the current process to go from data to understanding is pretty tedious. The right tools make all the difference. Imagine cutting a tree with an axe instead of a chainsaw. If you were cutting trees for a living, wouldn’t you prefer the chainsaw? Even if you only had to cut trees occasionally, wouldn’t you prefer a chainsaw, because, well, chainsaw! The key here is automation. Ideally you want as much of a process automated as you can, for the sake of productivity.

With data science, the two major bottlenecks are wrangling with data and wrangling with models. Wrangling with data involves gathering and transforming data into a form suitable for modeling. There are several companies that deal with data wrangling – for example, the entire ETL industry. Wrangling with models involves creating and testing hypotheses, building, testing and refining features and models. Eureqa can help with the model wrangling. It is the chainsaw that completely automates the process of creating, testing, refining and sharing models.

As I mentioned in last post, the goals of modeling are pretty simple to express. We want to figure out if a) all terms in our model are important, and b) we’ve missed any term that would improve the accuracy significantly. Eureqa uses evolutionary algorithms to automatically create and test linear as well as non-linear models, sort of like the infinite monkey theorem. Except in our case, with the advances in computation, and of course our secret sauce, the “eventually find models” practically translates to a few minutes or hours – much faster than any human could do it.

If you pause for a moment to think about it, it’s pretty powerful and liberating. As a would-be data scientist, using such a tool frees up your time to focus on the more creative aspects of data science. For example, what other data could we pull in that might affect our problem? What other types of problems could we model with our data? As a non-data scientist, using such a tool lowers the barrier to entry for modeling. Imagine having a personal army of robotic data scientists at your beck and call.

For me, this is one of the most exciting aspects of Nutonian’s technology. While most of the world is still talking about scaling analytics to ever-growing amounts of data, Eureqa can scale analytics to the most precious resource of all: people.

Topics: Big data, Demystifying data science, Scaling data science

Demystifying Data Science, Part 2: Anatomy of a data science workflow

Posted by Lakshmikant Shrinivas

22.04.2015 10:00 AM

“Data Science” is a pretty hot topic these days. But what is it exactly? Before I learned about data science, my reaction to it could be best described with the following cartoon:

Data_science

Data science is often described as the extraction of knowledge from data using statistical techniques. In this post I’m going to attempt to be a little more explicit about what goes into a data science workflow.

Let’s look at a popular modeling technique – linear regression. Regression is used to build a model to be able to predict some numerical quantity that you’re interested in, e.g., the weekly sales in a retail store. The steps involved in building a model are:

  1. Prepare data: Real world data is often noisy. Some values may be missing, and there may be outliers. The first step in modeling is to prepare a good sample of clean data, by interpolating missing values, removing outliers, etc.
  1. Build a model: You might start with building a model that predicts the weekly sales as a function of all other variables in the data. Domain expertise is often useful at this stage. For example, you may know that sales follows a seasonal cycle, so you would include time of year metrics as inputs. You also need to examine each term for statistical significance using the p-value. The p-value of a term basically tells you how likely that term is to occur by chance; the higher the p-value, the more likely the term has appeared by chance.
  1. Test the model: Once you have a model, you need to cross validate it against data that has been withheld from the modeling process. This helps to ensure the model will generalize and work well on future data.
  1. Add or remove features: You need to remove statistically insignificant terms to avoid over-fitting the model to the data at hand. After initial modeling, you may also need to add new features. For example, if you think that the prior week’s advertisements may affect this week’s sales, you would create some features (i.e., new variables) for the prior week’s advertisement metrics.
  1. Repeat until all terms are significant and you’re satisfied with the model structure.

There are several problems with this flow:

  1. Very manual process: Even though there are tools that can help with individual steps in the flow, testing the model and adding/removing features (steps 3 and 4 above) require a lot of experience in the field, and are very labor intensive.
  1. Easy to misinterpret results: P-values are sensitive to the amount of data you have, so it’s not enough to use a simple rule of thumb, such as p < 0.05 implies significance. Again, this requires a lot of experience with the data before you understand how much data to use, what p-values make sense for the particular domain, etc.

Despite all the complexity of the above process, the end goal is really quite simple to express: you want a model in which every term contributes to the accuracy, and you want to be sure you haven’t missed a term that would improve the accuracy significantly.  Other modeling algorithms, such as decision trees or neural nets, require the use of different techniques to achieve the same goal, and in general suffer from the same drawbacks – the process is bottlenecked by the data scientists ability to add or remove features based on domain expertise.

Eureqa completely automates this process from start to finish. The models generated by the engine vary by complexity (i.e., the number of terms) and each model satisfies the above goal.  The user can see how each additional term affects the accuracy. Additionally, Eureqa’s models are easily interpretable by business users, since they can be turned into plain English.

In the next post in this series, I’ll share some examples of how Eureqa helps users get an intuitive understanding of models through innovative UI visualizations.

Topics: Big data, Data science workflow, Demystifying data science

Demystifying Data Science, Part 1: My Transition from Infrastructure to Machine Intelligence

Posted by Lakshmikant Shrinivas

06.04.2015 10:00 AM

A lot of people thought I was crazy for leaving one of the hottest, most innovative big data companies of the last 10 years, to join another start-up.

I graduated from UW-Madison with a PhD in databases, and worked for several years as a systems software engineer at Vertica, as deep down in the guts as you can imagine: C++, multi-threaded programming, distributed systems, process management. Towards the end of my stint there, I was leading the analytics team.

The analytics team was responsible for creating a whole slew of analytic plugins for the Vertica database engine. These plugins provided functionality like geospatial capabilities and data mining algorithms such as linear regression, SVM, etc. In the early stages, I spoke to several customers to get some feedback to guide development. The conversations usually went like this:

Me: “We’re thinking of building a library of data mining functions – things like linear regression and support vector machines – to provide predictive analytics. We were hoping to get your thoughts on which algorithms you’d find most useful.”

Customer: “Predictive analytics sounds wonderful! However, how do we tell what algorithms could be used for our business problems?”

Me: “That would be something your data scientist would know.”

Customer: “Our data-what?”

After a couple of conversations like that, we got better at targeting customers that had data scientists working for them, from whom we got the feedback we were looking for. However, this made me realize that even though tools like Vertica and Tableau have solved the problems of capturing, processing and visualizing huge quantities of data, predictive modeling is currently a very human-intensive activity. In addition, from what I can tell, data scientists are a pretty scarce resource!

Enter Nutonian. The first time I had a conversation with Michael Schmidt (the founder of Nutonian), I was very impressed with Eureqa’s ability to automatically build predictive models that are easily understandable by a non-data scientist, like me. The Eureqa core engine is able to automatically discover non-linear relationships in data: essentially a set of mathematical equations that hold true over the range of the data. I realized that this technology has the potential to really disrupt the market by making predictive analysis accessible to the masses. That’s when I decided to join Nutonian, so I could work on really exciting and impactful technology.

Enabling users without a math background to really understand equations would require some very innovative user interfaces and visualizations. It felt like a great opportunity to learn something new and build a product that can disrupt the market. Besides, there is something very satisfying about being able to visually show what you’ve built. This would be in stark contrast to my prior work at Vertica, which was deep in the core of an analytic database – it’s very difficult to demo a SQL prompt!

Stay tuned for next week, when I share some of the interesting projects we’ve been working on in the advanced analytics team.

Topics: Big data, Demystifying data science, Machine Intelligence

Follow Me

Posts by Topic

see all