Demystifying Data Science, Part 3: Scaling data science

Posted by Lakshmikant Shrinivas

15.05.2015 09:45 AM

In my last post in this series, I spoke about what goes into a data science workflow. The current state of the art in data science is not ideal; the value of data is limited by our understanding of it, and the current process to go from data to understanding is pretty tedious. The right tools make all the difference. Imagine cutting a tree with an axe instead of a chainsaw. If you were cutting trees for a living, wouldn’t you prefer the chainsaw? Even if you only had to cut trees occasionally, wouldn’t you prefer a chainsaw, because, well, chainsaw! The key here is automation. Ideally you want as much of a process automated as you can, for the sake of productivity.

With data science, the two major bottlenecks are wrangling with data and wrangling with models. Wrangling with data involves gathering and transforming data into a form suitable for modeling. There are several companies that deal with data wrangling – for example, the entire ETL industry. Wrangling with models involves creating and testing hypotheses, building, testing and refining features and models. Eureqa can help with the model wrangling. It is the chainsaw that completely automates the process of creating, testing, refining and sharing models.

As I mentioned in last post, the goals of modeling are pretty simple to express. We want to figure out if a) all terms in our model are important, and b) we’ve missed any term that would improve the accuracy significantly. Eureqa uses evolutionary algorithms to automatically create and test linear as well as non-linear models, sort of like the infinite monkey theorem. Except in our case, with the advances in computation, and of course our secret sauce, the “eventually find models” practically translates to a few minutes or hours – much faster than any human could do it.

If you pause for a moment to think about it, it’s pretty powerful and liberating. As a would-be data scientist, using such a tool frees up your time to focus on the more creative aspects of data science. For example, what other data could we pull in that might affect our problem? What other types of problems could we model with our data? As a non-data scientist, using such a tool lowers the barrier to entry for modeling. Imagine having a personal army of robotic data scientists at your beck and call.

For me, this is one of the most exciting aspects of Nutonian’s technology. While most of the world is still talking about scaling analytics to ever-growing amounts of data, Eureqa can scale analytics to the most precious resource of all: people.

Topics: Big data, Demystifying data science, Scaling data science

Leave a Reply

Your email address will not be published. Required fields are marked *

Follow Me

Posts by Topic

see all