Blog

The “First Mover’s” Analytics Stack, 2015 vs. 2016

Posted by Jon Millis

01.07.2016 10:00 AM

The irony of data science is the glacial and blazing speed at which the industry seems to move. It’s been more than 10 years since the origin of the phrase “big data”, and yet what we initially set out to accomplish – extracting valuable answers from data – is still a painstaking process. Some of this could be attributed to what Gartner refers to as the “Hype Cycle”, which hypothesizes that emerging technologies experience a predictable wave of hype, trials and tribulations before the they hit full-scale market maturity: technology trigger → peak of inflated expectations → trough of disillusionment → slope of enlightenment → plateau of productivity.

The true skeptics call it all a data science bubble. But answer me this. If we’re in the midst of a bubble, how can we explain the sustained, consistent movement of tech luminaries and innovators into the market over the course of years and years? Sure, a healthy economy is full of new competitors competing for market share, creative destruction, and eventual consolidation, but take a look at this diagram and try to explain how so many people could be so wrong about data science? It’s hard to imagine we’re in a bubble when all around us is an indefinitely growing ecosystem of tools, technologies and investment. As we’re well aware, nothing bad happened after heaps of money were piled into mortgage-backed securities in the early 2000s, and oil speculators have made a killing off of $5/gallon gas prices in 2016.

We kid, we kid. Of course there are illogical investments and industries that miss, but we maintain our belief that there is astounding value in data. Not all companies have capitalized on it yet, but the problems, the dollars, and the benefits to society as a whole are real. Data science is here to stay.

With an ecosystem now wildly overwhelming with tools, approaches and technologies, how can we understand general market trends? What kinds of tools and technologies make up a typical company’s analytics “stack”? More importantly, where are the “first movers” moving and making investments to capitalize on data? To find out, we share general insights we’ve gleaned from talking with our customer base and clients, a mix of Fortune 500 behemoths and data-driven start-ups.

Here’s what the 2015 analytics stack looked like:

Data_analytics_stack_2015.png

Let’s take an outside-in approach, beginning with the raw data and getting closer and closer to the end user.

Data preparation – The cleansing layer of the ecosystem, where raw streams of data are prepped for storage or analysis. Ex., Informatica, Pentaho, Tamr, Trifacta

Data management – The data storage and management layer of the ecosystem, where data sits either structured, semi-structured or unstructured. Ex., ArcSight, Cloudera, Greenplum, Hortonworks, MapR, Oracle, Splunk, Sumo Logic, Teradata Aster, Vertica

Visualization – The visualization and dashboarding layer of the ecosystem, where business analysts can interact with, and “see”, their data and track KPIs. Ex., Microstrategy, Qlik, Tableau

Statistical – The statistical layer of the ecosystem, where statisticians and data scientists can build analytical and predictive models to predict future outcomes or dissect how a system/process “works” to make strategic changes. Ex., Cognos, H2O, Python, R, RapidMiner, SAS, SPSS

Simple enough, right? The most data-savvy organizations make it look like a cakewalk. But take a closer look, and you’ll notice there’s a significant difference between the outer two “orbits” and the inner orbit: the inner orbit is fragmented. This does not fit with the smooth flow of the rest of the solar system.

Why are two systems occupying the same space? Because they’re both end-user analyst and data science tools that aim to deliver answers to the business team. Nutonian’s bashfully modest vision is to occupy the entire inner sphere of how people extract answers from data, with the help of “machine intelligence”. While Nutonian’s AI-powered modeling engine, Eureqa, plays nicely with statistical and visualization tools via our API, we’re encouraging companies who are either frustrated by their lack of data science productivity or who have greenfield projects to invest in Eureqa as their one-size-fits-almost-all answers machine.

Our vision is to empower organizations and individual users to make smart data-driven decisions in minutes. Eureqa automates nearly everything accomplished in the statistical layer and the visualization layer of the analytics stack – with the exception of the domain expert himself, who’s vital to guiding Eureqa in the right direction. The innovative “first movers” in 2016 are putting the data they’ve collected to good use, and consolidating the asteroid belt of tools and technologies banging together in the inner orbit of their solar systems. It’s the simple law of conservation of [data science] energy.

Topics: Analytics stack, Big data, Eureqa, Machine Intelligence

Letter from a Grateful Hobbyist Who’s Predicting the Financial Markets with Eureqa

Posted by Jon Millis

22.06.2015 02:00 PM

Nutonian users aren’t just large corporations. They’re also hobbyist data modelers leveraging Eureqa to predict the popularity of songs, analyze home science experiments, and even determine what makes some Porsches faster than others. The letter below was sent to us by a former real estate investor and manager named Bill Russell, who’s been using Eureqa to anticipate relatively short-term movements in stock prices. Hopefully Bill’s note will not only shed light on Eureqa’s potential, but will encourage our non-commercial fans to start thinking how they might apply Eureqa to some cool personal projects outside the office.

 

To Michael Schmidt and the team at Nutonian:

Michael, I want to express my deep appreciation for what you have created and shared. I first started following Eureqa in early 2010 when my mathematician brother alerted me to your double pendulum demo and beta download when you were at Cornell.

By way of background, I’m 70 years old, retired from a career in real estate finance and management. My degree was in economics, but I always loved numbers and the numerical analysis side of that business. My serious hobby over the years has been an attempt to predict short-term moves in the financial markets. I never had an impressive level of success, but always a lot of enjoyment with the puzzle of it all. In retrospect, I am sobered by how much time and the many resources I’ve previously put into this hobby.

My attempts in market prediction began with Fourier analysis (thanks to my brother’s programming and math skills) on an HP-85 desktop computer that had 16kB of ram. Next, things got more serious with the IBM XT, Lotus, and very large worksheets of pre-processed data obtained from Pinnacle Data and TradeStation. Over the years, I went to seminars given by John Hill, Larry Williams, Bill Williams, Tom DiMark and others. I purchased the Market Logic course for Peter Steidlmayer’s market profile approach and the trading course Relevance III from Maynard Holt in Nashville. There were many ideas here and there for indicators and inter-market relationships, but choosing which to use, and how to use them together, was daunting. Eureqa has changed that. Along the way, I used some impressive programs at the time. Brainmaker Professional, a neural network program, took plenty of my time in searching for useable predictions. HNET Professional, a holographic neural network program was fast and impressive. AbTech’s Statnet was excellent as was Neuralware’s Neuralsim. Yet despite the prolonged, multi-year and serious approach, I could never find an integrated, consistent pathway to success.

Because Eureqa incorporates so much analytical power in one place and finds relationships that were simply impossible to find previously, I am encouraged as never before. With the opportunity to utilize Eureqa, so much of my past approach is obsolete and elementary. I have left most of my previous analytical programs behind and many of my technical market books have now been donated to the public library. Of great significance for individual traders is that you have diminished the gap between the professional and nonprofessional in approaching the markets. Each group can utilize Eureqa, and Eureqa is equally powerful for each.

In the past, my best insights into what data might be useful came from hundreds of tedious runs of Pearson correlations and trial-and-error runs in the neural networks. I looked for ways to recast and understand the data in S-Plus and now the R language, but I am not a programmer. Trying to smooth data with splines in R was almost an insurmountable task for me. Eureqa is enabling me now to pursue options that were previously impossible. Here are some of the reasons:

1) Power and Speed: I’m able to pursue so many more alternatives than were previously within my reach. Because Eureqa is so fast, I am now able to compare runs with a) raw data; b) the same data recast to binary form; c) the data uniformly redistributed; d) the data in a de-noised wave-shrunk form. There was simply not enough time to do this before I found Eureqa.

2) Fast Data Processing and Visualization in Eureqa: I had previously done smoothing, normalizing, and rescaling in S-Plus or R. Here Eureqa saves significant time and I have complete confidence that it is being done correctly. I was often uncertain if I was getting it right on my own with the R language.

3) Tighter Selection of Input Variables: I had previously looked for any correlated relationships among a bar’s open, high, low, close, and volume, and relationships with each of those inputs delayed four periods back. I likewise did this for inter-market correlations. There was lots of manual work with Excel. All this has become moot since Eureqa does this in a flash. I have been able to substantially reduce the number input variables.

4) Most importantly, Eureqa is finding predictive relationships that had simply been impossible to find.

Michael, it is a delight to be alive at 70, and see the breathtaking leaps in technology. I programmed a little in college, utilizing punched cards; I bought a cutting-edge four-function electronic calculator before finals in 1971 for $345 (a Sharp EL-8) and thought it was a bargain. And now there is Eureqa…….Wow!!! I can appreciate some of the incredible differences this product will continue to make in so many areas. Thank you so much for what you and your team have created, for sharing it in beta form in the past, and for still keeping it within reach for individuals.

With much appreciation,

Bill Russell

Topics: Big data, Eureqa, Financial Services

Trading Necessitates Speed Along Every Step of the Data Pipeline

Posted by Jon Millis

10.06.2015 01:43 PM

We just returned from Terrapinn’s The Trading Show, a data-driven financial services conference that brings together thought leadership in quant, automated trading, exchange technology, big data and derivatives. With more than 1,000 attendees and 60 exhibitors gathering at the Navy Pier in Chicago, this year’s event was an excellent way not only for us to educate the market about using AI to scale data science initiatives, but for us to learn about the most pressing needs faced by financial services companies.

The first day, Jay Schuren, our Field CTO, presented to an audience of 50 executives. His demo used publicly available data from Yahoo Finance – such as cash flow, valuation metrics and stock prices – to predict which NYSE companies were the most over- and undervalued compared to the rest of the market. To say the least, Jay’s discoveries, as well as the seamless and automated way in which he created his financial models, spurred heavy booth traffic for the rest of our trip.*

Finance is an interesting animal. Many industries have relatively straight-forward applications for machine intelligence. Utilities companies are often interested in daily demand forecasting. Manufacturing companies look to optimize processes and design new materials. Retailers want to determine the best locations to build new stores, while healthcare providers want to preemptively detect and treat diseases. But finance is a bit different.

Let’s take a timely analogy. As I was walking home last Friday, I saw probably half a dozen limos of Boston high-schoolers posing for photos and heading to prom. Most of our customers purchase Eureqa and just can’t help but gush to us how excited they are to go to prom with us. Leading up to the big day, we show off our dance moves (give them a live demo), and take them out for a few dates (send them a free two-week trial), and by the end of our brief tryout, they’re bursting with energy and telling us all about their plans for the big dance with us. Trading firms, on the other hand, are the stunning mystery girls.** They’re smart, they’re confident, and you don’t think they should be shy, but when you ask them to prom, they shrug their shoulders and indifferently and say, yeah, I guess that sounds cool. You raise an eyebrow unsure if you just got a date or got slapped in the face with a frozen ham. But then she sees you drag racing around the neighborhood, and all of a sudden, you’re the biggest heartthrob on the planet. What in the world just happened?

In the trading world, everything is about speed. It’s not only about the speed at which a company can execute a trade (though there were plenty of vendors there offering to shave off fractions of a second to do this), but it’s also about the time it takes for a firm to arrive at an answer about how their market works, whether that’s determining when a currency is undervalued, an asset is likely to significantly appreciate, or a large loan is too risky. Everything in the trading game revolves around timing. And everyone. Loves. Speed. Where Eureqa instantly became interesting to attendees was the automation from raw data to accurate analytical/predictive model, a process that Eureqa consolidates – and accelerates discovery – by orders of magnitude.

A majority of trading technology on display was new hardware and software that incrementally improves time-to-execution. Milliseconds are important, but implementing a trading strategy that no one else has thought of or discovered could be game-changing. Nutonian will never compete with these other products and services directly. But we’re bringing more than one date to prom.

 

* Email us at contact@nutonian.com for a live demo of this particular application. We’d love to share our current use cases in financial services and explore how we might be a fit for others. 

** We’ll ignore the fact that, in reality, it seems like a “trading” prom would be about 95% guys. Woof.

Topics: Big data, Eureqa, Financial Services, Machine Intelligence, The Trading Show

Demystifying Data Science, Part 3: Scaling data science

Posted by Lakshmikant Shrinivas

15.05.2015 09:45 AM

In my last post in this series, I spoke about what goes into a data science workflow. The current state of the art in data science is not ideal; the value of data is limited by our understanding of it, and the current process to go from data to understanding is pretty tedious. The right tools make all the difference. Imagine cutting a tree with an axe instead of a chainsaw. If you were cutting trees for a living, wouldn’t you prefer the chainsaw? Even if you only had to cut trees occasionally, wouldn’t you prefer a chainsaw, because, well, chainsaw! The key here is automation. Ideally you want as much of a process automated as you can, for the sake of productivity.

With data science, the two major bottlenecks are wrangling with data and wrangling with models. Wrangling with data involves gathering and transforming data into a form suitable for modeling. There are several companies that deal with data wrangling – for example, the entire ETL industry. Wrangling with models involves creating and testing hypotheses, building, testing and refining features and models. Eureqa can help with the model wrangling. It is the chainsaw that completely automates the process of creating, testing, refining and sharing models.

As I mentioned in last post, the goals of modeling are pretty simple to express. We want to figure out if a) all terms in our model are important, and b) we’ve missed any term that would improve the accuracy significantly. Eureqa uses evolutionary algorithms to automatically create and test linear as well as non-linear models, sort of like the infinite monkey theorem. Except in our case, with the advances in computation, and of course our secret sauce, the “eventually find models” practically translates to a few minutes or hours – much faster than any human could do it.

If you pause for a moment to think about it, it’s pretty powerful and liberating. As a would-be data scientist, using such a tool frees up your time to focus on the more creative aspects of data science. For example, what other data could we pull in that might affect our problem? What other types of problems could we model with our data? As a non-data scientist, using such a tool lowers the barrier to entry for modeling. Imagine having a personal army of robotic data scientists at your beck and call.

For me, this is one of the most exciting aspects of Nutonian’s technology. While most of the world is still talking about scaling analytics to ever-growing amounts of data, Eureqa can scale analytics to the most precious resource of all: people.

Topics: Big data, Demystifying data science, Scaling data science

Demystifying Data Science, Part 2: Anatomy of a data science workflow

Posted by Lakshmikant Shrinivas

22.04.2015 10:00 AM

“Data Science” is a pretty hot topic these days. But what is it exactly? Before I learned about data science, my reaction to it could be best described with the following cartoon:

Data_science

Data science is often described as the extraction of knowledge from data using statistical techniques. In this post I’m going to attempt to be a little more explicit about what goes into a data science workflow.

Let’s look at a popular modeling technique – linear regression. Regression is used to build a model to be able to predict some numerical quantity that you’re interested in, e.g., the weekly sales in a retail store. The steps involved in building a model are:

  1. Prepare data: Real world data is often noisy. Some values may be missing, and there may be outliers. The first step in modeling is to prepare a good sample of clean data, by interpolating missing values, removing outliers, etc.
  1. Build a model: You might start with building a model that predicts the weekly sales as a function of all other variables in the data. Domain expertise is often useful at this stage. For example, you may know that sales follows a seasonal cycle, so you would include time of year metrics as inputs. You also need to examine each term for statistical significance using the p-value. The p-value of a term basically tells you how likely that term is to occur by chance; the higher the p-value, the more likely the term has appeared by chance.
  1. Test the model: Once you have a model, you need to cross validate it against data that has been withheld from the modeling process. This helps to ensure the model will generalize and work well on future data.
  1. Add or remove features: You need to remove statistically insignificant terms to avoid over-fitting the model to the data at hand. After initial modeling, you may also need to add new features. For example, if you think that the prior week’s advertisements may affect this week’s sales, you would create some features (i.e., new variables) for the prior week’s advertisement metrics.
  1. Repeat until all terms are significant and you’re satisfied with the model structure.

There are several problems with this flow:

  1. Very manual process: Even though there are tools that can help with individual steps in the flow, testing the model and adding/removing features (steps 3 and 4 above) require a lot of experience in the field, and are very labor intensive.
  1. Easy to misinterpret results: P-values are sensitive to the amount of data you have, so it’s not enough to use a simple rule of thumb, such as p < 0.05 implies significance. Again, this requires a lot of experience with the data before you understand how much data to use, what p-values make sense for the particular domain, etc.

Despite all the complexity of the above process, the end goal is really quite simple to express: you want a model in which every term contributes to the accuracy, and you want to be sure you haven’t missed a term that would improve the accuracy significantly.  Other modeling algorithms, such as decision trees or neural nets, require the use of different techniques to achieve the same goal, and in general suffer from the same drawbacks – the process is bottlenecked by the data scientists ability to add or remove features based on domain expertise.

Eureqa completely automates this process from start to finish. The models generated by the engine vary by complexity (i.e., the number of terms) and each model satisfies the above goal.  The user can see how each additional term affects the accuracy. Additionally, Eureqa’s models are easily interpretable by business users, since they can be turned into plain English.

In the next post in this series, I’ll share some examples of how Eureqa helps users get an intuitive understanding of models through innovative UI visualizations.

Topics: Big data, Data science workflow, Demystifying data science

Follow Me

Posts by Topic

see all