Blog

Business Analytics: Simple is Better. Always.

Posted by Ben Israelite

31.01.2017 10:35 AM

Richard Branson tweet.png

Occam’s Razor, a well-known principle stating the simplest solution for a problem is often the best, has been utilized by businesses for decades to solve their most significant and complicated problems. The integration of data analytics – the pursuit of extracting meaning from raw data – into an enterprise’s decision-making process should aid in this effort. Yet, as organizations ramp up their data analytics capabilities, black box algorithms and highly convoluted predictions have been favored over concise and actionable insights.

The process of developing an analytical expression to drive successful business outcomes is very difficult. Traditionally, individuals with high-level degrees (in a STEM field) and a strong knowledge of technical tools (MATLAB, SAS, Stata, etc.) and programming languages (Python or R) spend weeks solving problems with data they spent months collecting, aggregating and transforming. The level of effort needed to implement this approach in big businesses is staggering but, seemingly, necessary in order to extract value from the vast amounts of data large companies are paying millions of dollars a year to store. The output generated through this approach, which ranges from simple linear models to black box machine learning algorithms (neural networks, SVMs, etc.), provides a prediction of what will happen but does not provide increased understanding or insights into decisive actions that can drive business results. Predictive accuracy became the most important metric in analytics because being right was prioritized over providing understanding.

The time has come for this to change. Companies must harness simplicity in order to generate significant business value moving forward. Rather than simply learning what will happen (sales tomorrow will be x), companies need to also understand why it will happen (marketing takes two weeks to influence sales, weather impacts in-person purchases, etc.).

The only way to do the latter is to build simple, interpretable (parsimonious!) models. Simple models deliver results that are as accurate as black box approaches but impact the business much more profoundly. It is time for companies to stop hiding behind their initial approach to predictive modeling and jump head first into the future of machine intelligence.

Topics: Machine Intelligence, Occam's Razor, Parsimonious models

Eureqa Hits Wall Street; Automatically Identifies Key Predictive Relationships

Posted by Jason Kutarnia

01.12.2016 10:46 AM

As a team of data scientists, analysts and software developers, we didn’t expect to be praised as financial gurus. But in an industry of ever-present uncertainly and huge financial gains and losses at stake, Eureqa, the dynamic modeling engine, displays a unique competitive advantage in the technology stack: the ability to quickly derive extremely accurate and simple-to-understand models that predict what will happen in the future, and why.

Typically, Wall Street employs elite squadrons of quants and analysts to build models to make forecasts about where individual stocks and other financial instruments are headed. Some firms, such as the consistently elite hedge funds, make delightful profits by “beating the market,” i.e., outperforming an industry-standard index like the S&P 500. Other financial institutions make their money simply off of the fees they charge for commissions. The laggards have significant room for improvement, where instead of leveraging only industry news and well-known metrics like return on equity, price/earnings ratio and idiosyncratic volatility, they could use stockpiles of data to search for signals and early indicators that an investment is primed to tumble or soar. Hunches and over-simplified metrics should be a thing of the past, and the proof should be in the pudding (the data). Some things, like natural disasters and leadership changes, are not always part of the data, but for everything else…there’s Mastercard. Err, Eureqa.

And for those overachievers – the hedge funds, the private wealth management firms, the day traders – who think they have mastered their own domain, we’re here to tell you, there’s a lot of room for improvement. Financial models are time-consuming to build, often to the tune of weeks or months to refine…and meanwhile, the markets, whether moving up or down, are making people money while you’re on the sidelines crafting your models. In addition to the time sink, manual human-made models with tools like R and SAS are not as accurate as they come, nor are they easy to interpret. The result is that firms are leaving millions on the table, and not understanding why the markets or assets behave as they do. It’s one thing to predict that real estate will beat the market in 2017, based on an algorithm that contains 2,000 variables and mind-numbingly complex transformations of those variables. But what if I could accurately predict that real estate in the Northeast U.S. will appreciate 10-12%, while I should leave the Midwest untouched, and the “drivers” of this growth will be 4 truly impactful variables: demographic growth of Millennials moving into the cities, wage increases, job growth, and a slowing of new construction permits. I could not only make more money, but I could justify all of my investments beforehand with a comprehensive understanding of “how things work.”

In order to validate Eureqa’s approach to a major investment firm, I built a simple trading strategy using the stocks in the S&P 500. The goal was to forecast whether a stock’s excess monthly return – the difference between the stock’s return and the overall S&P 500 return – would be positive or negative. In our strategy, we bought a stock if Eureqa predicted its excess return would be positive, and we shorted any stocks Eureqa thought would be negative.

Immediately, the client saw the enormous value of Eureqa. Leveraging publicly available data sets through 2014, in a matter of a few hours Eureqa created classification models unique to each industry (retail, finance, technology, etc.), and we plugged individual companies into the models to predict whether the stock would achieve excess return for 2015. We then hypothetically created a simple, equal portfolio of the predicted “overachievers”. Remarkably, Eureqa’s anticipated winning portfolio achieved a compound excess return of 14.1% for the following year, compared with the S&P 500’s disappointing -.7%. Not only was our portfolio’s performance exceptional, but so was our fundamental understanding of the causes of its success. We could convey to our hypothetical clients, bosses and others that not only did our strategy work this year, but it’s likely to work again next year, because some of the key drivers of excess returns for stock X are variables Q, R, S, T, U and V, and this is how it’ll move in the context of the current economy. In a matter of hours, with Eureqa at my side, a graduate student in tissue motion modeling transformed into a powerful financial analyst with a theoretical market-beating investment portfolio. Now, imagine what this application could do with even more data, and in the hands of a true industry expert…

Topics: Eureqa, Financial Analysis, Machine Intelligence

Machine Intelligence with Michael Schmidt: Searching data for causation

Posted by Michael Schmidt

27.07.2016 10:03 AM

The holy grail of data analytics is finding “causation” in data: identifying which variables, inputs, and processes are driving the outcome of a problem. The entire field of econometrics, for example, is dedicated to studying and characterizing where causation exists. Actually proving causation, however, is extremely difficult, typically involving carefully controlled experiments. To even get started, analysts need to know which variables are important to include in the evaluation, which need to be controlled for, and which to ignore. From there, they can build a model, design an experiment to test its causal predictions, and iterate until they arrive at a conclusion.

Proving causation relies heavily on these smart assumptions. What if you forgot to control for age, demographics, or socioeconomic conditions? It’s difficult to figure out how to start framing the problem to analyze causal impact. But this is a task that machines were born to solve.

There are two important steps required to identify causation: 1) among many possible variables, finding the few that are actually relevant, and 2) given a limited set of variables, executing the transformations needed to reveal the extent of each variable’s impact.

Determining_causation_with_machine_intelligence.jpg

For the first time, there exists software that helps companies reliably determine causation from raw, seemingly chaotic data.

People often use Eureqa for its ability to start from the ground up and “think like a scientist,” sifting through billions of potential models, structures, and nonlinear operations from scratch to create the ideal analytical model for your unique dataset – without needing to know the important variables or model algorithm ahead of time. Eureqa’s modeling engine effectively generates theories of causation via its processes of building analytical models from a dataset. Eureqa doesn’t attempt to prove causality on its own, but instead yields a very special form of model that can be interpreted physically for causal effects.

One of the biggest open problems in machine learning (and analytics in general) is avoiding spurious correlations and similar non-causal effects. In fact, there’s likely no perfect solution despite the advances we’ve made; ultimately a person needs to interpret the findings and provide context not contained by the data alone. One of the most-used visuals in Eureqa is the covariates window and the ability to block and replace variables from a model – features we’ve added specifically to interact with users to model complex systems.

There is some exciting research taking place, however, connecting Eureqa to live biological experiments to automatically guide experimentation and test predictions. While this research is still on-going, perhaps a physical robot scientist is around the corner.

Topics: Causation, Eureqa, Machine Intelligence

The “First Mover’s” Analytics Stack, 2015 vs. 2016

Posted by Jon Millis

01.07.2016 10:00 AM

The irony of data science is the glacial and blazing speed at which the industry seems to move. It’s been more than 10 years since the origin of the phrase “big data”, and yet what we initially set out to accomplish – extracting valuable answers from data – is still a painstaking process. Some of this could be attributed to what Gartner refers to as the “Hype Cycle”, which hypothesizes that emerging technologies experience a predictable wave of hype, trials and tribulations before the they hit full-scale market maturity: technology trigger → peak of inflated expectations → trough of disillusionment → slope of enlightenment → plateau of productivity.

The true skeptics call it all a data science bubble. But answer me this. If we’re in the midst of a bubble, how can we explain the sustained, consistent movement of tech luminaries and innovators into the market over the course of years and years? Sure, a healthy economy is full of new competitors competing for market share, creative destruction, and eventual consolidation, but take a look at this diagram and try to explain how so many people could be so wrong about data science? It’s hard to imagine we’re in a bubble when all around us is an indefinitely growing ecosystem of tools, technologies and investment. As we’re well aware, nothing bad happened after heaps of money were piled into mortgage-backed securities in the early 2000s, and oil speculators have made a killing off of $5/gallon gas prices in 2016.

We kid, we kid. Of course there are illogical investments and industries that miss, but we maintain our belief that there is astounding value in data. Not all companies have capitalized on it yet, but the problems, the dollars, and the benefits to society as a whole are real. Data science is here to stay.

With an ecosystem now wildly overwhelming with tools, approaches and technologies, how can we understand general market trends? What kinds of tools and technologies make up a typical company’s analytics “stack”? More importantly, where are the “first movers” moving and making investments to capitalize on data? To find out, we share general insights we’ve gleaned from talking with our customer base and clients, a mix of Fortune 500 behemoths and data-driven start-ups.

Here’s what the 2015 analytics stack looked like:

Data_analytics_stack_2015.png

Let’s take an outside-in approach, beginning with the raw data and getting closer and closer to the end user.

Data preparation – The cleansing layer of the ecosystem, where raw streams of data are prepped for storage or analysis. Ex., Informatica, Pentaho, Tamr, Trifacta

Data management – The data storage and management layer of the ecosystem, where data sits either structured, semi-structured or unstructured. Ex., ArcSight, Cloudera, Greenplum, Hortonworks, MapR, Oracle, Splunk, Sumo Logic, Teradata Aster, Vertica

Visualization – The visualization and dashboarding layer of the ecosystem, where business analysts can interact with, and “see”, their data and track KPIs. Ex., Microstrategy, Qlik, Tableau

Statistical – The statistical layer of the ecosystem, where statisticians and data scientists can build analytical and predictive models to predict future outcomes or dissect how a system/process “works” to make strategic changes. Ex., Cognos, H2O, Python, R, RapidMiner, SAS, SPSS

Simple enough, right? The most data-savvy organizations make it look like a cakewalk. But take a closer look, and you’ll notice there’s a significant difference between the outer two “orbits” and the inner orbit: the inner orbit is fragmented. This does not fit with the smooth flow of the rest of the solar system.

Why are two systems occupying the same space? Because they’re both end-user analyst and data science tools that aim to deliver answers to the business team. Nutonian’s bashfully modest vision is to occupy the entire inner sphere of how people extract answers from data, with the help of “machine intelligence”. While Nutonian’s AI-powered modeling engine, Eureqa, plays nicely with statistical and visualization tools via our API, we’re encouraging companies who are either frustrated by their lack of data science productivity or who have greenfield projects to invest in Eureqa as their one-size-fits-almost-all answers machine.

Our vision is to empower organizations and individual users to make smart data-driven decisions in minutes. Eureqa automates nearly everything accomplished in the statistical layer and the visualization layer of the analytics stack – with the exception of the domain expert himself, who’s vital to guiding Eureqa in the right direction. The innovative “first movers” in 2016 are putting the data they’ve collected to good use, and consolidating the asteroid belt of tools and technologies banging together in the inner orbit of their solar systems. It’s the simple law of conservation of [data science] energy.

Topics: Analytics stack, Big data, Eureqa, Machine Intelligence

Machine Intelligence with Michael Schmidt: IBM’s Watson, Eureqa, and the race for smart machines

Posted by Michael Schmidt

16.05.2016 11:12 AM

Three months ago I spoke at a conference affectionately titled “Datapalooza” sponsored by IBM. My talk covered how modern AI can infer the features and transformations that make raw data predictive. I’m not sure exactly how many IBM people were in the crowd, but two IBM database and analytics leads grabbed me after the talk:

“We love what you’re doing. The Watson team is attempting to do things like this internally but is nowhere near this yet.” – [names withheld]

What’s interesting is that Watson has been coming up more and more recently when I speak to customers. The billions of dollars IBM has invested to market Watson has created an air of mystery and hype around what artificial intelligence can do. In fact, IBM is expecting Watson to grow to over $1B per year in revenue in the next 18 months. Yet we haven’t seen any prospect choose Watson over Eureqa to date. So what’s going on?

Michael_Schmidt_presenting_Machine_Intelligence_at_IBM_Datapalooza.jpg

Speaking at IBM’s Datapalooza (2016) in Seattle, WA.

I remember the excitement in the AI lab (CCSL) at Cornell University when IBM’s Watson computer competed in the game show Jeopardy in 2011. A group of us watched live as the computer beat the show’s top player, Ken Jennings.

IBM had pioneered one of the most interactive AI systems in history. Instead of simulating moves in chess more than before (as it’s predecessor Deep Blue had done), Watson appeared to actually “think.” It interpreted speech and searched data sources for a relevant response. It inspired similar technology, like Apple’s Siri and Microsoft’s Cortana, which came out over the next few years.

Unlike Apple, Google, Facebook, and others, however, IBM recognized an enormous opportunity in the market. Every business in the world today stockpiles data faster than can be analyzed. Literally hundreds of billions of dollars in value lies in the applications of this data. Perhaps the technology that could win quiz competitions like Jeopardy could unlock some of this value as well. IBM decided to step out of the safe confines of a specific application, and attempted to work with business data and real-world problems with commercial deployments of Watson.

Google_searches_for_IBM_Watson.png

The interest in IBM Watson tied to Jeopardy over 10 years.

Coincidentally, I began working on the preliminary technology behind Eureqa around the same time. Eureqa was focused on a broader challenge; instead of trying to interpret sentences and look up responses, Eureqa was tasked with deducing how any arbitrary system behaved – just provide the data/observations. It became the first AI that could think like a scientist and produce new explanations for how any system worked.

The similarity, and the power, of both Eureqa and Watson is that they are examples of Machine Intelligence – meaning the answers they output can be meaningfully interpreted and understood, as opposed to some statistical prediction or data visualization. But this is where the similarities end.

Watson’s great challenge has been adapting its technology for answering trivia questions to real business problems. Despite the prevalence of unstructured text data, very few new business problems appear to be blocked by the ability to look up relevant information in text. From the WSJ: “According to a review of internal IBM documents and interviews with Watson’s first customers, Watson is having more trouble solving real-life problems.”

The data that most businesses have today consists of log data, event data, sensor data, sales data, or other numeric data (data where Watson’s core technology doesn’t apply). The major questions they need answered relate to what causes other things to happen, what triggers or blocks certain outcomes, or simply what’s possible with the data I have and how do I even begin? To me, the key interest and success behind Eureqa has come from its applicability to real business data and problems. It finds physical relationships, and interpretable models, which answer these types of questions directly.

Earlier this year, IBM announced they are splitting the different components inside Watson into individual services instead of trying to map a complete solution for customers. They may no longer be a pioneer in the space, but perhaps they’re starting to acknowledge what businesses need most today.

 

Topics: Artificial intelligence, IBM Watson, Machine Intelligence

Follow Me

Posts by Topic

see all