﻿

# Blog

Models are the foundation for predicting outcomes and forming business decisions from data. But all models are not created equal. Models range from simple trend analysis, to deep complex predictors and precise descriptions of how variables behave. One of the most powerful forms of model is an “analytical model” – that is, a model that can be analyzed, interpreted, and understood. In the past, analytical models have remained the most challenging type of model to obtain, requiring incredible skill and knowledge to create. However, modern AI today can infer these models directly from data.

mathematical model (or analytical model) is a “description of a system using mathematical concepts and language. Mathematical models are used not only in the natural sciences (such as physics, biology, earth science, meteorology) and engineering disciplines (e.g. computer science, artificial intelligence), but also in the social sciences (such as economics, psychology, sociology and political science); physicists, engineers, statisticians, operations research analysts and economists use mathematical models most extensively. A model may help to explain a system and to study the effects of different components, and to make predictions about behaviour.”

An example analytical model inferred by AI from data (Eureqa).

There’s a reason why every field of science uses math to describe and communicate the intrinsic patterns and concepts in nature, and why business analysts design mathematical models to analyze business outcomes. Essentially, these models give the most accurate predictions and the most concise explanation behind them. They allow us to forecast into the future and understand how things will react under entirely new circumstances.

Other forms of models are easier to create, but are less powerful to use. For example, linear models, polynomial models, and spline models can be used to fit curves and quantify past trends. They can estimate rates of change or interpolate between past values. Unfortunately, they are poor at extrapolating and predicting future data because they are too simple to capture general relationships. Similarly, these models often need to be quite large and high-dimensional in order to capture global variation in the data at all, which subsequently often makes them difficult or impossible to interpret.

Many open-source algorithms in machine learning attempt to improve the predictive accuracy over standard linear or nonparametric methods. Decision trees, neural networks, ensembles, and the like contain more complex and nonlinear operations that are more efficient at encoding trends in the data. In order to improve accuracy, they generally repeatedly apply the same nonlinear transformation over and over (such as a logistic function or split average), regardless of the actual underlying system. This causes these models to be almost impossible to interpret meaningfully. They also require significant expertise from those using them; entire competitions are held for experts to tune and control parameters of these algorithms and models to prevent them from overfitting the data, limiting where they can be applied.

Deep learning methods in machine learning can be viewed as an extreme, producing enormously large, complex models. These models perform extremely well a few particular types problems that have dense data, like images and text, where there are thousands of equally important inputs. Deep neural networks typically will use every input available, even when completely irrelevant or spurious, which makes them difficult to use where the important variables and inputs are unknown ahead of time.

The power of analytical models is that they use the least amount of complexity possible in order to achieve the same accuracy. Instead of reapplying the same transformation over and over, the structure of the model is specific to the system being modeled. This makes the model’s structure special – it is by definition the absolute best structure for the data, and the simplest and most elegant hypothesis on how the system works. The drawback to analytical models is that they require significant amounts of computational effort to compute.

Our mission with Eureqa has been to solve this challenge at scale, and we’ve already seen major impacts in both science/research and business/enterprise. For me personally, I’m most excited by the prospect of using machine intelligence for analytical modeling, where instead of completely automating a task or simply fitting data, the machines are making discoveries in the data we collect and interpreting them back to us automatically. Automation has never been so beneficial.

Speaking at the Open Data Science Conference (ODSC) last week, I discussed where artificial intelligence is going, what it will automate, and what its impact will be on science, business, and jobs. While the impact from Eureqa has been overwhelmingly positive, many are warning about a darker future:

“With artificial intelligence we are summoning the demon. In all those stories where there’s the guy with the pentagram and the holy water, he’s sure he can control the demon [but it] doesn’t work out.” –Elon Musk

Elon Musk, in the above quote, is worried about a very specific area of AI research – the sentient autonomous AI and robotics research as popularized in movies.

In fact, the press has characterized Eureqa as a “Robot Scientist” as well, speculating advanced tasks like scientific inquiry may become automated by machines one day. However, Eureqa was born out of the challenge to accelerate and scale the complexity of problems that we can tackle and solve – not simply mimic human behavior.

The areas of AI focused on simply learning tasks and replicating human behavior (e.g. IBM Watson or Google AlphaGo) are much hazier. It’s not clear what type of impact this trajectory will have.

The research group OpenAI, founded to support “beneficial” AI research, signaled they are focused entirely on this type of AI last month. Their platform OpenAI Gym enables researchers to develop reinforcement learning algorithms. Reinforcement learning is a class of machine learning algorithms used for tasks like chat bots, video games, and robots. Interestingly, it doesn’t typically start with data or try to learn from an existing data set; it attempts to learn to control an agent (like a robot) based purely on a set of actions it can take and its current state.

The downside of reinforcement learning is that it is not immediately applicable or natural for most business problems that I observe today. That is, businesses are not clamouring for chat bots or interactive agents; they tend to have more data than they can analyze and are invested in putting it to work instead.

Of all areas of machine learning and AI, reinforcement learning may be the furthest out. But early research is producing some exciting results, for example learning to play videogames like Mario from trial and error.

It’s important to keep in mind, however, how far there is to go before sentient AI systems Musk and OpenAI are alluding to may arise. Last week the White House Science and Technology Office concluded that despite improvement in areas like machine vision and speech understanding, AI research is still far from matching the flexibility and learning capability of the human mind. That said, I’ll be rooting for OpenAI to keep this area of AI beneficial for us as it matures.

Three months ago I spoke at a conference affectionately titled “Datapalooza” sponsored by IBM. My talk covered how modern AI can infer the features and transformations that make raw data predictive. I’m not sure exactly how many IBM people were in the crowd, but two IBM database and analytics leads grabbed me after the talk:

“We love what you’re doing. The Watson team is attempting to do things like this internally but is nowhere near this yet.” – [names withheld]

What’s interesting is that Watson has been coming up more and more recently when I speak to customers. The billions of dollars IBM has invested to market Watson has created an air of mystery and hype around what artificial intelligence can do. In fact, IBM is expecting Watson to grow to over \$1B per year in revenue in the next 18 months. Yet we haven’t seen any prospect choose Watson over Eureqa to date. So what’s going on?

Speaking at IBM’s Datapalooza (2016) in Seattle, WA.

I remember the excitement in the AI lab (CCSL) at Cornell University when IBM’s Watson computer competed in the game show Jeopardy in 2011. A group of us watched live as the computer beat the show’s top player, Ken Jennings.

IBM had pioneered one of the most interactive AI systems in history. Instead of simulating moves in chess more than before (as it’s predecessor Deep Blue had done), Watson appeared to actually “think.” It interpreted speech and searched data sources for a relevant response. It inspired similar technology, like Apple’s Siri and Microsoft’s Cortana, which came out over the next few years.

Unlike Apple, Google, Facebook, and others, however, IBM recognized an enormous opportunity in the market. Every business in the world today stockpiles data faster than can be analyzed. Literally hundreds of billions of dollars in value lies in the applications of this data. Perhaps the technology that could win quiz competitions like Jeopardy could unlock some of this value as well. IBM decided to step out of the safe confines of a specific application, and attempted to work with business data and real-world problems with commercial deployments of Watson.

The interest in IBM Watson tied to Jeopardy over 10 years.

Coincidentally, I began working on the preliminary technology behind Eureqa around the same time. Eureqa was focused on a broader challenge; instead of trying to interpret sentences and look up responses, Eureqa was tasked with deducing how any arbitrary system behaved – just provide the data/observations. It became the first AI that could think like a scientist and produce new explanations for how any system worked.

The similarity, and the power, of both Eureqa and Watson is that they are examples of Machine Intelligence – meaning the answers they output can be meaningfully interpreted and understood, as opposed to some statistical prediction or data visualization. But this is where the similarities end.

Watson’s great challenge has been adapting its technology for answering trivia questions to real business problems. Despite the prevalence of unstructured text data, very few new business problems appear to be blocked by the ability to look up relevant information in text. From the WSJ: “According to a review of internal IBM documents and interviews with Watson’s first customers, Watson is having more trouble solving real-life problems.”

The data that most businesses have today consists of log data, event data, sensor data, sales data, or other numeric data (data where Watson’s core technology doesn’t apply). The major questions they need answered relate to what causes other things to happen, what triggers or blocks certain outcomes, or simply what’s possible with the data I have and how do I even begin? To me, the key interest and success behind Eureqa has come from its applicability to real business data and problems. It finds physical relationships, and interpretable models, which answer these types of questions directly.

Earlier this year, IBM announced they are splitting the different components inside Watson into individual services instead of trying to map a complete solution for customers. They may no longer be a pioneer in the space, but perhaps they’re starting to acknowledge what businesses need most today.

After 2 years in a row of coming up roses, we’ve got our sights set on a 3rd year of success with the Kentucky Derby. We’ve got our handicapping data from Brisnet.com and we’ve prepped with plenty of mint juleps (drinks help you bet smarter, right?). Now we’ve spent the past couple days combining Eureqa’s data discovery horsepower with the raw horse power on the track to find out who’ll be in the winner’s circle for the 142nd running of the Kentucky Derby.

Rather than skimming through the daily racing form  before madly rushing the tellers with our bets, we turned to our tame A.I.-powered modeling engine, Eureqa, to automatically build, evaluate, and analyze billions of models to discover the most predictive factors. Eureqa’s machine intelligence lets us read and interpret the models, helping us steer the engine towards fruitful paths and away from red herrings. In the end, we found a model that combined these 5 key factors:

• Standardized live odds probability
• Speed over the past two races
• Post position
• Racing style
• Track conditions

So where does that leave us?

Eureqa’s Top 5:

1. Nyquist
2. Gun Runner
3. Exaggerator
4. Creator
5. Mohaymen

Want to try your paces with your own data against Eureqa? Come talk to us — and in the meantime, check back with us after the race to see how our predictions have panned out. With Eureqa at the wheel, we’re sure we’ll be riding “derby”.

Topics: Eureqa, Kentucky Derby

In March, the US Department of Education released its latest College Scorecard to “provide insights into the performance of schools eligible to receive federal financial aid, and offer a look at the outcomes of students at those schools.” Fortunately for us data-driven strategists (read: nerds) at Nutonian, the government also released the raw data it used to drive at its summary results and findings.

While Washington’s number-crunchers did a nice job increasing transparency about each college’s strengths and lifetime earnings ROI, there was one angle that was noticeably absent given the election cycle: a deep-dive into loan repayment rates. With so many students and families adamant that the current loan structure is broken and leads to a blatant poverty trap, why haven’t more analysts dug into this question? How flawed are current loan costs, if at all, and what leads to students being unable to pay them off?

We put machine intelligence to the test to automatically sift through the College Scorecard data set to highlight the most important relationships and predictive variables that influence loan repayment. It’s important to note that we only used the quantitative inputs available to us, and consequently, variables like motivational drive, professional network, etc. will not show up in our models, despite potentially playing a significant role in a student’s ability to repay his/her loans.

Our focus will be on the Scorecard variable “Repayment Rate 7 Years from Graduation”, or the percentage of students able to make any contribution to their loans 7 years after graduating college. The first step in using Eureqa is simply formulating a question. In this case, we’ll ask: “What causes low repayment rates?”

Using Eureqa, we built a model to predict the likelihood a school will have a repayment rate below 80% after 7 years. More interestingly, we were able to quickly identify a few of the drivers (“features”) of repayment.

After running Eureqa for five minutes, we found that repayment rate is:

• Positively correlated with parent/guardian income – The higher the family’s income, the more likely the student is to repay his or her loans.
• Negatively correlated with a school’s percentage of students on loans – The higher the proportion of a school’s students that are on loan programs, the less likely a student is to replay his or her loans.
• Negatively correlated with a school’s percentage of non-white students– The higher a school’s proportion of non-white students, the less likely a student is to repay his or her loans.
• Negatively correlated with a school’s acceptance rate – The higher a school’s acceptance rate, the less likely a student is to repay his or her loans.

Figure 1, below, shows the likelihood that students will default on their loans (y-axis) plotted against that student’s family income (x-axis). Default rate is remarkably high until family income hits about \$60,000, and then it plummets. Let’s think about that for a second. If a family is making less than \$50,000 per year, it’s more likely than not that their child will default on a loan payment and incur even more expenses as a penalty. For a lower or middle-class family hoping to send its child to school to climb the economic ladder, the system, to put it mildly, is not doing them any favors.

So what steps could the government take in addition to reexamining the pricing structure of their loan rates? Economists agree that successful completion of a college degree trends with better outcomes not just for an individual, but for society as a whole. College degrees generally spell higher incomes and intellectual capital, both of which college graduates use to enrich other people around them. One way the US government tries to “nudge” more people in the direction of a college degree is by issuing Pell Grants, or financial assistance packages to students that don’t need to be repaid. Most Pell Grants sit between \$3,700 and \$5,700 per year.

Unfortunately, there’s a positive correlation between the percentage of a school’s students receiving Pell Grants and students’ likelihood of default. Schools with a higher ratio of Pell Grant recipients tend to experience higher rates of default on their loans, even though Pell Grants are intended as a direct subsidy to chip away at student expenses. This suggests that Pell Grants may not do enough to help students escape their debt.

How about another interesting finding. What role does faculty quality have on student success? There’s a linear relationship between faculty salary and graduation rates: The higher the average monthly salary of a school’s professors, the higher the percentage of students that graduate within six years. This could indicate that the highest-paying schools draw the best professors, who pass off a higher-quality work ethic and knowledge to their students. Or, of course, these variables could simply be correlated, and students who are more likely to graduate college from the very beginning follow the schools with the pricier professors.

The results of the College Scorecard won’t rattle the earth with their insights, but they do bring to light potential problems inherent in the US college system. Ideally, we’d like to see the Department of Education collect more data about quantifiable loan rates, students and their characteristics so we can go deeper into the causes of loan default, and rely less on one-dimensional data like family income and ethnicity. From there, we may be able to determine an “optimal” loan rate that considers the trade-off between student/societal value achieved from an affordable education, and the government’s ability to keep its loaning sustainable.