Predicting future financial distress and understanding the factors that cause it are critical to how banks decide who can get financing and on what terms. Credit scoring algorithms, which predict the probability of default, are the primary method banks use to determine whether or not a loan should be granted to a given applicant.
In this Eureqa tutorial, we’ll examine how Eureqa can be used to predict whether somebody will experience financial distress in the next two years using anonymous credit-scoring data provided by Kaggle.com.
The original competition can be found here.
Let’s get started!
The Training Data
After downloading the training data from Kaggle.com we can go ahead and take a look at the various variables and values within Excel. At first glance, we can see that we’re going to be working with characteristics that are commonly used in assessing credit worthiness.
The full list of variables includes:
- Serious delinquency in 2 years (SeriousDlqin2yrs)
- Revolving Utilization Of Unsecured Lines
- Number Of Time 30-59 Days Past Due Not Worse
- Debt Ratio
- Monthly Income
- Number Of Open Credit Lines And Loans
- Number Of Times 90 Days Late
- Number of Real Estate Loans Or Lines
- Number Of Time 60-89 Days Past Due Not Worse
- Number of Dependents
You might suspect that a lot of these variables interact to affect someone’s likelihood of delinquency; but how exactly and what do we predict? We’ll use Eureqa to answer the question.
Preparing the Data for Modeling
Eureqa includes several data preparation tools for your convenience, however, we’ll try working with the raw data, without any special pre-processing or preparation, just to get started. Once we complete our first attempt at modeling the data, we can go back and consider options like removing outliers and smoothing.
Predicting Financial Delinquency using Credit Scoring Characteristics
We want to predict the variable SeriousDlqin2yrs, which is a variable containing 0’s for no delinquency, and 1’s for serious delinquency. Since this variable has only values of 0 and 1, we’re going to use a special target expression that will provide a similar constraint on the resulting models. We recommend using the logistic function (see below) which squashes values to be between 0 and 1. We choose the logistic function (as opposed to a step function) because it provides a better search gradient. For more information, see our tutorial on modeling binary values.
Your target expression should now look something like the following:
We’re also going to make a few changes to the model building blocks based on some assumptions that can be made about the data. Since the data is not related to engineering and is not cyclical or seasonal in nature, we can go ahead and uncheck the ‘Sine’ and ‘Cosine’ building blocks. We also recommend enabling the ‘Logistic’ building block, which when used as a building-block provides an easy ability to threshold input variables or values that may be useful in a model.
Next, we’re going to enable row weighting on our target variable by clicking Row Weight and selecting ‘1/occurences(SeriousDlqin2yrs)’. We do this because positive occurrences appear sparsely in the training data (roughly 10,000 out of 150,000 records), so we want to weight the outcomes proportionally to the more frequent case (e.g. by their frequency of occurrence, 1 to 15). For more information on row weighting, please visit our row weighting tutorial.
Now that we’ve set our target expression, selected the appropriate model building blocks, and enabled row weighting, we’re ready to start our search. From within Eureqa, select the ‘Start Search’ tab and click the button marked ‘Run’.
If you are interested in speeding up your searches by leveraging the cloud, please read our tutorial on using Amazon EC2 with Eureqa. Enabling the cloud can enable you to search up to 100x faster.
From the ‘View Results’ tab, we can get a digest view of all the solutions generated by Eureqa thus far. For this tutorial, we ran Eureqa using a 72 core private cloud for about five hours. If you’d like to learn more about how to setup a private cloud, please take a look at our Eureqa Dedicated Server product page.
As with our predicting insurance claim payments tutorial, we’re going to judge the predictive accuracy of the solutions using Mean Absolute Error (MAE). This metric is the average error (plus or minus) one can expect with the predictions generated by our models.
Looking at the solutions Eureqa has generated, we can see that the top four models offer similar predictive accuracy, with MAE ranging between .2245 and .2249, while differing substantially in complexity. Choosing the simplest of the four would result in a .2% decrease in predictive accuracy, while decreasing the number of terms by nearly 50%. This is clearly illustrated via Eureqa’s built in Pareto Front display.
Since the outputs and the predictions are nearly all 0’s or 1s, we can interpret mean absolute error statistic as the percentage of time the model found in Eureqa will make an incorrect prediction. In the case of our most accurate solution, Eureqa could correctly predict whether or not someone would have financial distress 77.55% of the time.
Even more interestingly, because the output of Eureqa is an analytical model, we can easily identify what characteristics are indicative of future financial delinquency. Our most accurate model includes Revolving Utilization Of Unsecured Lines, Number Of Times 90 Days Late, Number Of Time 60 – 89 Days Past Due Not Worse, and Number Of Time 30 – 59 Days Past Due Not Worse as the most important factors in determining future delinquency. Given these variables, other variables like age, monthly income, number of dependents, real estate loans, and debt ratio while possibly important indirectly, do not significantly improve accuracy and are not used in the best models. The variables related to being overdue appear to drive nearly all delinquency outcomes.
In a just over a few hours, we were able to go from a raw data set containing credit scoring characteristics, to a precise analytical model of financial delinquency that predicts claims correctly nearly 78% of the time, and we discovered that the variables related to overdue payments dramatically affect and drive these outcomes on the best model
For real world applications, we would likely want to improve on the predictive accuracy of our results by more thoroughly preparing our sample data, adding new building blocks, letting Eureqa search for a longer period of time, and leveraging additional computational resources such as Amazon EC2 or a private cloud using a dedicated Eureqa server. We can also use the models produced by Eureqa almost anywhere, such as software like Excel, R, SAS, or Matlab in order to do additional analysis now that we have a model to work with.
Ready to try for yourself? Go ahead and download the Eureqa project file to get started.