Understanding risk is intrinsic to running a successful insurance business. Developing a deep understanding of the relationships between various risk factors and the likelihood and cost of associated claims enables insurance companies to charge customers appropriately for the risk they represent.
In this tutorial, we’ll examine how you can use Eureqa to predict Bodily Injury Liability insurance claim payments based on the characteristics of the insured’s vehicle, using data provided by Allstate and Kaggle.com.
The original competition and data, hosted by Kaggle, can be found here.
Let’s get started!
This competition provided millions of rows of claims data, but we are only going to use a few hundred thousand for discovering the model. The rest we’ll set aside for more advanced validation and testing steps should we need them later on. You may get an error attempting load all data on the local machine, so we will first extract just the first 200,000 rows using the command line (or ‘Terminal’ for your OS X folks). This can be done using the following command in Linux or OS X: head –n 200000 originalFileNameAndLocation.csv > newFileName.csv
This will be our training dataset for identifying the model structure – we can go ahead and open it using Eureqa. To do this, open up Eureqa and click ‘File’ and then select ‘Import Data’.
At first glance after import, we can surmise that the insurance data provided by Allstate contains the makes/models of various vehicles, their associated characteristics, and respective claim payments (if made). It does not appear to contain any time series data.
Preparing Data for Modeling
Eureqa provides some essential data preparation options for your convenience. In the case of Allstate’s data, you may want to consider removing outliers on some of the columns. As for missing data, Eureqa will automatically fill it in with the statistical mean (this is also changeable via the Data Preparation tab).
For this exercise, we’ll try working with the raw data, without any special pre-processing or preparation.
Predicting Claim Payments with Eureqa
To predict claim payments, we are going to set our Eureqa target expression so that ‘Claim_Amount’ is modeled as a function of all the other variables. It should look something like the following:
We’re also going to make some changes to the building blocks based on a few basic assumptions that can be made about the data. First, we’re going to uncheck the Sine and Cosine building blocks because they are advanced building blocks used for engineering datasets, or cyclical or seasonal data. Second, we’re going to enable row weighting on the ‘Claim_Amount’ column by selecting ‘Row Weight: 1/occurences(Claim_Amount)’.
Row weighting enables us to adjust for the fact that ‘Claim_Amount’ appears quite sparsely in the sample data. Eureqa will weight rows where ‘Claim_Amount’ is greater than 0 relative to its occurrences in the data set. For more information, you can check out our tutorial on row weighting.
Now that our target expression is setup, we can start the search. If you are interested in speeding up your searches by leveraging the cloud, please read our tutorial on using Amazon EC2 with Eureqa.
Go ahead and select the ‘Start Search’ tab and click ‘Run’.
From the ‘View Results’ tab, we can get a summary view of the solutions Eureqa has generated thus far, along with all of the associated error metrics. In the case of this search, we ran it on 72 cores for the better part of two hours.
To judge the accuracy of the models generated by Eureqa we can look at a few different error metrics, but for this tutorial we are going to look at ‘Mean Absolute Error’. This metric, since it’s based on ‘Claim Amount’, is the average error (plus or minus) one can expect with the predictions we’ve generated.
Looking at the Pareto Front graph, we can see that two of the leading solutions are very similar in predictive accuracy, yet are substantially more complex than other models. While the claim amounts in the dataset range between $0.00 and $10,000.00, our most accurate solution predicts the claim amount to within plus or minus $123.66. Eureqa also identifies a few potentially simpler solutions; one comparable solution using only 11 variables predicts the claim amount plus or minus $124.72.
This is where domain expertise proves invaluable. Ultimately, it’s up to the analyst to determine the appropriate balance between solution accuracy and complexity.
You should see similar models to our results given enough computational effort. Early on in the search, when there has been less computational effort, solutions are more approximate and refine over time. Even early results, while approximate, can often be useful and require minimal computing time.
In just a few hours we were able to go directly from the Allstate claims dataset to a precise analytical model of Bodily Injury Liability Insurance claim payments, which predicts claim amounts within plus or minus $123.66 (on average), using the characteristics of the insured’s vehicles. Since this is an analytical model, we can also deduce what factors and variables lead to higher insurance claims, in addition to just predicting future claim values.
In the real world, we may want to expand on these results further to improve its performance. This could be achieved by more thoroughly preparing our sample data, adding new variables and features, adding new building blocks, letting Eureqa search for a longer period of time, and using additional computation resources such as Amazon EC2.
Ready to try for yourself? Go ahead and download the Eureqa project file to get started.