Managing risk is all about knowing your data.
While many insurance companies understand the importance of collecting data, it is equally, if not more important, to integrate that data into organizational decision-making practices. The daunting part: understanding what data to collect and how to effectively organize the data supply chain to support analytics which can positively impact business outcomes.
Consider the claims journey for a case-management kind of product like disability. Over the lifetime of the claims, more than 100 data points are collected: some structured, some unstructured. Robust machine learning (ML)-based techniques are going to look at the base dataset and generate additional composite features. So, the dataset itself is evolving as business processes and analytic tools are deployed to infer and impact business outcomes. A strong data supply platform is key to managing what is essentially living, breathing data.
Beyond managing the dynamic nature of the data, two important aspects of all data analytics are the quantity and quality profile of the data.
Techniques to Gather Data
Analytic techniques — especially ML — require statistically significant quantities of data. Unlike internet-facing companies (like Amazon), insurance carriers generate transactions in hundreds and thousands, but very rarely in millions. So any transference of ML algorithms and techniques to an insurance domain requires data science teams to overcome some unique challenges.
First and foremost of these challenges is the limited availability of annotated datasets which align business outcomes to target variables. This requires a culture of data first to permeate across the business and technology organizations. A data-first culture codifies business outcomes and institutional decision-making knowledge in rules. This makes it possible to align the data collected with the business goals and metrics. The difference is stark. Imagine starting your fraud ML journey knowing all of the 500 claims you have identified as fraud over the last couple of years versus not knowing which claims your business process has identified as fraud. It is important to have this information regardless of whether it is housed in a spreadsheet, report, or some other system.
Another nuance, especially when dealing with predicting potential fraud, is the low occurrence rate of the event itself. The percentage of fraudulent disability claims is a generally accepted industry average of less than 2%. This imbalance can make it difficult for data science teams to build an accurate model using just the available data. One data science technique commonly used to overcome this class imbalance challenge is the Synthetic Minority Oversampling Technique (SMOTE). With SMOTE, you oversample the underrepresented class of data, which in this case, is fraudulent claims. Use of such techniques makes fraud detection even more of a journey than a point-in-time solution. Experience has shown the models tend to be over-or under-fitted. It is by collecting user feedback and using that data to retrain models that an optimum model prediction outcome can be achieved.
Optimizing the Data Quality
In data-first culture, gathering the data is only the initial step in effectively using it to improve business performance and offerings. While you want to capture as much data as possible, you also want to ensure you are collecting quality data that can be used for multiple purposes and by multiple teams within your organization.
Data quality is an iterative process which starts with the fundamental question about the expectation of data quality for a specific use case and the impact of poor data quality on the outcome. Data quality which scores high on all attributes like syntactic and semantic accuracy, timeliness, completeness, consistency, and uniqueness is always the organizational goal. But the reality of the state of data quality is that data lakes often pool data from different business domains and systems with varying levels of data quality checks and balances. The use case of ML being an iterative process does not warrant that the data sourced from data lakes be of pristine quality. It is a journey of a thousand miles which needs to start with the dirty data at hand. Over time and through targeted data quality improvements, ML outcomes will improve.
Data science teams working with business partners are well-equipped to start this journey with dirty data. Some of the common data quality challenges overcome effectively are syntactic check of data, e.g., dates, Social Security number, and age; increasing semantic accuracy by filtering out records that are outside the business value threshold, e.g. age less than 16 or greater than 90; populating low tuple and column completeness datasets by running scripts based on business input, e.g. when date of hire is missing, substitute with first paycheck date; and combining a wide range of business values to a manageable number of buckets of values, e.g. bucket distinct medical diagnosis codes to diagnosis categories like neoplasms, musculoskeletal. These data-cleansing techniques impact the overall data available to experiment and test but still result in models that can start the ML journey. As user feedback is accumulated over time and data quality is improved, it becomes a clean dataset to re-train and optimize the models.
Beyond providing product teams with inspiration to design new features, leveraging the data you have in its current quantity and quality profile for predicting outcomes such as fraud helps to manage risk. Better risk management reduces associated spending, enables fine-tuning of risk pricing, and allows you to build up a better reserve and capital deployment strategy. By itself, data has no value. Let your data work for you while you work on making it better in terms of quantity, quality, and managing it as a living asset.
For more information on how to ensure the appropriate use of your data for machine learning to achieve desired business outcomes, contact FINEOS.