Marlan Crosier, Senior Data Scientist
Norm Preston, Manager of Data Science team
Jing Xie, Data Scientist
Adam Welly, Data Scientist
Jim Davis, Statistician
Greg Smith, Healthcare Data Analyst
Gene Prather, Dataiku System Admin
Premera Blue Cross
Premera Blue Cross is a not-for-profit, independent licensee of the BCBS Association. The company provides health, life, vision, dental, stop-loss, and disability benefits to 1.8 million people in the Pacific Northwest.
The feature store is an emergent concept in data science. It consists of a storehouse for features, which can be used in a variety of Machine Learning models. It helps streamline the process of building machine learning models and make it overall much more efficient, thanks to hundreds to thousands of features easily available.
Before the development of our feature store, we had to build features for each new model from scratch. Building a feature from scratch can take several days or even weeks. Besides the significant additional time required to build new features, such one-off features were often not as well tested and so models were more likely to be impacted by errors. The other big impact was that often we were unable to test as many features as we might have liked to. The result is that our models were not as accurate as they could have been.
Our feature store currently includes 283 features. As a health insurance company, members are foundational entities in our business and all features are currently linked to a member.
Our features are built from data in a SQL-based data warehouse. All features are all pre-calculated (vs. calculated on the fly). All processing runs in-database via SQL. In other words, we used SQL to build our features - as with the amount of data we are working with, using Python would not be practical.
Given the pre-calculated approach, the resulting feature tables are fairly large since, for many of our features, we store daily values for our entire member base. Most features are thus updated daily (that is, new values are calculated daily). Day-level feature values are sufficient for the vast majority of our use cases.
Our feature store includes a core table and then several tables for specific types of features. The data in these other tables is of a particular type or source, and is available in a particular timing.
The benefits of this approach are multiple:
The data types for each feature store table have been carefully selected to minimize storage requirements, and more importantly to minimize the memory footprint when data is read into Python-based machine learning processes.
2. Development & Deployment
We currently use a team approach for developing new features, and Dataiku’s collaboration features have been very helpful here. Each developer is provided with a current copy of the relevant feature store project, and then use either version control tracking to identify changes and additions, or git-based project merging to facilitate integrating the changes back into the main project.
We deploy updates to our feature store using Dataiku’s automation instances. Develop and test takes place on the Design instance, then updates are deployed to a Test instance, and finally to a Production one. We have incorporated a variety of design time and run time checks (via Dataiku’s Metrics and Checks) to assure data accuracy and reliability.
Additionally, we developed Python recipes that largely automate the feature table update process - for instance, copy data from existing table, drop existing table, create new table, copy existing data back in and then add new feature data.
3. Metadata & Discoverability
Each of our feature projects includes a metadata component. This metadata is entered via a Dataiku Editable dataset and includes attributes like name, data type, definition, a standard indicator to use for missing values, and feature effective date.
Since we were building it for a small team, and we wanted to try all store features in our models, we have initially been focusing on building features rather than discoverability.
We are now building a fairly simple webapp in Dataiku to provide data discoverability and exploration, in preparation for rolling out the feature store to more teams in our company. This discoverability tool utilizes the feature store metadata described above.
Data Scientists can incorporate feature store features in their projects using the Dataiku macro feature. The macro enables selection of subject areas to include and specification of how to join feature data to project data. The macro handles missing value logic and maintenance of data types to minimize memory demands in Python-based machine learning processes.
The overriding benefit of our feature store is of course how much more quickly we can develop machine learning models. Developing an initial model often takes just a few hours, whereas without our feature store that same model may have taken days, weeks, or even months. In some cases, our final ML models only use features from our feature store - although more commonly we supplement these features with features designed for the particular problem at hand.
Additional benefits include:
We've also experienced a bit of a virtuous cycle effect with our feature store. As the store expands and the value increases, it's easier to justify investing the resources to develop new features, test them thoroughly, assure that leakage is not occurring, etc. This in turn further increases the value of the store which makes it even easier to invest in further enhancements. And so on!
At the company level, our ability to develop more accurate models more quickly also enables more areas in the organization to benefit from our data science investments.