Royal Bank of Canada - Automated Data Quality Checks to Save Time and Increase Output Confidence
Vincent Huang, Director, Data Science
Preet Kanwal Singh, Data Scientist
Susy Lee, Data Engineer
Mark Subryann Director, Data Engineering
Kanika Vij, Sr. Director, Innovation
Organization: Royal Bank of Canada
Royal Bank of Canada (RY on TSX and NYSE) and its subsidiaries operate under the master brand name RBC. We are one of Canada's biggest banks, and among the largest in the world based on market capitalization. We have 97,000+ full- and part-time employees who serve 17 million clients in Canada, the U.S. and 27 other countries. We are one of North America's leading diversified financial services companies, and provide personal and commercial banking, wealth management, insurance, investor services and capital markets products and services on a global basis.
Best Acceleration Use Case
Best Approach for Building Trust in AI
Within the Internal Audit department at a large financial institution, we monitor key risk indicators (KRIs) and areas of emerging risks. For our various data initiatives, we collect, analyze, and aggregate over 1,200 features collected from various sources throughout the enterprise.
As such, data arrives in different forms and at different frequencies. Some are through databases and others are through manually curated reports and all of these may have wildly different data quality (DQ) problems that need to be addressed before analyses and modeling can continue.
When we began our various initiatives, we started with fewer than 10 KRIs and evaluating data quality was trivial. As we scaled out the projects, the number of KRIs and the number of data quality checks grew rapidly. Despite the important of data quality checks, it rapidly became an intractable problem and started to consume the capacity of our data teams leaving them with less time to work on other projects.
Thus, we needed a solution that can be easily scaled and has standardized outputs for monitoring purposes.
To effectively monitor the quality of incoming data, we built a solution that leverages the “Metrics and Checks” module that exists for each dataset within Dataiku.
Our data teams met and worked on generalizing existing checks and ensuring it aligns with enterprise requirements. Each check was set up as a Python function with standardized arguments and input data types where possible.
The library can be loaded and functions called as a probe in the dataset “Metrics” module and “Checks” can be performed on the metrics to ensure they are within expected parameters following the build of the dataset.
The following is a list of several commonly used metrics of varying complexity in our DQCheck library:
1. count_blanks: Count the number of blanks. In some datasets, we would expect certain columns to never have NULL entries such as IDs.
2. has_condition: Counts the number of elements in the column that are not within the expected range. This is useful for datasets where we expected a positive percentage value between 0 100%. When data reports are manually curated, analysts may sometimes input 50 instead of 0.5. A type check will pass as it is still numeric and downstream analyses are able to proceed but with wild and unexpected results. However, a range check like this can easily capture a minute inconsistency and throw warnings.
3. within_expected: This function will capture elements in a column that do not match against a list of possible values provided as an argument. This function is useful when we have columns that are categorical and users are inputting data that is beyond expected. At times, there may be new categories that are added, but usually, an error has occurred at data entry.
The combination of our custom Python DQCheck library and the “Metrics and Checks” within Dataiku has streamlined our data intake and refresh process. We are able to capture data issues and address them before they impact downstream models and inferences.
Using the Dataiku API, we are also able to collect and track the Metrics over time to determine if certain datasets are particularly problematic with underlying issues that need remediation before we integrate them into our analyses.
Business Area Enhanced: Financial Services Specific
Use Case Stage: In Production
By significantly reducing the time spent on developing bespoke data quality checks on incoming datasets, we were able to speed up our KRI onboarding and increase our confidence in the data moving through the analyses.
The quarterly data refreshes used to be painful and time-consuming as it was difficult at times to pinpoint the origin of the data problem. We were able to incorporate these custom data quality checks into the flows at strategic locations to stop the scenarios from proceeding when a DQ problem arises. This led to significant reduction in time spent on troubleshooting and rerunning entire pipelines.
The application was therefore more robust, and our stakeholders had increased confidence in the outputs. We were also able to free up a significant amount of time from the data team, which allowed to work on other innovative products.
Value Brought by Dataiku:
Dataiku was built to democratize analytics and data science by making it easy for everybody to use and putting AI at people’s fingertips. However, in the hands of a curious data professional with strong coding skills, it considerably speeds up the pace at which projects can be completed, and with a certain confidence of quality.
This is an example where the features within Dataiku enabled a team to streamline and improve a process that was previously unscalable. Intaking and onboarding over a thousand KRIs was virtually unheard of when we initially struggled with ingesting and setting up data quality checks for 70 as the speed at which the work was completed was unfathomable at the beginning.
The features in Dataiku enabled our data teams to easily monitor and tackle issues in the data while allowing us to build custom libraries that can be easily integrated. We were able to scale our processes and work rapidly without fears of deviating from our internal data governance guidance.
Dataiku is a great product for everyone, and it really shines when we can combine the utility of visual components with robust and sound coding practices.