Masood Ali (Senior Director, Data Strategy & Governance)
Vincent Huang (Director Data Science)
Mark Subryan (Director Data Engineering)
YuShing Law (Director Analytics Ecosystem)
Kanika Vij (Sr. Director Data Science and Automation)
Royal Bank of Canada
Royal Bank of Canada (RY on TSX and NYSE) and its subsidiaries operate under the master brand name RBC.
Internal Audit’s annual audit planning exercise comprises of two key components 1) risk assessment and 2) compilation of the audit plan. The risk assessment process results in the risk rating of auditable entities (organizational units). Internal Audit conducts risk assessment on over 400 auditable entities annually. The outcome of the risk assessment forms the basis of the audit plan.
The annual audit planning process is subjective and a manually intensive process comprising of several non-standardized offline processes to gather data points to risk assess from different sources and compile audit plan. Therefore, it is a time intensive process spanning many months to compile the annual audit plan.
Our objective was to build a continuous risk assessment tool that automates the monitoring of risk status and trends to provide a comprehensive and dynamic view of risk for an audit entity at any given time and automate the compilation of the audit plan.
The above challenge required a platform which provided the ability the perform extensive ETL related functions such as building a system to ingest and process data from various systems and sources across the Enterprise coupled with the ability to build and productionize machine learning models all in one place.
The scale of our project is enterprise wide and the impact is department wide i.e. Internal Audit. This is where Dataiku provided the ability to perform extensive ETL and Machine learning all in one platform.
Where did Dataiku fit into the picture?
To enable a data driven risk assessment in an automated way across the entire department, following are the key areas in which Dataiku has facilitated:
i. Data Acquisition – Setting up connections to source systems across the Enterprise. Currently, there are 96 connections to databases throughout the enterprise with only 2 platforms partially on-boarded. We anticipate the final number to be approximately 400 database connections when all platforms have been on-boarded.
ii. Data Pre-processing – All transformations to each dataset are captured within their own project. The visualization of the pipelines reduces the need for manual documentations on workflows and execution instructions, and the risk of key people dependencies. When data is refreshed or new data arrives, pipelines can be easily executed to re-perform the calculations. We currently have over 700 intermediate datasets between raw inputs and the final staging dataset encompassing a wide range of numbers of transformations and calculations. Manual maintenance of these workflows would have challenging.
iii. Automated Productionized Work flows - Dataiku enables IA to put workflows into production with a fraction of the staff and effort than custom coded or bespoke applications. At the moment, we have 21 scenarios set up in which 6 of them execute on a weekly or daily basis. The team receives email notifications of scenario executions and will promptly address failed runs. This fits our agile approach because we can respond to user enhancements faster. Also, the entire process is de-risked as we can roll-back the changes easily
iv. Computations - Raptor in its current form consume approximately 7.58 million rows of data and performs over 174 million computations. Without a complete and dedicated development team, setting up a large-scale project like this would have been impossible. Dataiku provided the piping and basic infrastructure and this makes it easy for small teams, such as ours, to put together large projects.
v. Machine Learning Models – Through Dataiku, we were able to easily set up a pipeline to consume data from an API, engineer features, prototype two different models with Dataiku’s Lab and deploy it with minimal friction. The model outputs were integrated with additional Enterprise data to derive additional insights. Dataiku was instrumental in this as it allowed us to monitor model performance and schedule model retraining and executions.
vi. Workflow Management - If Dataiku wasn’t there, there would be a lot of spaghetti code to deal with on people’s laptops given the number of individuals involved in the project. Dataiku facilitates the organization and visualization of the workflows, which makes for an easier review as well as reduces key people dependency.
vii. Scheduling workflows and adding dependencies – The risk assessments are to be updated on a quarterly basis. This entails a number of upstream and downstream dependencies. Dataiku makes it easier to schedule workflows and take into account the dependencies.
viii. Dataiku visual recipes – Dataiku’s visual recipes helped in joining and pre-processing datasets in an efficient manner. This prevented time being spent on writing long and cumbersome spark/SQL code.
ix. Freedom to focus on the problem – Dataiku has enabled IA to reduce the coding footprint to one-tenth of what it would be from a custom coded application. It gives Data Scientists/Engineers and Analysts the freedom to focus on problem they are trying to solve rather than having to wade through the overhead of handling miscellaneous IT issues. E.g., code environment issues, the code works on one person’s desktop but not the other. Also the data scientist doesn’t need to have a strong understanding of the details of how the system is being solutioned which allows them to focus on solving their task
Due to the project scope, data is being sourced and processed from various source systems and teams across the Enterprise. This lends in itself key concerns around Data Governance that Dataiku has helped address such as:
i. Data Lineage – Automating data lineage allows us to accurately capture what is actually happening to the data, not what employees believe is happening. In house built solution leverages Dataiku API to scan metadata in order to establish catalogue of data assets and their associated lineage at a data element level. This insight help identify that at IA there are 407 dataset reused 310 times; 16,273 datasets, 840,000 data elements consumed across analytics projects at IA.
ii. Dataiku Metadata Integration with Collibra – Lineage results are then integrated with Enterprise Data Governance Platform Collibra leveraging APIs. Dataiku helped speed up documenting lineage of Raptor related KRIs to instill transparency in data consumed to risk assessed audit entities. Without Dataiku/Collibra integration it would have been 75% more costly, 66% more time consuming and perhaps not feasible to contribute 1 million inventory of data assets for lineage and keeping it up to date on a daily basis.
iii. Data Quality – Raptor application derives 100’s of Key Risk Indicators (KRI) using 1000’s of critical data elements from variety of enterprise data sources. Knowing quality of critical data elements informing KRIs for audit planning decisions is very important. Dataiku’s data profiling, tagging, recipe sharing and integration with python capabilities provided the framework through which data quality checks were easily build and embedded in-line with data ingestion process. Results are harvested automatically using Dataiku APIs to integrate with Enterprise Data Governance Platform Collibra on a regular basis avoiding lots of manual effort.
iv. Adherence to coding practices and version control – It would be simply impossible to adhere to coding practices and version control in a project of such a large scale if code was to be maintained offline on the team member’s laptops. There is a feature in Dataiku that helps to modularize and build libraries that team members can access and apply the same function across different datasets. For example, to streamline the same data quality (DQ) check across all datasets we built a library of DQ checks which the various data analysts on the project team can leverage in a standardized manner.
Benefits are multi-faceted, and most impactful on two major areas:
1. Operational Efficiencies Department wide
i. Time savings from automating the continuous risk assessment process by streamlining administrative processes related to data sourcing, processing and risk calculations related to risk assessment.
ii. Reduction in manual processes and various end user computing tools such as excel files.
iii. Flexibility to diverge resources to platforms with elevated areas of risk and highest impact.
iv. Increase in consistency and repeatability of risk assessment process.
2. Quicker adjustments to the audit plan
i. Enterprise audit plan coverage can be aligned to areas of elevated risk.
ii. Visibility into emerging and changing risks on a continuous basis, which will help audit teams respond to changes in the risk environment by pivoting on the audit plan.