Aviva - Machine Learning Approach for Log Analytics
Mitesh Chandorkar with
Aviva Team - Simon Sinfield (Manager), Robert Tullis, David Pearce
Wipro Team - Richardson Jebasundar, Narinder Saini, Rishi Khowala
Title: Data Science Architect
Country: United Kingdom
Aviva plc is a British multinational insurance company headquartered in London, England. It has customers across its core markets of the United Kingdom, Ireland, and Canada. In the United Kingdom, Aviva is the largest general insurer and a leading life and pensions provider. Aviva is also the second largest general insurer in Canada.
Most Impactful Transformation Story
Large organizations with digital capabilities have significant IT operational challenges, dealing with huge IT assets, numerous processes, and multiple support teams. Businesses are evolving at a rapid pace, and there is a boom in digital and IT estates to support them. Today's organizations are faced with keeping up with an ever-expanding IT estate, massive amounts of data, customer demands, and process inefficiencies.
The Digital Operations Management Engine was set up to address these IT operational challenges and help streamline the existing processes. The historic incident management data and log details are analysed to understand the current IT operations and challenges. Wipro’s AI Solutions leverages its industry and business-focused solutions to create a seamless connection throughout the IT enterprise value chain.
Wipro partnered with Aviva on this journey with the goals of:
● Enhancing incident management to decrease the average time to recovery.
● Improving Service Availability through proactive action on the assets as required.
● Acting ahead of major disruption of services by being more proactive in identifying IT problems through logs.
● Increasing operational efficiency by identifying the root cause.
● Enhancing customer experience by identifying unreported customer incidents.
Dataiku is the backbone of this solution. Using Dataiku, we produced numerous use cases, many of which are functional and in production, while others are being worked on as proofs of concept. Key use cases include:
● Incident Classification (Supervised) – Leveraging NLP, the classifier helped classify the incoming incident to possible root cause with 97% accuracy.
● Anomaly Detection Using Log Data (Unsupervised) – It is difficult to manually find abnormalities in an estate of more than 300+ bots created for automation. The solution fetches stats data from logs every 15 minutes and alerts the support team about anomalies.
● Prediction of Cluster Failure (Semi-Supervised) – The model, which was trained using 23 different infrastructure metrics, was able to foresee server cluster failures well in advance, by around two to three hours.
● Event Sequence Modelling (Unsupervised) – The LSTM-based deep learning model identifies the normal pattern and predict anomalies from ingested log data. This is a proved concept, and we are planning to scale this solution.
NLP preparation and processing required for operational analytics was done on Dataiku (DSS v 9.x) using built-in recipes and custom python code. Dataiku enabled us to:
● Ingest data from various sources – Use of MySQL, PostgreSQL, Log data via API to ingest incident and log data for processing.
● Low code/no code data processing – The existing visual recipes aided in completing data transformations quickly. For complex transformations, custom Python recipes are used.
● Inbuilt machine learning algorithms - Multiple models were trained with very little configuration and several machine learning and deep earning models were developed.
● Data exploration made easy - Charts, dashboards, and statistics helped with quick data exploration.
● Orchestration layer end-to-end solution – Job scheduling, and scenarios were used as an orchestration layer to decide the data pipelines flow.
Business Area: IT/Cybersecurity/Data
Use Case Stage: In Production
The engine enables the support staff to consider why any given issue keeps occurring and fix it permanently.
The solution increased operational and resource efficiency through root cause labelling for the incoming incident. It provided the benefit of circa 200 hours in effort per month (Resource Efficiency) and circa 700 hours MTTR (Mean Time to Resolution) savings per month (Operational Efficiency).
What’s more, the solution enhanced customer experience by providing near real-time alerts to the support team in case of anomalies from the log data of Bots. This helped in taking a proactive approach to monitor and fix the estate as necessary.
By identifying the cluster failure, the run team could take preventative measures that increased service availability, decreased incident rates, and reduced associated work.
When a cluster of abnormal sequences is noticed in a short period of time, an event sequence model helps in identifying the customer journey and flagging potentially detrimental incidents in advance. Additionally, it can help identify unreported incidents (not raised by users) and get them fixed, thus improving customer experience.
Several other NLP clustering use cases still in development are assisting different IT teams in providing MI analytics reports with additional dimension(s) based on the clustering.
Value Brought by Dataiku:
Dataiku is the platform for Enterprise AI, systemizing the use of data for exceptional business results. It improved speed and agility through increased team efficiency, enhanced tech stack efficiency, improved risk management and governance through transparency and explainability, upskilling, and networking with resources such as the Dataiku Academy and Dataiku Community.
Key elements that generated value include:
● Dataiku’s MLOps capability, which helped in quickly getting the working model in time.
● Dataiku role in improving team productivity and speed. The visual recipes aided in the rapid construction of the ETL procedures, and model building frequently took only a few hours when built-in algorithms were employed.
● The orchestration layer in the form of scenarios, which has been good for us in building end-to-end solutions with less dependency on other tools for orchestration.
● The new plugins and features that are constantly added, which are very useful in PoC if we need some quick wins.
● The Dataiku Knowledge Base and the aforementioned Dataiku Academy, which are excellent sources of learning. They enabled the teams to self-learn and contribute to projects within a few weeks due to the excellent sources of information being available in one place.