High Availability for DataIKU Application

ratnesmo
Level 1
High Availability for DataIKU Application

I would like to know how to achieve high availability for dataiku application on  aws platform. As per my knowledge it can be installed only on EC2 instance. The dataiku installation is not supported with ECS/EKS/load balancer kind of setup

3 Replies
Turribeach

Dataiku is a complex beast since it integrates with pretty much all databases and compute and data Cloud engines. As such answering your question becomes a lot more complicated without knowing exactly what your setup and environment is. And once you consider your data layer the questions around high availability become more complicated. We also need to consider each of the main nodes Dataiku provides individually. Generally speaking there is not much you can do for the Designer and Automation nodes. You can have some high availability and elastic compute by running flow workloads on Kubernetes but this doesn't cover the DSS GUI. Given that all the project data is locally stored in the Designer and Automation nodes the opportunities for high availability are limited. On Cloud however you have access to certain tools that may allow you to increase your recovery time while still not being high available. For instance you could take disk snapshots of your Dataiku local storage say every 10 mins which should allow you to recreate the Designer and Automation node VMs very quickly in another Cloud zone or region, should the current one becomes unavailable. But even with this approach you will have to DIY since Dataiku doesn't provide any capabilities to handle such scenarios and you  will have to be careful you don't end up with two DSS nodes executing the same flow against the same data layer.

The API node is a different beast since it can be deployed exclusively on Kubernetes meaning that you can have proper high availability, auto-scaling and auto-healing. See this for more info: https://doc.dataiku.com/dss/latest/apinode/kubernetes/index.html

So all in all there isn't much you can do for Designer and Automation nodes. Having said that I would argue that this doesn't impact the Dataiku architecture that much. The Designer node is really a development environment so the need for high availability is somehow less critical. The API node can be highly available as stated above. That leaves us with the  Automation node. In this case high availability is more important as this is where you will run your production flows. In our experience we haven't found any availability issues with the Automation node. This is mostly because our users don't usually login to the Automation node and all the projects that get deployed in the Production Automation node are always tested and validated in our Test Automation Node. So the chances of the Production Automation node displaying any stability issues are very low in our case. 

Ultimately Dataiku being a third party product you are limited by what the vendor supports and the product features.

0 Kudos
ratnesmo
Level 1
Author

Thanks for the detailed update. From the computational workload perspective, we have already integrated with EKS so that is taken care. Designer node is for the development so HA is not an issue like you mentioned. Now since my production dataIKU application/GUI is on Automation node, I need the high availability of the same. Understood we can have workaround solution to spinup new instance and attached EBS volume to bring down the mean time to recovery. Even if it takes 30 mins to recover its a good downtime.

The stability of the dataiku and testing of the application is not in question here. There could be multiple reasons for non-availability of EC2 instance itself where GUI is installed. If we have huge number of use-cases and users where there are are more than 50 EC2 instances (Automation nodes) then it becomes very difficult to manage the recovery process and it would always be a workaround solution which cannot be called as high availability which is industry practice for business critical system 

Moaรฏ
Level 2

I agree with @ratnesmo, and would very much like an HA feature for the design and automation nodes. I understand DSS is an orchestrator of big data engines, but it still performs computation, and vertical scaling is limited...

Since duplication of instances and segregation of access is not always tractable/desirable, would it be possible to add this to the product backlog ? Or it is already taken care of ?

Regards,
Moaรฏ

0 Kudos