How does hybrid storage impact performance and scalability in Dataiku?

Charlesdevis
Level 2
How does hybrid storage impact performance and scalability in Dataiku?

Hello everyone,

I'm currently working with Dataiku and I'm curious about how hybrid storage  impacts performance and scalability. I've done some research and found some information that I think might be helpful to share.

Firstly, according to the Dataiku documentation, the DSS API node is natively highly available and scalable. This is achieved by deploying multiple instances of the API node.
In addition, the Dataiku Knowledge Base offers many ways to efficiently rebuild datasets so that downstream outputs reflect the latest upstream data. One way to do this is by enabling virtualization for intermediate datasets in the Flow. This can prevent Dataiku from writing the data of an intermediate dataset when executing the SQL pipeline, which can improve performance.

I also came across an article on Snowflake Data Warehouse, which is a cloud-based data warehousing solution. The article mentions that Snowflake has great performance and scalability thanks to the separated storage and compute. This allows Snowflake to run a virtually unlimited number of workloads without any performance degradation.

I'm still interested in learning more about how hybrid storage specifically impacts performance and scalability in Dataiku. If anyone has any insights or experiences to share, I would love to hear them.

Thank you!


Operating system used: linux

0 Kudos
1 Reply
Turribeach

Hi,

The link you posted is in french and talks about hybrid cloud not hybrid storage. In general hybrid cloud means private and public hosting and hybrid storage is a mix of SSDs and HDDs. You also have hybrid cloud storage which is mix of on-prem and cloud storage. Hybrid is an abstract concept and can mean many different things to different people so I suggest you clarify exactly what you mean and what is your proposed design or your architecture question. 

"The DSS API node is natively highly available and scalable" => In my view your statement is incorrect. The API node does support deployments that can be made highly available and scalable but the default installation is neither highly available and nor scalable. So yes while highly available and scalable deployments are supported they don't come by default and you need to integrate with other technologies like Kubernetes to make it happen. So in my mind that's not really "native".

You should consider that neither the Designer node nor the Automation node support deployments that are highly available and scalable. The Designer node must exist in order to deploy models and flows to the Automation node (for batch scoring) and APIs to the API node (for real time scoring). In my experience with Dataiku most solutions end up using batch scoring models, so the Automation node is more important than the API node but obviously each solution is different so it depends on the use case. 

SQL pipelines can indeed improve performance. But the beauty of Dataiku is the ability to be able combine different technologies, data on many different sources and either use code recipes as a coder or clicker visual recipes. For SQL pipelines to work effectively you will need to use visual recipes on the same database using the same Dataiku connection with no intermediate code recipes or recipes in other technologies or connections. You will also loose the ability to preview the intermediate datasets and "debug" your flow/data pipeline to see how the data is being transformed step by step, another big advantage of Dataiku. So yes SQL pipelines can give you some performance improvement but it's not "free" and it doesn't work in all the flows/cases.

Snowflake separates compute from storage and so does GCP's BigQuery so it's not a unique concept. Snowflake "can run a virtually unlimited number of workloads without any performance degradation" => correct but you pay for the allocated compute so having all this power at your disposal 24x7 will be very costly indeed. Whether Snowflake will be better for your use case will depend on many factors. But nothing comes for free. So I wouldn't buy into marketing statements that don't give much context. 

While the above gives you some more details from your points in your post it doesn't give you any direction because you haven't stated any of your requirements or goals. So if you are looking for help in designing a system it will be great if you share them.

Thanks

0 Kudos