What are the recommendations from Dataiku to run visual recipes on the cloud stack?

Ignacio_Toledo
Ignacio_Toledo Dataiku DSS Core Designer, Dataiku DSS Core Concepts, Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Frontrunner 2022 Finalist, Frontrunner 2022 Winner, Dataiku Frontrunner Awards 2021 Participant, Frontrunner 2022 Participant, Neuron 2023 Posts: 412 Neuron

A few weeks ago we started a journey to test the use of the DSS Cloud Stack, specifically in AWS. We have had an excellent support from both Dataiku and the AWS teams, and while we still have a long road ahead of us, we have deployed a stack on the cloud in record time.

We are now in the process of deciding and setting the best practices for the different users that will access the platform. Furthermore, we already know that we would like to offload most of the computing resources to EKS, to take advantage of the scaling features that are already handled by DSS when it manages its own EKS clusters.

This solution is great for spark workloads and code recipes, but we are scratching our heads on how to deal with the needs of the users that mostly use "visual recipes", and with datasets that are small enough to be run without the need of a spark engine.

One option would be to run those recipes (mainly preparation steps and data transformation for dashboards or reports, nothing related to AI or ML) using the DSS instance resources. But in that case, we would lose the elastic capabilities that the EKS provides, and we would need to increase the size of our DSS instances based on the resources that are needed in the hours with most demand from our users. From our on-premises infrastructure, we already have design nodes with 64 GB of memory and 12 CPUs, that can handle the rush hour loads, but there are hours when the machines are almost idle.

So, apparently the only other way to run in containers those visual recipes is to select the Spark engine, taking into account that we have not provisioned a data warehouse solution in the cloud, but that we are using mostly S3, EKS, EC2 and Athena.

For small jobs, the overhead of creating the spark job is sometimes inefficient. Also, most of these users are not experts in Spark (and to be honest, those of us with a bit more of experience are also not experts), and it will require from them to manually set the DSS engines to spark for the visual recipes, and select then the right spark configuration from a list that we will need to create.

The questions is, what would Dataiku recommend in this situation?

  • Should we create DSS instances that are big enough to run some of the minor visual recipes using the same instances resources?
  • Should we instead train our "visual recipe" users to get used to and familiarize with the spark engine?
  • Should we look for one of the multiple data warehouse solutions that DSS supports to offload these workloads to a different system, like snowflake?

Any other tips from the community will be welcome. For those of you in the cloud, what is your experience?

Thanks!


Operating system used: DSS Cloud Stack

Answers

  • Ignacio_Toledo
    Ignacio_Toledo Dataiku DSS Core Designer, Dataiku DSS Core Concepts, Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Frontrunner 2022 Finalist, Frontrunner 2022 Winner, Dataiku Frontrunner Awards 2021 Participant, Frontrunner 2022 Participant, Neuron 2023 Posts: 412 Neuron

    Hi @CoreyS
    ,

    I'm reaching out to you (I hope I'm not bothering you) to find out how I could get more attention to this post.

    Or perhaps this is a kind of question that should be forwarded to a support specialist?

    Thanks in advance!

    Ignacio

  • CoreyS
    CoreyS Dataiker Alumni, Dataiku DSS Core Designer, Dataiku DSS Core Concepts, Registered Posts: 1,150 ✭✭✭✭✭✭✭✭✭

    Hi @Ignacio_Toledo
    , I just sent you a private message. I hope that helps!

Setup Info
    Tags
      Help me…