External Compute in Dataiku

Options
sj0071992
sj0071992 Partner, Dataiku DSS Core Designer, Neuron, Dataiku DSS Adv Designer, Registered, Dataiku DSS Developer, Neuron 2022, Neuron 2023 Posts: 131 Neuron

Hi Team,

I need your help in understanding the Computation in Dataiku.

If I am using any external Compute engine (e.g. EMR) will Dataiku directly process data in EMR if I am using any visual recipe or will process it first in the local stream and then copy it to EMR?

Thanks in Advance

Best Answer

  • fchataigner2
    fchataigner2 Dataiker Posts: 355 Dataiker
    Answer ✓
    Options

    Hi

    if source is SQL, compute is in-database, and output is SQL in the same database, then DSS is only crafting the SQL statement and passing it to the database. In that case, increasing the cluster size is the only option. If the output is not SQL, or not in the same database, then the compute is still in-database, but the output is streamed via DSS to be written on the output dataset. In that case, depending on the output type, it can be worth replacing the output by a dataset in the same database and then synchronizing it to the original recipe output.

    For example, Snowflake -> group recipe -> S3 would be slow because the result is streamed via DSS to S3, but Snowflake -> group recipe -> Snowflake -> sync recipe -> S3 would be faster (because the sync recipe will do issue a "copy into..." statement to Snowflage, and that is fast)

Answers

  • fchataigner2
    fchataigner2 Dataiker Posts: 355 Dataiker
    Options

    Hi,

    as the doc explains, DSS can leverage EMR for computing, which means that the EMR cluster will do the data processing, provided you choose an engine for your recipes that runs on the cluster (Spark or Hive)

  • sj0071992
    sj0071992 Partner, Dataiku DSS Core Designer, Neuron, Dataiku DSS Adv Designer, Registered, Dataiku DSS Developer, Neuron 2022, Neuron 2023 Posts: 131 Neuron
    Options

    Hi,

    Thanks for your response, could you please also confirm if recipes load the source data first in the Dataiku server or it directly Loads to EMR or any other external compute?

  • fchataigner2
    fchataigner2 Dataiker Posts: 355 Dataiker
    Options

    Hi,

    recipes work off the storage of their input and output datasets. If your datasets are of type Filesystem, then the data is on the DSS server. For a EMR cluster the datasets would need to be S3 datasets.

  • sj0071992
    sj0071992 Partner, Dataiku DSS Core Designer, Neuron, Dataiku DSS Adv Designer, Registered, Dataiku DSS Developer, Neuron 2022, Neuron 2023 Posts: 131 Neuron
    Options

    Hi,

    I got your point but still, I have a question.

    If I enable EKS and I am reading Source Data from Snowflake (having 500GB of data) will it directly read in EKS or it will first load in Dataiku and then copy to EKS?

    Thanks in Advance

  • fchataigner2
    fchataigner2 Dataiker Posts: 355 Dataiker
    Options

    Hi

    sql recipes will run directly in snowflake, so I guess they're not the subject of your question. For other recipes like Prepare or Sync, if they're not set to use their SQL engine, the data will be streamed via the DSS server. The workaround in this case is to use the Spark engine, so that the spark-snowflake integration is used (see https://doc.dataiku.com/dss/latest/connecting/sql/snowflake.html#spark-integration, but for an EKS cluster that would be done by default)

  • sj0071992
    sj0071992 Partner, Dataiku DSS Core Designer, Neuron, Dataiku DSS Adv Designer, Registered, Dataiku DSS Developer, Neuron 2022, Neuron 2023 Posts: 131 Neuron
    Options

    Hi,

    I got your point.

    So how can I tackle the situation if my source is SQL Database and I am using In-Database Computing?

    Does Increasing SQL Cluster size will help? as it is taking almost 2.5 hour to read 10 billion of records.

Setup Info
    Tags
      Help me…