External Compute in Dataiku

Solved!
sj0071992
External Compute in Dataiku

Hi Team,

I need your help in understanding the Computation in Dataiku.

 

If I am using any external Compute engine (e.g. EMR) will Dataiku directly process data in EMR if I am using any visual recipe or will process it first in the local stream and then copy it to EMR?

Thanks in Advance

0 Kudos
1 Solution
fchataigner2
Dataiker

Hi

if source is SQL, compute is in-database, and output is SQL in the same database, then DSS is only crafting the SQL statement and passing it to the database. In that case, increasing the cluster size is the only option. If the output is not SQL, or not in the same database, then the compute is still in-database, but the output is streamed via DSS to be written on the output dataset. In that case, depending on the output type, it can be worth replacing the output by a dataset in the same database and then synchronizing it to the original recipe output.

For example, Snowflake -> group recipe -> S3 would be slow because the result is streamed via DSS to S3, but Snowflake -> group recipe -> Snowflake -> sync recipe -> S3 would be faster (because the sync recipe will do issue a "copy into..." statement to Snowflage, and that is fast)

View solution in original post

0 Kudos
7 Replies
fchataigner2
Dataiker

Hi,

as the doc explains, DSS can leverage EMR for computing, which means that the EMR cluster will do the data processing, provided you choose an engine for your recipes that runs on the cluster (Spark or Hive)

0 Kudos
sj0071992
Author

Hi,

Thanks for your response, could you please also confirm if recipes load the source data first in the Dataiku server or it directly Loads to EMR or any other external compute? 

0 Kudos
fchataigner2
Dataiker

Hi,

recipes work off the storage of their input and output datasets. If your datasets are of type Filesystem, then the data is on the DSS server. For a EMR cluster the datasets would need to be S3 datasets.

0 Kudos
sj0071992
Author

Hi,

 

I got your point but still, I have a question.

If I enable EKS and I am reading Source Data from Snowflake (having 500GB of data) will it directly read in EKS or it will first load in Dataiku and then copy to EKS?

 

Thanks in Advance

0 Kudos
fchataigner2
Dataiker

Hi

sql recipes will run directly in snowflake, so I guess they're not the subject of your question. For other recipes like Prepare or Sync, if they're not set to use their SQL engine, the data will be streamed via the DSS server. The workaround in this case is to use the Spark engine, so that the spark-snowflake integration is used (see https://doc.dataiku.com/dss/latest/connecting/sql/snowflake.html#spark-integration, but for an EKS cluster that would be done by default)

0 Kudos
sj0071992
Author

Hi,

 

I got your point.

So how can I tackle the situation if my source is SQL Database and I am using In-Database Computing?

Does Increasing SQL Cluster size will help? as it is taking almost 2.5 hour to read 10 billion of records.

0 Kudos
fchataigner2
Dataiker

Hi

if source is SQL, compute is in-database, and output is SQL in the same database, then DSS is only crafting the SQL statement and passing it to the database. In that case, increasing the cluster size is the only option. If the output is not SQL, or not in the same database, then the compute is still in-database, but the output is streamed via DSS to be written on the output dataset. In that case, depending on the output type, it can be worth replacing the output by a dataset in the same database and then synchronizing it to the original recipe output.

For example, Snowflake -> group recipe -> S3 would be slow because the result is streamed via DSS to S3, but Snowflake -> group recipe -> Snowflake -> sync recipe -> S3 would be faster (because the sync recipe will do issue a "copy into..." statement to Snowflage, and that is fast)

0 Kudos