Seeking Optimization Tips for DSS Flow and Spark Configuration

HAFEDH · March 5

hello everyone,

I am currently working on optimizing my DSS flow. I have a scenario that currently takes 20 minutes to execute, and I am looking to reduce this time to just 5 minutes. I would greatly appreciate any tips or strategies for optimization.

Additionally, I am interested in understanding how to configure Spark settings to ensure optimal resource allocation. Specifically, I am looking for guidance on configuring parameters like spark.driver.cores, spark.dynamicAllocation.initialExecutors, spark.executor.cores, spark.dynamicAllocation.enabled, spark.executor.instances, and spark.driver.memory.

Any advice or insights you could share on these topics would be greatly appreciated!

Thank you in advance for your help.

Turribeach · March 5

You have given us your requirement (to reduce the flow execution from 20mins to 5mins) but you haven’t given us any additional information to go about. Please post a picture of your flow, give detailed timings of each recipe, gives us information about row counts, etc

With regards to the Spark question I will suggest you start a separate thread. There are separate subjects so it’s difficult to cover multiple questions in a single thread.

HAFEDH · March 5

Thanks for your reponse. The scenario runs when new rows are added via a webapp in DSS. The join and self-join recipes, which are used to retrieve historical data, are particularly costly and each takes around 2 minutes.

Initially, there are not many rows, but by the end of the flow, I am dealing with approximately 17 million rows.

Turribeach · March 5

So the first thing you need to realise is that while Dataiku allows you to build a complex data pipeline in a visual way without writting any code this is never going to be the most optimal way of loading/preparing large datasets as fast as possible. The fact that DSS persists all the intermediate datasets is both a big advantage (explainability, debug, etc) and a big dissadvantage too (lots of redundant data, lots of reads and writes). Depending on the recipes and connections that you use you may be able to enable SQL pipelines in part of your flow which should make those recipes run faster.

You should change the flow view to Recipe engines. Any recipe showing as DSS engine should be reviewed because this means the data will have to moved to the DSS server for processing and back to the database for writing the output. This tends to be slower than SQL engine which means the execution happens only on the database without data moving to Dataiku.

Finally you should review your SQL database and make sure it's sized and tuned accordingly. When you start to get into millions of rows traditional RDBMS databases start to struggle so moving to other technologies that can handle billions of rows at speed will help (like Databricks, Snowflake, BigQuery, etc).

Seeking Optimization Tips for DSS Flow and Spark Configuration

Setup Info

Welcome!

Answers

Welcome!

Welcome!

Quick Links

Categories

Sign up to take part

Seeking Optimization Tips for DSS Flow and Spark Configuration

Setup Info

Welcome!

Answers

Welcome!

Welcome!

Quick Links

Categories