Seeking Optimization Tips for DSS Flow and Spark Configuration

HAFEDH
HAFEDH Registered Posts: 9 ✭✭✭

hello everyone,

I am currently working on optimizing my DSS flow. I have a scenario that currently takes 20 minutes to execute, and I am looking to reduce this time to just 5 minutes. I would greatly appreciate any tips or strategies for optimization.

Additionally, I am interested in understanding how to configure Spark settings to ensure optimal resource allocation. Specifically, I am looking for guidance on configuring parameters like spark.driver.coresspark.dynamicAllocation.initialExecutorsspark.executor.coresspark.dynamicAllocation.enabledspark.executor.instances, and spark.driver.memory.

Any advice or insights you could share on these topics would be greatly appreciated!

Thank you in advance for your help.

Answers

  • Turribeach
    Turribeach Dataiku DSS Core Designer, Neuron, Dataiku DSS Adv Designer, Registered, Neuron 2023, Circle Member Posts: 2,591 Neuron

    You have given us your requirement (to reduce the flow execution from 20mins to 5mins) but you haven’t given us any additional information to go about. Please post a picture of your flow, give detailed timings of each recipe, gives us information about row counts, etc

    With regards to the Spark question I will suggest you start a separate thread. There are separate subjects so it’s difficult to cover multiple questions in a single thread.

  • HAFEDH
    HAFEDH Registered Posts: 9 ✭✭✭
    edited March 5
    image.png

    Thanks for your reponse. The scenario runs when new rows are added via a webapp in DSS. The join and self-join recipes, which are used to retrieve historical data, are particularly costly and each takes around 2 minutes.

    Initially, there are not many rows, but by the end of the flow, I am dealing with approximately 17 million rows.

  • Turribeach
    Turribeach Dataiku DSS Core Designer, Neuron, Dataiku DSS Adv Designer, Registered, Neuron 2023, Circle Member Posts: 2,591 Neuron

    So the first thing you need to realise is that while Dataiku allows you to build a complex data pipeline in a visual way without writting any code this is never going to be the most optimal way of loading/preparing large datasets as fast as possible. The fact that DSS persists all the intermediate datasets is both a big advantage (explainability, debug, etc) and a big dissadvantage too (lots of redundant data, lots of reads and writes). Depending on the recipes and connections that you use you may be able to enable SQL pipelines in part of your flow which should make those recipes run faster.

    You should change the flow view to Recipe engines. Any recipe showing as DSS engine should be reviewed because this means the data will have to moved to the DSS server for processing and back to the database for writing the output. This tends to be slower than SQL engine which means the execution happens only on the database without data moving to Dataiku.

    Finally you should review your SQL database and make sure it's sized and tuned accordingly. When you start to get into millions of rows traditional RDBMS databases start to struggle so moving to other technologies that can handle billions of rows at speed will help (like Databricks, Snowflake, BigQuery, etc).

Setup Info
    Tags
      Help me…