Survey banner
Switching to Dataiku - a new area to help users who are transitioning from other tools and diving into Dataiku! CHECK IT OUT

Optimize Data read in time of the recipe in the Flow

Level 1
Optimize Data read in time of the recipe in the Flow


I have a recipe flow which reads data from s3 and does some filtration, calculation on the input data and produces the final output. If the input dataset is huge, the DSS recipe takes a lot of time to read it, is there any parallel processing methodologies to read the data at higher speed?

2 Replies

Hi @sanjay_1197 ,

The built-in parallelism in Spark should help.

If you can, use Spark engine for the visual recipe or PySpark for code recipes.

To allow direct reads from spark, you would need to ensure you are using STS Assume Role or Access Key/Secret on your S3 connection and that details are readable by allowed for the user running the recipe.

Also, when dealing with spark using parquet is preferred if you don't have your data in parquet currently. The fastest way is usually to use a sync recipe in DSS from your CSV to Parquet format and then use the newly created parquet dataset as input for rest of your flow.


Level 2

To speed up reading data from S3 in your DSS recipe, consider using parallel reads and partitioning your data. Switch to columnar file formats like Parquet for efficient reads. Also, look into optimizing DSS settings like memory and worker configurations to better handle large datasets.

Mia SmithDevOps Course
0 Kudos