Optimize Data read in time of the recipe in the Flow

sanjay_1197 · August 2023

Hi,

I have a recipe flow which reads data from s3 and does some filtration, calculation on the input data and produces the final output. If the input dataset is huge, the DSS recipe takes a lot of time to read it, is there any parallel processing methodologies to read the data at higher speed?

Alexandru · August 2023

Hi @sanjay_1197
,

The built-in parallelism in Spark should help.

https://doc.dataiku.com/dss/latest/spark/index.html

If you can, use Spark engine for the visual recipe or PySpark for code recipes.

To allow direct reads from spark, you would need to ensure you are using STS Assume Role or Access Key/Secret on your S3 connection and that details are readable by allowed for the user running the recipe.

Also, when dealing with spark using parquet is preferred if you don't have your data in parquet currently. The fastest way is usually to use a sync recipe in DSS from your CSV to Parquet format and then use the newly created parquet dataset as input for rest of your flow.

Thanks,

Miasm1 · August 2023

To speed up reading data from S3 in your DSS recipe, consider using parallel reads and partitioning your data. Switch to columnar file formats like Parquet for efficient reads. Also, look into optimizing DSS settings like memory and worker configurations to better handle large datasets.

Optimize Data read in time of the recipe in the Flow

Answers

Categories

Setup Info

Tags