Optimize Data read in time of the recipe in the Flow
Hi,
I have a recipe flow which reads data from s3 and does some filtration, calculation on the input data and produces the final output. If the input dataset is huge, the DSS recipe takes a lot of time to read it, is there any parallel processing methodologies to read the data at higher speed?
Answers
-
Alexandru Dataiker, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 1,212 Dataiker
Hi @sanjay_1197
,
The built-in parallelism in Spark should help.
https://doc.dataiku.com/dss/latest/spark/index.htmlIf you can, use Spark engine for the visual recipe or PySpark for code recipes.
To allow direct reads from spark, you would need to ensure you are using STS Assume Role or Access Key/Secret on your S3 connection and that details are readable by allowed for the user running the recipe.
Also, when dealing with spark using parquet is preferred if you don't have your data in parquet currently. The fastest way is usually to use a sync recipe in DSS from your CSV to Parquet format and then use the newly created parquet dataset as input for rest of your flow.
Thanks, -
To speed up reading data from S3 in your DSS recipe, consider using parallel reads and partitioning your data. Switch to columnar file formats like Parquet for efficient reads. Also, look into optimizing DSS settings like memory and worker configurations to better handle large datasets.