Optimize Data read in time of the recipe in the Flow

Options
sanjay_1197
sanjay_1197 Registered Posts: 1

Hi,

I have a recipe flow which reads data from s3 and does some filtration, calculation on the input data and produces the final output. If the input dataset is huge, the DSS recipe takes a lot of time to read it, is there any parallel processing methodologies to read the data at higher speed?

Tagged:

Answers

  • Alexandru
    Alexandru Dataiker, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 1,209 Dataiker
    Options

    Hi @sanjay_1197
    ,

    The built-in parallelism in Spark should help.

    https://doc.dataiku.com/dss/latest/spark/index.html

    If you can, use Spark engine for the visual recipe or PySpark for code recipes.

    To allow direct reads from spark, you would need to ensure you are using STS Assume Role or Access Key/Secret on your S3 connection and that details are readable by allowed for the user running the recipe.

    Also, when dealing with spark using parquet is preferred if you don't have your data in parquet currently. The fastest way is usually to use a sync recipe in DSS from your CSV to Parquet format and then use the newly created parquet dataset as input for rest of your flow.

    Thanks,

  • Miasm1
    Miasm1 Registered Posts: 7
    Options

    To speed up reading data from S3 in your DSS recipe, consider using parallel reads and partitioning your data. Switch to columnar file formats like Parquet for efficient reads. Also, look into optimizing DSS settings like memory and worker configurations to better handle large datasets.

Setup Info
    Tags
      Help me…