Discover this year's submissions to the Dataiku Frontrunner Awards and give kudos to your favorite use cases and success stories!READ MORE

Partition Redispatch S3 parquet dataset using column - how to run optimally?

gt
Level 1
Partition Redispatch S3 parquet dataset using column - how to run optimally?

Hi,
I am working on data in S3 which is partitioned by timestamp on filename. I need to repartition the data using a column value as the files contain unordered timestamp data. I tried redispatch partitioning using Sync recipe as mentioned in https://knowledge.dataiku.com/latest/courses/advanced-partitioning/hands-on-file-based-partitioning..... This approach didn't work for me as I'm running out of memory when running redispatch partitioning even for 2 hrs dataset (~3 GB). Is there a way I can run it on Spark or Snowflake?.

0 Kudos
0 Replies

Labels

?
Labels (1)

Setup info

?
A banner prompting to get Dataiku