Survey banner
The Dataiku Community is moving to a new home! Some short term disruption starting next week: LEARN MORE

Ideas for using partitioned and non-partitioned datasets in parallel

ptktmsz
Level 2
Ideas for using partitioned and non-partitioned datasets in parallel

Hi!

So I have this case in which I have dataset, in which I receive files every 30 minutes. I have to process the files as soon as they arrive, so for that I had to partition them by hour (to be able to process only the latest hour). But I also need to have an option to run a "reprocess" on the whole dataset. Unfortunately when I'm trying to do this on the partitioned flow, it takes wayyyyy to loong (cause there is something like 6k partitions, and max 5 partitions are processed at a time). So that's why I created a copy of the flow with unpartitioned datasets. The problem is though that both flows should write to the same table, and I cannot have two objects with the same name in my project.

So, my question is - do you guys have any idea how to implement running both partitioned and non-partitioned jobs in my flow, whatever the user chooses?

Thanks!

1 Reply

You you need to explain your flow a bit better. What exactly do you do with the files? Is it a long flow? How long does it take to process a file? What does a "reprocess" exactly mean? Do you need to reload all files again? There are ways around the single dataset output but it will depend on your flow design. 

0 Kudos