Ideas for using partitioned and non-partitioned datasets in parallel
Hi!
So I have this case in which I have dataset, in which I receive files every 30 minutes. I have to process the files as soon as they arrive, so for that I had to partition them by hour (to be able to process only the latest hour). But I also need to have an option to run a "reprocess" on the whole dataset. Unfortunately when I'm trying to do this on the partitioned flow, it takes wayyyyy to loong (cause there is something like 6k partitions, and max 5 partitions are processed at a time). So that's why I created a copy of the flow with unpartitioned datasets. The problem is though that both flows should write to the same table, and I cannot have two objects with the same name in my project.
So, my question is - do you guys have any idea how to implement running both partitioned and non-partitioned jobs in my flow, whatever the user chooses?
Thanks!
Answers
-
Turribeach Dataiku DSS Core Designer, Neuron, Dataiku DSS Adv Designer, Registered, Neuron 2023 Posts: 2,126 Neuron
You you need to explain your flow a bit better. What exactly do you do with the files? Is it a long flow? How long does it take to process a file? What does a "reprocess" exactly mean? Do you need to reload all files again? There are ways around the single dataset output but it will depend on your flow design.