Ideas for using partitioned and non-partitioned datasets in parallel

Tomasz · June 2024

Hi!

So I have this case in which I have dataset, in which I receive files every 30 minutes. I have to process the files as soon as they arrive, so for that I had to partition them by hour (to be able to process only the latest hour). But I also need to have an option to run a "reprocess" on the whole dataset. Unfortunately when I'm trying to do this on the partitioned flow, it takes wayyyyy to loong (cause there is something like 6k partitions, and max 5 partitions are processed at a time). So that's why I created a copy of the flow with unpartitioned datasets. The problem is though that both flows should write to the same table, and I cannot have two objects with the same name in my project.

So, my question is - do you guys have any idea how to implement running both partitioned and non-partitioned jobs in my flow, whatever the user chooses?

Thanks!

Turribeach · June 2024

You you need to explain your flow a bit better. What exactly do you do with the files? Is it a long flow? How long does it take to process a file? What does a "reprocess" exactly mean? Do you need to reload all files again? There are ways around the single dataset output but it will depend on your flow design.

Ideas for using partitioned and non-partitioned datasets in parallel

Answers

Welcome!

Welcome!

Quick Links

Categories

Sign up to take part

Ideas for using partitioned and non-partitioned datasets in parallel

Answers

Welcome!

Welcome!

Quick Links

Categories