Ideas for using partitioned and non-partitioned datasets in parallel

Tomasz
Tomasz Dataiku DSS Core Designer, Registered Posts: 8 ✭✭

Hi!

So I have this case in which I have dataset, in which I receive files every 30 minutes. I have to process the files as soon as they arrive, so for that I had to partition them by hour (to be able to process only the latest hour). But I also need to have an option to run a "reprocess" on the whole dataset. Unfortunately when I'm trying to do this on the partitioned flow, it takes wayyyyy to loong (cause there is something like 6k partitions, and max 5 partitions are processed at a time). So that's why I created a copy of the flow with unpartitioned datasets. The problem is though that both flows should write to the same table, and I cannot have two objects with the same name in my project.

So, my question is - do you guys have any idea how to implement running both partitioned and non-partitioned jobs in my flow, whatever the user chooses?

Thanks!

Answers

  • Turribeach
    Turribeach Dataiku DSS Core Designer, Neuron, Dataiku DSS Adv Designer, Registered, Neuron 2023 Posts: 2,126 Neuron

    You you need to explain your flow a bit better. What exactly do you do with the files? Is it a long flow? How long does it take to process a file? What does a "reprocess" exactly mean? Do you need to reload all files again? There are ways around the single dataset output but it will depend on your flow design.

Setup Info
    Tags
      Help me…