Submit your innovative use case or inspiring success story to the 2023 Dataiku Frontrunner Awards! LET'S GO

Building discrete time-partitioned datasets

JBR
Level 1
Building discrete time-partitioned datasets
Hello,

I have an input dataset coming from files that are created weekly with a structure including the day the file is created :

2017-01-01.csv

2017-01-08.csv

2017-01-15.csv

etc.

I have thus partitioned this dataset using a time-based partitioning using a "day" period.

QUESTION 1 : when I apply a recipe to this dataset and run it, it tries to build every day in the selected range, ie 2017-01-01 runs with success, but 2017-01-02, 2017-01-03, ..., 2017-01-07 fail. It's not a major problem since the recipe keeps running until the end but since the global status of the run is "failed", it's not optimal regarding scheduling and reporting. Is there a way around this?

QUESTION 2 : Since my initial dataset is quite heavy and increases in size every week, what I want to do is build it globally once, and then have a weekly scheduler build only the "newly created" partitions, and add them to my output dataset. Having researched the documentation, my understanding is that the way to do this would be to build all my recipes and partitions first (to create my initial up-to-date dataset), and then to edit each recipe to run only for a "D-7" time range, with the "append instead of overwrite" option checked. This is not my prefered option since it means that a global rebuild of the data (for instance if a recipe is modified) would require a complete re-edit of all recipes to restart the whole process. Is there a way to do this differently?

Thanks in advance, and sorry for the long post. 🙂

Julien
0 Kudos
0 Replies

Labels

?
Labels (2)
A banner prompting to get Dataiku