Sign up to take part
Registered users can ask their own questions, contribute to discussions, and be part of the Community!
I have finished my first meaningfully complex set of recipes and am proud of myself. 🙂
Now I need to figure out how to manage weekly updates that will flow in and need to feed the recipes to produce new insights/reports/etc.
Here's what I'm dealing with:
The question is probably simple:
Right now, each week I'm uploading a completely new dataset, and resetting the "Input" dataset for the first recipe that uses the data, then rerunning the whole chain of recipes to produce my outputs. Is there a better way?
Operating system used: MacOS Monterey v12.4
Thanks for the words of encouragement @Dataiku! While you wait for a more detailed response, I would suggest that you look into utilizing scenarios:
In this case, assuming your database is being updated regularly, using the force-rebuild datasets and dependencies option will trigger a recursive build on all the upstream flow elements. That would refresh the inputs and run all the recipes required to reach the output.
I hope this helps!
I notice that you list your operating system as Mac OS. That suggests that you might be using the free version of Dataiku DSS. Unfortunately, last time I checked the free version of Dataiku DSS does not support Scenarios. If you are using a licensed version of DSS then Scenarios could be a good answer for you.
The question then is where do you want to repeatedly run your process? Most folks moving to production use of a Scenario consider running such processes on servers, often Linux servers in the cloud or corporate computer centers to be a good idea.
If you are open to sharing please let us know a bit more.
As to the data. Managing Added, Deleted, and Modified Data can be a significant challenge to do incrementally without introducing error like duplicate records and other things. If your data set is not to big and guaranteed to be complete each time you get the data then the wipe and re-load process can often be the most reliable and effective. However, if the data is big (whatever that means to you) dealing with added deleted and modified records is what you will need to do. Having a reliable key in every record that is not duplicated and is unique and not going to be reused is very very helpful for this process. If you don't have such information this is going to again be harder.
I don't have time to look into this any further. But there may be some python libraries that can be used to link records for add, delete, and modify type updates. Others here doing this kind of thing regularly may have this more on the "tip of their tongue." All please jump in.
Good luck, Hope this helps just a bit.
I am using the free version right now, and will be for the foreseeable future. Sounds like the method I've developed is probably about as efficient as I'll get.
No worries -- I deeply appreciate the power of the Dataiku system (especially in comparison to how hard it is to manage similar analysis in an Excel environment), and if things go well we'll be in a position to afford a more advanced version that could support more efficient automations.
Thanks for the input!