Designing for regular data updates

wjkelly · August 2022

Hi,

I have finished my first meaningfully complex set of recipes and am proud of myself.

Now I need to figure out how to manage weekly updates that will flow in and need to feed the recipes to produce new insights/reports/etc.

Here's what I'm dealing with:

Item sales reports that show items purchased by customer with lots of details about the items that I use for further analysis. I need to update recent records (i.e. within 3 months) because the sales report also provides updates on payment and fulfillment status. So the same sales record that previously was in "Fulfilled" status when I last uploaded it, might be in "Completed" status now. Point being: I can't just append new records with each upload -- I have to "overlap" the data by several thousand records each week.
Customer lists -- we're not getting a lot of new records, so I'm updating this data once per month.

The question is probably simple:

What is the most efficient way to introduce a new data dump into the system?

Right now, each week I'm uploading a completely new dataset, and resetting the "Input" dataset for the first recipe that uses the data, then rerunning the whole chain of recipes to produce my outputs. Is there a better way?

Operating system used: MacOS Monterey v12.4

Dataiku · August 2022

@wjkelly
this bird is just flying in to quickly say we are proud of you too!

CoreyS · August 2022

Thanks for the words of encouragement @Dataiku
! While you wait for a more detailed response, I would suggest that you look into utilizing scenarios:

Concept: Scenarios (Knowledge Base)
Automation scenarios, metrics, and checks (Documentation)

In this case, assuming your database is being updated regularly, using the force-rebuild datasets and dependencies option will trigger a recursive build on all the upstream flow elements. That would refresh the inputs and run all the recipes required to reach the output.

I hope this helps!

tgb417 · August 2022

@wjkelly
,

I notice that you list your operating system as Mac OS. That suggests that you might be using the free version of Dataiku DSS. Unfortunately, last time I checked the free version of Dataiku DSS does not support Scenarios. If you are using a licensed version of DSS then Scenarios could be a good answer for you.

The question then is where do you want to repeatedly run your process? Most folks moving to production use of a Scenario consider running such processes on servers, often Linux servers in the cloud or corporate computer centers to be a good idea.

If you are open to sharing please let us know a bit more.

As to the data. Managing Added, Deleted, and Modified Data can be a significant challenge to do incrementally without introducing error like duplicate records and other things. If your data set is not to big and guaranteed to be complete each time you get the data then the wipe and re-load process can often be the most reliable and effective. However, if the data is big (whatever that means to you) dealing with added deleted and modified records is what you will need to do. Having a reliable key in every record that is not duplicated and is unique and not going to be reused is very very helpful for this process. If you don't have such information this is going to again be harder.

I don't have time to look into this any further. But there may be some python libraries that can be used to link records for add, delete, and modify type updates. Others here doing this kind of thing regularly may have this more on the "tip of their tongue." All please jump in.

Good luck, Hope this helps just a bit.

CoreyS · August 2022

Good point @tgb417
as listed on the website, pipeline scheduling is not a supported feature of the Free Edition. However, you can still do the forced recursive build manually.

wjkelly · August 2022

I am using the free version right now, and will be for the foreseeable future. Sounds like the method I've developed is probably about as efficient as I'll get.

No worries -- I deeply appreciate the power of the Dataiku system (especially in comparison to how hard it is to manage similar analysis in an Excel environment), and if things go well we'll be in a position to afford a more advanced version that could support more efficient automations.

Thanks for the input!

Designing for regular data updates

Answers

Categories

Setup Info

Tags