Designing for regular data updates

Options
wjkelly
wjkelly Registered Posts: 22 ✭✭✭✭

Hi,

I have finished my first meaningfully complex set of recipes and am proud of myself.

Now I need to figure out how to manage weekly updates that will flow in and need to feed the recipes to produce new insights/reports/etc.

Here's what I'm dealing with:

  • Item sales reports that show items purchased by customer with lots of details about the items that I use for further analysis. I need to update recent records (i.e. within 3 months) because the sales report also provides updates on payment and fulfillment status. So the same sales record that previously was in "Fulfilled" status when I last uploaded it, might be in "Completed" status now. Point being: I can't just append new records with each upload -- I have to "overlap" the data by several thousand records each week.
  • Customer lists -- we're not getting a lot of new records, so I'm updating this data once per month.

The question is probably simple:

  • What is the most efficient way to introduce a new data dump into the system?

Right now, each week I'm uploading a completely new dataset, and resetting the "Input" dataset for the first recipe that uses the data, then rerunning the whole chain of recipes to produce my outputs. Is there a better way?


Operating system used: MacOS Monterey v12.4

Tagged:

Answers

  • Dataiku
    Dataiku Administrator, Dataiker, Alpha Tester Posts: 88 Administrator
    Options

    @wjkelly
    this bird is just flying in to quickly say we are proud of you too!

  • CoreyS
    CoreyS Dataiker Alumni, Dataiku DSS Core Designer, Dataiku DSS Core Concepts, Registered Posts: 1,150 ✭✭✭✭✭✭✭✭✭
    Options

    Thanks for the words of encouragement @Dataiku
    ! While you wait for a more detailed response, I would suggest that you look into utilizing scenarios:

    In this case, assuming your database is being updated regularly, using the force-rebuild datasets and dependencies option will trigger a recursive build on all the upstream flow elements. That would refresh the inputs and run all the recipes required to reach the output.

    image.png

    I hope this helps!

  • tgb417
    tgb417 Dataiku DSS Core Designer, Dataiku DSS & SQL, Dataiku DSS ML Practitioner, Dataiku DSS Core Concepts, Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Frontrunner 2022 Finalist, Frontrunner 2022 Winner, Dataiku Frontrunner Awards 2021 Participant, Frontrunner 2022 Participant, Neuron 2023 Posts: 1,595 Neuron
    Options

    @wjkelly
    ,

    I notice that you list your operating system as Mac OS. That suggests that you might be using the free version of Dataiku DSS. Unfortunately, last time I checked the free version of Dataiku DSS does not support Scenarios. If you are using a licensed version of DSS then Scenarios could be a good answer for you.

    The question then is where do you want to repeatedly run your process? Most folks moving to production use of a Scenario consider running such processes on servers, often Linux servers in the cloud or corporate computer centers to be a good idea.

    If you are open to sharing please let us know a bit more.

    As to the data. Managing Added, Deleted, and Modified Data can be a significant challenge to do incrementally without introducing error like duplicate records and other things. If your data set is not to big and guaranteed to be complete each time you get the data then the wipe and re-load process can often be the most reliable and effective. However, if the data is big (whatever that means to you) dealing with added deleted and modified records is what you will need to do. Having a reliable key in every record that is not duplicated and is unique and not going to be reused is very very helpful for this process. If you don't have such information this is going to again be harder.

    I don't have time to look into this any further. But there may be some python libraries that can be used to link records for add, delete, and modify type updates. Others here doing this kind of thing regularly may have this more on the "tip of their tongue." All please jump in.

    Good luck, Hope this helps just a bit.

  • CoreyS
    CoreyS Dataiker Alumni, Dataiku DSS Core Designer, Dataiku DSS Core Concepts, Registered Posts: 1,150 ✭✭✭✭✭✭✭✭✭
    Options

    Good point @tgb417
    as listed on the website, pipeline scheduling is not a supported feature of the Free Edition. However, you can still do the forced recursive build manually.

  • wjkelly
    wjkelly Registered Posts: 22 ✭✭✭✭
    Options

    I am using the free version right now, and will be for the foreseeable future. Sounds like the method I've developed is probably about as efficient as I'll get.

    No worries -- I deeply appreciate the power of the Dataiku system (especially in comparison to how hard it is to manage similar analysis in an Excel environment), and if things go well we'll be in a position to afford a more advanced version that could support more efficient automations.

    Thanks for the input!

Setup Info
    Tags
      Help me…