Maintaining Sync with Slow datasets (Adds and Modify)

tgb417
tgb417 Dataiku DSS Core Designer, Dataiku DSS & SQL, Dataiku DSS ML Practitioner, Dataiku DSS Core Concepts, Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Frontrunner 2022 Finalist, Frontrunner 2022 Winner, Dataiku Frontrunner Awards 2021 Participant, Frontrunner 2022 Participant, Neuron 2023 Posts: 1,601 Neuron

How are folks maintaining a data set that is slow to gather and that changes slowly over time?

I have a dataset that is available to me over a REST API. However, pulling the full dataset takes ~30 hours at this time and is growing continually so this will just get slower over time.

In the dataset, there is a Create Dates, and Update Dates so that I could find information to do an incremental update as of the last time I pulled the data. Records are not deleted from this data set as far as we know. I can create a flow that:

  • Knows the latest create date and updates
  • Pulls new and changed records since those dates
  • And combine those records into a master complete record set.

However, I've not figured out how to update the first dataset so that I can re-run this process again. This would be a flow where a later node in the flow refers to an earlier record. Which I do not know if DSS can do. And if it can do this how do I set that up.

Has anyone done this and if so how did you proceed?

One idea I have for this is:

One time Setup

  • Set a variable to a very early
    • Latest Update Date
  • Create an empty dataset

Each time the process funs

  • Read the project variables for
    • Latest Update Date
  • Append all records Updated since Latest Update Date to the data set.
    • The first time this would be all of the current record.
    • Subsequent runs would be record change since last run
  • Update the Project variable with the latest receive record update date into the project variable Latest Update Date
  • From the long appended dataset, find just the one latest record for Each Record ID

I may give that a try. However, I'd love to know if anyone else has had this challenge. If so, how have you solved this? Or am I missing a real cool feature of DSS that I don't already know about that make this easy. Or is there a Python Library that makes this easy. Thanks for any help you can provide


Operating system used: Mac OS Monterey 12.6

Tagged:

Answers

  • Jean-Yves
    Jean-Yves Dataiker Posts: 14 Dataiker

    Hi Tom,

    If I'm understanding correctly your use case, I think that your idea is valid.

    If your records are appended to a SQL dataset, I would even get rid of the variable updating part and get the Latest Updated Date from the dataset itself. You would essentially have a scenario that is triggered as often as desired and which would run a custom python step. The python step would leverage our SQLExecutor2 to retrieve the Latest Update Date from the dataset and then use that date to generate the API request (i.e. to get all records that were updated after the retrieved date). Next, you would use the SQLExecutor2 once again to insert the results of the request into the dataset.

    A second alternative would be to use partitions (see this article). You would partition your dataset by day and every day, a scenario would create a "current day" partition that contains data whose Latest Update Date corresponds to that current day. That data would be pulled with the API in a python step of the scenario or a python recipe in the flow. This alternative has the advantage that it would not have to pull the previous date from your dataset so it would work efficiently on any kind of dataset (not necessarily SQL). However, a drawback I can think of is if your scenario fails to run on some days (e.g. either the API service or your server is down); you would risk missing data.

    I hope this helps!

    Best,

    Jean-Yves

  • tgb417
    tgb417 Dataiku DSS Core Designer, Dataiku DSS & SQL, Dataiku DSS ML Practitioner, Dataiku DSS Core Concepts, Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Frontrunner 2022 Finalist, Frontrunner 2022 Winner, Dataiku Frontrunner Awards 2021 Participant, Frontrunner 2022 Participant, Neuron 2023 Posts: 1,601 Neuron

    @Jean-Yves
    ,

    Right now I'm working with the Free Community Edition. We are waiting for our license keys so Scenarios and Partitions are not yet working for me on this node.

    I'm also concerned about a scenario that would accidentally drop records because of a problem that might occur with a single day's run. (Stuff happens as we all know.) Having a robust solution is very helpful.

    Looking forward to getting our license keys so we can give this a whirl. Thanks for the input. May have a few more questions in a week or two.

  • tgb417
    tgb417 Dataiku DSS Core Designer, Dataiku DSS & SQL, Dataiku DSS ML Practitioner, Dataiku DSS Core Concepts, Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Frontrunner 2022 Finalist, Frontrunner 2022 Winner, Dataiku Frontrunner Awards 2021 Participant, Frontrunner 2022 Participant, Neuron 2023 Posts: 1,601 Neuron

    @Jean-Yves

    When we do get the license keys. Would this work with the API connect plugin? Both as a dataset and as a recipe used to update current values in an earlier dataset.

    Or is this something that I'm likely to need to code myself using the Python requests library to pull the data? Some of my end users might find that to be more challenging. Thoughts?

  • Jean-Yves
    Jean-Yves Dataiker Posts: 14 Dataiker

    Hi Tom,

    Designing a robust data pipeline is definitely the goal here but as you can imagine that's not an easy task. Although rare, you are at the mercy of server downtimes - whether it be the server hosting DSS or the one which serves the data.

    The first alternative I suggested--checking for the Latest Update Date in the table that you're willing to update as a way to determine which records should be inserted into said table--seems like a good first step but is not the panacea. For example, if the service from which you retrieve data makes some records available only after you updated your dataset with some later records, you will lose those prior records. There might also be other problems that I'm overlooking. I would need to a better understanding of your use-case to better advise.

    Regarding the API Connect plugin, I believe you should be able to make it work but it will need to be part of a scenario that first updates the date range to generate request, and then inserts the newly retrieved records into the dataset.

    Finally, in the event you need to code something up, you always have the option of wrapping up that code in a plugin so that your users can run the code via an interface.

    Best,

    Jean-Yves

Setup Info
    Tags
      Help me…