Data Upsert

Options
Emiel_Veersma
Emiel_Veersma Registered, Frontrunner 2022 Finalist, Dataiku Frontrunner Awards 2021 Participant, Frontrunner 2022 Participant Posts: 20 ✭✭✭✭✭

Currently, Dataiku offers the choice to either overwrite or append data during dataset updates, yet it lacks the capability for a user to perform an upsert on their data. An upsert operation, which merges the functions of updating and inserting, enables users to harmonize their existing dataset with new or modified data. Unlike the current practice of users opting for complete dataset overwrites an upsert could present a more efficient alternative.

Existing workarounds involve resorting to custom SQL code or utilizing a combination of an append recipe and a window recipe. However, these methods are not user-friendly for those seeking simpler solutions, often leading them to default to a complete overwrite.

It's worth noting that other users have also expressed the need for this feature in different posts, although it hasn't been specifically addressed here. [1] [2] [3] [4] [5]

5
5 votes

In the Backlog · Last Updated

Comments

  • tgb417
    tgb417 Dataiku DSS Core Designer, Dataiku DSS & SQL, Dataiku DSS ML Practitioner, Dataiku DSS Core Concepts, Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Frontrunner 2022 Finalist, Frontrunner 2022 Winner, Dataiku Frontrunner Awards 2021 Participant, Frontrunner 2022 Participant, Neuron 2023 Posts: 1,595 Neuron
    Options

    @Emiel_Veersma
    ,

    Thanks for posting this idea. Although, I’ve never heard this described as UpSert before. I like the term. Thank you for sharing.

    As I think about this I think of the three steps that are needed for synchronizations of data.

    • ADD
    • DELETE
    • MODIFY

    Your UpSert seems to deal with Add and Modify. What about the case where a record is deleted from the source data? Do you envision that the UpSert process would also Delete the record from the down stream data set?

    P.S. This would be super helpful to me. I’m trying keep a cache of data from a very slow upstream data set, and going through all sorts of steps to try to do this in Dataiku, It is very slow and time consuming.

  • Turribeach
    Turribeach Dataiku DSS Core Designer, Neuron, Dataiku DSS Adv Designer, Registered, Neuron 2023 Posts: 1,708 Neuron
    Options

    In the UpSert lingo a delete is an Update since you all you do is expire/mark as inactive the row. Deletes are more expensive than updates so in a lot of cases updating a row as inactive is preferable. Also in a lot of DWH you want to support As Point in Time queries which means you need to have two dates, from and to, to define when the row was active.

    My personal opinion is that while Dataiku can do ETL (or ELT) it is much better to push that upstream to a proper ETL/ELT tool and bring the "clean" data to Dataiku to do the advanced analytics / ML modelling. Dataiku will struggle to do handle any sort of historical table in an efficient and safe way. For that reason I don't see this feature as something Dataiku will be interested to implement. But obviously that's just my opinion and I could be wrong.

  • tgb417
    tgb417 Dataiku DSS Core Designer, Dataiku DSS & SQL, Dataiku DSS ML Practitioner, Dataiku DSS Core Concepts, Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Frontrunner 2022 Finalist, Frontrunner 2022 Winner, Dataiku Frontrunner Awards 2021 Participant, Frontrunner 2022 Participant, Neuron 2023 Posts: 1,595 Neuron
    Options

    @Turribeach
    ,

    The idea of marking a delete the act of marking a as record no longer present in the source data, rather than removing the record all together is interesting, and deals with many of the "As of" data reporting challenges occur when systems delete records. However, that still leaves the GDPR issues of the right to be forgotten. to some extent. I'm aware that some approaches keep the record but obfuscate PII values, to maintain database referential integrity.

    I for one would appreciate the support of a feature like this from the Dataiku Team.

  • Katie
    Katie Dataiker, Registered, Product Ideas Manager Posts: 105 Dataiker
    Options

    Thanks for the feedback, we have added this to our backlog and will let you know if we move forward with development.

    Katie

  • Turribeach
    Turribeach Dataiku DSS Core Designer, Neuron, Dataiku DSS Adv Designer, Registered, Neuron 2023 Posts: 1,708 Neuron
    Options

    I heard this might be coming very soon...

  • tgb417
    tgb417 Dataiku DSS Core Designer, Dataiku DSS & SQL, Dataiku DSS ML Practitioner, Dataiku DSS Core Concepts, Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Frontrunner 2022 Finalist, Frontrunner 2022 Winner, Dataiku Frontrunner Awards 2021 Participant, Frontrunner 2022 Participant, Neuron 2023 Posts: 1,595 Neuron
    Options

    That would be SOOOO wonderful. I'm right now trying to figure out a way to do this without this feature.

Setup Info
    Tags
      Help me…