Data Upsert
Currently, Dataiku offers the choice to either overwrite or append data during dataset updates, yet it lacks the capability for a user to perform an upsert on their data. An upsert operation, which merges the functions of updating and inserting, enables users to harmonize their existing dataset with new or modified data. Unlike the current practice of users opting for complete dataset overwrites an upsert could present a more efficient alternative.
Existing workarounds involve resorting to custom SQL code or utilizing a combination of an append recipe and a window recipe. However, these methods are not user-friendly for those seeking simpler solutions, often leading them to default to a complete overwrite.
It's worth noting that other users have also expressed the need for this feature in different posts, although it hasn't been specifically addressed here. [1] [2] [3] [4] [5]
Comments
-
tgb417 Dataiku DSS Core Designer, Dataiku DSS & SQL, Dataiku DSS ML Practitioner, Dataiku DSS Core Concepts, Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Frontrunner 2022 Finalist, Frontrunner 2022 Winner, Dataiku Frontrunner Awards 2021 Participant, Frontrunner 2022 Participant, Neuron 2023 Posts: 1,601 Neuron
Thanks for posting this idea. Although, I’ve never heard this described as UpSert before. I like the term. Thank you for sharing.
As I think about this I think of the three steps that are needed for synchronizations of data.
- ADD
- DELETE
- MODIFY
Your UpSert seems to deal with Add and Modify. What about the case where a record is deleted from the source data? Do you envision that the UpSert process would also Delete the record from the down stream data set?
P.S. This would be super helpful to me. I’m trying keep a cache of data from a very slow upstream data set, and going through all sorts of steps to try to do this in Dataiku, It is very slow and time consuming.
-
Turribeach Dataiku DSS Core Designer, Neuron, Dataiku DSS Adv Designer, Registered, Neuron 2023 Posts: 2,160 Neuron
In the UpSert lingo a delete is an Update since you all you do is expire/mark as inactive the row. Deletes are more expensive than updates so in a lot of cases updating a row as inactive is preferable. Also in a lot of DWH you want to support As Point in Time queries which means you need to have two dates, from and to, to define when the row was active.
My personal opinion is that while Dataiku can do ETL (or ELT) it is much better to push that upstream to a proper ETL/ELT tool and bring the "clean" data to Dataiku to do the advanced analytics / ML modelling. Dataiku will struggle to do handle any sort of historical table in an efficient and safe way. For that reason I don't see this feature as something Dataiku will be interested to implement. But obviously that's just my opinion and I could be wrong.
-
tgb417 Dataiku DSS Core Designer, Dataiku DSS & SQL, Dataiku DSS ML Practitioner, Dataiku DSS Core Concepts, Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Frontrunner 2022 Finalist, Frontrunner 2022 Winner, Dataiku Frontrunner Awards 2021 Participant, Frontrunner 2022 Participant, Neuron 2023 Posts: 1,601 Neuron
The idea of marking a delete the act of marking a as record no longer present in the source data, rather than removing the record all together is interesting, and deals with many of the "As of" data reporting challenges occur when systems delete records. However, that still leaves the GDPR issues of the right to be forgotten. to some extent. I'm aware that some approaches keep the record but obfuscate PII values, to maintain database referential integrity.
I for one would appreciate the support of a feature like this from the Dataiku Team.
-
Thanks for the feedback, we have added this to our backlog and will let you know if we move forward with development.
Katie
-
Turribeach Dataiku DSS Core Designer, Neuron, Dataiku DSS Adv Designer, Registered, Neuron 2023 Posts: 2,160 Neuron
I heard this might be coming very soon...
-
tgb417 Dataiku DSS Core Designer, Dataiku DSS & SQL, Dataiku DSS ML Practitioner, Dataiku DSS Core Concepts, Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Frontrunner 2022 Finalist, Frontrunner 2022 Winner, Dataiku Frontrunner Awards 2021 Participant, Frontrunner 2022 Participant, Neuron 2023 Posts: 1,601 Neuron
That would be SOOOO wonderful. I'm right now trying to figure out a way to do this without this feature.
-
Hi Katie,
Do you have any new information regarding Data Upsert ?