Data Upsert

Currently, Dataiku offers the choice to either overwrite or append data during dataset updates, yet it lacks the capability for a user to perform an upsert on their data. An upsert operation, which merges the functions of updating and inserting, enables users to harmonize their existing dataset with new or modified data. Unlike the current practice of users opting for complete dataset overwrites an upsert could present a more efficient alternative.

Existing workarounds involve resorting to custom SQL code or utilizing a combination of an append recipe and a window recipe. However, these methods are not user-friendly for those seeking simpler solutions, often leading them to default to a complete overwrite.

It's worth noting that other users have also expressed the need for this feature in different posts, although it hasn't been specifically addressed here.  [1] [2] [3] [4] [5]

6 Comments

@Emiel_Veersma ,

Thanks for posting this idea.  Although, I’ve never heard this described as UpSert before. I like the term.  Thank you for sharing.

As I think about this I think of the three steps that are needed for synchronizations of data.

  • ADD
  • DELETE
  • MODIFY

Your UpSert seems to deal with Add and Modify. What about the case where a record is deleted from the source data?  Do you envision that the UpSert process would also Delete the record from the down stream data set?

P.S. This would be super helpful to me.  I’m trying keep a cache of data from a very slow upstream data set, and going through all sorts of steps to try to do this in Dataiku, It is very slow and time consuming.  

--Tom

@Emiel_Veersma ,

Thanks for posting this idea.  Although, I’ve never heard this described as UpSert before. I like the term.  Thank you for sharing.

As I think about this I think of the three steps that are needed for synchronizations of data.

  • ADD
  • DELETE
  • MODIFY

Your UpSert seems to deal with Add and Modify. What about the case where a record is deleted from the source data?  Do you envision that the UpSert process would also Delete the record from the down stream data set?

P.S. This would be super helpful to me.  I’m trying keep a cache of data from a very slow upstream data set, and going through all sorts of steps to try to do this in Dataiku, It is very slow and time consuming.  

In the UpSert lingo a delete is an Update since you all you do is expire/mark as inactive the row. Deletes are more expensive than updates so in a lot of cases updating a row as inactive is preferable. Also in a lot of DWH you want to support As Point in Time queries which means you need to have two dates, from and to, to define when the row was active.

My personal opinion is that while Dataiku can do ETL (or ELT) it is much better to push that upstream to a proper ETL/ELT tool and bring the "clean" data to Dataiku to do the advanced analytics / ML modelling. Dataiku will struggle to do handle any sort of historical table in an efficient and safe way. For that reason I don't see this feature as something Dataiku will be interested to implement. But obviously that's just my opinion and I could be wrong. 

In the UpSert lingo a delete is an Update since you all you do is expire/mark as inactive the row. Deletes are more expensive than updates so in a lot of cases updating a row as inactive is preferable. Also in a lot of DWH you want to support As Point in Time queries which means you need to have two dates, from and to, to define when the row was active.

My personal opinion is that while Dataiku can do ETL (or ELT) it is much better to push that upstream to a proper ETL/ELT tool and bring the "clean" data to Dataiku to do the advanced analytics / ML modelling. Dataiku will struggle to do handle any sort of historical table in an efficient and safe way. For that reason I don't see this feature as something Dataiku will be interested to implement. But obviously that's just my opinion and I could be wrong. 

@Turribeach,

The idea of marking a delete the act of marking a as record no longer present in the source data, rather than removing the record all together is interesting, and deals with many of the "As of" data reporting challenges occur when systems delete records.  However, that still leaves the GDPR issues of the right to be forgotten. to some extent.  I'm aware that some approaches keep the record but obfuscate PII values, to maintain database referential integrity.

I for one would appreciate the support of a feature like this from the Dataiku Team.

--Tom

@Turribeach,

The idea of marking a delete the act of marking a as record no longer present in the source data, rather than removing the record all together is interesting, and deals with many of the "As of" data reporting challenges occur when systems delete records.  However, that still leaves the GDPR issues of the right to be forgotten. to some extent.  I'm aware that some approaches keep the record but obfuscate PII values, to maintain database referential integrity.

I for one would appreciate the support of a feature like this from the Dataiku Team.

ktgross15
Dataiker

Thanks for the feedback, we have added this to our backlog and will let you know if we move forward with development.

Katie

Status changed to: In the Backlog

Thanks for the feedback, we have added this to our backlog and will let you know if we move forward with development.

Katie

I heard this might be coming very soon...

I heard this might be coming very soon...

That would be SOOOO wonderful. I'm right now trying to figure out a way to do this without this feature.

--Tom

That would be SOOOO wonderful. I'm right now trying to figure out a way to do this without this feature.