Overwrite only changed data set without scanning or reading all data set

Solved!
dave
Level 2
Overwrite only changed data set without scanning or reading all data set

i have scenario as below.

Every month we have a updated data set for ~500k rows, based on which we have like 10-15 rules[rules are based on the column condition like null or not null/or data format] & results.

Every month due to the functional reason multiple time data provider keep updating the data set & uploading the data set to data iku.I want to set up the condition if any change for those particular rows happens the rules based on those change data set  run & the results gets stored.  we have not time stamp in the table while data set changes.       

                                                                                Table                                                                                                                                                                                    

Change Data set impact[column value changes]-->Rule 1 would run----> Change of  Result

                  Rule 2---------------------------------------> As it is result no read/scanning/or overwrite by dataiku

                  Rule 3-------------------------------------->As it is result no read/scanning/or overwrite  by dataiku

We are not looking for any appending of rows as well. Is it possible?

0 Kudos
1 Solution
Liev
Dataiker Alumni

If you had a timestamp then you would need to build a delta pattern that might look a little like this:

1- Full sync into a master dataset

2 - Store the latest timestamp that is in your dataset into a project variable.

3 - In the following run only retrieve records that have a timestamp above the last stored project variable.

4 - Merge the dataset from 3 into the master dataset.

Repeat 2-4 with a certain frequency.

Good luck!

View solution in original post

11 Replies
Liev
Dataiker Alumni

Hi @dave 

The type of design pattern you might use will depend on your underlying infrastructure and the external services that publish the data.

It would be good to determine if you're storing the data into a structured database and/or filesystem, also whether timestamps could indeed be added to the data. In general, having timestamps will help you keep track of what has been updated in order to avoid entire dataset sync.

It's unclear what the rules are in your context either, maybe it's worth expanding a little on that, as well as the actual chain of changes triggered by them. 

Good luck!

0 Kudos
dave
Level 2
Author

Hi@Liev, Good day.Thanks for prompt reply.

Using Hadoop to store the data.But by default from the data provider we don't have the timestamp i.e. time & date while changes of any data occurs  in the table i.e. the real problem. And we have no clue  if in Data iku do we have sort of functionality where changed rows can be tracked, so that  only those rows  can be overwritten if no time stamp available.Rule are like if values are null or not null sort of for particular column values.

0 Kudos
Liev
Dataiker Alumni

A couple of things in this case:

- DSS will not be able to track individual rows, since DSS contacts and interacts with the infrastructure when you run a recipe (or notebook), hence it would only be the underlying data layer that might be aware of changes.

- The way data is inserted by your data provider could be a full overwrite, and not selected rows, in which case the entire dataset might appear as newly written (because they are) even to your data provider.

- Without timestamps it's hard to conceive a simple delta pattern in your flow, I would try to add them if possible.

- What is the performance of a full sync/override in DSS? If done not too often perhaps this is something you can live with.

- What engines are available to you? If you have Spark you may look for variations of UPSERT or MERGE type operations.

Good luck!

 

0 Kudos
dave
Level 2
Author

Hi@Liev,

Yes data provider does the overwritten. Hive, Impala & Spark engine which is available to us. Yes, time stamp is the easiest way to adopt, but is it possible we can have the update  of the timestamp[ column if we create customized column] for those rows where rows have been changed/updated/overwritten?

You mentioned as below:

"Without timestamps it's hard to conceive a simple delta pattern in your flow, I would try to add them if possible." where would you add them?

0 Kudos
Liev
Dataiker Alumni

Ideally, the timestamp is added by the provider indicating when a record has been updated.

In your case, I would say you could explore operations that perform UPSERT or MERGE in Spark as a way to update your own datasets.

However, it may be that there's no alternative beyond full sync if all else fails. 

0 Kudos
dave
Level 2
Author

ok so that mean in dataiku we don't have an option for overwrite the only changed row function right?

0 Kudos
Liev
Dataiker Alumni

If the row that has changed cannot be identified, then indeed there's no alternative other than full sync.

At the moment there are two things that prevent you from doing something smarter than full sync:

- data has no timestamps

- data is being overwritten each time by the ETL process.

0 Kudos
dave
Level 2
Author

Hi@Liev,

FInally ,if we introduced a timestamp which function of the dataiku can help  us to identify to check on the changed row or i ask you other way around which function of the dataiku help us to create the timestamp & track the no. of changed  row

0 Kudos
Liev
Dataiker Alumni

If you had a timestamp then you would need to build a delta pattern that might look a little like this:

1- Full sync into a master dataset

2 - Store the latest timestamp that is in your dataset into a project variable.

3 - In the following run only retrieve records that have a timestamp above the last stored project variable.

4 - Merge the dataset from 3 into the master dataset.

Repeat 2-4 with a certain frequency.

Good luck!

dave
Level 2
Author

Hi,Seems ok except merge since am not looking to merge the record but overwrite if the latest date is above than the last stored date -which what function i can use to overwrite the entire row 

0 Kudos
Liev
Dataiker Alumni

That is what a MERGE or UPSERT does, as described here.

DSS will not be able to perform operations that your underlying infrastructure or execution engines don't support. So I recommend to check the documentation for those and then ask DSS to issue those commands.

0 Kudos