Overwrite only changed data set without scanning or reading all data set

dave · August 2020

i have scenario as below.

Every month we have a updated data set for ~500k rows, based on which we have like 10-15 rules[rules are based on the column condition like null or not null/or data format] & results.

Every month due to the functional reason multiple time data provider keep updating the data set & uploading the data set to data iku.I want to set up the condition if any change for those particular rows happens the rules based on those change data set run & the results gets stored. we have not time stamp in the table while data set changes.

Table

Change Data set impact[column value changes]-->Rule 1 would run----> Change of Result

Rule 2---------------------------------------> As it is result no read/scanning/or overwrite by dataiku

Rule 3-------------------------------------->As it is result no read/scanning/or overwrite by dataiku

We are not looking for any appending of rows as well. Is it possible?

Liev · August 2020

If you had a timestamp then you would need to build a delta pattern that might look a little like this:

1- Full sync into a master dataset

2 - Store the latest timestamp that is in your dataset into a project variable.

3 - In the following run only retrieve records that have a timestamp above the last stored project variable.

4 - Merge the dataset from 3 into the master dataset.

Repeat 2-4 with a certain frequency.

Good luck!

Liev · August 2020

Hi @dave

The type of design pattern you might use will depend on your underlying infrastructure and the external services that publish the data.

It would be good to determine if you're storing the data into a structured database and/or filesystem, also whether timestamps could indeed be added to the data. In general, having timestamps will help you keep track of what has been updated in order to avoid entire dataset sync.

It's unclear what the rules are in your context either, maybe it's worth expanding a little on that, as well as the actual chain of changes triggered by them.

Good luck!

dave · August 2020

Hi@Liev, Good day.Thanks for prompt reply.

Using Hadoop to store the data.But by default from the data provider we don't have the timestamp i.e. time & date while changes of any data occurs in the table i.e. the real problem. And we have no clue if in Data iku do we have sort of functionality where changed rows can be tracked, so that only those rows can be overwritten if no time stamp available.Rule are like if values are null or not null sort of for particular column values.

Liev · August 2020

A couple of things in this case:

- DSS will not be able to track individual rows, since DSS contacts and interacts with the infrastructure when you run a recipe (or notebook), hence it would only be the underlying data layer that might be aware of changes.

- The way data is inserted by your data provider could be a full overwrite, and not selected rows, in which case the entire dataset might appear as newly written (because they are) even to your data provider.

- Without timestamps it's hard to conceive a simple delta pattern in your flow, I would try to add them if possible.

- What is the performance of a full sync/override in DSS? If done not too often perhaps this is something you can live with.

- What engines are available to you? If you have Spark you may look for variations of UPSERT or MERGE type operations.

Good luck!

dave · August 2020

Hi@Liev,

Yes data provider does the overwritten. Hive, Impala & Spark engine which is available to us. Yes, time stamp is the easiest way to adopt, but is it possible we can have the update of the timestamp[ column if we create customized column] for those rows where rows have been changed/updated/overwritten?

You mentioned as below:

"Without timestamps it's hard to conceive a simple delta pattern in your flow, I would try to add them if possible." where would you add them?

Liev · August 2020

Ideally, the timestamp is added by the provider indicating when a record has been updated.

In your case, I would say you could explore operations that perform UPSERT or MERGE in Spark as a way to update your own datasets.

However, it may be that there's no alternative beyond full sync if all else fails.

dave · August 2020

ok so that mean in dataiku we don't have an option for overwrite the only changed row function right?

Liev · August 2020

If the row that has changed cannot be identified, then indeed there's no alternative other than full sync.

At the moment there are two things that prevent you from doing something smarter than full sync:

- data has no timestamps

- data is being overwritten each time by the ETL process.

dave · August 2020

Hi@Liev,

FInally ,if we introduced a timestamp which function of the dataiku can help us to identify to check on the changed row or i ask you other way around which function of the dataiku help us to create the timestamp & track the no. of changed row

dave · August 2020

Hi,Seems ok except merge since am not looking to merge the record but overwrite if the latest date is above than the last stored date -which what function i can use to overwrite the entire row

Liev · August 2020

That is what a MERGE or UPSERT does, as described here.

DSS will not be able to perform operations that your underlying infrastructure or execution engines don't support. So I recommend to check the documentation for those and then ask DSS to issue those commands.

Overwrite only changed data set without scanning or reading all data set

Best Answer

Answers

Categories

Setup Info

Tags