Can changes in HDFS datasets be automatically tracked?

UserBird
Dataiker
Can changes in HDFS datasets be automatically tracked?
Hi,

I am using HDFS datasets in my workflow which are updating on a daily basis and I would like to find out if these daily changes can be tracked by DSS and saved in a separate "delta" file through a scenario or some other automation capability.

Thanks!
0 Kudos
1 Reply
Clรฉment_Stenac
Hi,

This needs some work but can be achieved using scenarios and partitioning.

You would have your "stock" dataset (not partitioned), and a changes dataset, partitioned by day. You will need to create a coding recipe that takes the change dataset both at input and output, but with a partition dependency that says

"to compute day N of the change dataset, I use the stock dataset and day N-1 of the change dataset" (use the "Time range" dependency)

Then your recipe does the actual computation

An important point is that you should not run this recipe in "recursive" mode, because this would recurse until the big bang (since to compute day N-1, you need day N-2 which needs day N-3, ...)

Then this can be automated using a time-based trigger, since you expect your files to change daily (note that this requires a professional version of DSS)
0 Kudos

Labels

?
Labels (3)
A banner prompting to get Dataiku