Can changes in HDFS datasets be automatically tracked?

UserBird
UserBird Dataiker, Alpha Tester Posts: 535 Dataiker
Hi,

I am using HDFS datasets in my workflow which are updating on a daily basis and I would like to find out if these daily changes can be tracked by DSS and saved in a separate "delta" file through a scenario or some other automation capability.

Thanks!

Answers

  • Clément_Stenac
    Clément_Stenac Dataiker, Dataiku DSS Core Designer, Registered Posts: 753 Dataiker
    Hi,

    This needs some work but can be achieved using scenarios and partitioning.

    You would have your "stock" dataset (not partitioned), and a changes dataset, partitioned by day. You will need to create a coding recipe that takes the change dataset both at input and output, but with a partition dependency that says

    "to compute day N of the change dataset, I use the stock dataset and day N-1 of the change dataset" (use the "Time range" dependency)

    Then your recipe does the actual computation

    An important point is that you should not run this recipe in "recursive" mode, because this would recurse until the big bang (since to compute day N-1, you need day N-2 which needs day N-3, ...)

    Then this can be automated using a time-based trigger, since you expect your files to change daily (note that this requires a professional version of DSS)
Setup Info
    Tags
      Help me…