Removed managed/local dataset in flow by paritition

nv
nv Registered Posts: 11 ✭✭✭✭
edited July 16 in Using Dataiku

Hi,

In a flow I want to delete a managed/local dataset which is created somewhere in the middle of the workflow, however I want to delete the dataset only at the end of the flow.

Currently I tried to add a shell script at the end of the flow (as input it has a hdfs dataset tough...). As a a command I tried:


rm -rf /var/app/dss3/managed_datasets/DPP_DIGITAL_AGGREGATES.DM_AGGREGATED_V2/$DKU_DST_load_date (==> load_date is the name of the partitioning).

The script works however if I look at the local directory the files (and partition) are still there.

How can I delete a local dataset? Or why doesn't this 'rm -rf path' doesn't work in the shell script?

Kind regards,

Nicolas

Tagged:

Best Answer

  • cperdigou
    cperdigou Alpha Tester, Dataiker Alumni Posts: 115 ✭✭✭✭✭✭✭
    Answer ✓

    Hello,

    You can use a scenario for this purpose, using a "Clear" step

    As to why the shell recipe does not remove the file, you should have a look at the logs to see if the file was found. If you're unsure you can use the -v option of rm to make it verbose.

Answers

  • nv
    nv Registered Posts: 11 ✭✭✭✭
    I tried intitialy to do it but I have some difficutly with the custom parameters. For me the documentation (reference doc) on custom params how the json format should look like with a datarange isn't very clear to me and I didn't found a clear example on the dataiku site itself. :-/
  • cperdigou
    cperdigou Alpha Tester, Dataiker Alumni Posts: 115 ✭✭✭✭✭✭✭
    If you select a partitioned dataset in your clear step, an information bubble with how to express ranges is available, does that help?

    It says:

    For time partitioning, in addition to the format indicated, you can also select ranges or use special keywords. To select all partitions between two dates, use the character / between the two dates to separate the beginning from the end.

    You can use the special keyword CURRENT_DAY if your partitioning maps to days : it will be respectively replaced by the current date when the scheduler will run the task. Similarly, you can use PREVIOUS_DAY. For time partitioning on months and hours, CURRENT_MONTH, PREVIOUS_MONTH, CURRENT_HOUR, PREVIOUS_HOUR can be used in the same fashion.

    Although, you cannot use these keywords with the / notation for date ranges.

    Examples :
    - 2014-03-01/2014-04-15 for a daily scheduled task will build every partition between 1st March 2014 and 15th April 2014 every day.

    - PREVIOUS_MONTH for a monthly scheduled task will build the partition corresponding to the previous month.
  • nv
    nv Registered Posts: 11 ✭✭✭✭
    Ah, ok the '2014-03-01/2014-04-15 ' was/is pretty straightforward and seems to work (so far, I actually have to wait untill it the flow completly over)...
    My problem was that I tought for custom date-range I hade to fill in the range like in this window. ==> https://imgur.com/a/WgWuh
Setup Info
    Tags
      Help me…