Removed managed/local dataset in flow by paritition
Hi,
In a flow I want to delete a managed/local dataset which is created somewhere in the middle of the workflow, however I want to delete the dataset only at the end of the flow.
Currently I tried to add a shell script at the end of the flow (as input it has a hdfs dataset tough...). As a a command I tried:
rm -rf /var/app/dss3/managed_datasets/DPP_DIGITAL_AGGREGATES.DM_AGGREGATED_V2/$DKU_DST_load_date (==> load_date is the name of the partitioning).
The script works however if I look at the local directory the files (and partition) are still there.
How can I delete a local dataset? Or why doesn't this 'rm -rf path' doesn't work in the shell script?
Kind regards,
Nicolas
Best Answer
-
Hello,
You can use a scenario for this purpose, using a "Clear" step
As to why the shell recipe does not remove the file, you should have a look at the logs to see if the file was found. If you're unsure you can use the -v option of rm to make it verbose.
Answers
-
I tried intitialy to do it but I have some difficutly with the custom parameters. For me the documentation (reference doc) on custom params how the json format should look like with a datarange isn't very clear to me and I didn't found a clear example on the dataiku site itself. :-/
-
If you select a partitioned dataset in your clear step, an information bubble with how to express ranges is available, does that help?
It says:
For time partitioning, in addition to the format indicated, you can also select ranges or use special keywords. To select all partitions between two dates, use the character / between the two dates to separate the beginning from the end.
You can use the special keyword CURRENT_DAY if your partitioning maps to days : it will be respectively replaced by the current date when the scheduler will run the task. Similarly, you can use PREVIOUS_DAY. For time partitioning on months and hours, CURRENT_MONTH, PREVIOUS_MONTH, CURRENT_HOUR, PREVIOUS_HOUR can be used in the same fashion.
Although, you cannot use these keywords with the / notation for date ranges.
Examples :
- 2014-03-01/2014-04-15 for a daily scheduled task will build every partition between 1st March 2014 and 15th April 2014 every day.
- PREVIOUS_MONTH for a monthly scheduled task will build the partition corresponding to the previous month. -
Ah, ok the '2014-03-01/2014-04-15 ' was/is pretty straightforward and seems to work (so far, I actually have to wait untill it the flow completly over)...
My problem was that I tought for custom date-range I hade to fill in the range like in this window. ==> https://imgur.com/a/WgWuh