Community Conundrum 25:Feature Visualization is now live! Read More

Removed managed/local dataset in flow by paritition

Level 2
Removed managed/local dataset in flow by paritition

Hi,



In a flow I want to delete a managed/local dataset which is created somewhere in the middle of the workflow, however I want to delete the dataset only at the end of the flow. 



Currently I tried to add a shell script at the end of the flow (as input it has a hdfs dataset tough...). As a a command I tried:




rm -rf /var/app/dss3/managed_datasets/DPP_DIGITAL_AGGREGATES.DM_AGGREGATED_V2/$DKU_DST_load_date (==> load_date is the name of the partitioning).


The script works however if I look at the local directory the files (and partition) are still there.



How can I delete a local dataset? Or why doesn't this 'rm -rf path' doesn't work in the shell script?



Kind regards,



Nicolas





 

0 Kudos
4 Replies
Dataiker
Dataiker

Hello,



You can use a scenario for this purpose, using a "Clear" step





 



 



As to why the shell recipe does not remove the file, you should have a look at the logs to see if the file was found. If you're unsure you can use the -v option of rm to make it verbose.

0 Kudos
Level 2
Author
I tried intitialy to do it but I have some difficutly with the custom parameters. For me the documentation (reference doc) on custom params how the json format should look like with a datarange isn't very clear to me and I didn't found a clear example on the dataiku site itself. 😕
0 Kudos
Dataiker
Dataiker
If you select a partitioned dataset in your clear step, an information bubble with how to express ranges is available, does that help?

It says:

For time partitioning, in addition to the format indicated, you can also select ranges or use special keywords. To select all partitions between two dates, use the character / between the two dates to separate the beginning from the end.

You can use the special keyword CURRENT_DAY if your partitioning maps to days : it will be respectively replaced by the current date when the scheduler will run the task. Similarly, you can use PREVIOUS_DAY. For time partitioning on months and hours, CURRENT_MONTH, PREVIOUS_MONTH, CURRENT_HOUR, PREVIOUS_HOUR can be used in the same fashion.

Although, you cannot use these keywords with the / notation for date ranges.

Examples :
- 2014-03-01/2014-04-15 for a daily scheduled task will build every partition between 1st March 2014 and 15th April 2014 every day.

- PREVIOUS_MONTH for a monthly scheduled task will build the partition corresponding to the previous month.
0 Kudos
Level 2
Author
Ah, ok the '2014-03-01/2014-04-15 ' was/is pretty straightforward and seems to work (so far, I actually have to wait untill it the flow completly over)...
My problem was that I tought for custom date-range I hade to fill in the range like in this window. ==> https://imgur.com/a/WgWuh
0 Kudos
Labels (2)