Removed managed/local dataset in flow by paritition

nv · ‎10-10-2017

Hi,

In a flow I want to delete a managed/local dataset which is created somewhere in the middle of the workflow, however I want to delete the dataset only at the end of the flow.

Currently I tried to add a shell script at the end of the flow (as input it has a hdfs dataset tough...). As a a command I tried:


rm -rf /var/app/dss3/managed_datasets/DPP_DIGITAL_AGGREGATES.DM_AGGREGATED_V2/$DKU_DST_load_date (==> load_date is the name of the partitioning).

The script works however if I look at the local directory the files (and partition) are still there.

How can I delete a local dataset? Or why doesn't this 'rm -rf path' doesn't work in the shell script?

Kind regards,

Nicolas

cperdigou · ‎10-10-2017

Hello,

You can use a scenario for this purpose, using a "Clear" step

As to why the shell recipe does not remove the file, you should have a look at the logs to see if the file was found. If you're unsure you can use the -v option of rm to make it verbose.

View solution in original post

cperdigou · ‎10-10-2017

Hello,

You can use a scenario for this purpose, using a "Clear" step

As to why the shell recipe does not remove the file, you should have a look at the logs to see if the file was found. If you're unsure you can use the -v option of rm to make it verbose.

nv · ‎10-10-2017

I tried intitialy to do it but I have some difficutly with the custom parameters. For me the documentation (reference doc) on custom params how the json format should look like with a datarange isn't very clear to me and I didn't found a clear example on the dataiku site itself. 😕

cperdigou · ‎10-10-2017

If you select a partitioned dataset in your clear step, an information bubble with how to express ranges is available, does that help?

It says:

For time partitioning, in addition to the format indicated, you can also select ranges or use special keywords. To select all partitions between two dates, use the character / between the two dates to separate the beginning from the end.

You can use the special keyword CURRENT_DAY if your partitioning maps to days : it will be respectively replaced by the current date when the scheduler will run the task. Similarly, you can use PREVIOUS_DAY. For time partitioning on months and hours, CURRENT_MONTH, PREVIOUS_MONTH, CURRENT_HOUR, PREVIOUS_HOUR can be used in the same fashion.

Although, you cannot use these keywords with the / notation for date ranges.

Examples :
- 2014-03-01/2014-04-15 for a daily scheduled task will build every partition between 1st March 2014 and 15th April 2014 every day.

- PREVIOUS_MONTH for a monthly scheduled task will build the partition corresponding to the previous month.

nv · ‎10-10-2017

Ah, ok the '2014-03-01/2014-04-15 ' was/is pretty straightforward and seems to work (so far, I actually have to wait untill it the flow completly over)...
My problem was that I tought for custom date-range I hade to fill in the range like in this window. ==> https://imgur.com/a/WgWuh

Removed managed/local dataset in flow by paritition

Removed managed/local dataset in flow by paritition

Labels

Datasets

Flow

Sign up to take part

Removed managed/local dataset in flow by paritition

Removed managed/local dataset in flow by paritition

Labels

Datasets

Flow