Drop Data

Erlebacher
Level 4
Drop Data

When I delete a dataset, Dataiku always asks whether I wish to delete the data. I understand that if this is dataset built from data I uploaded that the question makes sense.

But if I am deleting a dataset produced by a recipe, what is the meaning of the "drop data" option? I do not understand. What data is being deleted. Stated differently, where is the data that would not be deleted if the option is not selected?

Thank you.


Operating system used: Mac Ventura

0 Kudos
3 Replies
RoyE
Dataiker

Hi!

Take for example a Flow that does some visual recipes and saves the output datasets onto a SQL database. If you choose to not drop data, you might be left with several unused tables in your database over time.

This example is one reason a user would select to drop data.

Alternatively, if you did not want to delete the table from your SQL database to use in another project or at another time, you could explicitly select to not drop the database.

Although it highly depends on the contents of your Flow, I hope this better explains the option.

0 Kudos
Erlebacher
Level 4
Author

Thanks. Still, one point of confusion. If I have a regular dataset, which contains data generated by some recipe. If I disconnect the recipe, delete the dataset, but do not drop the data. Does the data still exist somewhere?

0 Kudos
RoyE
Dataiker

Indeed.

Let's say you sync an uploaded dataset to a managed dataset. This will be stored, by default, in your <DATA_DIR>/managed_datasets/<PROJECT_ID>/<DATASET_ID>/out-s0.csv.gz

Screen Shot 2022-11-26 at 13.24.17.png

If you choose to delete the dataset from the GUI, but not drop the data, this csv file will not be delete and will stay on your system. This is similar to how it would work in a SQL database, it will exist in the default connection string that you define in the connections. (Schema, table, etc.)

Screen Shot 2022-11-26 at 13.24.35.png

However, if you choose to drop data, you will notice that the dataset folder and csv file are also removed from your data directory.

Note that if I do not drop the data in the GUI and make a new connection to the test_copy folder location, I will have access to this data.

Screen Shot 2022-11-26 at 13.28.06.png

In terms of data that you will not need, it is recommended that you drop the data so that you do not have used dataset IDs that are floating in your DSS server.

Hope this helps!

0 Kudos