Drop Data
When I delete a dataset, Dataiku always asks whether I wish to delete the data. I understand that if this is dataset built from data I uploaded that the question makes sense.
But if I am deleting a dataset produced by a recipe, what is the meaning of the "drop data" option? I do not understand. What data is being deleted. Stated differently, where is the data that would not be deleted if the option is not selected?
Thank you.
Operating system used: Mac Ventura
Answers
-
RoyE Dataiker, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 31 Dataiker
Hi!
Take for example a Flow that does some visual recipes and saves the output datasets onto a SQL database. If you choose to not drop data, you might be left with several unused tables in your database over time.
This example is one reason a user would select to drop data.
Alternatively, if you did not want to delete the table from your SQL database to use in another project or at another time, you could explicitly select to not drop the database.
Although it highly depends on the contents of your Flow, I hope this better explains the option.
-
Thanks. Still, one point of confusion. If I have a regular dataset, which contains data generated by some recipe. If I disconnect the recipe, delete the dataset, but do not drop the data. Does the data still exist somewhere?
-
RoyE Dataiker, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 31 Dataiker
Indeed.
Let's say you sync an uploaded dataset to a managed dataset. This will be stored, by default, in your <DATA_DIR>/managed_datasets/<PROJECT_ID>/<DATASET_ID>/out-s0.csv.gz
If you choose to delete the dataset from the GUI, but not drop the data, this csv file will not be delete and will stay on your system. This is similar to how it would work in a SQL database, it will exist in the default connection string that you define in the connections. (Schema, table, etc.)
However, if you choose to drop data, you will notice that the dataset folder and csv file are also removed from your data directory.
Note that if I do not drop the data in the GUI and make a new connection to the test_copy folder location, I will have access to this data.
In terms of data that you will not need, it is recommended that you drop the data so that you do not have used dataset IDs that are floating in your DSS server.
Hope this helps!