Drop Data

Options
Erlebacher
Erlebacher Registered Posts: 82 ✭✭

When I delete a dataset, Dataiku always asks whether I wish to delete the data. I understand that if this is dataset built from data I uploaded that the question makes sense.

But if I am deleting a dataset produced by a recipe, what is the meaning of the "drop data" option? I do not understand. What data is being deleted. Stated differently, where is the data that would not be deleted if the option is not selected?

Thank you.


Operating system used: Mac Ventura

Tagged:

Answers

  • RoyE
    RoyE Dataiker, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 31 Dataiker
    Options

    Hi!

    Take for example a Flow that does some visual recipes and saves the output datasets onto a SQL database. If you choose to not drop data, you might be left with several unused tables in your database over time.

    This example is one reason a user would select to drop data.

    Alternatively, if you did not want to delete the table from your SQL database to use in another project or at another time, you could explicitly select to not drop the database.

    Although it highly depends on the contents of your Flow, I hope this better explains the option.

  • Erlebacher
    Erlebacher Registered Posts: 82 ✭✭
    Options

    Thanks. Still, one point of confusion. If I have a regular dataset, which contains data generated by some recipe. If I disconnect the recipe, delete the dataset, but do not drop the data. Does the data still exist somewhere?

  • RoyE
    RoyE Dataiker, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 31 Dataiker
    Options

    Indeed.

    Let's say you sync an uploaded dataset to a managed dataset. This will be stored, by default, in your <DATA_DIR>/managed_datasets/<PROJECT_ID>/<DATASET_ID>/out-s0.csv.gz

    Screen Shot 2022-11-26 at 13.24.17.png

    If you choose to delete the dataset from the GUI, but not drop the data, this csv file will not be delete and will stay on your system. This is similar to how it would work in a SQL database, it will exist in the default connection string that you define in the connections. (Schema, table, etc.)

    Screen Shot 2022-11-26 at 13.24.35.png

    However, if you choose to drop data, you will notice that the dataset folder and csv file are also removed from your data directory.

    Note that if I do not drop the data in the GUI and make a new connection to the test_copy folder location, I will have access to this data.

    Screen Shot 2022-11-26 at 13.28.06.png

    In terms of data that you will not need, it is recommended that you drop the data so that you do not have used dataset IDs that are floating in your DSS server.

    Hope this helps!

Setup Info
    Tags
      Help me…