How to correctly duplicate a project
I have a Dataiku project whose uploaded csv files are contained in a folder, from which datasets are created downstream, and then the flow complexifies. I want to duplicate the project, so I choose the option:
Uploaded files
Duplicate data of uploaded
datasets and managed folders.
I assumed that my csv files would be included since I consider them as "Uploaded files." However, they were not uploaded, and thus, I do not understand the option above. I also assumed that the schemas of the various datasets would be uploaded without the data, and that is fine.
The second advanced option:
Required inputsd
Duplicate data of required
(uploaded and input) datasets
and managed folders.
worked as I expected. The csv files in the input folder were duplicated, as were the datasets connected to these input files. But non of the other 15 datasets were duplicated since I can recreate them from the original dataset.
So I would like to understand better the first option, which would the least memory costly. Thanks.
Gordon
Operating system used: mac ventura
Answers
-
Miguel Angel Dataiker, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 118 Dataiker
Hello,
The upload option refers to the 'Upload your files' +Dataset option. All the other datasets and managed folders data will not be duplicated using the first option.
Note that when files in a folder are not uploaded, but added, which provides justification for this behaviour.