Sign up to take part
Registered users can ask their own questions, contribute to discussions, and be part of the Community!
Registered users can ask their own questions, contribute to discussions, and be part of the Community!
I have a Dataiku project whose uploaded csv files are contained in a folder, from which datasets are created downstream, and then the flow complexifies. I want to duplicate the project, so I choose the option:
Uploaded files
Duplicate data of uploaded
datasets and managed folders.
I assumed that my csv files would be included since I consider them as "Uploaded files." However, they were not uploaded, and thus, I do not understand the option above. I also assumed that the schemas of the various datasets would be uploaded without the data, and that is fine.
The second advanced option:
Required inputsd
Duplicate data of required
(uploaded and input) datasets
and managed folders.
worked as I expected. The csv files in the input folder were duplicated, as were the datasets connected to these input files. But non of the other 15 datasets were duplicated since I can recreate them from the original dataset.
So I would like to understand better the first option, which would the least memory costly. Thanks.
Gordon
Operating system used: mac ventura
Hello,
The upload option refers to the 'Upload your files' +Dataset option. All the other datasets and managed folders data will not be duplicated using the first option.
Note that when files in a folder are not uploaded, but added, which provides justification for this behaviour.