Having same dataset across projects as output for more than one receipe

Chandra_Mouli_R
Chandra_Mouli_R Registered Posts: 4 ✭✭✭

Hi All,

I have been working in dataiku for more than year now, have some query please read through below and give your solution.

What I Know:

There is an Exposed objects/ Share option to use the dataset across projects. There is a quite lot of limitation, main thing is we cannot write the output to shared dataset in Project B.

What I dont know:

Is it possible to write or save output rows of more than one flow zone in to a same dataset, without any work arounds? [Because i have workaround, because of the way dataset is getting built]

Is it possible to write or save output rows of more than one project in to a same dataset?, currently we can share a dataset across project for using it as an input in Project B, but can we use at as an output dataset?

Thanks,
Chandra Mouli R


Operating system used: Windows

Best Answer

  • Manuel
    Manuel Alpha Tester, Dataiker Alumni, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Core Concepts, Dataiku DSS Adv Designer, Registered Posts: 193 ✭✭✭✭✭✭✭
    Answer ✓

    Hi,

    No problem. You have two options:

    • You either do the same as you did in Project A: create the recipe with a new output dataset, but edit the settings after, so that they match;
    • Or, in Project B, before creating the recipe, do +dataset and add your shared dataset manually, with the same settings. Then when creating the recipe you should see the shared dataset already as existing (I have not tested this).

    I hope this helps.

    Best regards

Answers

  • Manuel
    Manuel Alpha Tester, Dataiker Alumni, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Core Concepts, Dataiku DSS Adv Designer, Registered Posts: 193 ✭✭✭✭✭✭✭

    Hi,

    Did you try the append option? You can find this in the input/output settings of a recipe. See attached image.

    I hope this helps.

  • Chandra_Mouli_R
    Chandra_Mouli_R Registered Posts: 4 ✭✭✭

    Hi @Manuel

    It wont help, across different projects, or even more than one receipe. we are not able to use same dataset for output in more than one project or receipe.
    More over, append is for inserting the new records with old records without replacing it.
    Thanks for the try!

  • Manuel
    Manuel Alpha Tester, Dataiker Alumni, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Core Concepts, Dataiku DSS Adv Designer, Registered Posts: 193 ✭✭✭✭✭✭✭

    Hi,

    I assumed you wanted projects to collaborate on a dataset (appending), but instead it seems you want projects to compete (overwrite).

    Is your challenge simply about defining an existing dataset as output? If that is the case, this is also possible:

    • Datasets are not stored in DSS, but in the underlying data platforms (db tables, blob storage, files);
    • The dataset icon in the flow is a “pointer” to a dataset stored somewhere;
    • You can edit two datasets two make sure they point to the same underlying table;

    See my examples attached, I have two dummy projects overwriting the same dataset:

    • By default, DSS prefixes the underlying table with the project key (making the table unique for that project);
    • By editing the dataset settings and removing the project key from the table names, you have two projects writing to the same dataset;

    I hope this helps.

  • Chandra_Mouli_R
    Chandra_Mouli_R Registered Posts: 4 ✭✭✭

    Hi @Manuel

    I will explain it step by step where is the problem while implementing your solution, please correct!

    In Project A, I tried using a recipe on a dataset to create a output "Shared_dataset"

    In this process, I am getting a window, asking for New dataset name I gave "Shared_dataset" and place where it will get stored is selected by default. Once click Create Receipe, window moves to Receipe details.

    After filling out Receipe details, there is no setting for the ouput dataset before creation of receipe, after receipe is created we come to the flow zone and click Explore on "Shared_dataset" [output of receipe] and go to settings, the same window resembling your screenshot appears.

    Now, the table name by default has $project key prefix to "$project_Shared_dataset", now i have removed it and saved the dataset settings. As you said in your solution.

    In project B, Creating a receipe asks for an output, when i click on "Existing Dataset", Shared_dataset is not seen in there.

    I tried exposing the dataset, means shared it across the projects, then shared_dataset appeared as black color in in Project B. Even after that it is not coming up in list of available datasets.

    Thanks,
    Chandra Mouli R

  • Chandra_Mouli_R
    Chandra_Mouli_R Registered Posts: 4 ✭✭✭

    Hi @Manuel
    ,

    Option 1 worked perfectly

    Thanks,

    Chandra Mouli R

Setup Info
    Tags
      Help me…