Dataiku Application

mayur_garg Registered Posts: 5 ✭✭✭

Hi, We need to understand what is happening when instance is building from an application. In our case it is taking a lot of time for instance building and while building it shows Importing/exporting dataset reading many GB Of data.

Second question which might be related to above is when do we need to include dataset in Included content of Application. Actually we have a pipeline which is base for our application. In some of the dataset of this pipeline, we have created charts/published to dashboard and now we want to see this dashboard on application. Do we need to include this dataset in Included content before we expose its dashboard. Basically what is the significance of Including dataset in Included content, Is it why instance making is taking long time because included dataset has many GB of data?

Lastly is there any option in application to add section where user can click the link and go and see the underlying dataset of dataiku pipeline. I can find download Dataset option but cannot find "view Dataset". There is one option which says select dataset file. Is this what we should use? and with what tile behavior for our purpose?


  • CoreyS
    CoreyS Dataiker Alumni, Dataiku DSS Core Designer, Dataiku DSS Core Concepts, Registered Posts: 1,150 ✭✭✭✭✭✭✭✭✭

    Hi @mayur_garg
    while you wait for a more detailed and complete response, I wanted to point out a few resources available to learn more about Dataiku Applications:

    1. Dataiku Applications (Documentation)
    2. Dataiku Applications (Knowledge Base)
    3. Dataiku Applications Tutorials (Academy)
    4. Converting your Dataiku DSS Project into a Reusable Application - Watch on Demand

    I hope this helps!

  • Manuel
    Manuel Alpha Tester, Dataiker Alumni, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Core Concepts, Dataiku DSS Adv Designer, Registered Posts: 193 ✭✭✭✭✭✭✭


    Applications allow the parallel execution of the same data pipeline with different inputs, but this means that the flow is duplicated for every instance. I suspect that in your case, you have a lot of data to instantiate, thus it is taking a long time and space.

    You can design your application to minimise this risk. See attached the example of an application pattern:

    • Minimise the flow that will be instantiated, minimising the size of each instance;
    • Make use of shared datasets, sharing instead of copying data.

    Concerning publishing charts to the dashboard, you can probably get away with sharing the underlying dataset with the application template project and defining the charts on that shared object.

    Concerning showing the underlying dataset, you can publish a dataset as a dashboard insight, which allows the user to review the data without accessing the flow per se.

    I hope this helps.

Setup Info
      Help me…