How to rebuild a dataset that is build from an Excel file that is located in a managed folder?

Pim
Pim Registered Posts: 3
edited December 22 in Using Dataiku

After successfully refreshing the Excel file in the managed folder, I can't find out how to refresh the downstream dataset.

The Excel file in the managed Folder is successfully refreshed by dragging the updated Excel file 'data_cleansing.xls' to the Managed Folder and overwriting the old version of this file.

image.png

But the Upstream Build of the dataset 'data_cleansing' doesn't result in an update of the data. The data in the dataset 'data_cleansing' remains unchanged. How can the data in this dataset be updated with the newly uploaded data?

image.png image.png

Operating system used: Windows

Operating system used: Windows

Best Answer

  • Turribeach
    Turribeach Dataiku DSS Core Designer, Neuron, Dataiku DSS Adv Designer, Registered, Neuron 2023, Circle Member Posts: 2,641 Neuron
    edited December 22 Answer ✓

    Indeed that's the problem. Uploaded datasets are just that, uploaded datasets. If you want them to update an uploaded dataset you need to re-upload them. In other words the data_cleansing in the first project is a completely different dataset than the uploaded one. Using datasets from files in Managed Folders is way better than using uploaded datasets since the managed folders datasets can be updated directly on the managed folder without doing anything in Dataiku. Furthermore your Managed Folder could be on network storage, NAS storage, cloud buckets etc meaning that this file update can be triggered completely independently from another system which write access to the same storage folder.

    To solve your issue you should "share" the data_cleansing from the Project 1 Managed Folder into the Project 2. Then edit the sync recipe and add the new data_cleansing shared as an input and delete the data_cleansing uploaded dataset input. Finally delete the data_cleansing uploaded dataset (it should now be left orphan in the flow). If you don't want to share the dataset across projects you will need to add the Managed Folder to project 2.

Answers

  • Turribeach
    Turribeach Dataiku DSS Core Designer, Neuron, Dataiku DSS Adv Designer, Registered, Neuron 2023, Circle Member Posts: 2,641 Neuron
    edited December 22

    Dataiku works in a simple concept, recipes have inputs and outputs. To update an output dataset (which Dataiku calls "build") you run (aka build) it which in effect runs the recipe that outputs it. The only exception to this rule are [flow] "input" datasets (not to be confused with recipe input datasets). These are external datasets that are not "buid" (in Dataiku words "managed") by Dataiku. Typically these are at the beginning of your flow. data_cleansing is such an input dataset. It can not be rebuild. If you want new data from the managed folder / data_cleansing dataset to be loaded you need to run the recipe (aka build) that uses data_cleansing as an input. Finally what you see on the Explore tab in the data_cleansing dataset is a data sample, which may not be the latest data. You can update the sample if needed but this will not push the data upstream, it's just for display purposes only. The sample issue applies to all datasets, external input, input or outputs. You nearly never see all data (aside from tiny datasets) and nearly never see "live" data.

  • Pim
    Pim Registered Posts: 3

    Thank you for your fast response, Turribeach.
    A run of the recipe that uses 'data_cleansing' as an input didn't result in an update of the datasets that it builds.

    Maybe this has to do with my setup of projects and flowzones, and with the use of a dataset Upload.

    image.png


    The managed folder and the dataset 'data_cleansing' are located in a project (let's call it 'Project 1'), while the Upload and the recipe that uses Upload are located in another project (let's call it 'Project 2').

    Running the recipe in the read circle (in Project 2) doesn't result in an update of the dataset it builds.

    image.png
  • Pim
    Pim Registered Posts: 3

    That solved the problem, Turribeach.
    Thank you very much!

Setup Info
    Tags
      Help me…