Scaling Exposed Datasets

yinkit Registered Posts: 5 ✭✭✭✭


The feature to expose Datasets seems to be good but i can t think of a way to scale it.

Raw Data
- stored in s3
- high volume, output ETLs will be partitioned %Y/%M/%D/%H
- recipe engine used : spark

Use case
I have a project that has multiple critical ELTs that output the organization main Dataset.
I would like to expose these Dataset to my end-users (business projects) so they can re-use them easily.

From the doc (
Datasets can be exposed to other project but the solution is that for each Dataset, i have to expose it to each others projects.
On a long term, N business project and N Datasets, it is going to be painful to maintain theses permissions.

Is there a way to change permission control ? (like on a user group other than project )?
(i.e : for each dataset, i give access to groups of users)

Work around
I can still output the datasets in S3 and let the end user create his dataset in his project.
Each projects will have to create their own Datasets.

Best Answer

  • Clément_Stenac
    Clément_Stenac Dataiker, Dataiku DSS Core Designer Posts: 753 Dataiker
    Answer ✓


    Sharing is indeed only based on projects and not groups. We'll be taking your feedback into account.

    Your workaround is likely the best solution indeed (i.e. recreate new datasets targeting the same location, in your other projects).

    We recommend that you use Parquet format. This is because when outputing CSV files, DSS defaults to outputing them without column headers, which would make it more painful to reopen them. While you can of course select to create CSV files with headers, this is incompatible with some of the optimized processing engines.


Setup Info
      Help me…