Scaling Exposed Datasets

yinkit · April 2020

Hi,

The feature to expose Datasets seems to be good but i can t think of a way to scale it.

Raw Data
- stored in s3
- high volume, output ETLs will be partitioned %Y/%M/%D/%H
- recipe engine used : spark

Use case
I have a project that has multiple critical ELTs that output the organization main Dataset.
I would like to expose these Dataset to my end-users (business projects) so they can re-use them easily.

Solution
From the doc (https://doc.dataiku.com/dss/latest/security/exposed-objects.html)
Datasets can be exposed to other project but the solution is that for each Dataset, i have to expose it to each others projects.
On a long term, N business project and N Datasets, it is going to be painful to maintain theses permissions.

Question
Is there a way to change permission control ? (like on a user group other than project )?
(i.e : for each dataset, i give access to groups of users)

Work around
I can still output the datasets in S3 and let the end user create his dataset in his project.
Each projects will have to create their own Datasets.
WDYT ?

Clément_Stenac · April 2020

Hi,

Sharing is indeed only based on projects and not groups. We'll be taking your feedback into account.

Your workaround is likely the best solution indeed (i.e. recreate new datasets targeting the same location, in your other projects).

We recommend that you use Parquet format. This is because when outputing CSV files, DSS defaults to outputing them without column headers, which would make it more painful to reopen them. While you can of course select to create CSV files with headers, this is incompatible with some of the optimized processing engines.

yinkit · April 2020

Hi Clément,

Thank you for your reply and for your remark with parquet format, i was outputing CSV files.
I will go with the work around for now.

Scaling Exposed Datasets

Best Answer

Answers

Categories

Setup Info

Tags