Discover this year's submissions to the Dataiku Frontrunner Awards and give kudos to your favorite use cases and success stories!READ MORE

Preventing users from accessing managed datasets as input

Solved!
VickeyC
Level 3
Preventing users from accessing managed datasets as input

We are planning on using SQL Server tables as part of our managed datasets, setting the "schema name" field in the connection to the project key.   This works well so far for managed datasets -- it's easy to identify all of the tables belonging to a particular project, and it prevents the users from inadvertently overwriting another project's dataset.  The problem is, users who have access permissions to multiple projects can pull in managed datasets from other projects as input datasets.

Is there any way to prevent this?  (e.g. a hook or a plugin?)  


Operating system used: RHEL 7.9

0 Kudos
1 Solution
Ignacio_Toledo
and I manually created the schema on SQL Server because the connection seems to assume that it already exists. 

 Yes, as far as I'm aware, you can't create schemas from a connection definition.

I've to be honest, and I'm kind of lost on how this will help, at the end, to control the permissions of the tables created through a dataiku managed dataset. Even if you define a schema to write in the connection settings, this will not limit the capability of a user of the connection to read and access the tables from other schemas, if you only have one set of credentials to connect to SQL server.

View solution in original post

0 Kudos
11 Replies
Ignacio_Toledo

Hello @VickeyC.

I'm uncertain if I understand correctly the problem. What does it mean that other users can "pull in managed datasets from other projects as input datasets"?

Is this related to the feature that allows to share datasets (managed datasets) between projects? Or users creating a new dataset in their projects that points to the managed dataset stored in the SQL Server?

Cheers,

Ignacio

0 Kudos
VickeyC
Level 3
Author

@Ignacio_Toledo let's say, for example, we have two projects (project A and project B), and that we have a data scientist (AKA Joe), who has read/write access to both of those projects.  Inside of a project, Joe can select the managed dataset connection as an input dataset and he will be presented with a list of all the datasets created with that connection that he has permissions to use, which includes datasets from both projects A and B.  

Is there any way to prevent this from happening?   I noticed that connections have places to add things like Post-connect commands and custom properties -- could I use one of those to limit a dataset's availability inside of projects?  I haven't been able to find any documentation on how those fields can be used.

0 Kudos
Ignacio_Toledo

@VickeyC, I think there is not a single solution. I think it will depend on the connection's types: SQL, local folders or files, HDFS, etc.

If the connection is to a SQL kind of database, as long as a user have access to the connection, the user will have access to all the tables within that database connection, even if within Dataiku your data scientist has only access to project A and not B, or vice versa. The SQL connections handle their own permissions and credentials.

Let me focus on this case, assuming that your dataset connections are to SQL dbs, what is your main concern if Joe uses as input dataset a table that is created by other user employing a managed dataset?

Is it a security problem? Or is it a concern with the possibility that in the future a workflow can be broken because of changes in the table created with a managed dataset of a different project? Or that datasets might be overwritten?

If it is a security problem, I think this documentation could help you: https://doc.dataiku.com/dss/10.0/security/connections.html#securing-access-to-connections

But let me know if I'm going in the appropriate direction, or I completely missed your use case.

Best,

Ignacio

 

VickeyC
Level 3
Author

@Ignacio_Toledo Here is my concern:  In the scenario I described, users can select an existing managed dataset belonging to another project as a source dataset (AKA an external dataset), exposing the data to other project members who otherwise may or may not have permission to view the data.   

The connection I'm using for managed datasets is for SQL Server.  As part of the setup, I created a database schema name on the SQL Server database to match the project key, and I added ${projectKey} to the to the schema field in the connection.  Whenever someone uses the connection to create a new managed dataset, the table is created as ${projectKey}.<managed dataset name>.   I had hoped that the connection would also only display datasets where the schema name on the database matches the ${projectKey}, but that's not the case.   Any idea on how I could make that happen? 

0 Kudos
AlexT
Dataiker
Dataiker

Hi @VickeyC ,

Just to add to what @Ignacio_Toledo mentioned if a user has access to a connection they will be able to read/use any tables as input datasets if credentials set on the connection can access those tables. 

You may be able to opt to use per-user credentials:  https://doc.dataiku.com/dss/10.0/security/connections.html#per-user-credentials-for-connections

 Per-user login allows the database to know the exact identity of the user performing the request thus it can enforce granular permission on tables/schema set in the database. 

Settings permissions would be done directly in your database and not by DSS when creating the managed datasets. 

Thanks,

VickeyC
Level 3
Author

@AlexT Thanks for the clarification.  So if I created a separate userid for each project on SQL Server, (and a separate connection for each project), would that solve my problem?  I really don't want to use per-user connections for the managed datasets.

0 Kudos
Ignacio_Toledo

@VickeyC, I was going to post more or less the same information @AlexT already shared with you. With that solution, you still have a single connection in Dataiku, but your SQL Server administrator will need to create the users in their side, and then keep track of what schemas or tables a user will have access to.

I was thinking that this might look as a shortcoming on the Dataiku side, but Dataiku just makes easy to connect to the data sources, but it doesn't provide its own managed database. What you need could only be fully achieved if somehow dataiku would become the administrator or provider of its own database solution.

But that is just what, I think, is happening. From my point of view as administration, I'd love to control all the security setting of our different data sources and data warehouse from a single place in dataiku!

Hopefully other users or dataikers will have some other options to offer.

Cheers

0 Kudos
VickeyC
Level 3
Author

@Ignacio_Toledo @AlexT  Here's where I'm at right now:   I added the ${projectKey} to the schema name in the connection, and I manually created the schema on SQL Server because the connection seems to assume that it already exists.   Then I set up an application role in SQL Server using the project key as both the role name and the default schema name.   If you're unfamiliar with the application role, you can read about it here:  https://docs.microsoft.com/en-us/sql/relational-databases/security/authentication-access/application...

Next I gave the application role some database permissions (e.g. create table, insert, update, etc).  I also added this SQL Server code to the "Post Connect Commands" section in my connection: EXEC sp_setapprole '<a hard-coded application role name>', '<the password assigned to the  application role>';  This setup allowed me to create the table in SQL Server with the ${projectKey} as the schema name.  

Now for the next challenge:  How can I use a Dataiku variable in the "Post Connect Commands" section?   I would really like to replace the hard-coded application name with the ${projectKey} name if possible.

0 Kudos
Ignacio_Toledo
and I manually created the schema on SQL Server because the connection seems to assume that it already exists. 

 Yes, as far as I'm aware, you can't create schemas from a connection definition.

I've to be honest, and I'm kind of lost on how this will help, at the end, to control the permissions of the tables created through a dataiku managed dataset. Even if you define a schema to write in the connection settings, this will not limit the capability of a user of the connection to read and access the tables from other schemas, if you only have one set of credentials to connect to SQL server.

0 Kudos
AlexT
Dataiker
Dataiker

​ For variable expansion within the post-connect command, the variables must be global variables defined under Administration > Settings > Variables. Project level variables will not work, so in your instance ${projectKey} would not be available.

 

0 Kudos
VickeyC
Level 3
Author

@AlexT Dang!  I wish I could use a project variable.

Thanks for all your help

0 Kudos

Labels

?
Labels (1)

Setup info

?
A banner prompting to get Dataiku