Preventing users from accessing managed datasets as input

Options
VickeyC
VickeyC Registered Posts: 27 ✭✭✭✭

We are planning on using SQL Server tables as part of our managed datasets, setting the "schema name" field in the connection to the project key. This works well so far for managed datasets -- it's easy to identify all of the tables belonging to a particular project, and it prevents the users from inadvertently overwriting another project's dataset. The problem is, users who have access permissions to multiple projects can pull in managed datasets from other projects as input datasets.

Is there any way to prevent this? (e.g. a hook or a plugin?)


Operating system used: RHEL 7.9

Best Answer

  • Ignacio_Toledo
    Ignacio_Toledo Dataiku DSS Core Designer, Dataiku DSS Core Concepts, Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Frontrunner 2022 Finalist, Frontrunner 2022 Winner, Dataiku Frontrunner Awards 2021 Participant, Frontrunner 2022 Participant, Neuron 2023 Posts: 411 Neuron
    Answer ✓
    Options
    and I manually created the schema on SQL Server because the connection seems to assume that it already exists.

    Yes, as far as I'm aware, you can't create schemas from a connection definition.

    I've to be honest, and I'm kind of lost on how this will help, at the end, to control the permissions of the tables created through a dataiku managed dataset. Even if you define a schema to write in the connection settings, this will not limit the capability of a user of the connection to read and access the tables from other schemas, if you only have one set of credentials to connect to SQL server.

Answers

  • Ignacio_Toledo
    Ignacio_Toledo Dataiku DSS Core Designer, Dataiku DSS Core Concepts, Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Frontrunner 2022 Finalist, Frontrunner 2022 Winner, Dataiku Frontrunner Awards 2021 Participant, Frontrunner 2022 Participant, Neuron 2023 Posts: 411 Neuron
    Options

    Hello @VickeyC
    .

    I'm uncertain if I understand correctly the problem. What does it mean that other users can "pull in managed datasets from other projects as input datasets"?

    Is this related to the feature that allows to share datasets (managed datasets) between projects? Or users creating a new dataset in their projects that points to the managed dataset stored in the SQL Server?

    Cheers,

    Ignacio

  • VickeyC
    VickeyC Registered Posts: 27 ✭✭✭✭
    Options

    @Ignacio_Toledo
    let's say, for example, we have two projects (project A and project B), and that we have a data scientist (AKA Joe), who has read/write access to both of those projects. Inside of a project, Joe can select the managed dataset connection as an input dataset and he will be presented with a list of all the datasets created with that connection that he has permissions to use, which includes datasets from both projects A and B.

    Is there any way to prevent this from happening? I noticed that connections have places to add things like Post-connect commands and custom properties -- could I use one of those to limit a dataset's availability inside of projects? I haven't been able to find any documentation on how those fields can be used.

  • Ignacio_Toledo
    Ignacio_Toledo Dataiku DSS Core Designer, Dataiku DSS Core Concepts, Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Frontrunner 2022 Finalist, Frontrunner 2022 Winner, Dataiku Frontrunner Awards 2021 Participant, Frontrunner 2022 Participant, Neuron 2023 Posts: 411 Neuron
    Options

    @VickeyC
    , I think there is not a single solution. I think it will depend on the connection's types: SQL, local folders or files, HDFS, etc.

    If the connection is to a SQL kind of database, as long as a user have access to the connection, the user will have access to all the tables within that database connection, even if within Dataiku your data scientist has only access to project A and not B, or vice versa. The SQL connections handle their own permissions and credentials.

    Let me focus on this case, assuming that your dataset connections are to SQL dbs, what is your main concern if Joe uses as input dataset a table that is created by other user employing a managed dataset?

    Is it a security problem? Or is it a concern with the possibility that in the future a workflow can be broken because of changes in the table created with a managed dataset of a different project? Or that datasets might be overwritten?

    If it is a security problem, I think this documentation could help you: https://doc.dataiku.com/dss/10.0/security/connections.html#securing-access-to-connections

    But let me know if I'm going in the appropriate direction, or I completely missed your use case.

    Best,

    Ignacio

  • VickeyC
    VickeyC Registered Posts: 27 ✭✭✭✭
    Options

    @Ignacio_Toledo
    Here is my concern: In the scenario I described, users can select an existing managed dataset belonging to another project as a source dataset (AKA an external dataset), exposing the data to other project members who otherwise may or may not have permission to view the data.

    The connection I'm using for managed datasets is for SQL Server. As part of the setup, I created a database schema name on the SQL Server database to match the project key, and I added ${projectKey} to the to the schema field in the connection. Whenever someone uses the connection to create a new managed dataset, the table is created as ${projectKey}.<managed dataset name>. I had hoped that the connection would also only display datasets where the schema name on the database matches the ${projectKey}, but that's not the case. Any idea on how I could make that happen?

  • Alexandru
    Alexandru Dataiker, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 1,209 Dataiker
    Options

    Hi @VickeyC
    ,

    Just to add to what @Ignacio_Toledo
    mentioned if a user has access to a connection they will be able to read/use any tables as input datasets if credentials set on the connection can access those tables.

    You may be able to opt to use per-user credentials: https://doc.dataiku.com/dss/10.0/security/connections.html#per-user-credentials-for-connections

    Per-user login allows the database to know the exact identity of the user performing the request thus it can enforce granular permission on tables/schema set in the database.

    Settings permissions would be done directly in your database and not by DSS when creating the managed datasets.

    Thanks,

  • VickeyC
    VickeyC Registered Posts: 27 ✭✭✭✭
    Options

    @AlexT
    Thanks for the clarification. So if I created a separate userid for each project on SQL Server, (and a separate connection for each project), would that solve my problem? I really don't want to use per-user connections for the managed datasets.

  • Ignacio_Toledo
    Ignacio_Toledo Dataiku DSS Core Designer, Dataiku DSS Core Concepts, Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Frontrunner 2022 Finalist, Frontrunner 2022 Winner, Dataiku Frontrunner Awards 2021 Participant, Frontrunner 2022 Participant, Neuron 2023 Posts: 411 Neuron
    Options

    @VickeyC
    , I was going to post more or less the same information @AlexT
    already shared with you. With that solution, you still have a single connection in Dataiku, but your SQL Server administrator will need to create the users in their side, and then keep track of what schemas or tables a user will have access to.

    I was thinking that this might look as a shortcoming on the Dataiku side, but Dataiku just makes easy to connect to the data sources, but it doesn't provide its own managed database. What you need could only be fully achieved if somehow dataiku would become the administrator or provider of its own database solution.

    But that is just what, I think, is happening. From my point of view as administration, I'd love to control all the security setting of our different data sources and data warehouse from a single place in dataiku!

    Hopefully other users or dataikers will have some other options to offer.

    Cheers

  • VickeyC
    VickeyC Registered Posts: 27 ✭✭✭✭
    Options

    @Ignacio_Toledo
    @AlexT
    Here's where I'm at right now: I added the ${projectKey} to the schema name in the connection, and I manually created the schema on SQL Server because the connection seems to assume that it already exists. Then I set up an application role in SQL Server using the project key as both the role name and the default schema name. If you're unfamiliar with the application role, you can read about it here: https://docs.microsoft.com/en-us/sql/relational-databases/security/authentication-access/application-roles?view=sql-server-ver15

    Next I gave the application role some database permissions (e.g. create table, insert, update, etc). I also added this SQL Server code to the "Post Connect Commands" section in my connection: EXEC sp_setapprole '<a hard-coded application role name>', '<the password assigned to the application role>'; This setup allowed me to create the table in SQL Server with the ${projectKey} as the schema name.

    Now for the next challenge: How can I use a Dataiku variable in the "Post Connect Commands" section? I would really like to replace the hard-coded application name with the ${projectKey} name if possible.

  • Alexandru
    Alexandru Dataiker, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 1,209 Dataiker
    Options

    ​ For variable expansion within the post-connect command, the variables must be global variables defined under Administration > Settings > Variables. Project level variables will not work, so in your instance ${projectKey} would not be available.

  • VickeyC
    VickeyC Registered Posts: 27 ✭✭✭✭
    Options

    @AlexT
    Dang! I wish I could use a project variable.

    Thanks for all your help

Setup Info
    Tags
      Help me…