With many enterprise companies adopting the idea of moving applications and data to the cloud, security becomes paramount. One of the unique principles to working with data in Dataiku is that the processing logic that acts upon a dataset is decoupled from its underlying storage infrastructure. This is accomplished through the fundamental concept of Connections. Admin users have the ability to manage connections on an instance from a centralized location. From here, they can control settings such as credentials, security settings, naming rules, and usage parameters. Connections can be created to establish data access to various sources such as SQL databases, cloud storage, and NoSQL databases.
Dataiku is an AWS advanced partner that offers many integration points with AWS products that includes Amazon S3 to store and access large amounts of structured or unstructured data. If you come from the world of AWS, then you are probably familiar with Identity Access Management (IAM) roles and the role it plays in limiting data access.
So you may be wondering, how does Dataiku leverage IAM roles? And what if you have users with multiple roles that need access to data within a bucket? In this article, I am going to break down how Dataiku implements IAM role-based access to your S3 data.
How IAM Roles Are Used in Dataiku S3 Connections
To get started, Dataiku will need to have an instance profile created within your AWS account. The Dataiku instance will act on behalf of a Dataiku user to the associated AWS service. In this case, the Dataiku instance will assume a role that has access to the S3 bucket.
The instance profile will then request temporary credentials of my role to AWS Security Token Service (STS). Once STS is able to verify that your instance profile is allowed to perform this AssumeRole action and is trusted by the distance role, you’re in business! STS provides temporary security credentials and you can now interact with the S3 bucket using these credentials and associated permissions.
Figure 1. Diagram on how Dataiku leverages the AWS AssumeRole mechanism
When setting up an S3 Connection in Dataiku, you have a couple of methods to pass my AWS credentials to Dataiku:
AWS Keypair: AccesId + SecretId.
Environment: Use credentials from Environment variables.
STS with AssumeRole: Assume a role, with master credentials coming from the environment.
(Since we’re focusing on IAM Roles in this article, let’s focus on the STS with AssumeRole method.)
Figure 2. S3 Connection Configuration via Dataiku STS with AssumeRole method
Now you will just need to create my s3 connection by selecting STS with AssumeRole as the credential type. You’ll need to enter the Amazon Resource Name (ARN) of the role that will interact with your s3 buckets (se-sandbox-s3-access-role).
Figure 3. se-sandbox-s3-access-role Permissions Policies via AWS IAM Identity Center
In AWS IAM Identity Center, we can see that I have an IAM policy that defines the authorization to the s3 bucket. This can be combined with native AWS resource control such as a resource policy. By having additional resource policies attached to your s3 buckets, you can ensure tighter access permissions that will be passed through to Dataiku when setting up your connections.
Added Security Through Dataiku Groups
So what if you only want to make a connection usable by a select group of Dataiku users? Maybe you want to ensure that only Dataiku users with the associated IAM role will be able to see and use my Dataiku S3 Connection. User groups are a great feature within Dataiku to provide connection security. User groups can allow Dataiku Administrators to control what actions a set of users can perform within Dataiku. When creating connections, you have the ability to give permissions to a select group of users that allows them to leverage a connection inside of their project to create new datasets and more generally “browse” the connection.
Figure 4. Setting group level access at the connection level
This is a great solution in order to lock down connections and prevent a Dataiku user from leveraging a connection they don’t have access to. Although this is a great way to control access, there may be certain circumstances where you will want users to be able to pass their IAM roles to connections in Dataiku. Suppose you are a user that has multiple IAM roles (very common practice for AWS customers) that grants me access to specific files within a bucket or another bucket all together? The solution to this is to dynamically read in Admin Properties associated with a user.
Dynamically Leverage IAM Roles for a Dataiku User
Let’s create a scenario, and use the following IAM roles in our example: sales and marketing. Suppose you have a bucket that can be accessed by both users associated with a sales or marketing IAM role. Now, in that bucket, you may have files ONLY accessible by sales, and vice versa with marketing. Some users are associated with just one of these IAM roles, and some are associated with both. How might you set up your connections in Dataiku?
Step 1. Define the IAM roles associated with the Dataiku user in their Admin Property Settings.
Step 2. Create an S3 Connection for EACH IAM Role to be assumed.
Step 3. Leverage the global variable as the input for your “STS role to assume.”
Let’s take a look at this below:
Figure 5. Define the variable at the user level Admin Properties
Here we can see that we’ve associated two separate IAM Roles with the above user by setting a variable. My Dataiku user can now pass their IAM Role to an associated Dataiku Connection that leverages the corresponding variable.
Figure 6. Map the Admin Property variable to the connection level
In this example, we have created the connection s3-managed-sales using the STS with AssumeRole method. You can now pass the salesrole variable in order for the Dataiku Instance Profile to assume. Note that if a user does not have salesrole defined in their Admin Properties, the user will run into a “Permission Denied” error when attempting to use the connection within Dataiku.
By defining our users' IAM roles at the Admin Property level, you can rely more firmly on your IAM policies to control access to your s3 buckets. While Dataiku is a platform that cultivates collaboration, you can rest easy knowing that your existing governance policies will be respected by Dataiku. The best part in all of this is you can use our Python APIs script/automate your configurations!