Hide API Keys in Project Library Editor
Hi, so I have this following code in my project's Library Editor, however, I have manually defined keys, which I do not want to be shared. Let me give more contexts.
I have a python script that calls the following function, and extracts my files from Confluence using
from langchain.document_loaders import ConfluenceLoader
After it creates a DF of the extracted documents with the context and metadata, which I am using for an LLM Chatbot.
My main issue is, I don't know where I can declare these API keys as we do not want it accessible for everyone. Any insight will be appreciated!!
import os import sys # Env variables OPEN_AI_API_KEY = 'KEY' CONFLUENCE_SPACE_NAME = 'https://company-team-nvs.atlassian.net/wiki' CONFLUENCE_API_KEY = 'KEY' CONFLUENCE_SPACE_KEY = 'ChatbotTest' CONFLUENCE_USERNAME = 'user@company.com' PATH_NAME_SPLITTER = './splitted_docs.jsonl' PERSIST_DIRECTORY = './db/chroma/' EVALUATION_DATASET = '../data/evaluation_dataset.tsv'
Answers
-
louisbarjon Dataiker, Dataiku DSS Core Designer, Dataiku DSS Adv Designer, Registered Posts: 9 Dataiker
Hello,
You can define variables for the project, on the top-left menu click on the 3 dots and chose variables.
You have 2 sections, global variables and local variables.
They have almost the same usage expect that local variables will not be exported when bundling the project which is probably what you want here as it's a confidential API key.
To define a variable use json :{ "confluence_api_key": "key" }
Then you can use this in your python code :
import dataiku client = dataiku.api_client() project = client.get_default_project() # retrieve your project variables in a json which two first keys are 'standard' (which contains your project global variables) and 'local' (which contains your project local variables) project_variables = project.get_variables() # You can now use the variable in python code CONFLUENCE_API_KEY = project_variables['local']["confluence_api_key"]
More information in this documentation
However variables could still be readable by other users, in this case you should have a look at documentation about secret
-
Turribeach Dataiku DSS Core Designer, Neuron, Dataiku DSS Adv Designer, Registered, Neuron 2023 Posts: 2,160 Neuron
So this is an area that I think Dataiku needs to improve. The built-in way to hide these secrets is using User Secrets. The issue with user secrets in Dataiku is that they can't be shared. This means you either need to enter your secrets on all the users that need to use them (which is an admin/user overhead) or use dedicated "runner" generic user accounts which have the secrets on them (which requires separate licenses for each runner account). There is an enhancement request to add shared secrets in the Product Ideas part of the community site, you should vote for it if you think that's something you want to see added to Dataiku.
But even using user secrets doesn't give full protection to the secret. As shown by the Python code sample in the documentation I linked above user secrets can be retrieved unencrypted but the user that has them stored in their account running the Python code. If you have a Scenario set to run as user A, a malicious user B with write access to the scenario could easily add a Python scenario step to retrieve the user A secrets as the scenario will run as user A. Furthermore Admin users can impersonate any other user when calling the Python API so they can also see any other user secrets.
Another way you could approach this is by storing your secrets in a secrets manager store. All clouds provide such service and there are also third party tools that do that too (see HashiCorp Vault). Having said that using a third party service moves the problem outside of Dataiku but still leaves you with the issue of how do you authenticate securely to this secrets service. Using an API key to authenticate to the secret manager service will obviously leave you in the same situation as you are now, how do you secure that key? In Cloud environments sometimes it's possible to authenticate using "cloud default credentials" without an actual key or password. This works well when it's available in your environment but in general doesn't allow sharing or segmenting different kinds of secrets as it is based on the identity/role of the VM running the Dataiku instance which means you only have a single identity.
Another approach you can take is to store your secrets in a database. You would need to encrypt the secrets to follow best practice and do not store clear text secrets. In that scenario you will keep the unencryption key in code and permission the Database connection so that only people that need access to the secret can use it. This option has the advantage that you can use Dataiku's security groups to permission access and "share" the secrets.
Ultimately what you need to understand is that these secrets are going to be needed in clear text to be used during code execution. This means users will have the ability to capture these secrets if they have the ability to modify the code that runs in Dataiku and uses these secrets. An analogy for this will be an application that is developed by a team of developers and code is committed to Git. Ultimately any malicious developer can commit code to Git which can extract secrets used by the application. So if you don't trust your developers, or your Dataiku users in this case, you have a bigger problem.