List of all project and respective tables in dataiku
Hi Team,
Is there any way to extract the list of project and respective tables from Dataiku?
Thanks in Advance
Best Answer
-
HI,
The exception "Dataset cannot be used: Declare it as input or Output of your Recipe" means that you are trying to write to the dataset that was not specified as the output dataset in the recipe. To fix it you will need to add the dataset as an output dataset to the recipe, please refer to the example below:
Additionally, some of the datasets may have no "versionTag" so to handle these cases I would recommend introducing try/except into your code like below:
import dataiku, json, datetime from dataiku import pandasutils as pdu import pandas as pd client = dataiku.api_client() projects = client.list_projects() df = pd.DataFrame(columns=['project', 'dataset', 'table_name','lastModifiedBy','lastModifiedOn']) for p in projects : proj = client.get_project(p["projectKey"]) datasets = [] all_datasets = proj.list_datasets() for dataset in all_datasets: versionTag = dataset.get('versionTag') try: lastModifiedBy = versionTag['lastModifiedBy'].get('login') lastModifiedOn = datetime.datetime.fromtimestamp(versionTag['lastModifiedOn']/1e3) except: lastModifiedBy = "" lastModifiedOn = "" table_name = dataset['params'].get('table') projectKey = dataset.get('projectKey') if(table_name is not None): table_name = (lambda a, b: re.sub(r'\$\{projectKey\}', a, b))(projectKey, table_name) df = df.append({'project': p["projectKey"], 'dataset': dataset.name, 'table_name': table_name,'lastModifiedBy':lastModifiedBy,'lastModifiedOn':lastModifiedOn}, ignore_index=True) df = df.drop_duplicates()
Hope this helps.
-Best
Answers
-
Hi, If by the tables you mean datasets, you can use Dataiku API. Somethin like follows:
import dataiku, json from dataiku import pandasutils as pdu import pandas as pd client = dataiku.api_client() projects = client.list_projects() df = pd.DataFrame(columns=['project', 'dataset']) for p in projects : proj = client.get_project(p["projectKey"]) datasets = [] all_datasets = proj.list_datasets() for dataset in all_datasets: df = df.append({'project': p["projectKey"], 'dataset': dataset.name}, ignore_index=True) df = df.drop_duplicates() print(df)
The above will create a data frame that you can save to a dataset if required.
Hope the above is what you were looking for.
-Best
-
sj0071992 Partner, Dataiku DSS Core Designer, Neuron, Dataiku DSS Adv Designer, Registered, Dataiku DSS Developer, Neuron 2022, Neuron 2023 Posts: 131 Neuron
Hi,
This is really helpful.
But instead of dataset name can we get the table name under that dataset because i have to use it for join and in some cases project key is appended with the dataset name (table Name).
Is there any way to extract the same?
Thanks in Advance
-
Hi,
If you need to display table names for SQL like datasets, you can get them from dataset params. However, if you are using variables in the names it is saved as a string like "${projectKey}_dataset_name" so you will need to replace the variable yourself. I modified the code I shared earlier, please check it below:
import dataiku, json import re from dataiku import pandasutils as pdu import pandas as pd client = dataiku.api_client() projects = client.list_projects() df = pd.DataFrame(columns=['project', 'dataset', 'table_name']) for p in projects : proj = client.get_project(p["projectKey"]) datasets = [] all_datasets = proj.list_datasets() for dataset in all_datasets: table_name = dataset['params'].get('table') projectKey = dataset.get('projectKey') if(table_name is not None): table_name = (lambda a, b: re.sub(r'\$\{projectKey\}', a, b))(projectKey, table_name) df = df.append({'project': p["projectKey"], 'dataset': dataset.name, 'table_name': table_name}, ignore_index=True) df = df.drop_duplicates() print(df)
Hope this helps.
-Best
-
Marlan Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Dataiku Frontrunner Awards 2021 Participant, Neuron 2023 Posts: 319 Neuron
Hi @sj0071992
,Below is an alternative approach to the one @VitaliyD
shared. This has the advantage of not having to replace variables yourself. It is possible for variables other than ${projectKey} to be used in a table name so nice that such situations are taken care of for you. It does involve using two additional classes available through the Python APIs: the dataikuapi.dss.dataset.DSSDataset class and the dataiku.Dataset class which may be a bit confusing.Documentation for these two classes is here: https://doc.dataiku.com/dss/latest/python-api/datasets-reference.html
The get_location_info method used returns a number of other pieces of information (illustrated in the code sample). This information is not unique to this method but the simplicity of the returned data structure is perhaps a bit more convenient than other options.
client = dataiku.api_client() proj = client.get_project('DKUTESTING') dssdataset_objects = proj.list_datasets(as_type='objects') # returns instances of dataikuapi.dss.dataset.DSSDataset class for dssdataset_object in dssdataset_objects: dataset_name = dssdataset_object.name # Get corresponding instance of dataiku.Dataset class for the current dataset # This class offers the get_location_info method which will return table name with variables replaced dataset_object = dssdataset_object.get_as_core_dataset() location_info = dataset_object.get_location_info() # Get table name, etc for SQL type datasets if location_info.get('locationInfoType') == 'SQL': table_name = location_info['info'].get('table') connection_name = location_info['info'].get('connectionName') database_type = location_info['info'].get('databaseType') print('Dataset name = {}'.format(dataset_name)) print(' Table name = {}'.format(table_name)) print(' Connection name = {}'.format(connection_name)) print(' Database Type = {}'.format(database_type)) print('')
Marlan
-
sj0071992 Partner, Dataiku DSS Core Designer, Neuron, Dataiku DSS Adv Designer, Registered, Dataiku DSS Developer, Neuron 2022, Neuron 2023 Posts: 131 Neuron
Hi,
This is really helpful.
Just an add on question, can we also extract the last modified by (User Name) as well here with table Name?
Thanks in Advance
-
Marlan Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Dataiku Frontrunner Awards 2021 Participant, Neuron 2023 Posts: 319 Neuron
Hi @sj0071992
,The following snippet can be added to the script to get both last modified by and last modified on:
version_info = dssdataset_object.get_settings().get_raw()['versionTag'] last_modified_by = version_info['lastModifiedBy']['login'] last_modified_on = datetime.datetime.fromtimestamp(version_info['lastModifiedOn'] / 1e3)
You'll need to add an import for the datetime package if you want last modified on. Most of the rest of the available information about datasets is available via dssdataset_object.get_settings().get_raw(). You can check the returned results to see what's there.
Marlan
-
sj0071992 Partner, Dataiku DSS Core Designer, Neuron, Dataiku DSS Adv Designer, Registered, Dataiku DSS Developer, Neuron 2022, Neuron 2023 Posts: 131 Neuron
Hi,
This is perfect, but there is one issue.
When i am running this in python Notebook it is working perfect but when i convert that into Python Recipe it is giving one exception
"Dataset cannot be used: Declare it as input or output of your recipe"
Could you please help here as well.
Thanks in Advance
-
sj0071992 Partner, Dataiku DSS Core Designer, Neuron, Dataiku DSS Adv Designer, Registered, Dataiku DSS Developer, Neuron 2022, Neuron 2023 Posts: 131 Neuron
Hi @Marlan
The Code is failing when executing below line
dataset_object = dssdataset_object.get_as_core_dataset()
Ideally it is running fine in Jupiter notebook but is getting failed when i tried running through python recipe
Error - "Dataset cannot be used: Declare it as input or Output of your Recipe"
Thanks in Advance
-
Marlan Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Dataiku Frontrunner Awards 2021 Participant, Neuron 2023 Posts: 319 Neuron
Hi @sj0071992
,Makes sense that you can't run the script provided in a recipe as DSS requires that use of dataiku.Dataset class in recipe matches recipe inputs and outputs.
See https://doc.dataiku.com/dss/latest/python-api/datasets.html for more information on the two classes you can use to interact with datasets programmatically. The solution I proposed used both classes because the dataiku.Dataset class has this nice method that returns the table name with all variables resolved. As you found, this works fine in a notebook but not in a recipe.
You can't meet the requirement for use of dataiku.Dataset class in a recipe so you'll need to go back to just using the DSSDataset class from the dataikuapi package.
Thus you could go back to using @VitaliyD
's earlier solution without change.One small enhancement you could make to that solution is to handle replacement of any variables used in the table name rather than just the projectKey variable.
Here is one way to do that that works pretty nicely:
import dataiku from string import Template table_name_with_variables = '${projectKey}_${tier}_DATASETNAME' variables = dataiku.get_custom_variables() # get all project variables (including projectKey) table_name = Template(table_name_with_variables).substitute(variables) # replace any variables used with their values
The advantage of this approach is that it handles replacing any variables used with their values where the values are also automatically obtained via the API. It uses the Template class from the Python built-in string package which happens to use the same syntax for variables as DSS. I also use this a lot for replacing variables in SQL queries ran through the API.
Marlan
-
sj0071992 Partner, Dataiku DSS Core Designer, Neuron, Dataiku DSS Adv Designer, Registered, Dataiku DSS Developer, Neuron 2022, Neuron 2023 Posts: 131 Neuron
-
Marlan Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Dataiku Frontrunner Awards 2021 Participant, Neuron 2023 Posts: 319 Neuron
Hi @sj0071992
, yes, the code I suggested earlier can be used with @VitaliyD
's solution as it uses the DSSDataset class.Marlan
-
sj0071992 Partner, Dataiku DSS Core Designer, Neuron, Dataiku DSS Adv Designer, Registered, Dataiku DSS Developer, Neuron 2022, Neuron 2023 Posts: 131 Neuron
Hi @Marlan
,I am getting the same exception when I tried modifying the @VitaliyD
code, below is the Code for your reference. It would e great if you have a look at this,client = dataiku.api_client()
projects = client.list_projects()
df = pd.DataFrame(columns=['project', 'dataset', 'table_name','lastModifiedBy','lastModifiedOn'])
for p in projects :
proj = client.get_project(p["projectKey"])
datasets = []
all_datasets = proj.list_datasets()
for dataset in all_datasets:
versionTag = dataset.get('versionTag')
lastModifiedBy = versionTag['lastModifiedBy'].get('login')
lastModifiedOn = datetime.datetime.fromtimestamp(versionTag['lastModifiedOn']/1e3)
table_name = dataset['params'].get('table')
projectKey = dataset.get('projectKey')
if(table_name is not None):
table_name = (lambda a, b: re.sub(r'\$\{projectKey\}', a, b))(projectKey, table_name)
df = df.append({'project': p["projectKey"], 'dataset': dataset.name, 'table_name': table_name,'lastModifiedBy':lastModifiedBy,'lastModifiedOn':lastModifiedOn}, ignore_index=True)
df = df.drop_duplicates()
-
sj0071992 Partner, Dataiku DSS Core Designer, Neuron, Dataiku DSS Adv Designer, Registered, Dataiku DSS Developer, Neuron 2022, Neuron 2023 Posts: 131 Neuron
-
sj0071992 Partner, Dataiku DSS Core Designer, Neuron, Dataiku DSS Adv Designer, Registered, Dataiku DSS Developer, Neuron 2022, Neuron 2023 Posts: 131 Neuron
Hi @VitaliyD
Could you please also help in extracting the Schema name details with the same snippet. As i have tried by self but the only reason is, the projects used a variables for Schema and when i tried to extract the details i got '${schema}' not the name.
I am just wondering, how to extract these kind of details.
Thanks in Advance
-
sj0071992 Partner, Dataiku DSS Core Designer, Neuron, Dataiku DSS Adv Designer, Registered, Dataiku DSS Developer, Neuron 2022, Neuron 2023 Posts: 131 Neuron
-
Hi @sj0071992
,If I understood the requirement correctly, you would need to read the project variable and use the similar lambda function(i used in my code) to replace the "${schema}" in the string with the value from the variable. It can be something like below:
# get project variable schemaVar = proj.get_variables()['standard']['schema'] # replace the placeholder with the value table_name = (lambda a, b: re.sub(r'\$\{schema\}', a, b))(schemaVar, table_name)
Hope this helps.
Best.
-
sj0071992 Partner, Dataiku DSS Core Designer, Neuron, Dataiku DSS Adv Designer, Registered, Dataiku DSS Developer, Neuron 2022, Neuron 2023 Posts: 131 Neuron
Hi @VitaliyD
,I got your point here but what if we don't know the variable name used in the Dataiku projects for Schema.
How we can extract the details in that case?
Thanks in Advance
-
Hi @sj0071992
,In this case, you can extract the variable name using Regex groups with something like below and then as you will have the variable name, get the value and replace the placeholder.
table_name = '${some_variable}/asdf/asdf' result = re.search(r'(\$\{)(.*)(\})', table_name) print(result.group(2))
Best,
Vitaliy
-
this works good.
actually my problem is to delete some of the tables zone wise .can you help me to get all the tables from respective zones .