How to quickly mass-change the database type of datasets for a project?
Thanks for your time firstly.
I am currently changing the type of database from 'Snowflake' to 'Redshift' (except for source node). While I could use 'Change connection' function at the bottom-right border of DSS, but it's quite robotic. Is there any efficient approach or python API that could modify the connection in a short time?
def mass_change_connection(project, orig_conn, dest_conn):
"""Mass change dataset connections in a project (filesystem connections only)"""
for dataset in project.list_datasets(as_type='objects'):
ds_settings = dataset.get_settings()
if ds_settings.type == 'Snowflake':
ds_settings
params = ds_settings.get_raw().get('params')
print(params)
current_connection = params.get('connection')
if current_connection == orig_conn:
params['connection'] = dest_conn
ds_settings.save()
I found the above function from previous Q&A. However it only change the name of the connection but not the type
of the database.
Thanks for your help.
Operating system used: Windows11
Best Answers
-
Ashley Dataiker, Alpha Tester, Dataiku DSS Core Designer, Registered, Product Ideas Manager Posts: 166 DataikerHi @Haoran ,
You might be able to use the 'Connections' Flow view to do this. Here's how:- On your Flow, click on 'Apply a view'. It's on the top left corner.
- Select Connections: this is going to color each dataset in your Flow according to the connection that it uses
- You can further filter the connections by clicking on any of the connection facets
- 'Select items': this is going to select all the datasets based on the connections highlighted in the view
- Right-click > Change Connection and move the datasets to a different connection. You'll need to rebuild them.
Cheers,
Ashley
-
Turribeach Dataiku DSS Core Designer, Neuron, Dataiku DSS Adv Designer, Registered, Neuron 2023, Circle Member Posts: 2,617 NeuronChanging the type can also be done although it requires more work since different connection types may have different parameters. Type can be changed here:
ds_settings.get_raw()['type']
The best way to do this is to look at two datasets created by Dataiku in the source and destination connections. Then compare the get_raw() output of each of them and figure out all the things you need to change. Then code and test and you should be good. Note each connection type will have different properties so the code will be source/destination specific.
-
It real make sense. Thanks for your share!😆
-
Here is my final solution code:
import dataiku
import urllib3
urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning)
def get_project(client, project_key):
return client.get_project(project_key)
def get_datasets(client, project_key):
project = get_project(client, project_key)
return project.list_datasets()
def change_connection(project, orig_conn, dest_conn):
i = 0
for dataset in project.list_datasets(as_type='objects'):
ds_settings = dataset.get_settings()
raw = ds_settings.get_raw()
if raw.get("type") == "Snowflake" and raw.get("managed") == True:
params = raw.get('params')
current_connection = params.get('connection')
if current_connection == orig_conn:
raw['type'] = 'Redshift'
params['connection'] = dest_conn
ds_settings.save()
i += 1
print(f"Changed connections: {i}")
if __name__ == "__main__":
# DSS
url = "xxx"
api_key = "xxx"
# connections
orig_conn = "CONNECTION_ORIG"
dest_conn = "CONNECTION_DEST"
dataiku.set_remote_dss(url, api_key, no_check_certificate=True)
client = dataiku.api_client()
# project key
project_key = "HELLOWORLD"
print("project_key: " + project_key)
# project
project = client.get_project(project_key) # function call
change_connection(project, orig_conn, dest_conn)
Answers
-
That's pretty cool. Thank u Ashley!
-
Turribeach Dataiku DSS Core Designer, Neuron, Dataiku DSS Adv Designer, Registered, Neuron 2023, Circle Member Posts: 2,617 NeuronThanks for posting the solution, should help others!
PS: In Dataiku you don't really need to use "if __name__ == "__main__" in Recipes or Jupyter Notebooks since you never going to import them as modules so they will always have a __name__ = __main__.


