Change source dataset in flow project.

Grixis · ‎12-08-2023

When migrating several projects from an Oracle datawarehouse to Postgrel, we have to import the projects from one instance to another, and therefore organize the remapping to make it as easy as possible.

First we dl zipped our project from DSS Oracle compatible instance and import then on our DSS Postgrel compatible instance.

For your information, we've made some fake Oracle connectors in our instance so as not to break our flow and to be able to select them and change connections and so on as we see fit.

And we're finding that there is a certain difficulty in remapping input datasets views, in other words, our Oracle-type input tables for our projects have to switch to Postgrel-type connections. We've remapped all the other elements without a hitch (intermediate dataset etc.), but on this one we know it's not possible.

This leads to a series of steps that can quickly become tedious for migration teams, for example:

You have an input oracle table with the name "ALIM" and you want to replace it with the new Postgrel power supply.
So in three steps;
1) Import the equivalent Postgrel table as a new dataset with a name other than "ALIM", such as "ALIM_BIS".
2) Go to all ALIM functionally dependent containers to substitute the input with ALIM_BIS.
3) Go to all scenarios and other objects calling the "ALIM" input table to replace with "ALIM_BIS".
4) Drop the oracle table "ALIM" which no longer exists

Why can't we, at least fictitiously, simply change the ALIM connector from Oracle connection to Postgrel connection to prevent these steps and then fit the dataset configurations to synchronize and have the right schema?

We welcome any recommendations or advice on bypassing this 4 minimums steps or on how to improve this process.

Turribeach · ‎12-08-2023

Different connections lead to different properties in the dataset. Therefore switching the connection is not directly supported. One way to solve this is to do what you did in several manual steps. But it’s certainly possible to do this via the Dataiku Python API if you are willing to get your hands dirty. In a nutshell you create a dataset in the first connection technology, then extract the JSON, then recreate the same dataset in the second connect technology and again extract the JSON. Finally you do a file compare and determine all the properties that need to be changed from one connection to the other. Finally you write some API code that does the work for you.

Grixis · ‎12-11-2023

Hello, thank you very much for your reply. That's what I suspected, I still have to explore the python API to possibly create a remapping pluggin but I already know that it won't be possible to retrieve from an SSH connection to a filesystem.

I'll come back and comment on this post if I come up with something conclusive.

Turribeach · ‎12-11-2023

Why do you need to use SSH? It will be much better to use variables in connections than trying to manipulate datasets. For instance if you mount your SSH connection using SSHFS you could easily parametrise the Path prefix of the connection using a variable which could be set a project level, runtime, anything you want.

Sign up to take part

Change source dataset in flow project.

Change source dataset in flow project.