What is the difference between the dataiku and the dataikuapi packages?
From what I could see, both seem to have the same functionality when I used them outside the Dataiku web interface. I see documentation for both the packages but they seem to be conflicting.
For example, in this page section, in the illustrative code, we are importing dataiku, but just below it, it is written that a dataikuapi.DSSClient object is created
In this page it says dataiku package is used to create a client from inside the DSS while dataikuapi is used to create a client from outside the DSS. But using this method I am able to use dataiku package too outside the DSS. So which one should I prefer?
I need to create modules outside the DSS to make Snowflake connections and run an entire ML cycle from outside the DSS.
Operating system used: Windows
Best Answer
-
From outside DSS, you should favor the
dataikuapi
package, which is designed for outside usage first.
Answers
-
Turribeach Dataiku DSS Core Designer, Neuron, Dataiku DSS Adv Designer, Registered, Neuron 2023 Posts: 2,165 Neuron
The answer to this question is like when you put on Facebook that your relationship status as "it's complicated". In general terms the advice from the client page you linked should be followed. From inside DSS use the internal client from outside DSS use the dataikuapi package. But the devil is in the detail. While the two clients attempt to be similar there are certainly not. Some APIs are available only on one client. Some APIs differ slightly or return different data structures. This page covers some of those differences but you can always see the difference by looking at the full Python API documentation and checking each of the methods in the two classes: internal is dataiku.* class and external is dataikuapi.* class. Have a go and see how many differences you can find!
A good example of this is the dataiku.Dataset.get_dataframe() method which only exists on the dataiku internal API. So in some cases you are forced to use the internal API with the method you linked. Sometimes the internal client is preferable since it's easier to install offline. Be very careful with the dataiku internal client when calling multiple different servers and instantiating multiple dataiku clients as it not multi-client safe. You must call dataiku.clear_remote_dss() before trying to instantiate a new client/dss server so if you connecting to multiple DSS environments to do things like migrations or environment comparasions always try to use the dataikuapi package (if possible) even if you are running the code from a Dataiku recipe/scenario/notebook.A good reason to use the outside DSS client is when you have a multi-version DSS state and you need to make sure you the compatible dataikuapi version. Using Python code environments, either outside DSS or inside DSS, allows you to easily maintain multiple versions of the dataikuapi package and connect to each environment using the correct one.
Also keep in mind there might be performance considerations between the internal and external clients. See this thread.
Finally since the dataikuapi is a client for Dataiku’s public REST API this brings now a third Dataiku API into the mix: the raw outside DSS Dataiku REST API. This API comes in handy when trying to integrate with other systems that don't support Python but can use plain old REST calls. It's not the most user friendly API but can certainly get you there when the other side only supports REST.
Bonus track! Forth API...
Did you think 3 API stacks was enough? Of course not! There is also the "private" server side REST API that the browser uses. This API is private and not meant to be used by developers but we have used it before when there is no public API available. This API is obviously undocumented and risky to use since it could change at any time and is also unsupported. To use it you need to "login" to DSS to get a cookie and then pass it along the XRSF-Token and you can issue calls like your browser does. A sample API end point is: /dip/api/admin/connections/list.