Sign up to take part
Registered users can ask their own questions, contribute to discussions, and be part of the Community!
Added on September 19, 2023 11:58AM
Likes: 0
Replies: 2
Not a question but a knowledge sharing post. It can be confusing at first to understand all the different APIs that you can use to interact with Dataiku. We have the "internal" API ("import dataiku") which is supposed to be used when you are running things inside Dataiku. We also have the "external" API ("import dataikuapi") which is meant to be used from outside Dataiku and is a wrapper of the REST API. So all in all we have 3 APIs. But crucially the "internal" API can be used to connect to an external DSS instance.
So I decided to do a quick test to see which API will be fastest, although of course I was always expecting the internal API to be the one:
import dataiku import dataikuapi import time dataiku_url = 'https://dss_url/' dataiku_api_key = 'API key' external_client = dataikuapi.DSSClient(dataiku_url, dataiku_api_key) external_client._session.verify = False internal_client = dataiku.api_client() dssVersion = 'v' + internal_client.get_instance_info().raw['dssVersion'] internal_client = dataiku.api_client() def benchmark_dataiku_client(client): project_list = client.list_project_keys() users = client.list_users() users_list = [i['login'] for i in users] for project_key in project_list: project = client.get_project(project_key) all_scenarios = project.list_scenarios() all_datasets = project.list_datasets() all_folders = project.list_managed_folders() all_models = project.list_saved_models() all_recipes = project.list_recipes() start_time = time.time() benchmark_dataiku_client(internal_client) end_time = time.time() print(dssVersion + " internal API client: " + str(round(end_time - start_time, 2))) start_time = time.time() benchmark_dataiku_client(external_client) end_time = time.time() print(dssVersion + " external API client: " + str(round(end_time - start_time, 2))) dataiku.set_remote_dss(dataiku_url, dataiku_api_key) internal_client_with_url = dataiku.api_client() start_time = time.time() benchmark_dataiku_client(internal_client_with_url) end_time = time.time() print(dssVersion + " internal API client using URL: " + str(round(end_time - start_time, 2)))
And here is the result:
v10.0.7 internal API client: 7.87 v10.0.7 external API client: 11.8 v10.0.7 internal API client using URL: 10.06
So no surprises there, the internal API is about ~50% faster than the external. But it's interesting to see that using the internal API with using a URL is ~20% faster than the external API.
Hi @Turribeach
,
From the DSS perspective the clients are all the same If you print type on internal/external, you can see they are all <class 'dataikuapi.dssclient.DSSClient'>
The difference in timing is likely due to the network overhead of going to localhost vs. external hostname, which may not have a direct path going through a proxy etc.
.
Testing on the Dataiku Cloud instance, the timings are very similar for me for all 3 tests:
v12.1.3 internal API client: 6.26 v12.1.3 external API client: 6.73 v12.1.3 internal API client using URL: 6.91
Hi Alex, thanks for that follow up. We use nginx and an HTTPS reverse proxy so that may account for the differences between internal and external calls. Not sure what you guys have in the Dataiku Cloud.
It's true that all the methods I chose all belong to the dataikuapi but they I wonder if there is also difference with the dataiku API methods. For instance dataiku.Dataset is dataiku.core.dataset.Dataset and I could get a dataset list using dataiku.Dataset.list() vs using project.list_datasets() on the dataikuapi. So I wonder what difference there is using those methods.