Survey banner
The Dataiku Community is moving to a new home! We are temporary in read only mode: LEARN MORE

Which API is faster?

Turribeach
Which API is faster?

Not a question but a knowledge sharing post. It can be confusing at first to understand all the different APIs that you can use to interact with Dataiku. We have the "internal" API ("import dataiku") which is supposed to be used when you are running things inside Dataiku. We also have the "external" API ("import dataikuapi") which is meant to be used from outside Dataiku and is a wrapper of the REST API. So all in all we have 3 APIs. But crucially the "internal" API can be used to connect to an external DSS instance.

So I decided to do a quick test to see which API will be fastest, although of course I was always expecting the internal API to be the one:

 

 

import dataiku
import dataikuapi
import time

dataiku_url = 'https://dss_url/'
dataiku_api_key = 'API key'

external_client = dataikuapi.DSSClient(dataiku_url, dataiku_api_key)
external_client._session.verify = False

internal_client = dataiku.api_client()
dssVersion = 'v' + internal_client.get_instance_info().raw['dssVersion']

internal_client = dataiku.api_client()

def benchmark_dataiku_client(client):

    project_list = client.list_project_keys()
    users = client.list_users()
    users_list = [i['login'] for i in users]

    for project_key in project_list:
        project = client.get_project(project_key)
        all_scenarios = project.list_scenarios()
        all_datasets = project.list_datasets()
        all_folders = project.list_managed_folders()
        all_models = project.list_saved_models()
        all_recipes = project.list_recipes()

start_time = time.time()
benchmark_dataiku_client(internal_client)
end_time = time.time()
print(dssVersion + " internal API client: " + str(round(end_time - start_time, 2)))

start_time = time.time()
benchmark_dataiku_client(external_client)
end_time = time.time()
print(dssVersion + " external API client: " + str(round(end_time - start_time, 2)))

dataiku.set_remote_dss(dataiku_url, dataiku_api_key)
internal_client_with_url = dataiku.api_client()

start_time = time.time()
benchmark_dataiku_client(internal_client_with_url)
end_time = time.time()
print(dssVersion + " internal API client using URL: " + str(round(end_time - start_time, 2)))

 

 

And here is the result:

 

 

v10.0.7 internal API client: 7.87
v10.0.7 external API client: 11.8
v10.0.7 internal API client using URL: 10.06

 

 

So no surprises there, the internal API is about ~50% faster than the external. But it's interesting to see that using the internal API with using a URL is ~20% faster than the external API.

 

 

 

0 Kudos
2 Replies
AlexT
Dataiker

Hi @Turribeach ,
From the DSS perspective the clients are all the same If you print type on internal/external, you can see they are all <class 'dataikuapi.dssclient.DSSClient'>

The difference in timing is likely due to the network overhead of going to localhost vs. external hostname, which may not have a direct path going through a proxy etc.
.
Testing on the Dataiku Cloud instance, the timings are very similar for me for all 3 tests:

v12.1.3 internal API client: 6.26
v12.1.3 external API client: 6.73
v12.1.3 internal API client using URL: 6.91
So quite an insignificant variance in some cases in my test, the external client was slightly faster  just pure coincidence based on the load at the time each of the parts was executed

v12.1.3 internal API client: 8.09
v12.1.3 external API client: 7.43
v12.1.3 internal API client using URL: 8.0

0 Kudos
Turribeach
Author

Hi Alex, thanks for that follow up. We use nginx and an HTTPS reverse proxy so that may account for the differences between internal and external calls. Not sure what you guys have in the Dataiku Cloud. 

It's true that all the methods I chose all belong to the dataikuapi but they I wonder if there is also difference with the dataiku API methods. For instance dataiku.Dataset is dataiku.core.dataset.Dataset and I could get a dataset list using dataiku.Dataset.list() vs using project.list_datasets() on the dataikuapi. So I wonder what difference there is using those methods. 

 

0 Kudos