Which API is faster?

Turribeach
Turribeach Dataiku DSS Core Designer, Neuron, Dataiku DSS Adv Designer, Registered, Neuron 2023 Posts: 2,160 Neuron

Not a question but a knowledge sharing post. It can be confusing at first to understand all the different APIs that you can use to interact with Dataiku. We have the "internal" API ("import dataiku") which is supposed to be used when you are running things inside Dataiku. We also have the "external" API ("import dataikuapi") which is meant to be used from outside Dataiku and is a wrapper of the REST API. So all in all we have 3 APIs. But crucially the "internal" API can be used to connect to an external DSS instance.

So I decided to do a quick test to see which API will be fastest, although of course I was always expecting the internal API to be the one:

import dataiku
import dataikuapi
import time

dataiku_url = 'https://dss_url/'
dataiku_api_key = 'API key'

external_client = dataikuapi.DSSClient(dataiku_url, dataiku_api_key)
external_client._session.verify = False

internal_client = dataiku.api_client()
dssVersion = 'v' + internal_client.get_instance_info().raw['dssVersion']

internal_client = dataiku.api_client()

def benchmark_dataiku_client(client):

    project_list = client.list_project_keys()
    users = client.list_users()
    users_list = [i['login'] for i in users]

    for project_key in project_list:
        project = client.get_project(project_key)
        all_scenarios = project.list_scenarios()
        all_datasets = project.list_datasets()
        all_folders = project.list_managed_folders()
        all_models = project.list_saved_models()
        all_recipes = project.list_recipes()

start_time = time.time()
benchmark_dataiku_client(internal_client)
end_time = time.time()
print(dssVersion + " internal API client: " + str(round(end_time - start_time, 2)))

start_time = time.time()
benchmark_dataiku_client(external_client)
end_time = time.time()
print(dssVersion + " external API client: " + str(round(end_time - start_time, 2)))

dataiku.set_remote_dss(dataiku_url, dataiku_api_key)
internal_client_with_url = dataiku.api_client()

start_time = time.time()
benchmark_dataiku_client(internal_client_with_url)
end_time = time.time()
print(dssVersion + " internal API client using URL: " + str(round(end_time - start_time, 2)))

And here is the result:

v10.0.7 internal API client: 7.87
v10.0.7 external API client: 11.8
v10.0.7 internal API client using URL: 10.06

So no surprises there, the internal API is about ~50% faster than the external. But it's interesting to see that using the internal API with using a URL is ~20% faster than the external API.

Tagged:

Answers

  • Alexandru
    Alexandru Dataiker, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 1,226 Dataiker
    edited July 17

    Hi @Turribeach
    ,
    From the DSS perspective the clients are all the same If you print type on internal/external, you can see they are all <class 'dataikuapi.dssclient.DSSClient'>

    The difference in timing is likely due to the network overhead of going to localhost vs. external hostname, which may not have a direct path going through a proxy etc.
    .
    Testing on the Dataiku Cloud instance, the timings are very similar for me for all 3 tests:

    v12.1.3 internal API client: 6.26
    v12.1.3 external API client: 6.73
    v12.1.3 internal API client using URL: 6.91
    So quite an insignificant variance in some cases in my test, the external client was slightly faster just pure coincidence based on the load at the time each of the parts was executed

    v12.1.3 internal API client: 8.09
    v12.1.3 external API client: 7.43
    v12.1.3 internal API client using URL: 8.0

    Kind Regards,
  • Turribeach
    Turribeach Dataiku DSS Core Designer, Neuron, Dataiku DSS Adv Designer, Registered, Neuron 2023 Posts: 2,160 Neuron

    Hi Alex, thanks for that follow up. We use nginx and an HTTPS reverse proxy so that may account for the differences between internal and external calls. Not sure what you guys have in the Dataiku Cloud.

    It's true that all the methods I chose all belong to the dataikuapi but they I wonder if there is also difference with the dataiku API methods. For instance dataiku.Dataset is dataiku.core.dataset.Dataset and I could get a dataset list using dataiku.Dataset.list() vs using project.list_datasets() on the dataikuapi. So I wonder what difference there is using those methods.

Setup Info
    Tags
      Help me…