Python API - iter_rows - no data

Options
soft_eng
soft_eng Registered Posts: 5
edited July 16 in Using Dataiku

Hello,

I am quite new to Dataiku and I am currently working on an ETL with a custom dashboard to expose our data at the end. Our dashboard was built using React so I am currently building an API (using FastAPI) to expose ou final dataset. Note that I have 5 columns and 5 millions rows.

So I am now working on my API and I have issues to get all my data stored in my last dataset on my Dataiku project, called transformed_data. I am able to get the schema but 0 data.

import dataikuapi


class DataikuAPIWrapper:
    def __init__(self, host, api_key, project_id):
        """
        Initialize the Dataiku API wrapper.

        Args:
            host (str): The URL of your Dataiku DSS instance (e.g., "https://your-dss-instance-url").
            api_key (str): Your Dataiku DSS API key.
        """
        self.client = dataikuapi.DSSClient(host, api_key)
        self.client._session.verify = False
        self.project = self.client.get_project(project_id)

    dataiku_api = DataikuAPIWrapper(
    DATAIKU_HOST_URL, DATAIKU_API_KEY_SECRET, DATAIKU_PROJECT_ID
    )
    dataset = dataiku_api.project.get_dataset("transformed_data")
    columns = [column["name"] for column in dataset.get_schema()["columns"]]
    for row in dataset.iter_rows():
        print(row)

0 rows and the iterator object seems to be empty.

I am able to print the schema doing this:

    dataset = dataiku_api.project.get_dataset("transformed_data")
    columns = [column["name"] for column in dataset.get_schema()["columns"]]
    print(columns)

note that the dataset variable is a DSSDataset instance and based on the Github repository, iter_rows() seems to be the only method that actually gets the data. https://github.com/dataiku/dataiku-api-client-python/blob/master/dataikuapi/dss/dataset.py

What are my options?

Answers

  • Turribeach
    Turribeach Dataiku DSS Core Designer, Neuron, Dataiku DSS Adv Designer, Registered, Neuron 2023 Posts: 1,726 Neuron
    edited July 17
    Options

    Where is your code running? You say you are buiklding an API to access your "final dataset" but then seem to be using the Dataiku API. Also why build an API on some other framework when Dataiku already has an API node to deploy APIs. Please describe exactly what you are trying to achieve as it's not clear to me why you need a separate API.

    In a Python recipe or Jupyter Notebook inside Dataiku you can simply do this:

    customers = dataiku.Dataset("customers")
    customers_df = customers.get_dataframe()
    customers_df.head()

    Iterating over 5m rows in a for loop will be very slow.

  • soft_eng
    soft_eng Registered Posts: 5
    Options

    Thanks for reply

    I am building a custom dashboard using React (+ Nivo) to display time-series charts with millions data points.

    The data that needs to be exposed is currently stored on a Daitaku dataset.

    Is there is a way to directly expose the final table to a React app, that's awsome.

    additionally, due to the large volume of data, I may not want to load all the data at once but probably using interval based on the user zoom level.

  • Turribeach
    Turribeach Dataiku DSS Core Designer, Neuron, Dataiku DSS Adv Designer, Registered, Neuron 2023 Posts: 1,726 Neuron
    edited July 17
    Options

    OK so you are not building any APIs, you are just calling the Dataiku API from your React app, hence why you confused me.

    There are different Dataiku APIs, see here:

    https://developer.dataiku.com/latest/getting-started/dataiku-python-apis/index.html

    The dataikuapi package won't give you direct access to the Dataset other than iter_rows which it's going to be slow:

    https://developer.dataiku.com/latest/concepts-and-examples/datasets/index.html

    So I suggest you install the internal API package as shown on this section:

    https://developer.dataiku.com/latest/tutorials/devtools/python-client/index.html#building-your-local-virtual-environment

    Then use the internal API as follows to get access to the dataset in a Data Frame:

    import dataiku
    
    dataiku.set_remote_dss("https://your_dss_URL", "YOURAPIKEY")
    client = dataiku.api_client()
    
    # Uncomment this if your instance has a self-signed certificate
    # client._session.verify = False
    
    dataset = dataiku.Dataset("YOURPROJECTKEY.dataset_name")
    dataset_df = dataset.get_dataframe()
    dataset_df.head()

  • soft_eng
    soft_eng Registered Posts: 5
    Options

    You are telling me there is a way to call the dataiku API from my React app, can you show me how you would do it (directly from React of corse without using Python)? thanks!

    I have been searching for it yesterday after you told it was possible but I haven't seen any JS code that allows to do it

  • soft_eng
    soft_eng Registered Posts: 5
    edited July 17
    Options

    ModuleNotFoundError: No module named 'dataiku'

    The only library that seems to ba available on python is that one: https://pypi.org/project/dataiku-api-client/

    That's already the one that I am using (dataikuapi)

    Anyone know how to import that dataiku library ?

  • Turribeach
    Turribeach Dataiku DSS Core Designer, Neuron, Dataiku DSS Adv Designer, Registered, Neuron 2023 Posts: 1,726 Neuron
    Options

    @soft_eng
    wrote:

    You are telling me there is a way to call the dataiku API from my React app, can you show me how you would do it (directly from React of corse without using Python)? thanks!

    I have been searching for it yesterday after you told it was possible but I haven't seen any JS code that allows to do it


    I never said that. I said that you are not building an API but calling one. That's a fundamentally different problem. How to call Python code from React and how to pass the data from Python to react it will be up for you. Another option could be to use the Dataiku REST API directly which might be easier to access from a JS webapp:

    https://doc.dataiku.com/dss/api/12/rest/#datasets-dataset-get

  • Turribeach
    Turribeach Dataiku DSS Core Designer, Neuron, Dataiku DSS Adv Designer, Registered, Neuron 2023 Posts: 1,726 Neuron
    Options

    The link in my post explains how to install the internal API package.

  • soft_eng
    soft_eng Registered Posts: 5
    Options

    Honestly, I have never seen so much confusion on a documentation. That's why I am not huge fan of mixing "no-code" tools and normal ETL tools. I think datakui is a great no code tool but not really made to build strong ETLs. Still, I have to use this tool given my organisation requirements. I work as an independent consultant on large companies and everytime there is something to do related to Dataiku, there is issues.

  • Turribeach
    Turribeach Dataiku DSS Core Designer, Neuron, Dataiku DSS Adv Designer, Registered, Neuron 2023 Posts: 1,726 Neuron
    Options

    I think it's worth considering that Dataiku is NOT an ETL although it can do a lot of ETL. So if you come from an ETL background trying to fit Dataiku into an ETL mould you are not going to get the best out of the tool. Personally I find both the code and no-code implementations in Dataiku extremelly powerful. I have yet the see a tool that blends so well no-code visual design with so many backend technologies and languanges (Python, Shell, SQL, R, Spark). On top of this the Dataiku APIs are very powerful but I do agree it is a bit confusing at the beggining with the different APIs available and which one to use in each case.The other thing you should consider is that if you write the data back to a database from Dataiku you can bypass Dataiku althogether and read the output data directly from the database without having to deal with any Dataiku APIs.

Setup Info
    Tags
      Help me…