Read a file, outside the API folder, from a DSS API

rona
rona Registered Posts: 52 ✭✭✭✭✭

Hello,

We would like to implement a DSS API with a python function which reads some data files stored in a distant server (not the DSS API Node).

This data file name will be an input parameter of the API.
The server, where the data file is stored, is known.
The data files are maintained by business users, this is why we can't use a managed/worked folder deployed with the API.

Please, could you advise what would be the best way to read this data file from the DSS API ?

Annie

Best Answer

  • Alexandru
    Alexandru Dataiker, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 1,225 Dataiker
    edited July 17 Answer ✓

    Hi,

    Thanks for clarifying. That makes now. Indeed you are correct by default when using managed folder in API endpoints these are copied over when the endpoint is deployed.

    However, you should be able to use the Dataiku Public API from the API endpoint.

    Here is an example for a dataset: https://community.dataiku.com/t5/Using-Dataiku/DSS-API-Designer-Read-dataset-from-DSS-flow-in-R-Python-API/m-p/7543

    Here is an example reading a file from managed folder creating a pandas dataframe and printing json:

    import pandas as pd
    import dataikuapi
    import io
    
    
    def api_py_function(project_key, folder_id):
    
        client = dataikuapi.DSSClient("http(s)://my_hostname:port", "apiKey")
        folder = dataikuapi.dss.managedfolder.DSSManagedFolder(client, project_key, folder_id)
        contents = folder.list_contents()
    
        for item in contents["items"]:
            file = folder.get_file(item["path"])
            file_data = file.content
            rawData = pd.read_csv(io.StringIO(file_data.decode('utf-8')))
            json_result = rawData.to_json(orient="table")
            return json_result
    
    

    Screenshot 2021-10-12 at 15.14.15.png

    Let me know if this helps!

Answers

  • Alexandru
    Alexandru Dataiker, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 1,225 Dataiker

    Hi Annie,

    Could please elaborate a bit on what you mean by "distant server". How will this remote server be able to server these data files? Will they be accessible over HTTP/S or an API? Or do you need to use SCP/SFTP?

    You can use all python-requests for HTTP or REST API.

    For SFTP you can use something like paramiko

  • rona
    rona Registered Posts: 52 ✭✭✭✭✭

    Hello Alex,

    I was thinking an approach like the following one :

    - Create a DSS connection to this server/folder location to access to the files stored at this place

    - question : Can we use such DSS connection in a API python function executed on the API Node ?

    - Then access to the file content with a DSS API using this DSS connection : is it possible ?

    it's just an idea ... I would like to understand the best way to address this need.

  • Alexandru
    Alexandru Dataiker, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 1,225 Dataiker

    Hi Rona,

    In your case, a managed folder should work.

    https://doc.dataiku.com/dss/latest/apinode/endpoint-python-function.html#using-managed-folders

    You mentioned you can't use a managed folder. Could elaborate a bit on why you can't use managed folders in your case? If you would be only using it as input and not writing back to this managed folder.

  • rona
    rona Registered Posts: 52 ✭✭✭✭✭

    Hi Alex,

    The files in the folder are managed by the business users. They can delete, update or add files in this folder at any moment.

    With API Node, my understanding is that the managed folder are defined when we define the endpoint using it. Then this managed folder is copied to the API node in order to have the files available on the API node when we deploy the API Endpoint to the API Node. Is it correct ?

    If yes, it means that the content of the folder is limited to the one defined at the time we deploy the API endpoint to the API Node. Then we can't dynamically consider the business users update in this folder.

    Please, let me know if something is wrong with my understanding.

    Thanks

  • vaishnavi
    vaishnavi Registered Posts: 40 ✭✭✭✭
    edited July 17
    client = dataikuapi.DSSClient("http(s)://my_hostname:port", "apiKey")
        folder = dataikuapi.dss.managedfolder.DSSManagedFolder(client, project_key, folder_id)
        contents = folder.list_contents()

    @AlexT
    In the above code am getting error while executing the line "contents = folder.list_contents()" There is no such attribute "list_contents" for the DSS managed folder.

  • Alexandru
    Alexandru Dataiker, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 1,225 Dataiker

    Hi @vaishnavi
    ,
    What DSS version are you on? list_contents was added starting with DSS 8.

    https://doc.dataiku.com/dss/8.0/python-api/managed_folders.html

    Thanks,

Setup Info
    Tags
      Help me…