How to access data within S3 folder using directory paths

gagaoreo
gagaoreo Registered Posts: 2

Hello all,

I am working on a project where I have to access images and files from an S3 folder. I have the folder within my flow paired with a Python recipe which performs the computation.

I would ideally be able to employ some directory to access these files, similar to how I could with a project on my local machine to access something with a directory path.

Any help is much appreciated!

Answers

  • Turribeach
    Turribeach Dataiku DSS Core Designer, Neuron, Dataiku DSS Adv Designer, Registered, Neuron 2023 Posts: 2,024 Neuron

    I am not sure what you are asking here. You can create a Dataiku Managed folder in an S3 bucket then you can access the managed folder via Python. Is that what you want?

  • gagaoreo
    gagaoreo Registered Posts: 2
    edited July 17

    Apologies for the vagueness, I already have a Dataiku managed folder within the S3 bucket set up. Currently I have a Python recipe from that folder in the flow. My current road block is with the implementation of a package which requires a parameter being the path of a file within the folder.

    I printed the current working directory, being:

    /data/dataiku/dss_data/jupyter-run/dku-workdirs/[PROJ_NAME]/notebook_editor_for_[FORMULA_NAME]/ipythondir/profile_default/db

    The directory of the S3 bucket within AWS is:

    AmazonS3/Buckets/[dept.]/dataiku/[PROJ_NAME]/[*folder*]

    I'm just confused regarding the file structure of Dataiku, and how to access this folder.

    Hope that cleared things up, thanks!

  • Turribeach
    Turribeach Dataiku DSS Core Designer, Neuron, Dataiku DSS Adv Designer, Registered, Neuron 2023 Posts: 2,024 Neuron
    edited July 17

    In order to interact with a Dataiku managed folder you need to use the Dataiku API. Also because this code may run outside of the DSS server you should use the external API. Here is some sample code:

    import dataikuapi
    
    host = "http://localhost:11200"
    apiKey = "some_key"
    client = dataikuapi.DSSClient(host, apiKey)
    project = client.get_project('MY_PROJECT')
    folder = project.get_managed_folder("my_folder_id")
    for content in folder.list_contents()['items']:
        last_modified_seconds = content["lastModified"] / 1000
        last_modified_str = datetime.fromtimestamp(last_modified_seconds).strftime("%Y-%m-%d %H:%m:%S")
        print("size=%s mtime=%s %s" % (content["size"], last_modified_str, content["path"]))

    Full API method list here: https://developer.dataiku.com/latest/api-reference/python/managed-folders.html#dataikuapi.dss.managedfolder.DSSManagedFolder.list_contents

Setup Info
    Tags
      Help me…