How to access data within S3 folder using directory paths

Options
gagaoreo
gagaoreo Registered Posts: 2

Hello all,

I am working on a project where I have to access images and files from an S3 folder. I have the folder within my flow paired with a Python recipe which performs the computation.

I would ideally be able to employ some directory to access these files, similar to how I could with a project on my local machine to access something with a directory path.

Any help is much appreciated!

Answers

  • Turribeach
    Turribeach Dataiku DSS Core Designer, Neuron, Dataiku DSS Adv Designer, Registered, Neuron 2023 Posts: 1,708 Neuron
    Options

    I am not sure what you are asking here. You can create a Dataiku Managed folder in an S3 bucket then you can access the managed folder via Python. Is that what you want?

  • gagaoreo
    gagaoreo Registered Posts: 2
    Options

    Apologies for the vagueness, I already have a Dataiku managed folder within the S3 bucket set up. Currently I have a Python recipe from that folder in the flow. My current road block is with the implementation of a package which requires a parameter being the path of a file within the folder.

    I printed the current working directory, being:

    /data/dataiku/dss_data/jupyter-run/dku-workdirs/[PROJ_NAME]/notebook_editor_for_[FORMULA_NAME]/ipythondir/profile_default/db

    The directory of the S3 bucket within AWS is:

    AmazonS3/Buckets/[dept.]/dataiku/[PROJ_NAME]/[*folder*]

    I'm just confused regarding the file structure of Dataiku, and how to access this folder.

    Hope that cleared things up, thanks!

  • Turribeach
    Turribeach Dataiku DSS Core Designer, Neuron, Dataiku DSS Adv Designer, Registered, Neuron 2023 Posts: 1,708 Neuron
    Options

    In order to interact with a Dataiku managed folder you need to use the Dataiku API. Also because this code may run outside of the DSS server you should use the external API. Here is some sample code:

    import dataikuapihost = "http://localhost:11200"apiKey = "some_key"client = dataikuapi.DSSClient(host, apiKey)project = client.get_project('MY_PROJECT')folder = project.get_managed_folder("my_folder_id")for content in folder.list_contents()['items']:last_modified_seconds = content["lastModified"] / 1000last_modified_str = datetime.fromtimestamp(last_modified_seconds).strftime("%Y-%m-%d %H:%m:%S")print("size=%s mtime=%s %s" % (content["size"], last_modified_str, content["path"]))

    Full API method list here: https://developer.dataiku.com/latest/api-reference/python/managed-folders.html#dataikuapi.dss.managedfolder.DSSManagedFolder.list_contents

Setup Info
    Tags
      Help me…