How to access data within S3 folder using directory paths
Hello all,
I am working on a project where I have to access images and files from an S3 folder. I have the folder within my flow paired with a Python recipe which performs the computation.
I would ideally be able to employ some directory to access these files, similar to how I could with a project on my local machine to access something with a directory path.
Any help is much appreciated!
Answers
-
Turribeach Dataiku DSS Core Designer, Neuron, Dataiku DSS Adv Designer, Registered, Neuron 2023 Posts: 1,984 Neuron
I am not sure what you are asking here. You can create a Dataiku Managed folder in an S3 bucket then you can access the managed folder via Python. Is that what you want?
-
Apologies for the vagueness, I already have a Dataiku managed folder within the S3 bucket set up. Currently I have a Python recipe from that folder in the flow. My current road block is with the implementation of a package which requires a parameter being the path of a file within the folder.
I printed the current working directory, being:
/data/dataiku/dss_data/jupyter-run/dku-workdirs/[PROJ_NAME]/notebook_editor_for_[FORMULA_NAME]/ipythondir/profile_default/db
The directory of the S3 bucket within AWS is:
AmazonS3/Buckets/[dept.]/dataiku/[PROJ_NAME]/[*folder*]
I'm just confused regarding the file structure of Dataiku, and how to access this folder.
Hope that cleared things up, thanks!
-
Turribeach Dataiku DSS Core Designer, Neuron, Dataiku DSS Adv Designer, Registered, Neuron 2023 Posts: 1,984 Neuron
In order to interact with a Dataiku managed folder you need to use the Dataiku API. Also because this code may run outside of the DSS server you should use the external API. Here is some sample code:
import dataikuapi host = "http://localhost:11200" apiKey = "some_key" client = dataikuapi.DSSClient(host, apiKey) project = client.get_project('MY_PROJECT') folder = project.get_managed_folder("my_folder_id") for content in folder.list_contents()['items']: last_modified_seconds = content["lastModified"] / 1000 last_modified_str = datetime.fromtimestamp(last_modified_seconds).strftime("%Y-%m-%d %H:%m:%S") print("size=%s mtime=%s %s" % (content["size"], last_modified_str, content["path"]))
Full API method list here: https://developer.dataiku.com/latest/api-reference/python/managed-folders.html#dataikuapi.dss.managedfolder.DSSManagedFolder.list_contents