How to work with managed folders with EKS compute

Solved!
Peter_R_Knight
Level 2
How to work with managed folders with EKS compute

I am running a python recipe in DSS version 10.0.2

I want to read and write to managed folders which I currently do usign the command:

config_path = dataiku.Folder("config files").get_path()

but I get the following error:

[10:41:24] [INFO] [dku.utils]  - *************** Recipe code failed **************
[10:41:24] [INFO] [dku.utils]  - Begin Python stack
[10:41:24] [INFO] [dku.utils]  - Traceback (most recent call last):
[10:41:24] [INFO] [dku.utils]  -   File "/opt/dataiku/python/dataiku/container/exec_py_recipe.py", line 19, in <module>
[10:41:24] [INFO] [dku.utils]  -     exec(fd.read())
[10:41:24] [INFO] [dku.utils]  -   File "<string>", line 16, in <module>
[10:41:24] [INFO] [dku.utils]  -   File "/opt/dataiku/python/dataiku/core/managed_folder.py", line 151, in get_path
[10:41:24] [INFO] [dku.utils]  -     self._ensure_and_check_direct_access()
[10:41:24] [INFO] [dku.utils]  -   File "/opt/dataiku/python/dataiku/core/managed_folder.py", line 132, in _ensure_and_check_direct_access
[10:41:24] [INFO] [dku.utils]  -     raise Exception('Python process is running remotely, direct access to folder is not possible')
[10:41:24] [INFO] [dku.utils]  - Exception: Python process is running remotely, direct access to folder is not possible

 

Is there a way around this you can recommend?

Thanks

 

0 Kudos
1 Solution
AlexT
Dataiker

Hi @Peter_R_Knight ,

Since you are running in containerized execution you will need to use the get_download_stream() 

As explained here: https://doc.dataiku.com/dss/latest/connecting/managed_folders.html#local-vs-non-local 

Since get_path() will only work for a local folder (i.e. a folder hosted on the filesystem, when the job is not running in a container),

folder_handle = dataiku.Folder("FOLDER_NAME")
with folder_handle.get_download_stream("/path/to/file/in/folder") as f:
    my_file = f.read()

Let me know if that helps! 

View solution in original post

0 Kudos
6 Replies
AlexT
Dataiker

Hi @Peter_R_Knight ,

Since you are running in containerized execution you will need to use the get_download_stream() 

As explained here: https://doc.dataiku.com/dss/latest/connecting/managed_folders.html#local-vs-non-local 

Since get_path() will only work for a local folder (i.e. a folder hosted on the filesystem, when the job is not running in a container),

folder_handle = dataiku.Folder("FOLDER_NAME")
with folder_handle.get_download_stream("/path/to/file/in/folder") as f:
    my_file = f.read()

Let me know if that helps! 

0 Kudos
Peter_R_Knight
Level 2
Author

Many thanks for the pointers. 

The issue I'm going to face is that I'm calling GitHub code that needs to also be able to run locally and so I will end up having to litter the GitHub code with if dataiku_flag then read/write this way, else do it another way.  I'm also calling other libraries that I believe can only save to a file path.

I wondered if there might be a way to copy input folders to somewhere accessible to EKS (perhaps S3), and write output to a temp location on S3, then at the end of the code copy it back to the managed folder. 

0 Kudos
AlexT
Dataiker

@Peter_R_Knight ,

You can create the folder in DSS to be stored in S3 and interact with the remote managed folder in the same manner with get_download_stream and upload_stream() or upload_data() 

Reference doc is available here: https://doc.dataiku.com/dss/latest/python-api/managed_folders.html 

You can use local storage on container or things like StreamIO, BytesIO if needed and then upload either the files or file-like objects to the S3 backed managed folder.

Let us know if you have questions. 

0 Kudos
ordinarydssuser
Level 1

Hi @AlexT ,

I have a follow up question. I'm trying to write a variable number of files to DSS File system managed folder from a Python recipe running on EKS after reading them from S3 using boto3 package.

client object has been initialized with S3 credentials.

If I run the code using local execution engine, the code saves file automatically to the managed folder which is the Python Recipe output.

However, during container execution does not. I'm unsure of what to tweak here. The documentation barely provides any usable example. 

Can you please suggest a starting point?

Let me know if you need any other context.

Thanks!

 

for i in range(1,len(file_list)):
    file_name =  file_dir + file_list[i]
    print(file_dir)
    print(file_name) 
    print(file_list[i])
    client.download_file(BUCKET_ID, Key = file_name, Filename = file_list[i])

 

0 Kudos
Scobbyy2k3
Level 3

i am having similar problems. How do i use a glob command with this?

0 Kudos
AlexT
Dataiker

@ordinarydssuser  We would need more info to understand why it fails.


How are you initializing the boto3 client? Are you Access Key/Secret or Environment credentials? 
If you are using an environment( instance profile), this will not work in containers by default.

Additionally, if the paths you defined are for the local managed folder path,  these can't be accessed from the container. The file would be written but within the container which is destroyed. 

If to want to use the boto3 client, you would need to copy the files to temp file or locally first and then copy them using the read/write API to the managed folder using upload_stream/upload_file

See example here : https://community.dataiku.com/t5/Plugins-Extending-Dataiku/Uploading-a-file-to-a-folder-using-the-Py...


But using Boto3 in general adds more complexity what is the exact goal here? Perhaps you can use a visual recipe like a merge folder recipe to copy objects from S3 to a managed folder.

The below code should be locally and in a container and all it does is essentially sync contents from a bucket to a managed folder in DSS. You can easily do this in DSS using a merge folder recipe directly. 



import dataiku
import boto3
import tempfile

out_folder = dataiku.Folder("OjmNay4X")
access_key = 'xxx'
secret_key = 'xxx'
bucket_name = 'bucket-name'
subdirectory = 'sub-path'

session = boto3.Session(aws_access_key_id=access_key, aws_secret_access_key=secret_key)
s3_client = session.client('s3')

response = s3_client.list_objects_v2(Bucket=bucket_name, Prefix=subdirectory)

for obj in response['Contents']:
    file_key = obj['Key']
    temp_file = tempfile.NamedTemporaryFile(delete=False)
    s3_client.download_file(bucket_name, file_key, temp_file.name)
    with out_folder.get_writer(file_key) as output_file:
        with open(temp_file.name, 'rb') as input_file:
            output_file.write(input_file.read())
    temp_file.close()

print('All files processed.')




0 Kudos