How to work with managed folders with EKS compute
I am running a python recipe in DSS version 10.0.2
I want to read and write to managed folders which I currently do usign the command:
config_path = dataiku.Folder("config files").get_path()
but I get the following error:
[10:41:24] [INFO] [dku.utils] - *************** Recipe code failed ************** [10:41:24] [INFO] [dku.utils] - Begin Python stack [10:41:24] [INFO] [dku.utils] - Traceback (most recent call last): [10:41:24] [INFO] [dku.utils] - File "/opt/dataiku/python/dataiku/container/exec_py_recipe.py", line 19, in <module> [10:41:24] [INFO] [dku.utils] - exec(fd.read()) [10:41:24] [INFO] [dku.utils] - File "<string>", line 16, in <module> [10:41:24] [INFO] [dku.utils] - File "/opt/dataiku/python/dataiku/core/managed_folder.py", line 151, in get_path [10:41:24] [INFO] [dku.utils] - self._ensure_and_check_direct_access() [10:41:24] [INFO] [dku.utils] - File "/opt/dataiku/python/dataiku/core/managed_folder.py", line 132, in _ensure_and_check_direct_access [10:41:24] [INFO] [dku.utils] - raise Exception('Python process is running remotely, direct access to folder is not possible') [10:41:24] [INFO] [dku.utils] - Exception: Python process is running remotely, direct access to folder is not possible
Is there a way around this you can recommend?
Thanks
Best Answer
-
Alexandru Dataiker, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 1,226 Dataiker
Hi @Peter_R_Knight
,Since you are running in containerized execution you will need to use the get_download_stream()
As explained here: https://doc.dataiku.com/dss/latest/connecting/managed_folders.html#local-vs-non-local
Since get_path() will only work for a local folder (i.e. a folder hosted on the filesystem, when the job is not running in a container),folder_handle = dataiku.Folder("FOLDER_NAME")
with folder_handle.get_download_stream("/path/to/file/in/folder") as f:
my_file = f.read()Let me know if that helps!
Answers
-
Many thanks for the pointers.
The issue I'm going to face is that I'm calling GitHub code that needs to also be able to run locally and so I will end up having to litter the GitHub code with if dataiku_flag then read/write this way, else do it another way. I'm also calling other libraries that I believe can only save to a file path.
I wondered if there might be a way to copy input folders to somewhere accessible to EKS (perhaps S3), and write output to a temp location on S3, then at the end of the code copy it back to the managed folder.
-
Alexandru Dataiker, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 1,226 Dataiker
You can create the folder in DSS to be stored in S3 and interact with the remote managed folder in the same manner with get_download_stream and upload_stream() or upload_data()
Reference doc is available here: https://doc.dataiku.com/dss/latest/python-api/managed_folders.html
You can use local storage on container or things like StreamIO, BytesIO if needed and then upload either the files or file-like objects to the S3 backed managed folder.
Let us know if you have questions.
-
i am having similar problems. How do i use a glob command with this?
-
Hi @AlexT
,I have a follow up question. I'm trying to write a variable number of files to DSS File system managed folder from a Python recipe running on EKS after reading them from S3 using boto3 package.
client object has been initialized with S3 credentials.
If I run the code using local execution engine, the code saves file automatically to the managed folder which is the Python Recipe output.
However, during container execution does not. I'm unsure of what to tweak here. The documentation barely provides any usable example.
Can you please suggest a starting point?
Let me know if you need any other context.
Thanks!
for i in range(1,len(file_list)): file_name = file_dir + file_list[i] print(file_dir) print(file_name) print(file_list[i]) client.download_file(BUCKET_ID, Key = file_name, Filename = file_list[i])
-
Alexandru Dataiker, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 1,226 Dataiker
@ordinarydssuser
We would need more info to understand why it fails.
How are you initializing the boto3 client? Are you Access Key/Secret or Environment credentials?
If you are using an environment( instance profile), this will not work in containers by default.
Additionally, if the paths you defined are for the local managed folder path, these can't be accessed from the container. The file would be written but within the container which is destroyed.
If to want to use the boto3 client, you would need to copy the files to temp file or locally first and then copy them using the read/write API to the managed folder using upload_stream/upload_file
See example here : https://community.dataiku.com/t5/Plugins-Extending-Dataiku/Uploading-a-file-to-a-folder-using-the-Python-API/td-p/10500
But using Boto3 in general adds more complexity what is the exact goal here? Perhaps you can use a visual recipe like a merge folder recipe to copy objects from S3 to a managed folder.
The below code should be locally and in a container and all it does is essentially sync contents from a bucket to a managed folder in DSS. You can easily do this in DSS using a merge folder recipe directly.import dataiku import boto3 import tempfile out_folder = dataiku.Folder("OjmNay4X") access_key = 'xxx' secret_key = 'xxx' bucket_name = 'bucket-name' subdirectory = 'sub-path' session = boto3.Session(aws_access_key_id=access_key, aws_secret_access_key=secret_key) s3_client = session.client('s3') response = s3_client.list_objects_v2(Bucket=bucket_name, Prefix=subdirectory) for obj in response['Contents']: file_key = obj['Key'] temp_file = tempfile.NamedTemporaryFile(delete=False) s3_client.download_file(bucket_name, file_key, temp_file.name) with out_folder.get_writer(file_key) as output_file: with open(temp_file.name, 'rb') as input_file: output_file.write(input_file.read()) temp_file.close() print('All files processed.')