How to work with managed folders with EKS compute

Peter_R_Knight
Peter_R_Knight Registered Posts: 31 ✭✭✭✭
edited July 16 in Using Dataiku

I am running a python recipe in DSS version 10.0.2

I want to read and write to managed folders which I currently do usign the command:

config_path = dataiku.Folder("config files").get_path()

but I get the following error:

[10:41:24] [INFO] [dku.utils]  - *************** Recipe code failed **************
[10:41:24] [INFO] [dku.utils]  - Begin Python stack
[10:41:24] [INFO] [dku.utils]  - Traceback (most recent call last):
[10:41:24] [INFO] [dku.utils]  -   File "/opt/dataiku/python/dataiku/container/exec_py_recipe.py", line 19, in <module>
[10:41:24] [INFO] [dku.utils]  -     exec(fd.read())
[10:41:24] [INFO] [dku.utils]  -   File "<string>", line 16, in <module>
[10:41:24] [INFO] [dku.utils]  -   File "/opt/dataiku/python/dataiku/core/managed_folder.py", line 151, in get_path
[10:41:24] [INFO] [dku.utils]  -     self._ensure_and_check_direct_access()
[10:41:24] [INFO] [dku.utils]  -   File "/opt/dataiku/python/dataiku/core/managed_folder.py", line 132, in _ensure_and_check_direct_access
[10:41:24] [INFO] [dku.utils]  -     raise Exception('Python process is running remotely, direct access to folder is not possible')
[10:41:24] [INFO] [dku.utils]  - Exception: Python process is running remotely, direct access to folder is not possible

Is there a way around this you can recommend?

Thanks

Best Answer

Answers

  • Peter_R_Knight
    Peter_R_Knight Registered Posts: 31 ✭✭✭✭

    Many thanks for the pointers.

    The issue I'm going to face is that I'm calling GitHub code that needs to also be able to run locally and so I will end up having to litter the GitHub code with if dataiku_flag then read/write this way, else do it another way. I'm also calling other libraries that I believe can only save to a file path.

    I wondered if there might be a way to copy input folders to somewhere accessible to EKS (perhaps S3), and write output to a temp location on S3, then at the end of the code copy it back to the managed folder.

  • Alexandru
    Alexandru Dataiker, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 1,211 Dataiker

    @Peter_R_Knight
    ,

    You can create the folder in DSS to be stored in S3 and interact with the remote managed folder in the same manner with get_download_stream and upload_stream() or upload_data()

    Reference doc is available here: https://doc.dataiku.com/dss/latest/python-api/managed_folders.html

    You can use local storage on container or things like StreamIO, BytesIO if needed and then upload either the files or file-like objects to the S3 backed managed folder.

    Let us know if you have questions.

  • Scobbyy2k3
    Scobbyy2k3 Partner, Registered Posts: 26 Partner

    i am having similar problems. How do i use a glob command with this?

  • ordinarydssuser
    ordinarydssuser Registered Posts: 1
    edited July 17

    Hi @AlexT
    ,

    I have a follow up question. I'm trying to write a variable number of files to DSS File system managed folder from a Python recipe running on EKS after reading them from S3 using boto3 package.

    client object has been initialized with S3 credentials.

    If I run the code using local execution engine, the code saves file automatically to the managed folder which is the Python Recipe output.

    However, during container execution does not. I'm unsure of what to tweak here. The documentation barely provides any usable example.

    Can you please suggest a starting point?

    Let me know if you need any other context.

    Thanks!

    for i in range(1,len(file_list)):
        file_name =  file_dir + file_list[i]
        print(file_dir)
        print(file_name) 
        print(file_list[i])
        client.download_file(BUCKET_ID, Key = file_name, Filename = file_list[i])

  • Alexandru
    Alexandru Dataiker, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 1,211 Dataiker
    edited July 17

    @ordinarydssuser
    We would need more info to understand why it fails.


    How are you initializing the boto3 client? Are you Access Key/Secret or Environment credentials?
    If you are using an environment( instance profile), this will not work in containers by default.

    Additionally, if the paths you defined are for the local managed folder path, these can't be accessed from the container. The file would be written but within the container which is destroyed.

    If to want to use the boto3 client, you would need to copy the files to temp file or locally first and then copy them using the read/write API to the managed folder using upload_stream/upload_file

    See example here : https://community.dataiku.com/t5/Plugins-Extending-Dataiku/Uploading-a-file-to-a-folder-using-the-Python-API/td-p/10500


    But using Boto3 in general adds more complexity what is the exact goal here? Perhaps you can use a visual recipe like a merge folder recipe to copy objects from S3 to a managed folder.

    The below code should be locally and in a container and all it does is essentially sync contents from a bucket to a managed folder in DSS. You can easily do this in DSS using a merge folder recipe directly.



    import dataiku
    import boto3
    import tempfile
    
    out_folder = dataiku.Folder("OjmNay4X")
    access_key = 'xxx'
    secret_key = 'xxx'
    bucket_name = 'bucket-name'
    subdirectory = 'sub-path'
    
    session = boto3.Session(aws_access_key_id=access_key, aws_secret_access_key=secret_key)
    s3_client = session.client('s3')
    
    response = s3_client.list_objects_v2(Bucket=bucket_name, Prefix=subdirectory)
    
    for obj in response['Contents']:
        file_key = obj['Key']
        temp_file = tempfile.NamedTemporaryFile(delete=False)
        s3_client.download_file(bucket_name, file_key, temp_file.name)
        with out_folder.get_writer(file_key) as output_file:
            with open(temp_file.name, 'rb') as input_file:
                output_file.write(input_file.read())
        temp_file.close()
    
    print('All files processed.')




Setup Info
    Tags
      Help me…