Managed folder on S3

NikMishin
NikMishin Registered, Frontrunner 2022 Participant Posts: 8 ✭✭✭✭

Hi,

We are migrating Dataiku from an onprem server to AWS. There is a project that currently uses a managed folder with a few csv files in it as an input into an R recipe. Post migration we wish to use an S3 location with HDFS connection into it for any storage. When I repoint the above managed folder to the S3' HDFS connection and try to run the recipe I get an error (see attached screenshot).

Please let me know if it is at all possible to use S3 in the above scenario. And if not, can you please point me in the right direction for the "read/write API" mentioned in the error.

Thank you.

Best Answer

  • Andrey
    Andrey Dataiker Alumni Posts: 119 ✭✭✭✭✭✭✭
    edited July 17 Answer ✓

    My bad, I missed the fact that you're using R and proposed a Python solution.

    In case of R the similar solution would be to use

    dkuManagedFolderDownloadPath

    In this case you rely on DSS to read data from the file storage behind managed folder and give you the result depending on what you pass as an "as" parameter.

    Regards

Answers

  • NikMishin
    NikMishin Registered, Frontrunner 2022 Participant Posts: 8 ✭✭✭✭

    Just to clarify the current managed folder uses a Filesystem connection.

  • Andrey
    Andrey Dataiker Alumni Posts: 119 ✭✭✭✭✭✭✭

    Hi,

    Are you trying to read/write data from/to a managed folder manually by constructing a path with "get_path" or "file_path"?

    If yes, you'd need to use "get_download_stream" and "upload_stream" for reading and writing operations:

    https://doc.dataiku.com/dss/latest/python-api/managed_folders.html#dataiku.Folder.get_download_stream

    Regards

  • NikMishin
    NikMishin Registered, Frontrunner 2022 Participant Posts: 8 ✭✭✭✭

    Hi @Andrey
    , thanks for this. Our Data scientists have a huge R script that picks up files from a managed folder. The code looks something like this:

    MyManagedFolder <- dkuManagedFolderPath("WCrIUW3D")

    MyDataset = read.csv(paste0(MyManagedFolder , "/MyFile.csv"), stringsAsFactors = F, header = F)

    Note there are lots more files in such a folder so ideally we will find an R based solution if at all possible.

  • Pascal_B
    Pascal_B Registered Posts: 10 ✭✭✭✭
    edited July 17

    Hello,

    I found this thread, that answer on "reading" a file on a managed folder on S3 from a R Notebook. What would be the procedure to "write" a file to a managed folder on S3 from a R Notebook ?

    I get the following error when trying to access to it directly with dkuManagedFolderPath.

    Error in dkuManagedFolderPath("Suivi_Campagne_RT"): Folder is not on the local filesystem, cannot perform direct filesystem access. Use the read/write API instead.
    Traceback:
    
    1. dkuManagedFolderPath("Suivi_Campagne_RT")
    2. stop("Folder is not on the local filesystem, cannot perform direct filesystem access. Use the read/write API instead.")

    and I do not understand how to use the dkuManagedFolderUploadPath I found from the DSS help: how can I specify the write parameters (file formatting, sep, dec, etc.) ?

    dkuManagedFolderUploadPath("folder_name", "path_in_folder", data)
    

    Thanks for your help

    Pascal

  • Ankur30
    Ankur30 Partner, Dataiku DSS Core Designer, Dataiku DSS Core Concepts, Dataiku DSS Adv Designer Posts: 40 Partner

    Hi @Andrey
    ,

    Kindly let me know what will be Python code to load multiple CSV files on S3 from Dataiku.

    I am getting error. attached is the screenshot.

    Regards,

    Ankur,

Setup Info
    Tags
      Help me…