Managed folder on S3

NikMishin · ‎07-21-2020

Hi,

We are migrating Dataiku from an onprem server to AWS. There is a project that currently uses a managed folder with a few csv files in it as an input into an R recipe. Post migration we wish to use an S3 location with HDFS connection into it for any storage. When I repoint the above managed folder to the S3' HDFS connection and try to run the recipe I get an error (see attached screenshot).

Please let me know if it is at all possible to use S3 in the above scenario. And if not, can you please point me in the right direction for the "read/write API" mentioned in the error.

Thank you.

Andrey · ‎07-21-2020

My bad, I missed the fact that you're using R and proposed a Python solution.

In case of R the similar solution would be to use

dkuManagedFolderDownloadPath

In this case you rely on DSS to read data from the file storage behind managed folder and give you the result depending on what you pass as an "as" parameter.

Regards

Andrey Avtomonov
R&D Engineer @ Dataiku

View solution in original post

NikMishin · ‎07-21-2020

Just to clarify the current managed folder uses a Filesystem connection.

Andrey · ‎07-21-2020

Hi,

Are you trying to read/write data from/to a managed folder manually by constructing a path with "get_path" or "file_path"?

If yes, you'd need to use "get_download_stream" and "upload_stream" for reading and writing operations:

https://doc.dataiku.com/dss/latest/python-api/managed_folders.html#dataiku.Folder.get_download_strea...

Regards

Andrey Avtomonov
R&D Engineer @ Dataiku

NikMishin · ‎07-21-2020

Hi @Andrey , thanks for this. Our Data scientists have a huge R script that picks up files from a managed folder. The code looks something like this:

MyManagedFolder <- dkuManagedFolderPath("WCrIUW3D")

MyDataset = read.csv(paste0(MyManagedFolder , "/MyFile.csv"), stringsAsFactors = F, header = F)

Note there are lots more files in such a folder so ideally we will find an R based solution if at all possible.

Andrey · ‎07-21-2020

My bad, I missed the fact that you're using R and proposed a Python solution.

In case of R the similar solution would be to use

dkuManagedFolderDownloadPath

In this case you rely on DSS to read data from the file storage behind managed folder and give you the result depending on what you pass as an "as" parameter.

Regards

Andrey Avtomonov
R&D Engineer @ Dataiku

Pascal_B · ‎11-12-2020

Hello,

I found this thread, that answer on "reading" a file on a managed folder on S3 from a R Notebook. What would be the procedure to "write" a file to a managed folder on S3 from a R Notebook ?

I get the following error when trying to access to it directly with dkuManagedFolderPath.

Error in dkuManagedFolderPath("Suivi_Campagne_RT"): Folder is not on the local filesystem, cannot perform direct filesystem access. Use the read/write API instead.
Traceback:

1. dkuManagedFolderPath("Suivi_Campagne_RT")
2. stop("Folder is not on the local filesystem, cannot perform direct filesystem access. Use the read/write API instead.")

and I do not understand how to use the dkuManagedFolderUploadPath I found from the DSS help: how can I specify the write parameters (file formatting, sep, dec, etc.) ?

dkuManagedFolderUploadPath("folder_name", "path_in_folder", data)

Thanks for your help

Pascal

Ankur30 · ‎10-25-2021

Hi @Andrey ,

Kindly let me know what will be Python code to load multiple CSV files on S3 from Dataiku.

I am getting error. attached is the screenshot.

Regards,

Ankur,

Sign up to take part

Managed folder on S3

Managed folder on S3