Managed folder on S3

Solved!
NikMishin
Level 1
Managed folder on S3

Hi,

We are migrating Dataiku from an onprem server to AWS. There is a project that currently uses a managed folder with a few csv files in it as an input into an R recipe. Post migration we wish to use an S3 location with HDFS connection into it for any storage. When I repoint the above managed folder to the S3' HDFS connection and try to run the recipe I get an error (see attached screenshot).

Please let me know if it is at all possible to use S3 in the above scenario. And if not, can you please point me in the right direction for the "read/write API" mentioned in the error.

Thank you.

 

 

0 Kudos
1 Solution
Andrey
Dataiker Alumni

My bad, I missed the fact that you're using R and proposed a Python solution.

In case of R the similar solution would be to use 

dkuManagedFolderDownloadPath

In this case you rely on DSS to read data from the file storage behind managed folder and give you the result depending on what you pass as an "as" parameter.

Regards

Andrey Avtomonov
R&D Engineer @ Dataiku

View solution in original post

0 Kudos
6 Replies
NikMishin
Level 1
Author

Just to clarify the current managed folder uses a Filesystem connection.

0 Kudos
Andrey
Dataiker Alumni

Hi,

Are you trying to read/write data from/to a managed folder manually by constructing a path with "get_path" or "file_path"?

If yes, you'd need to use "get_download_stream" and "upload_stream" for reading and writing operations:

https://doc.dataiku.com/dss/latest/python-api/managed_folders.html#dataiku.Folder.get_download_strea...

 

Regards

Andrey Avtomonov
R&D Engineer @ Dataiku
0 Kudos
NikMishin
Level 1
Author

Hi @Andrey , thanks for this. Our Data scientists have a huge R script that picks up files from a managed folder. The code looks something like this:

MyManagedFolder <- dkuManagedFolderPath("WCrIUW3D")

MyDataset = read.csv(paste0(MyManagedFolder , "/MyFile.csv"), stringsAsFactors = F, header = F)

Note there are lots more files in such a folder so ideally we will find an R based solution if at all possible.

0 Kudos
Andrey
Dataiker Alumni

My bad, I missed the fact that you're using R and proposed a Python solution.

In case of R the similar solution would be to use 

dkuManagedFolderDownloadPath

In this case you rely on DSS to read data from the file storage behind managed folder and give you the result depending on what you pass as an "as" parameter.

Regards

Andrey Avtomonov
R&D Engineer @ Dataiku
0 Kudos
Pascal_B
Level 2

Hello,

I found this thread, that answer on "reading" a file on a managed folder on S3 from a R Notebook. What would be the procedure to "write" a file to a managed folder on S3 from a R Notebook ?

I get the following error when trying to access to it directly with dkuManagedFolderPath.

Error in dkuManagedFolderPath("Suivi_Campagne_RT"): Folder is not on the local filesystem, cannot perform direct filesystem access. Use the read/write API instead.
Traceback:

1. dkuManagedFolderPath("Suivi_Campagne_RT")
2. stop("Folder is not on the local filesystem, cannot perform direct filesystem access. Use the read/write API instead.")

and I do not understand how to use the dkuManagedFolderUploadPath I found from the DSS help: how can I specify the write parameters (file formatting, sep, dec, etc.) ?

dkuManagedFolderUploadPath("folder_name", "path_in_folder", data)

Thanks for your help

Pascal

0 Kudos
Ankur30
Level 3

Hi @Andrey ,

 

Kindly let me know what will be Python code to load multiple CSV files on S3 from Dataiku.

 

I am getting error. attached is the screenshot.

 

Regards,

Ankur,

0 Kudos