Copying data from local managed folder to S3 managed folder
Hi,
I have some model files in a managed folder stored on DSS, I want to copy them to a new folder in S3, is there a way to do it using the python recipe?
Best Answer
-
JordanB Dataiker, Dataiku DSS Core Designer, Dataiku DSS Adv Designer, Registered Posts: 296 Dataiker
Hi @harsha_dataiku
,Yes, you can do so with the managed folder read/write APIs: https://doc.dataiku.com/dss/latest/connecting/managed_folders.html#usage-in-python
The following code should work for transferring files from a local folder to a remote folder:
input_folder = dataiku.Folder("lE3JuuYn") # replace with folder name or folder ID (retrieved from URL) input_folder_files = input_folder.list_paths_in_partition() output_folder = dataiku.Folder("Pxyks4jt") x = 0 for input_folder_files[x] in input_folder_files: with input_folder.get_download_stream(input_folder_files[x]) as f: data = f.read() output_path = input_folder_files[x].split('/')[-1] with output_folder.get_writer(output_path) as w: w.write(data) print("Successfully transferred {}".format(output_path)) x += 1
Best,
Jordan
Answers
-
harsha_dataiku Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 4 ✭
Hi, it is throwing an error when I am trying to copy PDF files, is there a way we can copy them? Thanks in advance.
-
Turribeach Dataiku DSS Core Designer, Neuron, Dataiku DSS Adv Designer, Registered, Neuron 2023 Posts: 1,984 Neuron
An error? What error?
-
harsha_dataiku Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 4 ✭
I am trying to copy a local pdf to S3 folder, I am facing the below error.
-
Turribeach Dataiku DSS Core Designer, Neuron, Dataiku DSS Adv Designer, Registered, Neuron 2023 Posts: 1,984 Neuron
There is a much better way of doing this and it doesn't require any code. First use Files in Folder dataset to group your files by dataset. Then simple use a Sync recipe to move them from Files in Folder dataset to a bucket in any cloud. See below my sample flow using an Azure bucket but S3 will work in the same way. And if you use this hidden feature of the Files in Folder dataset you can even have full traceability of where each record came from. This solution will be way faster than Python.
And if you need to keep the files in the original format either use s3fs-fuse or this solution to moint the S3 bucket on your current DSS machine and copy the files manually from each directory.