copy files from bucket to bucket
Hi !!
I have two folders both connected to the same S3 bucket but not the same directory.
I want to copy a subset of first folder in the second folder with a recipe (python if possible).
I've already tried to do this with :
f = Folder_A.get_download_stream(filename) and Folder_B.upload_stream(file_copy, f)
But it's veeeery slow (5 min to copy a 20mo file)
Is there a better method to copy a file from bucket to bucket ?
Thank you !!
Best Answer
The get_download_stream and upload_stream method imply data streaming in-and-out of S3 onto the DSS server.
You can achieve higher efficiency by using a cloud-specific API:
Tanguy Dataiku DSS Core Designer, Dataiku DSS & SQL, Dataiku DSS ML Practitioner, Dataiku DSS Core Concepts, Neuron, Dataiku DSS Adv Designer, Registered, Dataiku DSS Developer, Neuron 2023 Posts: 129 Neuron
Here is a code snippet we use to copy/paste S3 objects on a same bucket. I suppose you could adapt it to copy/paste objects across buckets (but your message suggests you are working on a single bucket).
import dataiku
import boto3
BUCKET_NAME = "my_bucket_name"
def get_aws_credentials(connector):
Get AWS credentials from Dataiku S3 connector
client = dataiku.api_client()
connection = client.get_connection(connector)
connection_info = connection.get_info()
aws_credentials = connection_info.get_aws_credential()
return aws_credentials
def get_bucket_handler(connector):
aws_credentials = get_aws_credentials(connector)
session = boto3.Session(aws_access_key_id=aws_credentials["accessKey"],
aws_session_token = aws_credentials["sessionToken"])
s3 = session.resource('s3')
bucket = s3.Bucket(BUCKET_NAME)
return bucket
def copy_files_s3(source_key, target_key, connector):
bucket = get_bucket_handler(connector)
old_source = {'Bucket': BUCKET_NAME,
'Key': source_key}
new_obj = bucket.Object(target_key)
new_obj.copy(old_source)Be careful that your keys do not start with '/' as boto3 will not find your object (or will not be able to write it).