Survey banner
Switching to Dataiku - a new area to help users who are transitioning from other tools and diving into Dataiku! CHECK IT OUT

copy files from bucket to bucket

Solved!
mass84
Level 1
copy files from bucket to bucket

Hi !!

I have two folders both connected to the same S3 bucket but not the same directory.

I want to copy a subset of first folder in the second folder with a recipe (python if possible). 

I've already tried to do this with :

f = Folder_A.get_download_stream(filename) and Folder_B.upload_stream(file_copy, f)

But it's veeeery slow (5 min to copy a 20mo file)

Is there a better  method to copy a file from bucket to bucket ?

Thank you !!

1 Solution
Alex_Combessie
Dataiker Alumni

Hi,

The get_download_stream and upload_stream method imply data streaming in-and-out of S3 onto the DSS server.

You can achieve higher efficiency by using a cloud-specific API: https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/s3.html#S3.Client

Cheers,

Alex

View solution in original post

2 Replies
Alex_Combessie
Dataiker Alumni

Hi,

The get_download_stream and upload_stream method imply data streaming in-and-out of S3 onto the DSS server.

You can achieve higher efficiency by using a cloud-specific API: https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/s3.html#S3.Client

Cheers,

Alex

tanguy

Here is a code snippet we use to copy/paste S3 objects on a same bucket. I suppose you could adapt it to copy/paste objects across buckets (but your message suggests you are working on a single bucket).

import dataiku
import boto3

BUCKET_NAME = "my_bucket_name"

def get_aws_credentials(connector):
"""
Get AWS credentials from Dataiku S3 connector
"""
client = dataiku.api_client()
connection = client.get_connection(connector)
connection_info = connection.get_info()
aws_credentials = connection_info.get_aws_credential()

return aws_credentials


def get_bucket_handler(connector):
aws_credentials = get_aws_credentials(connector)
session = boto3.Session(aws_access_key_id=aws_credentials["accessKey"],
aws_secret_access_key=aws_credentials["secretKey"],
aws_session_token = aws_credentials["sessionToken"])

s3 = session.resource('s3')
bucket = s3.Bucket(BUCKET_NAME)

return bucket


def copy_files_s3(source_key, target_key, connector):
bucket = get_bucket_handler(connector)
old_source = {'Bucket': BUCKET_NAME,
'Key': source_key}
new_obj = bucket.Object(target_key)
new_obj.copy(old_source)

 Be careful that your keys do not start with '/' as boto3 will not find your object (or will not be able to write it).

0 Kudos

Labels

?
Labels (1)
A banner prompting to get Dataiku