We're excited to announce that we're launching the second installment of Dataiku Product Days Register Now

Python code to read S3 files

sj0071992
Level 2
Level 2
Python code to read S3 files

Hi,

 

How we can read S3 files using Python recipe in Dataiku?

 

Thanks in Advance

0 Kudos
12 Replies
AlexT
Dataiker
Dataiker

Hi @sj0071992 ,

Reading files from S3 with the Python APIs can be achieved by

1) Creating a manager folder pointing to the S3 buckets using the relevant managed folder call like get_download_stream.

This article has a code sample you can reference : 

https://knowledge.dataiku.com/latest/courses/folders/managed-folders-hands-on.html

Thanks,

0 Kudos
sj0071992
Level 2
Level 2
Author

Hi Alex,

 

Could you help in Creating a manager folder pointing to the S3 buckets using the relevant managed folder call, as

where to pass the path of the bucket and how to automate that ?

0 Kudos
AlexT
Dataiker
Dataiker

Hi,

Perhaps you can provide a bit more context around what you want to achieve exactly. Creating a managed folder in most cases can be done in the UI and does not need to be automated.

If you do wish to use the API to create the managed folder and specify the path you can use this :

 

import dataiku
import pandas as pd, numpy as np
from dataiku import pandasutils as pdu
import dataikuapi

client = dataiku.api_client()

#assuming it will run notebook/scenario within the project 
project = client.get_default_project()
project_key = project.get_summary()['projectKey']


#project_k = project.project_key()

managed_folder_name = "my_s3_managed_folder"
s3_connection_name = "s3-test"

#nnection to S3 should already exist, create managed folder if it does not exists

folder = dataiku.Folder(managed_folder_name)

try:
    folder_id = folder.get_id()
except:
    print("creating folder")
    project.create_managed_folder(managed_folder_name,connection_name=s3_connection_name)


# Modify path within the root path of the connection - default is /${projectKey}/${odbId}'

fld = dataikuapi.dss.managedfolder.DSSManagedFolder(client, project_key, folder_id)
fld_def = fld.get_definition()
# replace path relative within the root of the S3 connection 
fld_def['path'] = '/${projectKey}/${odbId}/testing'
fld.set_definition(fld_def)

 

0 Kudos
sj0071992
Level 2
Level 2
Author

Hi Alex,

 

Is there any way to achieve this without creating the managing folder? Can't we create a direct connection with S3 in python recipe and read the file using bucket name and path?

 

Thanks in Advance

0 Kudos
AlexT
Dataiker
Dataiker

Hi,

Yes, if you wish to create a direct connection in python recipes you typically need to use boto3 python SDK,  this means you need to manage the connection details/credentials yourself and things like multi-part download, and so on.  It would definitely add complexity vs using a managed folder or S3 dataset in DSS directly. 

If you want to create an S3 dataset directly from python code (instead of managed folder) all you need is to run:

dataset = project.create_s3_dataset(dataset_name, connection, path_in_connection, bucket=None)

Let me know if you have any questions. 

0 Kudos
sj0071992
Level 2
Level 2
Author

Hi Alex,

 

Thanks for the reply.

 

Now i am able to make connection with S3 but the issue right now i am facing is to read the .gz file from S3. Below is the code i am using

 

import dataiku

from dataiku import pandasutils as pdu

import pandas as pd

import boto3

import io

import gzip

 

s3_client = boto3.client('s3')

bucket ="bucket"

source_file_path = "file/path/_tmp.out-s0-2021-10-01-00-25-14-501.gz"

s3_file = s3_client.get_object(Bucket=bucket, Key=source_file_path)

s3_file_data = s3_file['Body'].read()

s3_file_data = io.BytesIO(s3_file_data)

 

s3_file_data_df = pd.read_csv(s3_file_data, compression='gzip',header=0, sep=',', quotechar='"')

 

On reading through CSV it is getting failed. Could you please help here?

 

Thanks in advance

 

0 Kudos
AlexT
Dataiker
Dataiker

Hi,

What's the exact error stack trace when reading the file? Likely the files is corrupted unreadable.

Try manually downloading the file and checking.

The filename here starts with _tmp file do you have any non _tmp files you can try to read instead?

Are these files generated via the Event Server? 

 

0 Kudos
sj0071992
Level 2
Level 2
Author

Hi Alex,

 

I have attached the error message.

 

Also these are the CRU logs generated by the Event Server and all the files are with format "_tmp_"

I am able to see the content till " s3_file_data = io.BytesIO(s3_file_data) " -- Bytes Data. But after that below error is coming

Can we change the format of the files to .csv from the Event Server?

log_error.png

0 Kudos
AlexT
Dataiker
Dataiker

If all files a_tmp files in cru logs are failed files they cannot be loaded hence why the exception you see.
This suggests an issue with your event server configuration as it was unable to generate any valid files.

 

0 Kudos
AlexT
Dataiker
Dataiker

You can check  DATADIR/run/eventserver.log for possible issues. Also may worth review the steps described :

https://knowledge.dataiku.com/latest/kb/setup-admin/cru/index.html

And this hands-on example here:

https://knowledge.dataiku.com/latest/kb/setup-admin/cru/index.html 

 

 

0 Kudos
sj0071992
Level 2
Level 2
Author

Hi Alex,

 

When i tried creating the connection through creating Cloud storage dataset, its working fine. There i am able to see the data but if you are saying that all "_tmp" files are the failed ones and the valid files should start from "out-" then we can re-configure our event server to produce valid log files.

Please let me know if my understanding is correct

 

Thanks

0 Kudos
AlexT
Dataiker
Dataiker

Yes, that's correct valid files should be out-*.

If it worked in a regular DSS dataset that would mean at least some files were actually some non _tmp files.

0 Kudos
A banner prompting to get Dataiku DSS