Python code to read S3 files

Solved!
sj0071992
Python code to read S3 files

Hi,

 

How we can read S3 files using Python recipe in Dataiku?

 

Thanks in Advance

0 Kudos
1 Solution
AlexT
Dataiker

Hi,

Perhaps you can provide a bit more context around what you want to achieve exactly. Creating a managed folder in most cases can be done in the UI and does not need to be automated.

If you do wish to use the API to create the managed folder and specify the path you can use this :

 

import dataiku
import pandas as pd, numpy as np
from dataiku import pandasutils as pdu
import dataikuapi

client = dataiku.api_client()

#assuming it will run notebook/scenario within the project 
project = client.get_default_project()
project_key = project.get_summary()['projectKey']


#project_k = project.project_key()

managed_folder_name = "my_s3_managed_folder"
s3_connection_name = "s3-test"

#nnection to S3 should already exist, create managed folder if it does not exists

folder = dataiku.Folder(managed_folder_name)

try:
    folder_id = folder.get_id()
except:
    print("creating folder")
    project.create_managed_folder(managed_folder_name,connection_name=s3_connection_name)


# Modify path within the root path of the connection - default is /${projectKey}/${odbId}'

fld = dataikuapi.dss.managedfolder.DSSManagedFolder(client, project_key, folder_id)
fld_def = fld.get_definition()
# replace path relative within the root of the S3 connection 
fld_def['path'] = '/${projectKey}/${odbId}/testing'
fld.set_definition(fld_def)

 

View solution in original post

0 Kudos
21 Replies
AlexT
Dataiker

Hi @sj0071992 ,

Reading files from S3 with the Python APIs can be achieved by

1) Creating a manager folder pointing to the S3 buckets using the relevant managed folder call like get_download_stream.

This article has a code sample you can reference : 

https://knowledge.dataiku.com/latest/courses/folders/managed-folders-hands-on.html

Thanks,

sj0071992
Author

Hi Alex,

 

Could you help in Creating a manager folder pointing to the S3 buckets using the relevant managed folder call, as

where to pass the path of the bucket and how to automate that ?

0 Kudos
AlexT
Dataiker

Hi,

Perhaps you can provide a bit more context around what you want to achieve exactly. Creating a managed folder in most cases can be done in the UI and does not need to be automated.

If you do wish to use the API to create the managed folder and specify the path you can use this :

 

import dataiku
import pandas as pd, numpy as np
from dataiku import pandasutils as pdu
import dataikuapi

client = dataiku.api_client()

#assuming it will run notebook/scenario within the project 
project = client.get_default_project()
project_key = project.get_summary()['projectKey']


#project_k = project.project_key()

managed_folder_name = "my_s3_managed_folder"
s3_connection_name = "s3-test"

#nnection to S3 should already exist, create managed folder if it does not exists

folder = dataiku.Folder(managed_folder_name)

try:
    folder_id = folder.get_id()
except:
    print("creating folder")
    project.create_managed_folder(managed_folder_name,connection_name=s3_connection_name)


# Modify path within the root path of the connection - default is /${projectKey}/${odbId}'

fld = dataikuapi.dss.managedfolder.DSSManagedFolder(client, project_key, folder_id)
fld_def = fld.get_definition()
# replace path relative within the root of the S3 connection 
fld_def['path'] = '/${projectKey}/${odbId}/testing'
fld.set_definition(fld_def)

 

0 Kudos
sj0071992
Author

Hi Alex,

 

Is there any way to achieve this without creating the managing folder? Can't we create a direct connection with S3 in python recipe and read the file using bucket name and path?

 

Thanks in Advance

0 Kudos
AlexT
Dataiker

Hi,

Yes, if you wish to create a direct connection in python recipes you typically need to use boto3 python SDK,  this means you need to manage the connection details/credentials yourself and things like multi-part download, and so on.  It would definitely add complexity vs using a managed folder or S3 dataset in DSS directly. 

If you want to create an S3 dataset directly from python code (instead of managed folder) all you need is to run:

dataset = project.create_s3_dataset(dataset_name, connection, path_in_connection, bucket=None)

Let me know if you have any questions. 

0 Kudos
sj0071992
Author

Hi Alex,

 

Thanks for the reply.

 

Now i am able to make connection with S3 but the issue right now i am facing is to read the .gz file from S3. Below is the code i am using

 

import dataiku

from dataiku import pandasutils as pdu

import pandas as pd

import boto3

import io

import gzip

 

s3_client = boto3.client('s3')

bucket ="bucket"

source_file_path = "file/path/_tmp.out-s0-2021-10-01-00-25-14-501.gz"

s3_file = s3_client.get_object(Bucket=bucket, Key=source_file_path)

s3_file_data = s3_file['Body'].read()

s3_file_data = io.BytesIO(s3_file_data)

 

s3_file_data_df = pd.read_csv(s3_file_data, compression='gzip',header=0, sep=',', quotechar='"')

 

On reading through CSV it is getting failed. Could you please help here?

 

Thanks in advance

 

0 Kudos
AlexT
Dataiker

Hi,

What's the exact error stack trace when reading the file? Likely the files is corrupted unreadable.

Try manually downloading the file and checking.

The filename here starts with _tmp file do you have any non _tmp files you can try to read instead?

Are these files generated via the Event Server? 

 

0 Kudos
sj0071992
Author

Hi Alex,

 

I have attached the error message.

 

Also these are the CRU logs generated by the Event Server and all the files are with format "_tmp_"

I am able to see the content till " s3_file_data = io.BytesIO(s3_file_data) " -- Bytes Data. But after that below error is coming

Can we change the format of the files to .csv from the Event Server?

log_error.png

0 Kudos
AlexT
Dataiker

If all files a_tmp files in cru logs are failed files they cannot be loaded hence why the exception you see.
This suggests an issue with your event server configuration as it was unable to generate any valid files.

 

0 Kudos
dhaouadi
Level 1

hi AlexT, i would import a project from dataiku to S3 directly without upload it in local can i do this ?

0 Kudos
AlexT
Dataiker

Hi @dhaouadi ,

Are you looking to import/export projects directly from S3?

This is possible by using a managed folder and using the DSS API. You can store an export and read an export from S3 without copying to local. 

https://doc.dataiku.com/dss/latest/python-api/projects.html#exporting

Is that what you are looking to do? 

 

0 Kudos
dhaouadi
Level 1

thank u for your response @AlexT 

no the inverse, i don't want to import my projects in local. i m searching a solution to import the project directly in S3 .

import a project from dataiku and store it into S3 . is it possible ?
 
0 Kudos
AlexT
Dataiker

Still not very clear to me what you are trying to achieve.

Are you trying to export a full DSS  to S3 bucket?

Are you trying to export a dataset from DSS to S3? This can be done with Sync recipe or Export to Folder recipe or code recipe. 

To export actual projects to S3 you can simply create a managed folder with the S3 destination you wish. Then run a Notebook or Recipe/Scenario. With something like : 

 

import dataiku
from dataiku import pandasutils as pdu
import pandas as pd
import datetime
import time

today_folder=time.strftime('%Y%m%d')

client = dataiku.api_client()

#replace Folder id
output_folder = dataiku.Folder("5l2INQoq")
#replace with project key 
pk = "PROJECTKEY"

#other options available 
#exportUploads (boolean): Exports the data of Uploaded datasets - default False
#exportManagedFS (boolean): Exports the data of managed Filesystem datasets - default False
#exportAnalysisModels (boolean): Exports the models trained in analysis - default False
#exportSavedModels (boolean): Exports the models trained in saved models - default False
#exportManagedFolders (boolean): Exports the data of managed folders - default False
#exportAllInputDatasets (boolean): Exports the data of all input datasets - default False
#exportAllDatasets (boolean): Exports the data of all datasets - default False
#exportAllInputManagedFolders (boolean): Exports the data of all input managed folders - default False
#exportGitRepositoy (boolean): Exports the Git repository history - default False
#exportInsightsData (boolean): Exports the data of static insights - default False

project = client.get_project(pk)
project.export_to_file('exported_project.zip')
with project_info.get_export_stream({'exportAnalysisModels':True, 'exportSavedModels':True, 'exportGitRepositoy':True,
                                             'exportInsightsData':True}) as s:
    output_folder.upload_stream("today_folder/" + pk + '_' + today_folder + '.zip', s)

 

This will store the project.zip in your S3 bucket. 

0 Kudos
dhaouadi
Level 1

yes just project not the full DSS

0 Kudos
dhaouadi
Level 1

when I open a project for example and I click on import the project I don't want it to download as a .zip in my local, on the other hand I want to store it directly on S3 that's the goal

0 Kudos
AlexT
Dataiker

The only way to do that is via API see the code above it should achieve what you are looking for. 

0 Kudos
dhaouadi
Level 1

thank you @AlexT 

0 Kudos
AlexT
Dataiker

You can check  DATADIR/run/eventserver.log for possible issues. Also may worth review the steps described :

https://knowledge.dataiku.com/latest/kb/setup-admin/cru/index.html

And this hands-on example here:

https://knowledge.dataiku.com/latest/kb/setup-admin/cru/index.html 

 

 

0 Kudos
sj0071992
Author

Hi Alex,

 

When i tried creating the connection through creating Cloud storage dataset, its working fine. There i am able to see the data but if you are saying that all "_tmp" files are the failed ones and the valid files should start from "out-" then we can re-configure our event server to produce valid log files.

Please let me know if my understanding is correct

 

Thanks

0 Kudos
AlexT
Dataiker

Yes, that's correct valid files should be out-*.

If it worked in a regular DSS dataset that would mean at least some files were actually some non _tmp files.

0 Kudos