Python code to read S3 files
Hi,
How we can read S3 files using Python recipe in Dataiku?
Thanks in Advance
Best Answer
-
Alexandru Dataiker, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 1,212 Dataiker
Hi,
Perhaps you can provide a bit more context around what you want to achieve exactly. Creating a managed folder in most cases can be done in the UI and does not need to be automated.
If you do wish to use the API to create the managed folder and specify the path you can use this :
import dataiku import pandas as pd, numpy as np from dataiku import pandasutils as pdu import dataikuapi client = dataiku.api_client() #assuming it will run notebook/scenario within the project project = client.get_default_project() project_key = project.get_summary()['projectKey'] #project_k = project.project_key() managed_folder_name = "my_s3_managed_folder" s3_connection_name = "s3-test" #nnection to S3 should already exist, create managed folder if it does not exists folder = dataiku.Folder(managed_folder_name) try: folder_id = folder.get_id() except: print("creating folder") project.create_managed_folder(managed_folder_name,connection_name=s3_connection_name) # Modify path within the root path of the connection - default is /${projectKey}/${odbId}' fld = dataikuapi.dss.managedfolder.DSSManagedFolder(client, project_key, folder_id) fld_def = fld.get_definition() # replace path relative within the root of the S3 connection fld_def['path'] = '/${projectKey}/${odbId}/testing' fld.set_definition(fld_def)
Answers
-
Alexandru Dataiker, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 1,212 Dataiker
Hi @sj0071992
,Reading files from S3 with the Python APIs can be achieved by
1) Creating a manager folder pointing to the S3 buckets using the relevant managed folder call like get_download_stream.
This article has a code sample you can reference :
https://knowledge.dataiku.com/latest/courses/folders/managed-folders-hands-on.html
Thanks,
-
sj0071992 Partner, Dataiku DSS Core Designer, Neuron, Dataiku DSS Adv Designer, Registered, Dataiku DSS Developer, Neuron 2022, Neuron 2023 Posts: 131 Neuron
Hi Alex,
Could you help in Creating a manager folder pointing to the S3 buckets using the relevant managed folder call, as
where to pass the path of the bucket and how to automate that ?
-
sj0071992 Partner, Dataiku DSS Core Designer, Neuron, Dataiku DSS Adv Designer, Registered, Dataiku DSS Developer, Neuron 2022, Neuron 2023 Posts: 131 Neuron
Hi Alex,
Is there any way to achieve this without creating the managing folder? Can't we create a direct connection with S3 in python recipe and read the file using bucket name and path?
Thanks in Advance
-
Alexandru Dataiker, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 1,212 Dataiker
Hi,
Yes, if you wish to create a direct connection in python recipes you typically need to use boto3 python SDK, this means you need to manage the connection details/credentials yourself and things like multi-part download, and so on. It would definitely add complexity vs using a managed folder or S3 dataset in DSS directly.
If you want to create an S3 dataset directly from python code (instead of managed folder) all you need is to run:
dataset = project.create_s3_dataset(dataset_name, connection, path_in_connection, bucket=None)
Let me know if you have any questions.
-
sj0071992 Partner, Dataiku DSS Core Designer, Neuron, Dataiku DSS Adv Designer, Registered, Dataiku DSS Developer, Neuron 2022, Neuron 2023 Posts: 131 Neuron
Hi Alex,
Thanks for the reply.
Now i am able to make connection with S3 but the issue right now i am facing is to read the .gz file from S3. Below is the code i am using
import dataiku
from dataiku import pandasutils as pdu
import pandas as pd
import boto3
import io
import gzip
s3_client = boto3.client('s3')
bucket ="bucket"
source_file_path = "file/path/_tmp.out-s0-2021-10-01-00-25-14-501.gz"
s3_file = s3_client.get_object(Bucket=bucket, Key=source_file_path)
s3_file_data = s3_file['Body'].read()
s3_file_data = io.BytesIO(s3_file_data)
s3_file_data_df = pd.read_csv(s3_file_data, compression='gzip',header=0, sep=',', quotechar='"')
On reading through CSV it is getting failed. Could you please help here?
Thanks in advance
-
Alexandru Dataiker, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 1,212 Dataiker
Hi,
What's the exact error stack trace when reading the file? Likely the files is corrupted unreadable.
Try manually downloading the file and checking.
The filename here starts with _tmp file do you have any non _tmp files you can try to read instead?
Are these files generated via the Event Server?
-
sj0071992 Partner, Dataiku DSS Core Designer, Neuron, Dataiku DSS Adv Designer, Registered, Dataiku DSS Developer, Neuron 2022, Neuron 2023 Posts: 131 Neuron
Hi Alex,
I have attached the error message.
Also these are the CRU logs generated by the Event Server and all the files are with format "_tmp_"
I am able to see the content till " s3_file_data = io.BytesIO(s3_file_data) " -- Bytes Data. But after that below error is coming
Can we change the format of the files to .csv from the Event Server?
-
Alexandru Dataiker, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 1,212 Dataiker
If all files a_tmp files in cru logs are failed files they cannot be loaded hence why the exception you see.
This suggests an issue with your event server configuration as it was unable to generate any valid files. -
Alexandru Dataiker, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 1,212 Dataiker
You can check DATADIR/run/eventserver.log for possible issues. Also may worth review the steps described :
https://knowledge.dataiku.com/latest/kb/setup-admin/cru/index.html
And this hands-on example here:
https://knowledge.dataiku.com/latest/kb/setup-admin/cru/index.html
-
sj0071992 Partner, Dataiku DSS Core Designer, Neuron, Dataiku DSS Adv Designer, Registered, Dataiku DSS Developer, Neuron 2022, Neuron 2023 Posts: 131 Neuron
Hi Alex,
When i tried creating the connection through creating Cloud storage dataset, its working fine. There i am able to see the data but if you are saying that all "_tmp" files are the failed ones and the valid files should start from "out-" then we can re-configure our event server to produce valid log files.
Please let me know if my understanding is correct
Thanks
-
Alexandru Dataiker, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 1,212 Dataiker
Yes, that's correct valid files should be out-*.
If it worked in a regular DSS dataset that would mean at least some files were actually some non _tmp files.
-
hi AlexT, i would import a project from dataiku to S3 directly without upload it in local can i do this ?
-
Alexandru Dataiker, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 1,212 Dataiker
Hi @dhaouadi
,Are you looking to import/export projects directly from S3?
This is possible by using a managed folder and using the DSS API. You can store an export and read an export from S3 without copying to local.
https://doc.dataiku.com/dss/latest/python-api/projects.html#exporting
Is that what you are looking to do?
-
thank u for your response @AlexT
no the inverse, i don't want to import my projects in local. i m searching a solution to import the project directly in S3 .
import a project from dataiku and store it into S3 . is it possible ? -
Alexandru Dataiker, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 1,212 Dataiker
Still not very clear to me what you are trying to achieve.
Are you trying to export a full DSS to S3 bucket?
Are you trying to export a dataset from DSS to S3? This can be done with Sync recipe or Export to Folder recipe or code recipe.
To export actual projects to S3 you can simply create a managed folder with the S3 destination you wish. Then run a Notebook or Recipe/Scenario. With something like :
import dataiku from dataiku import pandasutils as pdu import pandas as pd import datetime import time today_folder=time.strftime('%Y%m%d') client = dataiku.api_client() #replace Folder id output_folder = dataiku.Folder("5l2INQoq") #replace with project key pk = "PROJECTKEY" #other options available #exportUploads (boolean): Exports the data of Uploaded datasets - default False #exportManagedFS (boolean): Exports the data of managed Filesystem datasets - default False #exportAnalysisModels (boolean): Exports the models trained in analysis - default False #exportSavedModels (boolean): Exports the models trained in saved models - default False #exportManagedFolders (boolean): Exports the data of managed folders - default False #exportAllInputDatasets (boolean): Exports the data of all input datasets - default False #exportAllDatasets (boolean): Exports the data of all datasets - default False #exportAllInputManagedFolders (boolean): Exports the data of all input managed folders - default False #exportGitRepositoy (boolean): Exports the Git repository history - default False #exportInsightsData (boolean): Exports the data of static insights - default False project = client.get_project(pk) project.export_to_file('exported_project.zip') with project_info.get_export_stream({'exportAnalysisModels':True, 'exportSavedModels':True, 'exportGitRepositoy':True, 'exportInsightsData':True}) as s: output_folder.upload_stream("today_folder/" + pk + '_' + today_folder + '.zip', s)
This will store the project.zip in your S3 bucket.
-
yes just project not the full DSS
-
when I open a project for example and I click on import the project I don't want it to download as a .zip in my local, on the other hand I want to store it directly on S3 that's the goal
-
Alexandru Dataiker, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 1,212 Dataiker
The only way to do that is via API see the code above it should achieve what you are looking for.
-
thank you @AlexT
-