Sign up to take part
Registered users can ask their own questions, contribute to discussions, and be part of the Community!
Registered users can ask their own questions, contribute to discussions, and be part of the Community!
Hi,
How we can read S3 files using Python recipe in Dataiku?
Thanks in Advance
Hi,
Perhaps you can provide a bit more context around what you want to achieve exactly. Creating a managed folder in most cases can be done in the UI and does not need to be automated.
If you do wish to use the API to create the managed folder and specify the path you can use this :
import dataiku
import pandas as pd, numpy as np
from dataiku import pandasutils as pdu
import dataikuapi
client = dataiku.api_client()
#assuming it will run notebook/scenario within the project
project = client.get_default_project()
project_key = project.get_summary()['projectKey']
#project_k = project.project_key()
managed_folder_name = "my_s3_managed_folder"
s3_connection_name = "s3-test"
#nnection to S3 should already exist, create managed folder if it does not exists
folder = dataiku.Folder(managed_folder_name)
try:
folder_id = folder.get_id()
except:
print("creating folder")
project.create_managed_folder(managed_folder_name,connection_name=s3_connection_name)
# Modify path within the root path of the connection - default is /${projectKey}/${odbId}'
fld = dataikuapi.dss.managedfolder.DSSManagedFolder(client, project_key, folder_id)
fld_def = fld.get_definition()
# replace path relative within the root of the S3 connection
fld_def['path'] = '/${projectKey}/${odbId}/testing'
fld.set_definition(fld_def)
Hi @sj0071992 ,
Reading files from S3 with the Python APIs can be achieved by
1) Creating a manager folder pointing to the S3 buckets using the relevant managed folder call like get_download_stream.
This article has a code sample you can reference :
https://knowledge.dataiku.com/latest/courses/folders/managed-folders-hands-on.html
Thanks,
Hi Alex,
Could you help in Creating a manager folder pointing to the S3 buckets using the relevant managed folder call, as
where to pass the path of the bucket and how to automate that ?
Hi,
Perhaps you can provide a bit more context around what you want to achieve exactly. Creating a managed folder in most cases can be done in the UI and does not need to be automated.
If you do wish to use the API to create the managed folder and specify the path you can use this :
import dataiku
import pandas as pd, numpy as np
from dataiku import pandasutils as pdu
import dataikuapi
client = dataiku.api_client()
#assuming it will run notebook/scenario within the project
project = client.get_default_project()
project_key = project.get_summary()['projectKey']
#project_k = project.project_key()
managed_folder_name = "my_s3_managed_folder"
s3_connection_name = "s3-test"
#nnection to S3 should already exist, create managed folder if it does not exists
folder = dataiku.Folder(managed_folder_name)
try:
folder_id = folder.get_id()
except:
print("creating folder")
project.create_managed_folder(managed_folder_name,connection_name=s3_connection_name)
# Modify path within the root path of the connection - default is /${projectKey}/${odbId}'
fld = dataikuapi.dss.managedfolder.DSSManagedFolder(client, project_key, folder_id)
fld_def = fld.get_definition()
# replace path relative within the root of the S3 connection
fld_def['path'] = '/${projectKey}/${odbId}/testing'
fld.set_definition(fld_def)
Hi Alex,
Is there any way to achieve this without creating the managing folder? Can't we create a direct connection with S3 in python recipe and read the file using bucket name and path?
Thanks in Advance
Hi,
Yes, if you wish to create a direct connection in python recipes you typically need to use boto3 python SDK, this means you need to manage the connection details/credentials yourself and things like multi-part download, and so on. It would definitely add complexity vs using a managed folder or S3 dataset in DSS directly.
If you want to create an S3 dataset directly from python code (instead of managed folder) all you need is to run:
dataset = project.create_s3_dataset(dataset_name, connection, path_in_connection, bucket=None)
Let me know if you have any questions.
Hi Alex,
Thanks for the reply.
Now i am able to make connection with S3 but the issue right now i am facing is to read the .gz file from S3. Below is the code i am using
import dataiku
from dataiku import pandasutils as pdu
import pandas as pd
import boto3
import io
import gzip
s3_client = boto3.client('s3')
bucket ="bucket"
source_file_path = "file/path/_tmp.out-s0-2021-10-01-00-25-14-501.gz"
s3_file = s3_client.get_object(Bucket=bucket, Key=source_file_path)
s3_file_data = s3_file['Body'].read()
s3_file_data = io.BytesIO(s3_file_data)
s3_file_data_df = pd.read_csv(s3_file_data, compression='gzip',header=0, sep=',', quotechar='"')
On reading through CSV it is getting failed. Could you please help here?
Thanks in advance
Hi,
What's the exact error stack trace when reading the file? Likely the files is corrupted unreadable.
Try manually downloading the file and checking.
The filename here starts with _tmp file do you have any non _tmp files you can try to read instead?
Are these files generated via the Event Server?
Hi Alex,
I have attached the error message.
Also these are the CRU logs generated by the Event Server and all the files are with format "_tmp_"
I am able to see the content till " s3_file_data = io.BytesIO(s3_file_data) " -- Bytes Data. But after that below error is coming
Can we change the format of the files to .csv from the Event Server?
If all files a_tmp files in cru logs are failed files they cannot be loaded hence why the exception you see.
This suggests an issue with your event server configuration as it was unable to generate any valid files.
hi AlexT, i would import a project from dataiku to S3 directly without upload it in local can i do this ?
Hi @dhaouadi ,
Are you looking to import/export projects directly from S3?
This is possible by using a managed folder and using the DSS API. You can store an export and read an export from S3 without copying to local.
https://doc.dataiku.com/dss/latest/python-api/projects.html#exporting
Is that what you are looking to do?
thank u for your response @AlexT
no the inverse, i don't want to import my projects in local. i m searching a solution to import the project directly in S3 .
Still not very clear to me what you are trying to achieve.
Are you trying to export a full DSS to S3 bucket?
Are you trying to export a dataset from DSS to S3? This can be done with Sync recipe or Export to Folder recipe or code recipe.
To export actual projects to S3 you can simply create a managed folder with the S3 destination you wish. Then run a Notebook or Recipe/Scenario. With something like :
import dataiku
from dataiku import pandasutils as pdu
import pandas as pd
import datetime
import time
today_folder=time.strftime('%Y%m%d')
client = dataiku.api_client()
#replace Folder id
output_folder = dataiku.Folder("5l2INQoq")
#replace with project key
pk = "PROJECTKEY"
#other options available
#exportUploads (boolean): Exports the data of Uploaded datasets - default False
#exportManagedFS (boolean): Exports the data of managed Filesystem datasets - default False
#exportAnalysisModels (boolean): Exports the models trained in analysis - default False
#exportSavedModels (boolean): Exports the models trained in saved models - default False
#exportManagedFolders (boolean): Exports the data of managed folders - default False
#exportAllInputDatasets (boolean): Exports the data of all input datasets - default False
#exportAllDatasets (boolean): Exports the data of all datasets - default False
#exportAllInputManagedFolders (boolean): Exports the data of all input managed folders - default False
#exportGitRepositoy (boolean): Exports the Git repository history - default False
#exportInsightsData (boolean): Exports the data of static insights - default False
project = client.get_project(pk)
project.export_to_file('exported_project.zip')
with project_info.get_export_stream({'exportAnalysisModels':True, 'exportSavedModels':True, 'exportGitRepositoy':True,
'exportInsightsData':True}) as s:
output_folder.upload_stream("today_folder/" + pk + '_' + today_folder + '.zip', s)
This will store the project.zip in your S3 bucket.
yes just project not the full DSS
when I open a project for example and I click on import the project I don't want it to download as a .zip in my local, on the other hand I want to store it directly on S3 that's the goal
The only way to do that is via API see the code above it should achieve what you are looking for.
thank you @AlexT
You can check DATADIR/run/eventserver.log for possible issues. Also may worth review the steps described :
https://knowledge.dataiku.com/latest/kb/setup-admin/cru/index.html
And this hands-on example here:
https://knowledge.dataiku.com/latest/kb/setup-admin/cru/index.html
Hi Alex,
When i tried creating the connection through creating Cloud storage dataset, its working fine. There i am able to see the data but if you are saying that all "_tmp" files are the failed ones and the valid files should start from "out-" then we can re-configure our event server to produce valid log files.
Please let me know if my understanding is correct
Thanks
Yes, that's correct valid files should be out-*.
If it worked in a regular DSS dataset that would mean at least some files were actually some non _tmp files.