Python code to read S3 files

sj0071992
sj0071992 Partner, Dataiku DSS Core Designer, Neuron, Dataiku DSS Adv Designer, Registered, Dataiku DSS Developer, Neuron 2022, Neuron 2023 Posts: 131 Neuron

Hi,

How we can read S3 files using Python recipe in Dataiku?

Thanks in Advance

Best Answer

  • Alexandru
    Alexandru Dataiker, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 1,225 Dataiker
    edited July 17 Answer ✓

    Hi,

    Perhaps you can provide a bit more context around what you want to achieve exactly. Creating a managed folder in most cases can be done in the UI and does not need to be automated.

    If you do wish to use the API to create the managed folder and specify the path you can use this :

    import dataiku
    import pandas as pd, numpy as np
    from dataiku import pandasutils as pdu
    import dataikuapi
    
    client = dataiku.api_client()
    
    #assuming it will run notebook/scenario within the project 
    project = client.get_default_project()
    project_key = project.get_summary()['projectKey']
    
    
    #project_k = project.project_key()
    
    managed_folder_name = "my_s3_managed_folder"
    s3_connection_name = "s3-test"
    
    #nnection to S3 should already exist, create managed folder if it does not exists
    
    folder = dataiku.Folder(managed_folder_name)
    
    try:
        folder_id = folder.get_id()
    except:
        print("creating folder")
        project.create_managed_folder(managed_folder_name,connection_name=s3_connection_name)
    
    
    # Modify path within the root path of the connection - default is /${projectKey}/${odbId}'
    
    fld = dataikuapi.dss.managedfolder.DSSManagedFolder(client, project_key, folder_id)
    fld_def = fld.get_definition()
    # replace path relative within the root of the S3 connection 
    fld_def['path'] = '/${projectKey}/${odbId}/testing'
    fld.set_definition(fld_def)
    

Answers

  • Alexandru
    Alexandru Dataiker, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 1,225 Dataiker

    Hi @sj0071992
    ,

    Reading files from S3 with the Python APIs can be achieved by

    1) Creating a manager folder pointing to the S3 buckets using the relevant managed folder call like get_download_stream.

    This article has a code sample you can reference :

    https://knowledge.dataiku.com/latest/courses/folders/managed-folders-hands-on.html

    Thanks,

  • sj0071992
    sj0071992 Partner, Dataiku DSS Core Designer, Neuron, Dataiku DSS Adv Designer, Registered, Dataiku DSS Developer, Neuron 2022, Neuron 2023 Posts: 131 Neuron

    Hi Alex,

    Could you help in Creating a manager folder pointing to the S3 buckets using the relevant managed folder call, as

    where to pass the path of the bucket and how to automate that ?

  • sj0071992
    sj0071992 Partner, Dataiku DSS Core Designer, Neuron, Dataiku DSS Adv Designer, Registered, Dataiku DSS Developer, Neuron 2022, Neuron 2023 Posts: 131 Neuron

    Hi Alex,

    Is there any way to achieve this without creating the managing folder? Can't we create a direct connection with S3 in python recipe and read the file using bucket name and path?

    Thanks in Advance

  • Alexandru
    Alexandru Dataiker, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 1,225 Dataiker

    Hi,

    Yes, if you wish to create a direct connection in python recipes you typically need to use boto3 python SDK, this means you need to manage the connection details/credentials yourself and things like multi-part download, and so on. It would definitely add complexity vs using a managed folder or S3 dataset in DSS directly.

    If you want to create an S3 dataset directly from python code (instead of managed folder) all you need is to run:

    dataset = project.create_s3_dataset(dataset_name, connection, path_in_connection, bucket=None)

    Let me know if you have any questions.

  • sj0071992
    sj0071992 Partner, Dataiku DSS Core Designer, Neuron, Dataiku DSS Adv Designer, Registered, Dataiku DSS Developer, Neuron 2022, Neuron 2023 Posts: 131 Neuron

    Hi Alex,

    Thanks for the reply.

    Now i am able to make connection with S3 but the issue right now i am facing is to read the .gz file from S3. Below is the code i am using

    import dataiku

    from dataiku import pandasutils as pdu

    import pandas as pd

    import boto3

    import io

    import gzip

    s3_client = boto3.client('s3')

    bucket ="bucket"

    source_file_path = "file/path/_tmp.out-s0-2021-10-01-00-25-14-501.gz"

    s3_file = s3_client.get_object(Bucket=bucket, Key=source_file_path)

    s3_file_data = s3_file['Body'].read()

    s3_file_data = io.BytesIO(s3_file_data)

    s3_file_data_df = pd.read_csv(s3_file_data, compression='gzip',header=0, sep=',', quotechar='"')

    On reading through CSV it is getting failed. Could you please help here?

    Thanks in advance

  • Alexandru
    Alexandru Dataiker, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 1,225 Dataiker

    Hi,

    What's the exact error stack trace when reading the file? Likely the files is corrupted unreadable.

    Try manually downloading the file and checking.

    The filename here starts with _tmp file do you have any non _tmp files you can try to read instead?

    Are these files generated via the Event Server?

  • sj0071992
    sj0071992 Partner, Dataiku DSS Core Designer, Neuron, Dataiku DSS Adv Designer, Registered, Dataiku DSS Developer, Neuron 2022, Neuron 2023 Posts: 131 Neuron

    Hi Alex,

    I have attached the error message.

    Also these are the CRU logs generated by the Event Server and all the files are with format "_tmp_"

    I am able to see the content till " s3_file_data = io.BytesIO(s3_file_data) " -- Bytes Data. But after that below error is coming

    Can we change the format of the files to .csv from the Event Server?

    log_error.png

  • Alexandru
    Alexandru Dataiker, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 1,225 Dataiker

    If all files a_tmp files in cru logs are failed files they cannot be loaded hence why the exception you see.
    This suggests an issue with your event server configuration as it was unable to generate any valid files.

  • Alexandru
    Alexandru Dataiker, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 1,225 Dataiker

    You can check DATADIR/run/eventserver.log for possible issues. Also may worth review the steps described :

    https://knowledge.dataiku.com/latest/kb/setup-admin/cru/index.html

    And this hands-on example here:

    https://knowledge.dataiku.com/latest/kb/setup-admin/cru/index.html

  • sj0071992
    sj0071992 Partner, Dataiku DSS Core Designer, Neuron, Dataiku DSS Adv Designer, Registered, Dataiku DSS Developer, Neuron 2022, Neuron 2023 Posts: 131 Neuron

    Hi Alex,

    When i tried creating the connection through creating Cloud storage dataset, its working fine. There i am able to see the data but if you are saying that all "_tmp" files are the failed ones and the valid files should start from "out-" then we can re-configure our event server to produce valid log files.

    Please let me know if my understanding is correct

    Thanks

  • Alexandru
    Alexandru Dataiker, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 1,225 Dataiker

    Yes, that's correct valid files should be out-*.

    If it worked in a regular DSS dataset that would mean at least some files were actually some non _tmp files.

  • dhaouadi
    dhaouadi Registered Posts: 5 ✭✭✭

    hi AlexT, i would import a project from dataiku to S3 directly without upload it in local can i do this ?

  • Alexandru
    Alexandru Dataiker, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 1,225 Dataiker

    Hi @dhaouadi
    ,

    Are you looking to import/export projects directly from S3?

    This is possible by using a managed folder and using the DSS API. You can store an export and read an export from S3 without copying to local.

    https://doc.dataiku.com/dss/latest/python-api/projects.html#exporting

    Is that what you are looking to do?

  • dhaouadi
    dhaouadi Registered Posts: 5 ✭✭✭

    thank u for your response @AlexT

    no the inverse, i don't want to import my projects in local. i m searching a solution to import the project directly in S3 .

    import a project from dataiku and store it into S3 . is it possible ?
  • Alexandru
    Alexandru Dataiker, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 1,225 Dataiker
    edited July 17

    Still not very clear to me what you are trying to achieve.

    Are you trying to export a full DSS to S3 bucket?

    Are you trying to export a dataset from DSS to S3? This can be done with Sync recipe or Export to Folder recipe or code recipe.

    To export actual projects to S3 you can simply create a managed folder with the S3 destination you wish. Then run a Notebook or Recipe/Scenario. With something like :

    import dataiku
    from dataiku import pandasutils as pdu
    import pandas as pd
    import datetime
    import time
    
    today_folder=time.strftime('%Y%m%d')
    
    client = dataiku.api_client()
    
    #replace Folder id
    output_folder = dataiku.Folder("5l2INQoq")
    #replace with project key 
    pk = "PROJECTKEY"
    
    #other options available 
    #exportUploads (boolean): Exports the data of Uploaded datasets - default False
    #exportManagedFS (boolean): Exports the data of managed Filesystem datasets - default False
    #exportAnalysisModels (boolean): Exports the models trained in analysis - default False
    #exportSavedModels (boolean): Exports the models trained in saved models - default False
    #exportManagedFolders (boolean): Exports the data of managed folders - default False
    #exportAllInputDatasets (boolean): Exports the data of all input datasets - default False
    #exportAllDatasets (boolean): Exports the data of all datasets - default False
    #exportAllInputManagedFolders (boolean): Exports the data of all input managed folders - default False
    #exportGitRepositoy (boolean): Exports the Git repository history - default False
    #exportInsightsData (boolean): Exports the data of static insights - default False
    
    project = client.get_project(pk)
    project.export_to_file('exported_project.zip')
    with project_info.get_export_stream({'exportAnalysisModels':True, 'exportSavedModels':True, 'exportGitRepositoy':True,
                                                 'exportInsightsData':True}) as s:
        output_folder.upload_stream("today_folder/" + pk + '_' + today_folder + '.zip', s)

    This will store the project.zip in your S3 bucket.

  • dhaouadi
    dhaouadi Registered Posts: 5 ✭✭✭

    yes just project not the full DSS

  • dhaouadi
    dhaouadi Registered Posts: 5 ✭✭✭

    when I open a project for example and I click on import the project I don't want it to download as a .zip in my local, on the other hand I want to store it directly on S3 that's the goal

  • Alexandru
    Alexandru Dataiker, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 1,225 Dataiker

    The only way to do that is via API see the code above it should achieve what you are looking for.

  • dhaouadi
    dhaouadi Registered Posts: 5 ✭✭✭
  • Sajid
    Sajid Partner, Dataiku DSS Core Designer, Dataiku DSS Adv Designer, Registered Posts: 17 Partner
Setup Info
    Tags
      Help me…