S3 path Details whenever a new file is received

sj0071992
sj0071992 Partner, Dataiku DSS Core Designer, Neuron, Dataiku DSS Adv Designer, Registered, Dataiku DSS Developer, Neuron 2022, Neuron 2023 Posts: 131 Neuron

Hi Team,

I am creating a process where i have to process s3 files, so is there any way to get the complete s3 path whenever a new file is added.

We can create a managed folder which will have the s3 files and in Scenario we can also trigger the process on Managed folder change but can we get the file path details?

Is there any way to do this?

Thanks in Advance

Answers

  • Alexandru
    Alexandru Dataiker, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 1,225 Dataiker
    edited July 17

    Hi @sj0071992
    ,

    If I understand correctly you are looking for the scenario trigger to provide the path of the file that triggered the scenario run?

    The full path of the file or files that changed since it last triggered is not available to the scenario directly given how this is currently tracked see the explanation here.

    Even if this information was available directly how would you use this?

    The parameters/variable within the scenario can be retrieved like so:

    from dataiku.scenario import Scenario
    s = Scenario()
    trigger_params = s.get_all_variables()
    print(trigger_params)

    This will contain the project/managed folder ID that was modified but not which file/files/folders within were changed.

    'scenarioTriggerParam_modified': '["TESTING_S3_NEW.livm8UXE.NP"]'

    To actually get the full paths since the last trigger you could save a project variable every time with the timestamp(e.g epoch) for example of the last file that was processed and then compare that with the files that were created after using the modified date that and retrieve their full path.

    Here is a code snippet that would retrieve the information you are looking for which you can adapt to your need and use in the Scenario Python step.

    import dataiku
    from dataiku import pandasutils as pdu
    import pandas as pd
    import time
    
    folder_id = "G8glecnp"
    input_folder = dataiku.Folder(folder_id)
    
    
    current_epoch = int(time.time())*1000
    
    for item in input_folder.get_path_details()["children"]:
            print(item)
            print(item['lastModified'])
    
    print(current_epoch)

    Hope this helps!

  • degananda264
    degananda264 Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 5

    Hi @AlexT

    How to keep monitoring the onedrive folder?

    whenever uploaded the file in a onedrive folder, i want to read the file in a python receipe and do some data engineering.

    Could you please help me on this ?

    Thanks in advance

    Degananda

Setup Info
    Tags
      Help me…