We're excited to announce that we're launching the second installment of Dataiku Product Days Register Now

S3 path Details whenever a new file is received

sj0071992
Level 2
Level 2
S3 path Details whenever a new file is received

Hi Team,

 

I am creating a process where i have to process s3 files, so is there any way to get the complete s3 path whenever a new file is added.

 

We can create a managed folder which will have the s3 files and in Scenario we can also trigger the process on Managed folder change but can we get the file path details?

Is there any way to do this?

 

Thanks in Advance

0 Kudos
1 Reply
AlexT
Dataiker
Dataiker

Hi @sj0071992 ,

If I understand correctly you are looking for the scenario trigger to provide the path of the file that triggered the scenario run? 

The full path of the file or files that changed since it last triggered is not available to the scenario directly given how this is currently tracked see the explanation here

Even if this information was available directly how would you use this? 

The parameters/variable within the scenario can be retrieved like so:

from dataiku.scenario import Scenario
s = Scenario()
trigger_params = s.get_all_variables()
print(trigger_params)

This will contain the project/managed folder ID that was modified but not which file/files/folders within were changed.

'scenarioTriggerParam_modified': '["TESTING_S3_NEW.livm8UXE.NP"]'

 

To actually get the full paths since the last trigger you could save a project variable every time with the timestamp(e.g epoch)  for example of the last file that was processed and then compare that with the files that were created after using the modified date that and retrieve their full path. 

Here is a code snippet that would retrieve the information you are looking for which you can adapt to your need and use in the Scenario Python step.

 

import dataiku
from dataiku import pandasutils as pdu
import pandas as pd
import time

folder_id = "G8glecnp"
input_folder = dataiku.Folder(folder_id)


current_epoch = int(time.time())*1000

for item in input_folder.get_path_details()["children"]:
        print(item)
        print(item['lastModified'])

print(current_epoch)

 

Hope this helps!

0 Kudos
A banner prompting to get Dataiku DSS