Dataiku Automation| Reading files from S3 bucket

JT97
Level 1
Dataiku Automation| Reading files from S3 bucket

Hi,

I have files present in S3 bucket. I have also got S3 connections setup to read the file from S3 bucket and import it into Dataiku zone.

So now I want to automate this process as currently it is a manual process.

Need some help and assistance on how we can do this?

I am assuming this can be done from scenario and writing a custom python script.

Request to share solutions for this use case. 

0 Kudos
3 Replies
AlexT
Dataiker

Hi @JT97 ,

Indeed you can use a scenario for this case and build recursively by selecting the right-most dataset in the flow and building this dataset as a scenario step.  You can also do this directly from the Action menu when selecting the dataset - Other Action - Add to a scenario. 

When the build is performed will read any input files you have defined from your S3 bucket. 

Reading the actual files can be done using either python code or Files from Folder dataset. Both approaches are described here: https://community.dataiku.com/t5/Using-Dataiku/Listing-and-Reading-all-the-files-in-a-Managed-Folder...

Depending on how you files are structured and updated you may be able to partitioned Folder/Dataset : https://doc.dataiku.com/dss/latest/partitions/fs_datasets.html

 

Hope that helps, 

0 Kudos
JT97
Level 1
Author

Hi @AlexT 

In my S3 bucket from where I have to read the XLS file every Monday we get new file.

So the file has date as well in its naming pattern. So I need to pick up the latest file and read that from my S3 bucket directly into my Zone and then apply Sync recipe on it to read the data.

Can you please help me in setting this up?

Thanks

0 Kudos
AlexT
Dataiker

Hi,

One approach without code would be to use Partitioned Folder ( by day) where you define for example .*%Y/%M/%D/.*  or apadapt the pattern based the day format in your file name. 

From the advanced tab of the folder, you should enable the option "Missing partitions as empty" since you have gaps in your days as the file is only added e.g Monday. 


In your scenario, you can trigger on folder change ( e.g so the scenario triggers when new files are detected) or by time-based if you know what time the file is uploaded. In either case build your dataset( output of sync recipe) using the CURRENT_DAY special keyword. 

Screenshot 2022-08-29 at 12.55.24.pngScreenshot 2022-08-29 at 12.58.27.png

0 Kudos