Dataiku Automation| Reading files from S3 bucket

JT97
JT97 Registered Posts: 2 ✭✭✭

Hi,

I have files present in S3 bucket. I have also got S3 connections setup to read the file from S3 bucket and import it into Dataiku zone.

So now I want to automate this process as currently it is a manual process.

Need some help and assistance on how we can do this?

I am assuming this can be done from scenario and writing a custom python script.

Request to share solutions for this use case.

Tagged:

Answers

  • Alexandru
    Alexandru Dataiker, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 1,226 Dataiker

    Hi @JT97
    ,

    Indeed you can use a scenario for this case and build recursively by selecting the right-most dataset in the flow and building this dataset as a scenario step. You can also do this directly from the Action menu when selecting the dataset - Other Action - Add to a scenario.

    When the build is performed will read any input files you have defined from your S3 bucket.

    Reading the actual files can be done using either python code or Files from Folder dataset. Both approaches are described here: https://community.dataiku.com/t5/Using-Dataiku/Listing-and-Reading-all-the-files-in-a-Managed-Folder/m-p/8241/highlight/true#M4280

    Depending on how you files are structured and updated you may be able to partitioned Folder/Dataset : https://doc.dataiku.com/dss/latest/partitions/fs_datasets.html

    Hope that helps,

  • JT97
    JT97 Registered Posts: 2 ✭✭✭

    Hi @AlexT

    In my S3 bucket from where I have to read the XLS file every Monday we get new file.

    So the file has date as well in its naming pattern. So I need to pick up the latest file and read that from my S3 bucket directly into my Zone and then apply Sync recipe on it to read the data.

    Can you please help me in setting this up?

    Thanks

  • Alexandru
    Alexandru Dataiker, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 1,226 Dataiker

    Hi,

    One approach without code would be to use Partitioned Folder ( by day) where you define for example .*%Y/%M/%D/.* or apadapt the pattern based the day format in your file name.

    From the advanced tab of the folder, you should enable the option "Missing partitions as empty" since you have gaps in your days as the file is only added e.g Monday.


    In your scenario, you can trigger on folder change ( e.g so the scenario triggers when new files are detected) or by time-based if you know what time the file is uploaded. In either case build your dataset( output of sync recipe) using the CURRENT_DAY special keyword.

    Screenshot 2022-08-29 at 12.55.24.pngScreenshot 2022-08-29 at 12.58.27.png

Setup Info
    Tags
      Help me…