Read from S3 connection basis last modified date

pnaik1 · July 2022

Hi all, I have a S3 connection from which I can access data kept on S3. However, the data is huge and I want to read only data which is modified lately. Also I can't filter using end_date as other team is doing full refresh of data and uploading all data on daily basis.

So in simple terms, I want to access only yellow highlighted files (these are the latest files modified). Is there direct way or through python recipe for achieving this?

Thanks in advance!

Turribeach · July 2022

Hi, this is a hacky way but it would work for a one off way of doing it without having to write a Python recipe. Use the Files in Foilder dataset and then set the Files Included to Only Select and in expression use something like *202207*. If you need to automate this you will need to write some Python code. You can use the S3 folder as the input for your recipe and then list all files on the file folder so you can choose which files to load programatically.

pnaik1 · July 2022

Hi @Turribeach
Thanks for the response, however, I can't filter on end_date (partitioned column), I am trying to find a way to filter on Last Modified column. Hence, I am trying finding a way through which I can access this Last Modified column on Dataiku or read files only which has been updated during last S3 upload.

Turribeach · July 2022

BTW I have seen this asked a few times already and we have aslo wanted to do this without code so I raised this idea on the Product Ideas section:

https://community.dataiku.com/t5/Product-Ideas/Enhance-the-Files-in-Folder-dataset-to-allow-filtering-for-the/idi-p/27039#M691

Feel free to up vote for it by clicking in the up arrow.

Turribeach · July 2022

This should help:

https://community.dataiku.com/t5/Using-Dataiku/Listing-and-Reading-all-the-files-in-a-Managed-Folder/m-p/8140

Read from S3 connection basis last modified date

Answers

Categories

Setup Info

Tags