You now have until September 15th to submit your use case or success story to the 2022 Dataiku Frontrunner Awards!ENTER YOUR SUBMISSION

Read from S3 connection basis last modified date

pnaik1
Level 3
Read from S3 connection basis last modified date

Hi all, I have a S3 connection from which I can access data kept on S3. However, the data is huge and I want to read only data which is modified lately. Also I can't filter using end_date as other team is doing full refresh of data and uploading all data on daily basis. 

pic.png

So in simple terms, I want to access only yellow highlighted files (these are the latest files modified). Is there direct way or through python recipe for achieving this?

Thanks in advance!

0 Kudos
4 Replies
Turribeach
Level 5

Hi, this is a hacky way but it would work for a one off way of doing it without having to write a Python recipe. Use the Files in Foilder dataset and then set the Files Included to Only Select and in expression use something like *202207*. If you need to automate this you will need to write some Python code. You can use the S3 folder as the input for your recipe and then list all files on the file folder so you can choose which files to load programatically. 

0 Kudos
pnaik1
Level 3
Author

Hi @Turribeach Thanks for the response, however, I can't filter on end_date (partitioned column), I am trying to find a way to filter on  Last Modified column. Hence, I am trying finding a way through which I can access this Last Modified column on Dataiku or read files only which has been updated during last S3 upload.

0 Kudos
Turribeach
Level 5

BTW I have seen this asked a few times already and we have aslo wanted to do this without code so I raised this idea on the Product Ideas section:

https://community.dataiku.com/t5/Product-Ideas/Enhance-the-Files-in-Folder-dataset-to-allow-filterin...

Feel free to up vote for it by clicking in the up arrow.