Sign up to take part
Registered users can ask their own questions, contribute to discussions, and be part of the Community!
Hi all, I have a S3 connection from which I can access data kept on S3. However, the data is huge and I want to read only data which is modified lately. Also I can't filter using end_date as other team is doing full refresh of data and uploading all data on daily basis.
So in simple terms, I want to access only yellow highlighted files (these are the latest files modified). Is there direct way or through python recipe for achieving this?
Thanks in advance!
Hi, this is a hacky way but it would work for a one off way of doing it without having to write a Python recipe. Use the Files in Foilder dataset and then set the Files Included to Only Select and in expression use something like *202207*. If you need to automate this you will need to write some Python code. You can use the S3 folder as the input for your recipe and then list all files on the file folder so you can choose which files to load programatically.
Hi @Turribeach Thanks for the response, however, I can't filter on end_date (partitioned column), I am trying to find a way to filter on Last Modified column. Hence, I am trying finding a way through which I can access this Last Modified column on Dataiku or read files only which has been updated during last S3 upload.
BTW I have seen this asked a few times already and we have aslo wanted to do this without code so I raised this idea on the Product Ideas section:
Feel free to up vote for it by clicking in the up arrow.