Read from S3 connection basis last modified date
Hi all, I have a S3 connection from which I can access data kept on S3. However, the data is huge and I want to read only data which is modified lately. Also I can't filter using end_date as other team is doing full refresh of data and uploading all data on daily basis.
So in simple terms, I want to access only yellow highlighted files (these are the latest files modified). Is there direct way or through python recipe for achieving this?
Thanks in advance!
Answers
-
Turribeach Dataiku DSS Core Designer, Neuron, Dataiku DSS Adv Designer, Registered, Neuron 2023 Posts: 2,124 Neuron
Hi, this is a hacky way but it would work for a one off way of doing it without having to write a Python recipe. Use the Files in Foilder dataset and then set the Files Included to Only Select and in expression use something like *202207*. If you need to automate this you will need to write some Python code. You can use the S3 folder as the input for your recipe and then list all files on the file folder so you can choose which files to load programatically.
-
Hi @Turribeach
Thanks for the response, however, I can't filter on end_date (partitioned column), I am trying to find a way to filter on Last Modified column. Hence, I am trying finding a way through which I can access this Last Modified column on Dataiku or read files only which has been updated during last S3 upload. -
Turribeach Dataiku DSS Core Designer, Neuron, Dataiku DSS Adv Designer, Registered, Neuron 2023 Posts: 2,124 Neuron
BTW I have seen this asked a few times already and we have aslo wanted to do this without code so I raised this idea on the Product Ideas section:
Feel free to up vote for it by clicking in the up arrow.
-
Turribeach Dataiku DSS Core Designer, Neuron, Dataiku DSS Adv Designer, Registered, Neuron 2023 Posts: 2,124 Neuron