Community Conundrum 28: News Engagement is live! Read More

Accessing older files in S3 from python

Level 1
Accessing older files in S3 from python

I would like to check my s3 folder and find which is the older file in that and get that file name.

Similarly i would like to rename and delete the s3 file from dss python code.

 

Requirement:

Need to process the file "XXXXXX_0.txt" whenever it is placed in the s3. There could be multiple file sometime. But i have to process them one by one.

So my plan is:

1.Whenever the process need to be initiated, "process_start.txt" file will be placed In folder1.This file i will use for my auto trigger(Data folder modify option)

2.In my scenario, i will look the files which is having a files like XXXXXX_0.txt(in different folder) and process them.

we may have 2 files XXXXXX_0.txt ,YYYYY_0.txt . So i wanted to check, which file is older and  i wanted to  process that alone. at the end i will remove/rename that file
So next time, the latest file can be processed with corresponding trigger.

For this reason i am looking for finding the older file and renaming the file option.

 

Or is there any other better solution ? Or custom trigger? Like in trigger itself finding the older file and identifying which file need to be processed? Is it possible?

 

Thanks,

Vinothkumar M

0 Kudos
1 Reply
Neuron
Neuron

@Vinothkumar I apologize, but I do not fully understand your use case. There are a few ways to interact with S3 from Python. You can use Boto3 to interact with, inspect, look at metadata (which you will likely need) or you can use Pandas (if you have the s3fs package installed) to read/write from/to S3 in a way that looks like you're reading from an ordinary file system. I suspect you will need to use Boto since you want some logic around processing oldest files.

I have included some code which shows how you can manipulate files in S3 from Python (including checking for the oldest file, reading to python, writing to S3, copying/renaming and deleting). Another option (instead of deleting older files would be to update the metadata in S3). So look for the oldest file, process it, and when it is processed, add a metadata tag to the file like "processed":True or something else. Then in the future you would process the oldest file in the bucket that does not have the required metadata.

anyways, here's some helper code if it's useful (general python/S3 stuff)

import pandas as pd
import boto3
import io
import os

s3_resource = boto3.resource('s3',
                             aws_access_key_id='<key>',
                             aws_secret_access_key='<secret>')

my_bucket = s3_resource.Bucket('twright-sflots') #subsitute this for your s3 bucket name.

files = list(my_bucket.objects.filter(Prefix='errors/')) # list of the files in the bucket

# Get the Oldest File in S3
oldest_file = min(files, key= lambda file: file.last_modified) # select the oldest file in the bucket
oldest_file_obj = s3_resource.Object(my_bucket.name, oldest_file.key) # get the object

# do something to process the file (I presume in Python since you've asked about Python from DSS?):
df = pd.read_csv(io.BytesIO(oldest_file_obj.get()['Body'].read()))
df['new_column'] = 'new_value'

# write df back to S3
csv_buffer = io.StringIO()
df.to_csv(csv_buffer)
s3_resource.Object(my_bucket.name, 'errors/some_new_file.csv').put(Body=csv_buffer.getvalue())

# Copy and Rename the oldest_file (not sure you'd need this)
fname, extension = os.path.splitext(oldest_file.key)
s3_resource.Object(my_bucket.name, fname + '_old' + extension)\
.copy_from(CopySource={'Bucket': oldest_file.bucket_name, 'Key': oldest_file.key})

# Delete the original file
oldest_file_obj.delete() 

 

0 Kudos
A banner prompting to get Dataiku DSS