Submit your inspiring success story or innovative use case to the 2022 Dataiku Frontrunner Awards! ENTER YOUR SUBMISSION

Selecting the latest file added in a folder using Python

pafj
Level 2
Selecting the latest file added in a folder using Python

Hi All,

I would appreciate it if someone can provide me with a python script to select (read_csv) the latest csv file in a SFTP folder?

Currently I am using the following script to read csv files from a SFTP folder, however when we have multiple files added in different dates, I would like to select only the latest one added. 

 

import dataiku
import pandas as pd
import numpy as np
from dataiku import pandasutils as pdu

FOLDER_NAME = 'folder_1'
FILE_NAME = 'file_1.csv'
DATASET_NAME = 'dataset_1'

folder = dataiku.Folder(FOLDER_NAME)
with folder.get_download_stream(FILE_NAME) as f:
    df = pd.read_csv(f)
dataiku.Dataset(DATASET_NAME).write_with_schema(df)

 

Lets assume I have two files in the folder: file_1.csv and file_2.csv and file_2.csv has been added to the folder today and file_1 added last month. How can I select file_2?

dataiku_pic_1.jpg 

0 Kudos
2 Replies
tgb417
Neuron
Neuron

@pafj 

What you are trying to do is achievable.  There are a number of posts in the Dataiku community where folks are doing this kind of thing.  

https://community.dataiku.com/t5/Using-Dataiku/S3-path-Details-whenever-a-new-file-is-received/td-p/... 

https://community.dataiku.com/t5/Using-Dataiku/quot-Finger-Printing-quot-files-in-a-Managed-Folder/t... 

There is also documentation Like this:

https://doc.dataiku.com/dss/latest/python-api/managed_folders.html 

https://doc.dataiku.com/dss/latest/connecting/managed_folders.html

in general you might do something like this:

#...

input_folder = dataiku.Folder("AAAAAAAA")
paths = input_folder.list_paths_in_partition()

#...

path_details = []
for path in paths:
path_details.append(input_folder.get_path_details(path=path))
#…

note that the folder “AAAAAAAA” is what ever folder name you gave the ftp connected folder

in path_details you have a bunch of data about your files including modify times. 

This seems to work ok for up to moderate sized data repository of less than 100,000 files.  After that size things start breaking down and moving to shell scripts is faster and more reliable in my experience.  

I hope this helps.   Let us know how you get on with your project.  

—Tom

 

--Tom
pafj
Level 2
Author

Hi @tgb417 ,

Thank you for taking the time and answering my question. 

It was really helpful and I'm using the following script now: 

import dataiku
from dataiku import pandasutils as pdu
import pandas as pd
import time

FOLDER_NAME = "AAAAA"
input_folder = dataiku.Folder(FOLDER_NAME)


current_epoch = int(time.time())*1000

for item in input_folder.get_path_details()["children"]:
print(item)
print(item['lastModified'])

print(current_epoch)

This script list out all the files along with their last modified date which is perfect. Now how can I select the max file (latest file)?

0 Kudos