Sign up to take part
Registered users can ask their own questions, contribute to discussions, and be part of the Community!
I would appreciate it if someone can provide me with a python script to select (read_csv) the latest csv file in a SFTP folder?
Currently I am using the following script to read csv files from a SFTP folder, however when we have multiple files added in different dates, I would like to select only the latest one added.
import dataiku import pandas as pd import numpy as np from dataiku import pandasutils as pdu FOLDER_NAME = 'folder_1' FILE_NAME = 'file_1.csv' DATASET_NAME = 'dataset_1' folder = dataiku.Folder(FOLDER_NAME) with folder.get_download_stream(FILE_NAME) as f: df = pd.read_csv(f) dataiku.Dataset(DATASET_NAME).write_with_schema(df)
Lets assume I have two files in the folder: file_1.csv and file_2.csv and file_2.csv has been added to the folder today and file_1 added last month. How can I select file_2?
What you are trying to do is achievable. There are a number of posts in the Dataiku community where folks are doing this kind of thing.
There is also documentation Like this:
in general you might do something like this:
input_folder = dataiku.Folder("AAAAAAAA")
paths = input_folder.list_paths_in_partition()
path_details = 
for path in paths:
note that the folder “AAAAAAAA” is what ever folder name you gave the ftp connected folder
in path_details you have a bunch of data about your files including modify times.
This seems to work ok for up to moderate sized data repository of less than 100,000 files. After that size things start breaking down and moving to shell scripts is faster and more reliable in my experience.
I hope this helps. Let us know how you get on with your project.
Hi @tgb417 ,
Thank you for taking the time and answering my question.
It was really helpful and I'm using the following script now:
from dataiku import pandasutils as pdu
import pandas as pd
FOLDER_NAME = "AAAAA"
input_folder = dataiku.Folder(FOLDER_NAME)
current_epoch = int(time.time())*1000
for item in input_folder.get_path_details()["children"]:
This script list out all the files along with their last modified date which is perfect. Now how can I select the max file (latest file)?