Selecting the latest file added in a folder using Python
Hi All,
I would appreciate it if someone can provide me with a python script to select (read_csv) the latest csv file in a SFTP folder?
Currently I am using the following script to read csv files from a SFTP folder, however when we have multiple files added in different dates, I would like to select only the latest one added.
import dataiku import pandas as pd import numpy as np from dataiku import pandasutils as pdu FOLDER_NAME = 'folder_1' FILE_NAME = 'file_1.csv' DATASET_NAME = 'dataset_1' folder = dataiku.Folder(FOLDER_NAME) with folder.get_download_stream(FILE_NAME) as f: df = pd.read_csv(f) dataiku.Dataset(DATASET_NAME).write_with_schema(df)
Lets assume I have two files in the folder: file_1.csv and file_2.csv and file_2.csv has been added to the folder today and file_1 added last month. How can I select file_2?
Answers
-
tgb417 Dataiku DSS Core Designer, Dataiku DSS & SQL, Dataiku DSS ML Practitioner, Dataiku DSS Core Concepts, Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Frontrunner 2022 Finalist, Frontrunner 2022 Winner, Dataiku Frontrunner Awards 2021 Participant, Frontrunner 2022 Participant, Neuron 2023 Posts: 1,598 Neuron
What you are trying to do is achievable. There are a number of posts in the Dataiku community where folks are doing this kind of thing.
There is also documentation Like this:
https://doc.dataiku.com/dss/latest/python-api/managed_folders.html
https://doc.dataiku.com/dss/latest/connecting/managed_folders.html
in general you might do something like this:#...
input_folder = dataiku.Folder("AAAAAAAA")
paths = input_folder.list_paths_in_partition()
#...
path_details = []
for path in paths:
path_details.append(input_folder.get_path_details(path=path))
#…note that the folder “AAAAAAAA” is what ever folder name you gave the ftp connected folder
in path_details you have a bunch of data about your files including modify times.
This seems to work ok for up to moderate sized data repository of less than 100,000 files. After that size things start breaking down and moving to shell scripts is faster and more reliable in my experience.
I hope this helps. Let us know how you get on with your project.
—Tom
-
Hi @tgb417
,Thank you for taking the time and answering my question.
It was really helpful and I'm using the following script now:
import dataiku
from dataiku import pandasutils as pdu
import pandas as pd
import timeFOLDER_NAME = "AAAAA"
input_folder = dataiku.Folder(FOLDER_NAME)
current_epoch = int(time.time())*1000for item in input_folder.get_path_details()["children"]:
print(item)
print(item['lastModified'])print(current_epoch)
This script list out all the files along with their last modified date which is perfect. Now how can I select the max file (latest file)?
-
Hi,
Were you able to find a solution on this where DataIKU reads the latest uploaded file out of all the list of uploaded files?