Selecting the latest file added in a folder using Python

Options
pafj
pafj Dataiku DSS Core Designer, Registered Posts: 13 ✭✭✭
edited July 16 in Using Dataiku

Hi All,

I would appreciate it if someone can provide me with a python script to select (read_csv) the latest csv file in a SFTP folder?

Currently I am using the following script to read csv files from a SFTP folder, however when we have multiple files added in different dates, I would like to select only the latest one added.

import dataiku
import pandas as pd
import numpy as np
from dataiku import pandasutils as pdu

FOLDER_NAME = 'folder_1'
FILE_NAME = 'file_1.csv'
DATASET_NAME = 'dataset_1'

folder = dataiku.Folder(FOLDER_NAME)
with folder.get_download_stream(FILE_NAME) as f:
    df = pd.read_csv(f)
dataiku.Dataset(DATASET_NAME).write_with_schema(df)

Lets assume I have two files in the folder: file_1.csv and file_2.csv and file_2.csv has been added to the folder today and file_1 added last month. How can I select file_2?

dataiku_pic_1.jpg

Tagged:

Answers

  • tgb417
    tgb417 Dataiku DSS Core Designer, Dataiku DSS & SQL, Dataiku DSS ML Practitioner, Dataiku DSS Core Concepts, Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Frontrunner 2022 Finalist, Frontrunner 2022 Winner, Dataiku Frontrunner Awards 2021 Participant, Frontrunner 2022 Participant, Neuron 2023 Posts: 1,595 Neuron
    edited July 17
    Options

    @pafj

    What you are trying to do is achievable. There are a number of posts in the Dataiku community where folks are doing this kind of thing.

    https://community.dataiku.com/t5/Using-Dataiku/S3-path-Details-whenever-a-new-file-is-received/td-p/21214

    https://community.dataiku.com/t5/Using-Dataiku/quot-Finger-Printing-quot-files-in-a-Managed-Folder/td-p/21914

    There is also documentation Like this:

    https://doc.dataiku.com/dss/latest/python-api/managed_folders.html

    https://doc.dataiku.com/dss/latest/connecting/managed_folders.html

    in general you might do something like this:

    #...

    input_folder = dataiku.Folder("AAAAAAAA")
    paths = input_folder.list_paths_in_partition()

    #...

    path_details = []
    for path in paths:
    path_details.append(input_folder.get_path_details(path=path))
    #…

    note that the folder “AAAAAAAA” is what ever folder name you gave the ftp connected folder

    in path_details you have a bunch of data about your files including modify times.

    This seems to work ok for up to moderate sized data repository of less than 100,000 files. After that size things start breaking down and moving to shell scripts is faster and more reliable in my experience.

    I hope this helps. Let us know how you get on with your project.

    —Tom

  • pafj
    pafj Dataiku DSS Core Designer, Registered Posts: 13 ✭✭✭
    Options

    Hi @tgb417
    ,

    Thank you for taking the time and answering my question.

    It was really helpful and I'm using the following script now:

    import dataiku
    from dataiku import pandasutils as pdu
    import pandas as pd
    import time

    FOLDER_NAME = "AAAAA"
    input_folder = dataiku.Folder(FOLDER_NAME)


    current_epoch = int(time.time())*1000

    for item in input_folder.get_path_details()["children"]:
    print(item)
    print(item['lastModified'])

    print(current_epoch)

    This script list out all the files along with their last modified date which is perfect. Now how can I select the max file (latest file)?

  • PrathameshPatil
    PrathameshPatil Dataiku DSS Core Designer, Registered Posts: 2
    Options

    Hi,

    Were you able to find a solution on this where DataIKU reads the latest uploaded file out of all the list of uploaded files?

Setup Info
    Tags
      Help me…