Sequential processing of files from filesystem_root folder

abhayt
abhayt Registered Posts: 3 ✭✭✭

I built a Dataiku flow using Dataset -> Upload your File option (single csv file)

Now I have replaced the single file with a filesystem_root folder that contains multiple files of the same type.

I want files to be picked up iteratively from this folder, because with current behavior of folders, all the files are merged (appended) at first and then processed in the flow which causes unexpected output.

How do I handle the iterative picking & processing of the files from this folder?

Regards.

Answers

  • Alexandru
    Alexandru Dataiker, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 1,237 Dataiker
    edited July 2024

    Hi,

    Perhaps you can elaborate a bit on what you mean by unexpected output? If the files are of the same type and have the same schema stacking them should not be problematic.

    If it is then you would want to use a code recipe to read each one and add them to a dataset.

    Here is an example this would give you more flexibility to process exclude lines etc from each file using pandas. If you are using a local filesystem managed folder, if you are using a remote folder( e.g S3) you will need to use get_download_stream(). Instead.

    import dataiku
    from dataiku import pandasutils as pdu
    import pandas as pd
    import glob
    
    # Get DSS client
    client = dataiku.api_client()
    # Get managed folder
    project = client.get_project(dataiku.default_project_key())
    managed_folder = dataiku.Folder("27jITy2p")
    # Get path to managed folder on disk
    path = managed_folder.get_path()
    
    l = [pd.read_csv(filename) for filename in glob.glob(path+ "/*.csv")]
    df = pd.concat(l, axis=0)
    
    #Write recipe outputs
    merged_orders = dataiku.Dataset("merged_orders")
    merged_orders.write_with_schema(df)

Setup Info
    Tags
      Help me…