Sequential processing of files from filesystem_root folder

abhayt · August 2021

I built a Dataiku flow using Dataset -> Upload your File option (single csv file)

Now I have replaced the single file with a filesystem_root folder that contains multiple files of the same type.

I want files to be picked up iteratively from this folder, because with current behavior of folders, all the files are merged (appended) at first and then processed in the flow which causes unexpected output.

How do I handle the iterative picking & processing of the files from this folder?

Regards.

Alexandru · August 2021

Hi,

Perhaps you can elaborate a bit on what you mean by unexpected output? If the files are of the same type and have the same schema stacking them should not be problematic.

If it is then you would want to use a code recipe to read each one and add them to a dataset.

Here is an example this would give you more flexibility to process exclude lines etc from each file using pandas. If you are using a local filesystem managed folder, if you are using a remote folder( e.g S3) you will need to use get_download_stream(). Instead.

import dataiku
from dataiku import pandasutils as pdu
import pandas as pd
import glob

# Get DSS client
client = dataiku.api_client()
# Get managed folder
project = client.get_project(dataiku.default_project_key())
managed_folder = dataiku.Folder("27jITy2p")
# Get path to managed folder on disk
path = managed_folder.get_path()

l = [pd.read_csv(filename) for filename in glob.glob(path+ "/*.csv")]
df = pd.concat(l, axis=0)

#Write recipe outputs
merged_orders = dataiku.Dataset("merged_orders")
merged_orders.write_with_schema(df)

Sequential processing of files from filesystem_root folder

Answers

Categories

Setup Info

Tags