Sequential processing of files from filesystem_root folder
I built a Dataiku flow using Dataset -> Upload your File option (single csv file)
Now I have replaced the single file with a filesystem_root folder that contains multiple files of the same type.
I want files to be picked up iteratively from this folder, because with current behavior of folders, all the files are merged (appended) at first and then processed in the flow which causes unexpected output.
How do I handle the iterative picking & processing of the files from this folder?
Regards.
Answers
-
Alexandru Dataiker, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 1,237 Dataiker
Hi,
Perhaps you can elaborate a bit on what you mean by unexpected output? If the files are of the same type and have the same schema stacking them should not be problematic.
If it is then you would want to use a code recipe to read each one and add them to a dataset.
Here is an example this would give you more flexibility to process exclude lines etc from each file using pandas. If you are using a local filesystem managed folder, if you are using a remote folder( e.g S3) you will need to use get_download_stream(). Instead.
import dataiku from dataiku import pandasutils as pdu import pandas as pd import glob # Get DSS client client = dataiku.api_client() # Get managed folder project = client.get_project(dataiku.default_project_key()) managed_folder = dataiku.Folder("27jITy2p") # Get path to managed folder on disk path = managed_folder.get_path() l = [pd.read_csv(filename) for filename in glob.glob(path+ "/*.csv")] df = pd.concat(l, axis=0) #Write recipe outputs merged_orders = dataiku.Dataset("merged_orders") merged_orders.write_with_schema(df)