Filtering files on a folder based on a external list

mejiaks · ‎11-29-2023

Ok, I am beyond (or behind) a newbie on Dataiku so, bear with me on this.

I have a folder containing csv files, the folder contains currently 3000 files, size of each file is probably 100KB at most. but all together they go to maybe 15MM.

I've created a dataset based on this folder and filter only rows that I needed with a recipe, the dataset is 1 column only, containing a structured string with specific suffix and I filter them based on that suffix becoming 10MM+ rows and then on I parsed this dataset and turned them in to more columns based on some specific criteria.

at the end I create a table in snowflake with all the detail, becoming 1.3MM rows dataset, let's call it "DetailedDataset".

Here is my issue, every time there are new files to add to the folder, I have to run the flow again with all the 3000 files and 15MM rows and so on. This takes more time everyday, and now is taking 35 minutes every time. Please keep in mind that only the first recipe where I filter only the rows I need is tasking the majority if the time, 26 minutes or more. Some of you guys would argue 30 minutes isn't that bad but, what will it happen when instead of 3000 files we have 6000?, 9000?

So, I am looking for a solution that would probably will relies on Python (which I do not know how to use but as a programmer I can figure it out) to filter out the files that already were processed based on a dataset that select a list of files from "DetailedDataset" and only process in that first recipe that takes half an hour.

is this possible? improving the performance of the flow?

Turribeach · ‎11-29-2023

Hi, first of all I am not what MM means. Is that millions of million? Please clarify.

The secondly your post title says "Filtering files on a folder based on a external list" but what your post seems to be about seems to be on a way to avoiding reloading all the CSV files again every time new CSV files come. This is a common pattern on data pipelines. I do agree that a Python recipe will give the best capabilities but can be achieved with other recipe types too (like the Shell recipe). You need to redesign your flow to load these CSV files as they come and once loaded move then to another folder so that you know they have been loaded successfully and avoid having to load them ever again when the flow is started again. You also need to start to maintain a "historical" table so that you can keep adding to this table as needed. Dataiku recipes support using the "Append instead of overwrite" but you need to be careful since Dataiku will always drop a dataset when it detects schema changes. So if you want to use an historical table you better manage it outside Dataiku so that it won't be dropped when you need to make column changes.

Finally you can use this "hidden" feature of the Files in Folder dataset to load the original file name so you can trace back all your records to the original file where they came on:

https://community.dataiku.com/t5/Using-Dataiku/Using-the-quot-Files-in-folder-quot-dataset/m-p/33214

mejiaks · ‎12-07-2023

Thank you

That was helpful.

Regarding Dataiku dropping the table when Schema changes, I found out that the hard way last night.

I implemented that scenario you described.

And, MM means Millions

Sign up to take part

Filtering files on a folder based on a external list

Filtering files on a folder based on a external list