Best approach to appending daily data from files?
Hi everyone!
I want to see your input on the best approach to do the following:
I have daily CSV files that contain no date column, but values change on a daily basis. I want to create a dataset based on these files. I will be added files every day. I need to add a date column to the file of the date it was added.
I created a SharePoint folder where Dataiku will pull the files.
I was able to get DataIku to add a date column using expression recipe with now() expression - This will add todays date. But then tomorrow it will add tomorrow's date and override the previous date.
Can I create a python recipe to do this or any suggestions?
Thanks,
Operating system used: Windows
Best Answer
-
Turribeach Dataiku DSS Core Designer, Neuron, Dataiku DSS Adv Designer, Registered, Neuron 2023 Posts: 2,088 Neuron
Use the Files in Folder dataset and the File_Name to get your date:
https://community.dataiku.com/t5/Using-Dataiku/Using-the-quot-Files-in-folder-quot-dataset/m-p/33214
Answers
-
tgb417 Dataiku DSS Core Designer, Dataiku DSS & SQL, Dataiku DSS ML Practitioner, Dataiku DSS Core Concepts, Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Frontrunner 2022 Finalist, Frontrunner 2022 Winner, Dataiku Frontrunner Awards 2021 Participant, Frontrunner 2022 Participant, Neuron 2023 Posts: 1,598 Neuron
Yes, You can create a python recipe to do almost anything you want to do. You can do that from a python recipe node that looks at a folder rather than a CSV or database based data set.
You say that "I will be add[ing] files every day." You don't say if these files will override the existing files OR if the files will be added to the same directory along side the files that are already there. You also don't say if you have any control of the creation of the original files. For example you could create the files with names that reflected the current date. Or you could look at the time stamp from the files in the file system and use those dates. You also did not mention anything about the relitive size of these files.
That said, depending on what exactly you are trying to do you might find one or more of these things helpful.
- Partitioned Datasets to managed multiple files as if they are one large dataset. There are ways to setup partitioned datasets
- From a single managed folder you can treat all of the files in a folder as a dataset.
- There is the Shell flow node that can be used to capture any file details you can capture in your linux shell. (This method can be particularly quick when working with file system based data at scale.)
- The use of the Stack Recipe followed by a window recipe to de-dupe records based on a sort and row within a partition based on some useful key. This is a common data processing pattern for me.
I'm sure that there are a bunch of other options that might be helpful. But, I'm not clear enough about your use case to provide a more specific set of suggestions.
-
Thanks for your fast response.
To give you a more detail as follow:
1 - files will be added to the same directory alongside the files that are already there.
2 - I download the file from a 3rd party source and I rename it with the current date Example = "FILENAME_02272024" (Feb 02, 2024).
3 - Each file contains average of 1500 rows and 25 columns - size 202kbIn other scenarios I have created datasets with this type of structure - where the dataset is created from all the files within a directory. The difference is that those files have a date column.
Please advise.
Thanks!
-
@Turribeach
Thanks for sharing!