Using the "Files in folder" dataset
If you load lots of files of the same type into Dataiku you should be looking using the Files in folder dataset. It's a great built-in feature to automate the ingestion of files of the same format.
You can create a new Files in Folders dataset by going to: Dataset => New Dataset => All dataset types => DSS => Files in folder. This is how it looks:
But this tip is not about Files in Folders dataset but about a "hidden" built-in feature of this dataset. When using the Files in Folders dataset is it possible to have the filename and row ID of records imported into Dataiku added as a column? Yes it is! What you need to do is:
- Add a Prepare recipe using the Files in Folder dataset as an input and add a step using the “Enrich record with context information” processor (you can search for it).
- Then define the columns for filepath, filename and row ID and run the recipe as follows:
Finally run your recipe to populate the new columns. You will now have the filepath, filename and row ID of records added to your data automatically!
- You can also add the “Enrich record with build information” processor as a step to get the Build Date and the Job ID of when your files last loaded in DSS.
- Here is a sample output of all the columns that will be added to all your files if you add the two built-in processors:
Answers
-
Alexandru Dataiker, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 1,226 Dataiker
Thanks for Sharing! @Turribeach