Using the "Files in folder" dataset

Turribeach
Using the "Files in folder" dataset

If you load lots of files of the same type into Dataiku you should be looking using the Files in folder dataset. It's a great built-in feature to automate the ingestion of files of the same format. 

You can create a new Files in Folders dataset by going to: Dataset => New Dataset => All dataset types => DSS => Files in folder. This is how it looks:

 Capture2.PNG

But this tip is not about Files in Folders dataset but about a "hidden" built-in feature of this dataset. When using the Files in Folders dataset is it possible to have the filename and row ID of records imported into Dataiku added as a column? Yes it is! What you need to do is:

  1.  Add a Prepare recipe using the Files in Folder dataset as an input and add a step using the “Enrich record with context information” processor (you can search for it).
  2.  Then define the columns for filepath, filename and row ID and run the recipe as follows:d82752dd-0de4-482c-9ba1-87da2d253acc.png
  3. Finally run your recipe to populate the new columns. You will now have the filepath, filename and row ID of records added to your data automatically!

  4. You can also add the “Enrich record with build information” processor as a step to get the Build Date and the Job ID of when your files last loaded in DSS.
  5. Here is a sample output of all the columns that will be added to all your files if you add the two built-in processors:Capture3.PNG
1 Reply
AlexT
Dataiker

Thanks for Sharing! @Turribeach 

0 Kudos