Check out the first Dataiku 8 Deep Dive focusing on Productivity on October 29th Read More

How to parse filename for file based datasets/partitions

Level 3
How to parse filename for file based datasets/partitions

Hi - I know pieces exist to do the following but after reading documentation, discussions, etc. I am not connecting the pieces.

I have a folder on AWS S3 that will contain multiple files that start with a date YYYYMMDD-xxxxx.csv for example. 

I can set up a new dataset and choose the folder and activate partitioning on date and it finds the files as separate partitions.

What I want to do is end up with a combined dataset that contains the date in the filename as one of the columns. I then want to automate this to pick up and process new files as they are put into the folder.

I really don't know python but I am sure this could be done just adding a python script to the initial dataset. 

Also - I don't necessarily need to do this as partitions - that is I could just process each file as it comes in and keep appending it to an existing data set but it looked like using partitions would be the best solution.

Any help is really appreciated on setting this up and thank you in advance!

7 Replies
Dataiker
Dataiker

Hi @aw30 

In order to do this, I would suggest the following:

- From your partitioned dataset create a prepare recipe, the output dataset should NOT be partitioned.

- In your prepare recipe script, add a new step. The step is under Misc > Enrich record with context information. 

- Give a name to the output partition column inside of the step.

- Run your recipe.

Good luck!

Level 3
Author

Thank you this does the trick pulling in the filename and putting everything into one data set!!

0 Kudos
Level 5

Thanks for the suggestion @Liev ! One question, why the output dataset should not be partitioned? Most probably is obvious why (which is a way to say that my question is coming from complete ignorance)

0 Kudos
Dataiker
Dataiker

I think I understood from the question that the output was not going to be partitioned, but it indeed doesn't need to be 🙂

Level 3
Author

Hi - I am having a hard time automating the flow above because the output sits on HDFS/Hive and there is no option to append. I tried adjusting my recipe to pull in the latest file and added a sync step to a partitioned dataset that I thought would keep the resulting output and add in the new partition but it overwrote the final data set. I can't seem to get this working correctly once I try to automate. Any additional help is really appreciated!

0 Kudos
Level 5

Is it possible for you to share some more information of how your flow looks like right now? Or at least, how does it look your partition dependencies configuration? (https://doc.dataiku.com/dss/latest/partitions/dependencies.html)

0 Kudos
Level 3
Author

Hi - I still do not have this working. In the interim I am rebuilding the entire set in the folder but will post something once we get it working.

For now I added a prep step that enriches the data with the filename (that has the date in it). The prep step just combines all the data together and the enrich function allows me to identify the rows that came from which file.

Eventually, we want to figure this out so that only the new file that comes in will be processed and added to the resulting data set but for now this is how we are doing it.