How to parse filename for file based datasets/partitions

aw30 · August 2020

Hi - I know pieces exist to do the following but after reading documentation, discussions, etc. I am not connecting the pieces.

I have a folder on AWS S3 that will contain multiple files that start with a date YYYYMMDD-xxxxx.csv for example.

I can set up a new dataset and choose the folder and activate partitioning on date and it finds the files as separate partitions.

What I want to do is end up with a combined dataset that contains the date in the filename as one of the columns. I then want to automate this to pick up and process new files as they are put into the folder.

I really don't know python but I am sure this could be done just adding a python script to the initial dataset.

Also - I don't necessarily need to do this as partitions - that is I could just process each file as it comes in and keep appending it to an existing data set but it looked like using partitions would be the best solution.

Any help is really appreciated on setting this up and thank you in advance!

Liev · August 2020

Hi @aw30

In order to do this, I would suggest the following:

- From your partitioned dataset create a prepare recipe, the output dataset should NOT be partitioned.

- In your prepare recipe script, add a new step. The step is under Misc > Enrich record with context information.

- Give a name to the output partition column inside of the step.

- Run your recipe.

Good luck!

Ignacio_Toledo · August 2020

Thanks for the suggestion @Liev
! One question, why the output dataset should not be partitioned? Most probably is obvious why (which is a way to say that my question is coming from complete ignorance)

aw30 · August 2020

Thank you this does the trick pulling in the filename and putting everything into one data set!!

Liev · August 2020

I think I understood from the question that the output was not going to be partitioned, but it indeed doesn't need to be

aw30 · August 2020

Hi - I am having a hard time automating the flow above because the output sits on HDFS/Hive and there is no option to append. I tried adjusting my recipe to pull in the latest file and added a sync step to a partitioned dataset that I thought would keep the resulting output and add in the new partition but it overwrote the final data set. I can't seem to get this working correctly once I try to automate. Any additional help is really appreciated!

Ignacio_Toledo · September 2020

Is it possible for you to share some more information of how your flow looks like right now? Or at least, how does it look your partition dependencies configuration? (https://doc.dataiku.com/dss/latest/partitions/dependencies.html)

aw30 · September 2020

Hi - I still do not have this working. In the interim I am rebuilding the entire set in the folder but will post something once we get it working.

For now I added a prep step that enriches the data with the filename (that has the date in it). The prep step just combines all the data together and the enrich function allows me to identify the rows that came from which file.

Eventually, we want to figure this out so that only the new file that comes in will be processed and added to the resulting data set but for now this is how we are doing it.

Kiran · November 2020

Hello,

I came across this while looking for something similar and couldn't find the option available ("Misc > Enrich record with context information") on DSS 6.0.4.

How to parse filename for file based datasets/partitions

Best Answer

Answers

Categories

Setup Info

Tags