How to parse filename for file based datasets/partitions

Options
aw30
aw30 Dataiku DSS & SQL, Registered Posts: 49 ✭✭✭✭✭

Hi - I know pieces exist to do the following but after reading documentation, discussions, etc. I am not connecting the pieces.

I have a folder on AWS S3 that will contain multiple files that start with a date YYYYMMDD-xxxxx.csv for example.

I can set up a new dataset and choose the folder and activate partitioning on date and it finds the files as separate partitions.

What I want to do is end up with a combined dataset that contains the date in the filename as one of the columns. I then want to automate this to pick up and process new files as they are put into the folder.

I really don't know python but I am sure this could be done just adding a python script to the initial dataset.

Also - I don't necessarily need to do this as partitions - that is I could just process each file as it comes in and keep appending it to an existing data set but it looked like using partitions would be the best solution.

Any help is really appreciated on setting this up and thank you in advance!

Best Answer

  • Liev
    Liev Dataiker Alumni Posts: 176 ✭✭✭✭✭✭✭✭
    Answer ✓
    Options

    Hi @aw30

    In order to do this, I would suggest the following:

    - From your partitioned dataset create a prepare recipe, the output dataset should NOT be partitioned.

    - In your prepare recipe script, add a new step. The step is under Misc > Enrich record with context information.

    - Give a name to the output partition column inside of the step.

    - Run your recipe.

    Good luck!

Answers

  • Ignacio_Toledo
    Ignacio_Toledo Dataiku DSS Core Designer, Dataiku DSS Core Concepts, Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Frontrunner 2022 Finalist, Frontrunner 2022 Winner, Dataiku Frontrunner Awards 2021 Participant, Frontrunner 2022 Participant, Neuron 2023 Posts: 411 Neuron
    Options

    Thanks for the suggestion @Liev
    ! One question, why the output dataset should not be partitioned? Most probably is obvious why (which is a way to say that my question is coming from complete ignorance)

  • aw30
    aw30 Dataiku DSS & SQL, Registered Posts: 49 ✭✭✭✭✭
    Options

    Thank you this does the trick pulling in the filename and putting everything into one data set!!

  • Liev
    Liev Dataiker Alumni Posts: 176 ✭✭✭✭✭✭✭✭
    Options

    I think I understood from the question that the output was not going to be partitioned, but it indeed doesn't need to be

  • aw30
    aw30 Dataiku DSS & SQL, Registered Posts: 49 ✭✭✭✭✭
    Options

    Hi - I am having a hard time automating the flow above because the output sits on HDFS/Hive and there is no option to append. I tried adjusting my recipe to pull in the latest file and added a sync step to a partitioned dataset that I thought would keep the resulting output and add in the new partition but it overwrote the final data set. I can't seem to get this working correctly once I try to automate. Any additional help is really appreciated!

  • Ignacio_Toledo
    Ignacio_Toledo Dataiku DSS Core Designer, Dataiku DSS Core Concepts, Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Frontrunner 2022 Finalist, Frontrunner 2022 Winner, Dataiku Frontrunner Awards 2021 Participant, Frontrunner 2022 Participant, Neuron 2023 Posts: 411 Neuron
    Options

    Is it possible for you to share some more information of how your flow looks like right now? Or at least, how does it look your partition dependencies configuration? (https://doc.dataiku.com/dss/latest/partitions/dependencies.html)

  • aw30
    aw30 Dataiku DSS & SQL, Registered Posts: 49 ✭✭✭✭✭
    Options

    Hi - I still do not have this working. In the interim I am rebuilding the entire set in the folder but will post something once we get it working.

    For now I added a prep step that enriches the data with the filename (that has the date in it). The prep step just combines all the data together and the enrich function allows me to identify the rows that came from which file.

    Eventually, we want to figure this out so that only the new file that comes in will be processed and added to the resulting data set but for now this is how we are doing it.

  • Kiran
    Kiran Registered Posts: 6 ✭✭✭✭
    Options

    Hello,

    I came across this while looking for something similar and couldn't find the option available ("Misc > Enrich record with context information") on DSS 6.0.4.

    image.png

Setup Info
    Tags
      Help me…