Splitting datasets into multiple outputs by rules.

DavidRAMBEAU
DavidRAMBEAU Registered Posts: 1

Hello !

I'm a newbie on the powerful platform Dataiku that I'm discovering.

I want to know how to split a dataset (Data2) on multiple output. To split data, I've got another dataset (Data1) which is filled by a list of elements (UUID). For each of these element I want to create an output with just the part. I put a scheme to be clearer.

tryDSS.jpg

For example, I've a list of person with some characteristics, and I want an output by person.

Thanks for your support.

Best regards, David.


Operating system used: macOS

Answers

  • Sarina
    Sarina Dataiker, Dataiku DSS Core Designer, Dataiku DSS Adv Designer, Registered Posts: 317 Dataiker
    edited July 17

    Hi @DavidRAMBEAU
    ,

    If I understand correctly, your input has various "Person" rows, and you want your output to be split out by each individual "Person" records, is that correct?

    Creating a unique output dataset for each "Person" sounds like i would generate too many output datasets. I think your best option would be to partition by each person (i.e. "person_id"). Then your output dataset can contain just the relevant information for each person. If you are using a file-based output dataset, your final output data could then have the form of:

    <person1>/output.csv.gz
    <person2>/output.csv.gz
    <person3>/output.csv.gz


    Then your output would be logically separated, and you can also process based on partition in DSS. You can do the same thing with SQL based partitioning as well. You might find this academy course on partitioning useful as well.

    I hope that information is helpful, please let us know if you are still working on this use case and if we can provide any other thoughts!

    Thanks,
    Sarina

Setup Info
    Tags
      Help me…