Splitting datasets into multiple outputs by rules.

DavidRAMBEAU · April 2023

Hello !

I'm a newbie on the powerful platform Dataiku that I'm discovering.

I want to know how to split a dataset (Data2) on multiple output. To split data, I've got another dataset (Data1) which is filled by a list of elements (UUID). For each of these element I want to create an output with just the part. I put a scheme to be clearer.

For example, I've a list of person with some characteristics, and I want an output by person.

Thanks for your support.

Best regards, David.

Operating system used: macOS

Sarina · May 2023

Hi @DavidRAMBEAU
,

If I understand correctly, your input has various "Person" rows, and you want your output to be split out by each individual "Person" records, is that correct?

Creating a unique output dataset for each "Person" sounds like i would generate too many output datasets. I think your best option would be to partition by each person (i.e. "person_id"). Then your output dataset can contain just the relevant information for each person. If you are using a file-based output dataset, your final output data could then have the form of:

<person1>/output.csv.gz
<person2>/output.csv.gz
<person3>/output.csv.gz

Then your output would be logically separated, and you can also process based on partition in DSS. You can do the same thing with SQL based partitioning as well. You might find this academy course on partitioning useful as well.

I hope that information is helpful, please let us know if you are still working on this use case and if we can provide any other thoughts!

Thanks,
Sarina

Splitting datasets into multiple outputs by rules.

Answers

Categories

Setup Info

Tags