When working with a Partitioned dataset is there a way to determine which partition the record is in
I'm working with several partitioned datasets.
I've run into a problem that data in one of the partitions is partially corrupt. (Lots of extra spaces added to a field.)
Going forward, I can put steps into my recipes to correct this before the data is put into the partitioned data set.
However, I would like to clear and rebuild the particular partition where the offending data item is stored. Other than filtering and hand-walking through the partitions to find the right one, is there a way to tell which partition a particular record is coming from?
Thanks for any help you can provide.
Operating system used: Mac OS Ventura 13.0.1
Best Answer
-
tgb417 Dataiku DSS Core Designer, Dataiku DSS & SQL, Dataiku DSS ML Practitioner, Dataiku DSS Core Concepts, Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Frontrunner 2022 Finalist, Frontrunner 2022 Winner, Dataiku Frontrunner Awards 2021 Participant, Frontrunner 2022 Participant, Neuron 2023 Posts: 1,598 Neuron
Looks like I found an answer.
One can use the enrich record with file info visual recipe processor as a step.
However, this processor does not seem to work in a "lab" visual analysis.
Answers
-
Ignacio_Toledo Dataiku DSS Core Designer, Dataiku DSS Core Concepts, Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Frontrunner 2022 Finalist, Frontrunner 2022 Winner, Dataiku Frontrunner Awards 2021 Participant, Frontrunner 2022 Participant, Neuron 2023 Posts: 415 Neuron
Yes, that processor only works once you run it and the output is written in a dataset.
Have you explored the possibility of creating a) a "Check" step for a dataset, or b) create a python script that could help you detect those cases and identify the partitions with problems?
Of course, this would only make sense if you are thinking in establishing a repeatable flow, not if you only want to quickly check the problem once.
Cheers!
Ignacio
-
tgb417 Dataiku DSS Core Designer, Dataiku DSS & SQL, Dataiku DSS ML Practitioner, Dataiku DSS Core Concepts, Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Frontrunner 2022 Finalist, Frontrunner 2022 Winner, Dataiku Frontrunner Awards 2021 Participant, Frontrunner 2022 Participant, Neuron 2023 Posts: 1,598 Neuron
Hope you are doing well.
Yes, I'm working on creating a repeatable flow with Scenarios.
What did you have in mind for a check step?
-
Ignacio_Toledo Dataiku DSS Core Designer, Dataiku DSS Core Concepts, Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Frontrunner 2022 Finalist, Frontrunner 2022 Winner, Dataiku Frontrunner Awards 2021 Participant, Frontrunner 2022 Participant, Neuron 2023 Posts: 415 Neuron
I think I would create a check using a Custom Python script.
But this would only be good to detect where you have partitions with the corrupted or offending data, not to fix it.
Sadly I don't have right now access to a partitioned dataset I could use to do some tests, but just out of the top of my head, I think a python recipe could be the solution.
Best of lucks Tom, sorry I can't help more right now.
-
tgb417 Dataiku DSS Core Designer, Dataiku DSS & SQL, Dataiku DSS ML Practitioner, Dataiku DSS Core Concepts, Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Frontrunner 2022 Finalist, Frontrunner 2022 Winner, Dataiku Frontrunner Awards 2021 Participant, Frontrunner 2022 Participant, Neuron 2023 Posts: 1,598 Neuron
Thanks for the insights.
I ended up using a visual recipe to populate the names of the partitions of the dataset. I knew that the offending problems with a really long record. Added a length formula to a visual recipe got me that information. I sorted and found the partition with the problem. Went to the status tab of the partitioned dataset to delete the offending partition, and rebuild the offending partition.
To "solve" the problem, I've added a precautionary recipe step before the data goes into the partitioned dataset that looks for all of the added white space and removes them.
This approach is discussed here. https://community.dataiku.com/t5/Using-Dataiku/Removing-Multiple-Spaces-from-Data-in-all-columns/m-p/29295
Hope you are having a great early summer.