When working with a Partitioned dataset is there a way to determine which partition the record is in

tgb417 · November 2022

I'm working with several partitioned datasets.

I've run into a problem that data in one of the partitions is partially corrupt. (Lots of extra spaces added to a field.)

Going forward, I can put steps into my recipes to correct this before the data is put into the partitioned data set.

However, I would like to clear and rebuild the particular partition where the offending data item is stored. Other than filtering and hand-walking through the partitions to find the right one, is there a way to tell which partition a particular record is coming from?

Thanks for any help you can provide.

Operating system used: Mac OS Ventura 13.0.1

tgb417 · November 2022

Looks like I found an answer.

One can use the enrich record with file info visual recipe processor as a step.

However, this processor does not seem to work in a "lab" visual analysis.

Ignacio_Toledo · November 2022

Yes, that processor only works once you run it and the output is written in a dataset.

Have you explored the possibility of creating a) a "Check" step for a dataset, or b) create a python script that could help you detect those cases and identify the partitions with problems?

Of course, this would only make sense if you are thinking in establishing a repeatable flow, not if you only want to quickly check the problem once.

Cheers!

Ignacio

tgb417 · November 2022

@Ignacio_Toledo

Hope you are doing well.

Yes, I'm working on creating a repeatable flow with Scenarios.

What did you have in mind for a check step?

Ignacio_Toledo · November 2022

I think I would create a check using a Custom Python script.

But this would only be good to detect where you have partitions with the corrupted or offending data, not to fix it.

Sadly I don't have right now access to a partitioned dataset I could use to do some tests, but just out of the top of my head, I think a python recipe could be the solution.

Best of lucks Tom, sorry I can't help more right now.

tgb417 · December 2022

@Ignacio_Toledo
,

Thanks for the insights.

I ended up using a visual recipe to populate the names of the partitions of the dataset. I knew that the offending problems with a really long record. Added a length formula to a visual recipe got me that information. I sorted and found the partition with the problem. Went to the status tab of the partitioned dataset to delete the offending partition, and rebuild the offending partition.

To "solve" the problem, I've added a precautionary recipe step before the data goes into the partitioned dataset that looks for all of the added white space and removes them.

This approach is discussed here. https://community.dataiku.com/t5/Using-Dataiku/Removing-Multiple-Spaces-from-Data-in-all-columns/m-p/29295

Hope you are having a great early summer.

When working with a Partitioned dataset is there a way to determine which partition the record is in

Best Answer

Answers

Categories

Setup Info

Tags