Zip Codes 5 and 9 digits

Solved!
calhos4
Level 1
Zip Codes 5 and 9 digits

I have a dataset with a mixture of US zip codes (both 5 and 9 digits long) and some zip codes from Canada, the UK, and other countries. Is there a formula or another way that Dataiku will add in the hyphen for 9 digit US zip codes without impacting the other zip codes?

0 Kudos
1 Solution
tgb417

@calhos4 

Welcome to the Dataiku Community.  We are glad to have you among us.

Postal Codes can be a challenged, particularly when they come from many different countries.

It is my understanding that the rules around postal codes are widely varying by country and some times regions within a country.  Do your have a Country, City, State / Provence / Division for each of your addresses?

It sounds like you only want to change 9 digit US postal codes (aka zipcodes) from the united states.  Without a country designation it will be difficult to isolate those postal codes that need to be changed.

If you have budget / time, one of the better ways is to pay a services to do address cleanup.  They often have database with all of the valid postal addresses in a country, and they can tell if this is a real address and the exact postal code that should be used. This data changes constantly.

There are some geocoding plugins that are available for Dataiku DSS that can be used to "Normalize" Postal Addresses, which may have the byproduct of correcting 9 digits without the hyphen to 9 digets with the hyphen.

If that is beyond the scope of your project.  And you want to try to do this on your own, I'd start to look for a Python or other library that might help. Geopy might be such a python library for you https://geopy.readthedocs.io/en/stable/ 

Finally if you are simply trying to find 9 number codes and add a "-" between the 5th and 6th characters.  You might look at this python code, https://marcusrauhut.com/validating-cleaning-zip-codes-in-python/ and working with a dataiku python recipe.

Finally in visual recipes there is an option to do find and replace visual step.  The find a replace step will take a regular expressions that can find various patterns.  You could create a pattern to find either 5 or nine character postal codes (just for the US records) and simply add a hyphen in the right place

Overall there are a lot of ways to work on this problem.  None will be 100% perfect because the data is likely to be "dirty" in ways that will completely surprise you, because they were added by real human beings who are very "creative".  Your challenge will be to decide what is good enough for your specific use case.

Others here may have code that they have used and might be able to share.

Hope this helps a little bit.  My recommendation is start and see which method most closely meets your desired accuracy and cost in time and other expenses.

Have fun.  Let us know how you get along.

--Tom

View solution in original post

0 Kudos
3 Replies
tgb417

@calhos4 

Welcome to the Dataiku Community.  We are glad to have you among us.

Postal Codes can be a challenged, particularly when they come from many different countries.

It is my understanding that the rules around postal codes are widely varying by country and some times regions within a country.  Do your have a Country, City, State / Provence / Division for each of your addresses?

It sounds like you only want to change 9 digit US postal codes (aka zipcodes) from the united states.  Without a country designation it will be difficult to isolate those postal codes that need to be changed.

If you have budget / time, one of the better ways is to pay a services to do address cleanup.  They often have database with all of the valid postal addresses in a country, and they can tell if this is a real address and the exact postal code that should be used. This data changes constantly.

There are some geocoding plugins that are available for Dataiku DSS that can be used to "Normalize" Postal Addresses, which may have the byproduct of correcting 9 digits without the hyphen to 9 digets with the hyphen.

If that is beyond the scope of your project.  And you want to try to do this on your own, I'd start to look for a Python or other library that might help. Geopy might be such a python library for you https://geopy.readthedocs.io/en/stable/ 

Finally if you are simply trying to find 9 number codes and add a "-" between the 5th and 6th characters.  You might look at this python code, https://marcusrauhut.com/validating-cleaning-zip-codes-in-python/ and working with a dataiku python recipe.

Finally in visual recipes there is an option to do find and replace visual step.  The find a replace step will take a regular expressions that can find various patterns.  You could create a pattern to find either 5 or nine character postal codes (just for the US records) and simply add a hyphen in the right place

Overall there are a lot of ways to work on this problem.  None will be 100% perfect because the data is likely to be "dirty" in ways that will completely surprise you, because they were added by real human beings who are very "creative".  Your challenge will be to decide what is good enough for your specific use case.

Others here may have code that they have used and might be able to share.

Hope this helps a little bit.  My recommendation is start and see which method most closely meets your desired accuracy and cost in time and other expenses.

Have fun.  Let us know how you get along.

--Tom
0 Kudos
calhos4
Level 1
Author

Thank you so much, Tom!! 

In this particular dataset there is not a field specifically for city, but I do have a state, country, and zip code field. I attached a screenshot of the column titles as an example. I was initially thinking that I may have to simply export the most cleaned version of this data to Excel and then manually go through and add in the hyphens, but this is just one of many mailing files I will have, so I was hoping to eliminate as much manual work as possible.

I looked into the Geocoder install, but unfortunately I am not able to install plugins. I couldn't get the Python code to work yet, but this is my first time even seeing Python so I think I just need to play around a bit more! I will look into the find and replace as well, as that seems like one of the easier methods to follow. 

Thank you very much, @tgb417  - I appreciate all of the thought and work you put into this answer!
Samantha

tgb417

@calhos4 

What did you end up doing with you postal code challange?  Knowing what you did might help others as well.

 

--Tom
0 Kudos