Zip Codes 5 and 9 digits

calhos4
calhos4 Dataiku DSS Core Designer, Registered Posts: 2

I have a dataset with a mixture of US zip codes (both 5 and 9 digits long) and some zip codes from Canada, the UK, and other countries. Is there a formula or another way that Dataiku will add in the hyphen for 9 digit US zip codes without impacting the other zip codes?

Best Answer

  • tgb417
    tgb417 Dataiku DSS Core Designer, Dataiku DSS & SQL, Dataiku DSS ML Practitioner, Dataiku DSS Core Concepts, Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Frontrunner 2022 Finalist, Frontrunner 2022 Winner, Dataiku Frontrunner Awards 2021 Participant, Frontrunner 2022 Participant, Neuron 2023 Posts: 1,598 Neuron
    Answer ✓

    @calhos4

    Welcome to the Dataiku Community. We are glad to have you among us.

    Postal Codes can be a challenged, particularly when they come from many different countries.

    It is my understanding that the rules around postal codes are widely varying by country and some times regions within a country. Do your have a Country, City, State / Provence / Division for each of your addresses?

    It sounds like you only want to change 9 digit US postal codes (aka zipcodes) from the united states. Without a country designation it will be difficult to isolate those postal codes that need to be changed.

    If you have budget / time, one of the better ways is to pay a services to do address cleanup. They often have database with all of the valid postal addresses in a country, and they can tell if this is a real address and the exact postal code that should be used. This data changes constantly.

    There are some geocoding plugins that are available for Dataiku DSS that can be used to "Normalize" Postal Addresses, which may have the byproduct of correcting 9 digits without the hyphen to 9 digets with the hyphen.

    If that is beyond the scope of your project. And you want to try to do this on your own, I'd start to look for a Python or other library that might help. Geopy might be such a python library for you https://geopy.readthedocs.io/en/stable/

    Finally if you are simply trying to find 9 number codes and add a "-" between the 5th and 6th characters. You might look at this python code, https://marcusrauhut.com/validating-cleaning-zip-codes-in-python/ and working with a dataiku python recipe.

    Finally in visual recipes there is an option to do find and replace visual step. The find a replace step will take a regular expressions that can find various patterns. You could create a pattern to find either 5 or nine character postal codes (just for the US records) and simply add a hyphen in the right place

    Overall there are a lot of ways to work on this problem. None will be 100% perfect because the data is likely to be "dirty" in ways that will completely surprise you, because they were added by real human beings who are very "creative". Your challenge will be to decide what is good enough for your specific use case.

    Others here may have code that they have used and might be able to share.

    Hope this helps a little bit. My recommendation is start and see which method most closely meets your desired accuracy and cost in time and other expenses.

    Have fun. Let us know how you get along.

Answers

  • calhos4
    calhos4 Dataiku DSS Core Designer, Registered Posts: 2

    Thank you so much, Tom!!

    In this particular dataset there is not a field specifically for city, but I do have a state, country, and zip code field. I attached a screenshot of the column titles as an example. I was initially thinking that I may have to simply export the most cleaned version of this data to Excel and then manually go through and add in the hyphens, but this is just one of many mailing files I will have, so I was hoping to eliminate as much manual work as possible.

    I looked into the Geocoder install, but unfortunately I am not able to install plugins. I couldn't get the Python code to work yet, but this is my first time even seeing Python so I think I just need to play around a bit more! I will look into the find and replace as well, as that seems like one of the easier methods to follow.

    Thank you very much, @tgb417
    - I appreciate all of the thought and work you put into this answer!
    Samantha

  • tgb417
    tgb417 Dataiku DSS Core Designer, Dataiku DSS & SQL, Dataiku DSS ML Practitioner, Dataiku DSS Core Concepts, Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Frontrunner 2022 Finalist, Frontrunner 2022 Winner, Dataiku Frontrunner Awards 2021 Participant, Frontrunner 2022 Participant, Neuron 2023 Posts: 1,598 Neuron

    @calhos4

    What did you end up doing with you postal code challange? Knowing what you did might help others as well.

Setup Info
    Tags
      Help me…