How to Validate Mailing Addresses (US)

CASS Certified Verification

Unfortunately, the only “real” way to validate addresses is to use a verification service or program. Simple validation checks cannot guarantee that an address is correct or deliverable. In the United States, proper address verification requires the use of CASS certified software. Several online services offer address verification but to use one you must write code to interact with that service’s API. Implementing such a solution is beyond the scope of this document.

Heuristic Evaluation

Sometimes the benefits of comprehensive address verification are not enough to justify the work required to interface with a third-party service or the possible cost of a subscription fee. Simple checks for well-formedness and set membership can catch many obvious errors and omissions. This weaker form of verification can be useful in many situations.

Load Data as Text

To start, we will load our example addresses into a pandas DataFrame. It’s important to specify dtype=str to prevent pandas’ type inference from loading certain columns using a numeric dtype. In some data sets, ZIP Codes could be misidentified as numeric data and loading them into a numeric column would strip any leading zeros—corrupting the data you’re testing:

1
2
3
4
5
6
import pandas as pd
from datatest import validate

df = pd.read_csv('addresses.csv', dtype=str)

...

Our address data will look something like the following:

street

city

state

zipcode

1600 Pennsylvania Avenue NW

Washington

DC

20500

30 Rockefeller Plaza

New York

NY

10112

350 Fifth Avenue, 34th Floor

New York

NY

10118-3299

1060 W Addison St

Chicago

IL

60613

15 Central Park W Apt 7P

New York

NY

10023-7711

11 Wall St

New York

NY

10005

2400 Fulton St

San Francisco

CA

94118-4107

351 Farmington Ave

Hartford

CT

06105-6400

Street Address

Street addresses are difficult to validate with a simple check. The US Postal Service publishes addressing standards designed to account for a majority of address styles (see delivery address line). But these standards do not account for all situations.

You could build a function to check that “street” values contain commonly used suffixes, but such a test could give misleading results when checking hyphenated address ranges, grid-style addresses, and rural routes. If you are not using a third-party verification service, it may be best to simply check that the field is not empty.

The example below uses a regular expression, \w+, to match one or more letters or numbers:

7
8
9
...

validate.regex(df['street'], r'\w+')

City Name

The US Postal Service sells a regularly updated City State Product file. For paying customers who purchase the USPS file or for users of third-party services, “city” values can be matched against a controlled vocabulary of approved city names. As with street validation, when such resources are unavailable it’s probably best to check that the field is not empty.

The example below uses a regular expression, [A-Za-z]+, to match one or more letters:

10
11
12
...

validate.regex(df['city'], r'[A-Za-z]+')

State Abbreviation

Unlike the previous fields, the set of possible state abbreviations is small and easy to check against. The set includes codes for all 50 states, the District of Columbia, US territories, associate states, and armed forces delivery codes.

In this example, we use validate.subset() to check that the values in the “state” column are members of the state_codes set:

13
14
15
16
17
18
19
20
21
22
23
24
...

state_codes = {
    'AL', 'AK', 'AZ', 'AR', 'CA', 'CO', 'CT', 'DE', 'FL', 'GA',
    'HI', 'ID', 'IL', 'IN', 'IA', 'KS', 'KY', 'LA', 'ME', 'MD',
    'MA', 'MI', 'MN', 'MS', 'MO', 'MT', 'NE', 'NV', 'NH', 'NJ',
    'NM', 'NY', 'NC', 'ND', 'OH', 'OK', 'OR', 'PA', 'RI', 'SC',
    'SD', 'TN', 'TX', 'UT', 'VT', 'VA', 'WA', 'WV', 'WI', 'WY',
    'DC', 'AS', 'GU', 'MP', 'PR', 'VI', 'FM', 'MH', 'PW',
    'AA', 'AE', 'AP',
}
validate.subset(df['state'], state_codes)

ZIP Code

The set of valid ZIP Codes is very large but they can be easily checked for well-formedness. Basic ZIP Codes are five digits and extended ZIP+4 Codes are nine digits (e.g., 20500 and 20500-0005).

This example uses a regex, ^\d{5}(-\d{4})?$, to match the two possible formats:

25
26
27
...

validate.regex(df['zipcode'], r'^\d{5}(-\d{4})?$')

State and ZIP Code Consistency

The first digit of a ZIP Code is associated with a specific region of the country (a group of states). For example, ZIP Codes beginning with “4” only occur in Indiana, Kentucky, Michigan, and Ohio. We can use these regional associations as a sanity check to make sure that our “state” and “zipcode” values are plausible and consistent.

The following example defines a helper function, state_zip_consistency(), to check the first digit of a ZIP Code against a set of associated state codes:

28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
...

def state_zip_consistency(state_zipcode):
    """ZIP Code should be consistent with state."""
    lookup = {
        '0': {'CT', 'MA', 'ME', 'NH', 'NJ', 'NY', 'PR', 'RI', 'VT', 'VI', 'AE'},
        '1': {'DE', 'NY', 'PA'},
        '2': {'DC', 'MD', 'NC', 'SC', 'VA', 'WV'},
        '3': {'AL', 'FL', 'GA', 'MS', 'TN', 'AA'},
        '4': {'IN', 'KY', 'MI', 'OH'},
        '5': {'IA', 'MN', 'MT', 'ND', 'SD', 'WI'},
        '6': {'IL', 'KS', 'MO', 'NE'},
        '7': {'AR', 'LA', 'OK', 'TX'},
        '8': {'AZ', 'CO', 'ID', 'NM', 'NV', 'UT', 'WY'},
        '9': {'AK', 'AS', 'CA', 'GU', 'HI', 'MH', 'FM', 'MP', 'OR', 'PW', 'WA', 'AP'},
    }
    state, zipcode = state_zipcode
    first_digit = zipcode[0]
    return state in lookup[first_digit]

validate(df[['state', 'zipcode']], state_zip_consistency)

This check works well to detect data processing errors that might mis-align or otherwise damage “state” and “zipcode” values. But it cannot detect if ZIP Codes are assigned to the wrong states within in the same region—for example, it wouldn’t be able to determine if an Indiana ZIP Code was used on an Kentucky address (since the ZIP Codes in both of these states begin with “4”).