How to Validate Mailing Addresses (US)¶
CASS Certified Verification¶
Unfortunately, the only “real” way to validate addresses is to use a verification service or program. Simple validation checks cannot guarantee that an address is correct or deliverable. In the United States, proper address verification requires the use of CASS certified software. Several online services offer address verification but to use one you must write code to interact with that service’s API. Implementing such a solution is beyond the scope of this document.
Heuristic Evaluation¶
Sometimes the benefits of comprehensive address verification are not enough to justify the work required to interface with a third-party service or the possible cost of a subscription fee. Simple checks for well-formedness and set membership can catch many obvious errors and omissions. This weaker form of verification can be useful in many situations.
Load Data as Text¶
To start, we will load our example addresses into a pandas
DataFrame
. It’s important to specify
dtype=str
to prevent pandas’ type inference from loading certain
columns using a numeric dtype. In some data sets, ZIP Codes could be
misidentified as numeric data and loading them into a numeric column
would strip any leading zeros—corrupting the data you’re testing:
1 2 3 4 5 6 | import pandas as pd
from datatest import validate
df = pd.read_csv('addresses.csv', dtype=str)
...
|
Our address data will look something like the following:
street |
city |
state |
zipcode |
---|---|---|---|
1600 Pennsylvania Avenue NW |
Washington |
DC |
20500 |
30 Rockefeller Plaza |
New York |
NY |
10112 |
350 Fifth Avenue, 34th Floor |
New York |
NY |
10118-3299 |
1060 W Addison St |
Chicago |
IL |
60613 |
15 Central Park W Apt 7P |
New York |
NY |
10023-7711 |
11 Wall St |
New York |
NY |
10005 |
2400 Fulton St |
San Francisco |
CA |
94118-4107 |
351 Farmington Ave |
Hartford |
CT |
06105-6400 |
Street Address¶
Street addresses are difficult to validate with a simple check. The US Postal Service publishes addressing standards designed to account for a majority of address styles (see delivery address line). But these standards do not account for all situations.
You could build a function to check that “street” values contain commonly used suffixes, but such a test could give misleading results when checking hyphenated address ranges, grid-style addresses, and rural routes. If you are not using a third-party verification service, it may be best to simply check that the field is not empty.
The example below uses a regular expression, \w+
, to match one or
more letters or numbers:
7 8 9 | ...
validate.regex(df['street'], r'\w+')
|
City Name¶
The US Postal Service sells a regularly updated City State Product file. For paying customers who purchase the USPS file or for users of third-party services, “city” values can be matched against a controlled vocabulary of approved city names. As with street validation, when such resources are unavailable it’s probably best to check that the field is not empty.
The example below uses a regular expression, [A-Za-z]+
, to match one
or more letters:
10 11 12 | ...
validate.regex(df['city'], r'[A-Za-z]+')
|
State Abbreviation¶
Unlike the previous fields, the set of possible state abbreviations is small and easy to check against. The set includes codes for all 50 states, the District of Columbia, US territories, associate states, and armed forces delivery codes.
In this example, we use validate.subset()
to check that the values in
the “state” column are members of the state_codes
set:
13 14 15 16 17 18 19 20 21 22 23 24 | ...
state_codes = {
'AL', 'AK', 'AZ', 'AR', 'CA', 'CO', 'CT', 'DE', 'FL', 'GA',
'HI', 'ID', 'IL', 'IN', 'IA', 'KS', 'KY', 'LA', 'ME', 'MD',
'MA', 'MI', 'MN', 'MS', 'MO', 'MT', 'NE', 'NV', 'NH', 'NJ',
'NM', 'NY', 'NC', 'ND', 'OH', 'OK', 'OR', 'PA', 'RI', 'SC',
'SD', 'TN', 'TX', 'UT', 'VT', 'VA', 'WA', 'WV', 'WI', 'WY',
'DC', 'AS', 'GU', 'MP', 'PR', 'VI', 'FM', 'MH', 'PW',
'AA', 'AE', 'AP',
}
validate.subset(df['state'], state_codes)
|
ZIP Code¶
The set of valid ZIP Codes is very large but they can be easily checked for well-formedness. Basic ZIP Codes are five digits and extended ZIP+4 Codes are nine digits (e.g., 20500 and 20500-0005).
This example uses a regex, ^\d{5}(-\d{4})?$
, to match the two possible
formats:
25 26 27 | ...
validate.regex(df['zipcode'], r'^\d{5}(-\d{4})?$')
|
State and ZIP Code Consistency¶
The first digit of a ZIP Code is associated with a specific region of the country (a group of states). For example, ZIP Codes beginning with “4” only occur in Indiana, Kentucky, Michigan, and Ohio. We can use these regional associations as a sanity check to make sure that our “state” and “zipcode” values are plausible and consistent.
The following example defines a helper function, state_zip_consistency()
,
to check the first digit of a ZIP Code against a set of associated state
codes:
28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 | ...
def state_zip_consistency(state_zipcode):
"""ZIP Code should be consistent with state."""
lookup = {
'0': {'CT', 'MA', 'ME', 'NH', 'NJ', 'NY', 'PR', 'RI', 'VT', 'VI', 'AE'},
'1': {'DE', 'NY', 'PA'},
'2': {'DC', 'MD', 'NC', 'SC', 'VA', 'WV'},
'3': {'AL', 'FL', 'GA', 'MS', 'TN', 'AA'},
'4': {'IN', 'KY', 'MI', 'OH'},
'5': {'IA', 'MN', 'MT', 'ND', 'SD', 'WI'},
'6': {'IL', 'KS', 'MO', 'NE'},
'7': {'AR', 'LA', 'OK', 'TX'},
'8': {'AZ', 'CO', 'ID', 'NM', 'NV', 'UT', 'WY'},
'9': {'AK', 'AS', 'CA', 'GU', 'HI', 'MH', 'FM', 'MP', 'OR', 'PW', 'WA', 'AP'},
}
state, zipcode = state_zipcode
first_digit = zipcode[0]
return state in lookup[first_digit]
validate(df[['state', 'zipcode']], state_zip_consistency)
|
This check works well to detect data processing errors that might mis-align or otherwise damage “state” and “zipcode” values. But it cannot detect if ZIP Codes are assigned to the wrong states within in the same region—for example, it wouldn’t be able to determine if an Indiana ZIP Code was used on an Kentucky address (since the ZIP Codes in both of these states begin with “4”).