How to Validate Fuzzy Matches

When comparing strings of text, it can sometimes be useful to check that values are similar instead of asserting that they are exactly the same. Datatest provides options for approximate string matching (also called “fuzzy matching”).

When checking mappings or sequences of values, you can accept approximate matches with the accepted.fuzzy() acceptance:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
from datatest import validate, accepted

linked_record = {
    'id165': 'Saint Louis',
    'id382': 'Raliegh',
    'id592': 'Austin',
    'id720': 'Cincinatti',
    'id826': 'Philadelphia',
}

master_record = {
    'id165': 'St. Louis',
    'id382': 'Raleigh',
    'id592': 'Austin',
    'id720': 'Cincinnati',
    'id826': 'Philadelphia',
}

with accepted.fuzzy(cutoff=0.6):
    validate(linked_record, master_record)

If variation is an inherent, natural feature of the data and does not necessarily represent a defect, it may be appropriate to use validate.fuzzy() instead of the acceptance shown previously:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
from datatest import validate

linked_record = {
    'id165': 'Saint Louis',
    'id382': 'Raliegh',
    'id592': 'Austin',
    'id720': 'Cincinatti',
    'id826': 'Philadelphia',
}

master_record = {
    'id165': 'St. Louis',
    'id382': 'Raleigh',
    'id592': 'Austin',
    'id720': 'Cincinnati',
    'id826': 'Philadelphia',
}

validate.fuzzy(linked_record, master_record, cutoff=0.6)

That said, it’s probably more appropriate to use an acceptance for this specific example.