Datatest Core API Reference

Validation

datatest.validate(data, requirement, msg=None)

Raise a ValidationError if data does not satisfy requirement or pass without error if data is valid.

This is a rich comparison function—the given data and requirement arguments can be mappings, iterables, or other objects (including objects from pandas, numpy, database cursors, and squint). An optional msg string can be provided to describe the validation.

Predicate Validation:

When requirement is a callable, tuple, string, or non-iterable object, it is used to construct a Predicate for testing elements in data:

from datatest import validate

data = [2, 4, 6, 8]

def is_even(x):
    return x % 2 == 0

validate(data, is_even)  # <- callable used as predicate

If the predicate returns False, then an Invalid or Deviation difference is generated. If the predicate returns a difference object, that object is used in place of a generated difference (see Differences). When the predicate returns any other truthy value, an element is considered valid.

Set Validation:

When requirement is a set, the elements in data are checked for membership in the set:

from datatest import validate

data = ['a', 'a', 'b', 'b', 'c', 'c']

required_set = {'a', 'b', 'c'}

validate(data, required_set)  # <- tests for set membership

If the elements in data do not match the required set, then Missing and Extra differences are generated.

Sequence Validation:

When requirement is an iterable type other than a set, mapping, tuple or string, then data is validated by index position. Elements are checked for predicate matches against required objects of the same index position (both data and requirement should yield values in a predictable order):

from datatest import validate

data = ['A', 'B', 'C', ...]

sequence = ['A', 'B', 'C', ...]

validate(data, sequence)  # <- compare elements by position

For details on predicate matching, see Predicate.

Mapping Validation:

When requirement is a dictionary or other mapping, the values in data are checked against required objects of the same key (data must also be a mapping):

from datatest import validate

data = {'A': 1, 'B': 2, 'C': ...}

required_dict = {'A': 1, 'B': 2, 'C': ...}

validate(data, required_dict)  # <- compares values

If values do not satisfy the corresponding required object, then differences are generated according to each object type. If an object itself is a nested mapping, it is treated as a predicate object.

Requirement Object Validation:

When requirement is a subclass of BaseRequirement, then validation and difference generation are delegated to the requirement itself.

In addition to validate()’s default behavior, the following methods can be used to specify additional validation behaviors.

predicate(data, requirement, msg=None)

Use requirement to construct a Predicate and check elements in data for matches (see predicate validation for more details).

regex(data, requirement, flags=0, msg=None)

Require that string values match a given regular expression (also see Regular Expression Syntax):

from datatest import validate

data = ['46532', '43206', '60632']

validate.regex(data, r'^\d{5}$')

The example above is roughly equivalent to:

import re
from datatest import validate

data = ['46532', '43206', '60632']

validate(data, re.compile(r'^\d{5}$'))
approx(data, requirement, places=None, msg=None, delta=None)

Require that numeric values are approximately equal. The given requirement can be a single element or a mapping.

Values compare as equal if their difference rounded to the given number of decimal places (default 7) equals zero, or if the difference between values is less than or equal to a given delta:

from datatest import validate

data = {'A': 1.3125, 'B': 8.6875}

requirement = {'A': 1.31, 'B': 8.69}

validate.approx(data, requirement, places=2)

It is appropriate to use validate.approx() when checking for nominal values—where some deviation is considered an intrinsic feature of the data. But when deviations represent an undesired-but-acceptible variation, accepted.tolerance() would be more fitting.

fuzzy(data, requirement, cutoff=0.6, msg=None)

Require that strings match with a similarity greater than or equal to cutoff (default 0.6).

Similarity measures are determined using SequenceMatcher.ratio() from the Standard Library’s difflib module. The values range from 1.0 (exactly the same) to 0.0 (completely different).

from datatest import validate

data = {
    'MO': 'Saint Louis',
    'NY': 'New York',  # <- does not meet cutoff
    'OH': 'Cincinatti',
}

requirement = {
    'MO': 'St. Louis',
    'NY': 'New York City',
    'OH': 'Cincinnati',
}

validate.fuzzy(data, requirement, cutoff=0.8)
interval(data, min=None, max=None, msg=None)

Require that values are within the defined interval:

from datatest import validate

data = [5, 10, 15, 20]  # <- 20 outside of interval

validate.interval(data, 5, 15)

Require that values are greater than or equal to min (omitting max creates a left-bounded interval):

from datatest import validate

data = [5, 10, 15, 20]

validate.interval(data, min=5)

Require that values are less than or equal to max (omitting min creates a right-bounded interval):

from datatest import validate

data = [5, 10, 15, 20]

validate.interval(data, max=20)
set(data, requirement, msg=None)

Check that the set of elements in data matches the set of elements in requirement (applies set validation using a requirement of any iterable type).

subset(data, requirement, msg=None)

Check that the set of elements in data is a subset of the set of elements in requirement (i.e., that every element of data is also a member of requirement).

from datatest import validate

data = ['A', 'B', 'C']

requirement = {'A', 'B', 'C', 'D'}

validate.subset(data, requirement)

Attention

Since version 0.10.0, the semantics of subset() have been inverted. To mitigate problems for users upgrading from 0.9.6, this method issues a warning.

To ignore this warning you can add the following lines to your code:

import warnings
warnings.filterwarnings('ignore', message='subset and superset warning')

And for pytest users, you can add the following to the beginning of a test script:

pytestmark = pytest.mark.filterwarnings('ignore:subset and superset warning')
superset(data, requirement, msg=None)

Check that the set of elements in data is a superset of the set of elements in requirement (i.e., that members of data include all elements of requirement).

from datatest import validate

data = ['A', 'B', 'C', 'D']

requirement = {'A', 'B', 'C'}

validate.superset(data, requirement)

Attention

Since version 0.10.0, the semantics of superset() have been inverted. To mitigate problems for users upgrading from 0.9.6, this method issues a warning.

To ignore this warning you can add the following lines to your code:

import warnings
warnings.filterwarnings('ignore', message='subset and superset warning')

And for pytest users, you can add the following to the beginning of a test script:

pytestmark = pytest.mark.filterwarnings('ignore:subset and superset warning')
unique(data, msg=None)

Require that elements in data are unique:

from datatest import validate

data = [1, 2, 3, ...]

validate.unique(data)
order(data, requirement, msg=None)

Check that elements in data match the relative order of elements in requirement:

from datatest import validate

data = ['A', 'C', 'D', 'E', 'F', ...]

required_order = ['A', 'B', 'C', 'D', 'E', ...]

validate.order(data, required_order)

If elements do not match the required order, Missing and Extra differences are raised. Each difference will contain a two-tuple whose first value is the index of the position in data where the difference occurs and whose second value is the non-matching element itself.

In the given example, data is missing 'B' at index 1 and contains an extra 'F' at index 4:

\[\begin{split}\begin{array}{cc} \begin{array}{r} \textrm{data:} \\ \textrm{requirement:} \end{array} & \begin{array}{c} \begin{array}{cc} & extra \\ & \downarrow \\ \begin{array}{ccc}\textbf{A} & \textbf{C} & \textbf{D} \end{array} & \begin{array}{ccc} \textbf{E} & \textbf{F} & ... \end{array} \\ \begin{array}{ccc}\textbf{A} & \textbf{B} & \textbf{C} \end{array} & \begin{array}{ccc} \textbf{D} & \textbf{E} & ... \end{array} \\ \uparrow & \\ missing & \\ \end{array} \end{array} \end{array}\end{split}\]

The validation fails with the following error:

ValidationError: does not match required order (2 differences): [
    Missing((1, 'B')),
    Extra((4, 'F')),
]

Notice there are no differences for 'C', 'D', and 'E' because their relative order matches the requirement—even though their index positions are different.

Note

Calling validate() or its methods will either raise an exception or pass without error. To get an explicit True/False return value, use the valid() function instead.

datatest.valid(data, requirement)

Return True if data satisfies requirement else return False.

See validate() for supported data and requirement values and detailed validation behavior.

exception datatest.ValidationError(differences, description=None)

This exception is raised when data validation fails.

differences

A collection of “difference” objects to describe elements in the data under test that do not satisfy the requirement.

description

An optional description of the failed requirement.

Differences

class datatest.BaseDifference

The base class for “difference” objects—all other difference classes are derived from this base.

args

The tuple of arguments given to the difference constructor. Some difference (like Deviation) expect a certain number of arguments and assign a special meaning to the elements of this tuple, while others are called with only a single value.

class datatest.Missing(value)

Created when value is missing from the data under test.

In the following example, the required value 'A' is missing from the data under test:

data = ['B', 'C']

requirement = {'A', 'B', 'C'}

datatest.validate(data, requirement)

Running this example raises the following error:

ValidationError: does not satisfy set membership (1 difference): [
    Missing('A'),
]
class datatest.Extra(value)

Created when value is unexpectedly found in the data under test.

In the following example, the value 'C' is found in the data under test but it’s not part of the required values:

data = ['A', 'B', 'C']

requirement = {'A', 'B'}

datatest.validate(data, requirement)

Running this example raises the following error:

ValidationError: does not satisfy set membership (1 difference): [
    Extra('C'),
]
class datatest.Invalid(invalid, expected=<no value>)

Created when a value does not satisfy a function, equality, or regular expression requirement.

In the following example, the value 9 does not satisfy the required function:

data = [2, 4, 6, 9]

def is_even(x):
    return x % 2 == 0

datatest.validate(data, is_even)

Running this example raises the following error:

ValidationError: does not satisfy is_even() (1 difference): [
    Invalid(9),
]
invalid

The invalid value under test.

expected

The expected value (optional).

class datatest.Deviation(deviation, expected)

Created when a quantative value deviates from its expected value.

In the following example, the dictionary item 'C': 33 does not satisfy the required item 'C': 30:

data = {'A': 10, 'B': 20, 'C': 33}

requirement = {'A': 10, 'B': 20, 'C': 30}

datatest.validate(data, requirement)

Running this example raises the following error:

ValidationError: does not satisfy mapping requirement (1 difference): {
    'C': Deviation(+3, 30),
}
deviation

Quantative deviation from expected value.

expected

The expected value.

Acceptances

Acceptances are context managers that operate on a ValidationError’s collection of differences.

datatest.accepted(obj, msg=None, scope=None)

Returns a context manager that accepts differences that match obj without triggering a test failure. The given obj can be a difference class, a difference instance, or a collection of instances.

When obj is a difference class, differences are accepted if they are instances of the class. When obj is a difference instance or collection of instances, then differences are accepted if they compare as equal to one of the accepted instances.

If given, the scope can be 'element', 'group', or 'whole'. An element-wise scope will accept all differences that have a match in obj. A group-wise scope will accept one difference per match in obj per group. A whole-error scope will accept one difference per match in obj over the ValidationError as a whole.

If unspecified, scope will default to 'element' if obj is a single element and 'group' if obj is a collection of elements. If obj is a mapping, the scope is limited to the group of differences associated with a given key (which effectively treats whole-error scopes the same as group-wise scopes).

Accepted Type:

When obj is a class (Missing, Extra, Deviation, Invalid, etc.), differences are accepted if they are instances of the class.

The following example accepts all instances of the Missing class:

from datatest import validate, accepted, Missing

data = ['A', 'B']

requirement = {'A', 'B', 'C'}

with accepted(Missing):
    validate(data, requirement)

Without this acceptance, the validation would have failed with the following error:

ValidationError: does not satisfy set membership (1 difference): [
    Missing('C'),
]

Accepted Difference:

When obj is an instance, differences are accepted if they match the instance exactly.

The following example accepts all differences that match Extra('D'):

from datatest import validate, accepted, Extra

data = ['A', 'B', 'C', 'D']

requirement = {'A', 'B', 'C'}

with accepted(Extra('D')):
    validate(data, requirement)

Without this acceptance, the validation would have failed with the following error:

ValidationError: does not satisfy set membership (1 difference): [
    Extra('D'),
]

Accepted Collection:

When obj is a collection of difference instances, then an error’s differences are accepted if they match an instance in the given collection:

from datatest import validate, accepted, Missing, Extra

data = ['x', 'y', 'q']

requirement = {'x', 'y', 'z'}

known_issues = accepted([
    Extra('q'),
    Missing('z'),
])

with known_issues:
    validate(data, requirement)

A dictionary of acceptances can accept groups of differences by matching key:

from datatest import validate, accepted, Missing, Extra

data = {
    'A': ['x', 'y', 'q'],
    'B': ['x', 'y'],
}

requirement = {'x', 'y', 'z'}

known_issues = accepted({
    'A': [Extra('q'), Missing('z')],
    'B': [Missing('z')],
})

with known_issues:
    validate(data, requirement)
keys(predicate, msg=None)

Returns a context manager that accepts differences whose associated keys satisfy the given predicate (see Predicates for details).

The following example accepts differences associated with the key 'B':

from datatest import validate, accepted

data = {'A': 'x', 'B': 'y'}

requirement = 'x'

with accepted.keys('B'):
    validate(data, requirement)

Without this acceptance, the validation would have failed with the following error:

ValidationError: does not satisfy 'x' (1 difference): {
    'B': Invalid('y'),
}
args(predicate, msg=None)

Returns a context manager that accepts differences whose args satisfy the given predicate (see Predicates for details).

The example below accepts differences that contain the value 'y':

from datatest import validate, accepted

data = {'A': 'x', 'B': 'y'}

requirement = 'x'

with accepted.args('y'):
    validate(data, requirement)

Without this acceptance, the validation would have failed with the following error:

ValidationError: does not satisfy 'x' (1 difference): {
    'B': Invalid('y'),
}
tolerance(tolerance, /, msg=None)
tolerance(lower, upper, msg=None)

Accepts quantitative differences within a given tolerance without triggering a test failure:

from datatest import validate, accepted

data = {'A': 45, 'B': 205}

requirement = {'A': 50, 'B': 200}

with accepted.tolerance(5):
    validate(data, requirement)

The example above accepts differences within a tolerance of ±5. Without this acceptance, the validation would have failed with the following error:

ValidationError: does not satisfy mapping requirements (2 differences): {
    'A': Deviation(-5, 50),
    'B': Deviation(+5, 200),
}

Specifying different lower and upper bounds:

with accepted.tolerance(-2, 7):  # <- tolerance from -2 to +7
    validate(..., ...)

Deviations within the given range are suppressed while those outside the range will trigger a test failure.

percent(tolerance, /, msg=None)
percent(lower, upper, msg=None)

Accepts percentages of error within a given tolerance without triggering a test failure:

from datatest import validate, accepted

data = {'A': 47, 'B': 318}

requirement = {'A': 50, 'B': 300}

with accepted.percent(0.06):
    validate(data, requirement)

The example above accepts differences within a tolerance of ±6%. Without this acceptance, the validation would have failed with the following error:

ValidationError: does not satisfy mapping requirements (2 differences): {
    'A': Deviation(-3, 50),
    'B': Deviation(+18, 300),
}

Specifying different lower and upper bounds:

with accepted.percent(-0.02, 0.01):  # <- tolerance from -2% to +1%
    validate(..., ...)

Deviations within the given range are suppressed while those outside the range will trigger a test failure.

fuzzy(cutoff=0.6, msg=None)

Returns a context manager that accepts invalid strings that match their expected value with a similarity greater than or equal to cutoff (default 0.6). Similarity measures are determined using SequenceMatcher.ratio() from the Standard Library’s difflib module. The values range from 1.0 (exactly the same) to 0.0 (completely different).

The following example accepts string differences that match with a ratio of 0.6 or greater:

from datatest import validate, accepted

data = {'A': 'aax', 'B': 'bbx'}

requirement = {'A': 'aaa', 'B': 'bbb'}

with accepted.fuzzy(cutoff=0.6):
    validate(data, requirement)

Without this acceptance, the validation would have failed with the following error:

ValidationError: does not satisfy mapping requirements (2 differences): {
    'A': Invalid('aax', expected='aaa'),
    'B': Invalid('bbx', expected='bbb'),
}
count(number, msg=None, scope=None)

Returns a context manager that accepts up to a given number of differences without triggering a test failure. If the count of differences exceeds the given number, the test case will fail with a ValidationError containing the remaining differences.

The following example accepts up to 2 differences:

from datatest import validate, accepted

data = ['A', 'B', 'A', 'C']

requirement = 'A'

with accepted.count(2):
    validate(data, requirement)

Without this acceptance, the validation would have failed with the following error:

ValidationError: does not satisfy 'A' (2 differences): [
    Invalid('B'),
    Invalid('C'),
]

Composability

Acceptances can be combined to create new acceptances with modified behavior.

The & operator can be used to create an intersection of acceptance criteria. In the following example, accepted(Missing) and accepted.count(5) are combined into a single acceptance that accepts up to five Missing differences:

from datatest import validate, accepted

with accepted(Missing) & accepted.count(5):
    validate(..., ...)

The | operator can be used to create union of acceptance criteria. In the following example, accepted.tolerance() and accepted.percent() are combined into a single acceptance that accepts Deviations of ±10 as well as Deviations of ±5%:

from datatest import validate, accepted

with accepted.tolerance(10) | accepted.percent(0.05):
    validate(..., ...)

And composed acceptances, themselves, can be composed to define increasingly specific criteria:

from datatest import validate, accepted

five_missing = accepted(Missing) & accepted.count(5)

minor_deviations = accepted.tolerance(10) | accepted.percent(0.05)

with five_missing | minor_deviations:
    validate(..., ...)

Order of Operations

Acceptance composition uses the following order of operations—shown from highest precedence to lowest precedence. Operations with the same precedence level (appearing in the same cell) are evaluated from left to right.

Order

Operation

Description

1

()

Parentheses

2

&

Bitwise AND (intersection)

3

|

Bitwise OR (union)

4

Element-wise acceptances

5

Group-wise acceptances

6

Whole-error acceptances

Predicates

Datatest can use Predicate objects for validation and to define certain acceptances.

class datatest.Predicate(obj, name=None)

A Predicate is used like a function of one argument that returns True when applied to a matching value and False when applied to a non-matching value. The criteria for matching is determined by the obj type used to define the predicate:

obj type

matches when

function

the result of function(value) tests as True

type

value is an instance of the type

re.compile(pattern)

value matches the regular expression pattern

True

value is truthy (bool(value) returns True)

False

value is falsy (bool(value) returns False)

str or non-container

value is equal to the object

set

value is a member of the set

tuple of predicates

tuple of values satisfies corresponding tuple of predicates—each according to their type

... (Ellipsis literal)

(used as a wildcard, matches any value)

Example matches:

obj example

value

matches

def is_even(x):
    return x % 2 == 0

4

Yes

9

No

float

1.0

Yes

1

No

re.compile('[bc]ake')

'bake'

Yes

'cake'

Yes

'fake'

No

True

'x'

Yes

''

No

False

''

Yes

'x'

No

'foo'

'foo'

Yes

'bar'

No

{'A', 'B'}

'A'

Yes

'C'

No

('A', float)

('A', 1.0)

Yes

('A', 2)

No

('A', ...)

Uses ellipsis wildcard.

('A', 'X')

Yes

('A', 'Y')

Yes

('B', 'X')

No

Example code:

>>> pred = Predicate({'A', 'B'})
>>> pred('A')
True
>>> pred('C')
False

Predicate matching behavior can also be inverted with the inversion operator (~). Inverted Predicates return False when applied to a matching value and True when applied to a non-matching value:

>>> pred = ~Predicate({'A', 'B'})
>>> pred('A')
False
>>> pred('C')
True

If the name argument is given, a __name__ attribute is defined using the given value:

>>> pred = Predicate({'A', 'B'}, name='a_or_b')
>>> pred.__name__
'a_or_b'

If the name argument is omitted, the object will not have a __name__ attribute:

>>> pred = Predicate({'A', 'B'})
>>> pred.__name__
Traceback (most recent call last):
  File "<input>", line 1, in <module>
    pred.__name__
AttributeError: 'Predicate' object has no attribute '__name__'