Datatest Core API Reference¶
Validation¶
-
datatest.
validate
(data, requirement, msg=None)¶ Raise a
ValidationError
if data does not satisfy requirement or pass without error if data is valid.This is a rich comparison function—the given data and requirement arguments can be mappings, iterables, or other objects (including objects from
pandas
,numpy
, database cursors, andsquint
). An optional msg string can be provided to describe the validation.Predicate Validation:
When requirement is a callable, tuple, string, or non-iterable object, it is used to construct a
Predicate
for testing elements in data:from datatest import validate data = [2, 4, 6, 8] def is_even(x): return x % 2 == 0 validate(data, is_even) # <- callable used as predicate
If the predicate returns False, then an
Invalid
orDeviation
difference is generated. If the predicate returns a difference object, that object is used in place of a generated difference (see Differences). When the predicate returns any other truthy value, an element is considered valid.Set Validation:
When requirement is a set, the elements in data are checked for membership in the set:
from datatest import validate data = ['a', 'a', 'b', 'b', 'c', 'c'] required_set = {'a', 'b', 'c'} validate(data, required_set) # <- tests for set membership
If the elements in data do not match the required set, then
Missing
andExtra
differences are generated.Sequence Validation:
When requirement is an iterable type other than a set, mapping, tuple or string, then data is validated by index position. Elements are checked for predicate matches against required objects of the same index position (both data and requirement should yield values in a predictable order):
from datatest import validate data = ['A', 'B', 'C', ...] sequence = ['A', 'B', 'C', ...] validate(data, sequence) # <- compare elements by position
For details on predicate matching, see
Predicate
.Mapping Validation:
When requirement is a dictionary or other mapping, the values in data are checked against required objects of the same key (data must also be a mapping):
from datatest import validate data = {'A': 1, 'B': 2, 'C': ...} required_dict = {'A': 1, 'B': 2, 'C': ...} validate(data, required_dict) # <- compares values
If values do not satisfy the corresponding required object, then differences are generated according to each object type. If an object itself is a nested mapping, it is treated as a predicate object.
Requirement Object Validation:
When requirement is a subclass of
BaseRequirement
, then validation and difference generation are delegated to the requirement itself.In addition to
validate()
’s default behavior, the following methods can be used to specify additional validation behaviors.-
predicate
(data, requirement, msg=None)¶ Use requirement to construct a
Predicate
and check elements in data for matches (see predicate validation for more details).
-
regex
(data, requirement, flags=0, msg=None)¶ Require that string values match a given regular expression (also see Regular Expression Syntax):
from datatest import validate data = ['46532', '43206', '60632'] validate.regex(data, r'^\d{5}$')
The example above is roughly equivalent to:
import re from datatest import validate data = ['46532', '43206', '60632'] validate(data, re.compile(r'^\d{5}$'))
-
approx
(data, requirement, places=None, msg=None, delta=None)¶ Require that numeric values are approximately equal. The given requirement can be a single element or a mapping.
Values compare as equal if their difference rounded to the given number of decimal places (default 7) equals zero, or if the difference between values is less than or equal to a given delta:
from datatest import validate data = {'A': 1.3125, 'B': 8.6875} requirement = {'A': 1.31, 'B': 8.69} validate.approx(data, requirement, places=2)
It is appropriate to use
validate.approx()
when checking for nominal values—where some deviation is considered an intrinsic feature of the data. But when deviations represent an undesired-but-acceptible variation,accepted.tolerance()
would be more fitting.
-
fuzzy
(data, requirement, cutoff=0.6, msg=None)¶ Require that strings match with a similarity greater than or equal to cutoff (default
0.6
).Similarity measures are determined using
SequenceMatcher.ratio()
from the Standard Library’sdifflib
module. The values range from1.0
(exactly the same) to0.0
(completely different).from datatest import validate data = { 'MO': 'Saint Louis', 'NY': 'New York', # <- does not meet cutoff 'OH': 'Cincinatti', } requirement = { 'MO': 'St. Louis', 'NY': 'New York City', 'OH': 'Cincinnati', } validate.fuzzy(data, requirement, cutoff=0.8)
-
interval
(data, min=None, max=None, msg=None)¶ Require that values are within the defined interval:
from datatest import validate data = [5, 10, 15, 20] # <- 20 outside of interval validate.interval(data, 5, 15)
Require that values are greater than or equal to min (omitting max creates a left-bounded interval):
from datatest import validate data = [5, 10, 15, 20] validate.interval(data, min=5)
Require that values are less than or equal to max (omitting min creates a right-bounded interval):
from datatest import validate data = [5, 10, 15, 20] validate.interval(data, max=20)
-
set
(data, requirement, msg=None)¶ Check that the set of elements in data matches the set of elements in requirement (applies set validation using a requirement of any iterable type).
-
subset
(data, requirement, msg=None)¶ Check that the set of elements in data is a subset of the set of elements in requirement (i.e., that every element of data is also a member of requirement).
from datatest import validate data = ['A', 'B', 'C'] requirement = {'A', 'B', 'C', 'D'} validate.subset(data, requirement)
Attention
Since version 0.10.0, the semantics of
subset()
have been inverted. To mitigate problems for users upgrading from 0.9.6, this method issues a warning.To ignore this warning you can add the following lines to your code:
import warnings warnings.filterwarnings('ignore', message='subset and superset warning')
And for pytest users, you can add the following to the beginning of a test script:
pytestmark = pytest.mark.filterwarnings('ignore:subset and superset warning')
-
superset
(data, requirement, msg=None)¶ Check that the set of elements in data is a superset of the set of elements in requirement (i.e., that members of data include all elements of requirement).
from datatest import validate data = ['A', 'B', 'C', 'D'] requirement = {'A', 'B', 'C'} validate.superset(data, requirement)
Attention
Since version 0.10.0, the semantics of
superset()
have been inverted. To mitigate problems for users upgrading from 0.9.6, this method issues a warning.To ignore this warning you can add the following lines to your code:
import warnings warnings.filterwarnings('ignore', message='subset and superset warning')
And for pytest users, you can add the following to the beginning of a test script:
pytestmark = pytest.mark.filterwarnings('ignore:subset and superset warning')
-
unique
(data, msg=None)¶ Require that elements in data are unique:
from datatest import validate data = [1, 2, 3, ...] validate.unique(data)
-
order
(data, requirement, msg=None)¶ Check that elements in data match the relative order of elements in requirement:
from datatest import validate data = ['A', 'C', 'D', 'E', 'F', ...] required_order = ['A', 'B', 'C', 'D', 'E', ...] validate.order(data, required_order)
If elements do not match the required order,
Missing
andExtra
differences are raised. Each difference will contain a two-tuple whose first value is the index of the position in data where the difference occurs and whose second value is the non-matching element itself.In the given example, data is missing
'B'
at index 1 and contains an extra'F'
at index 4:\[\begin{split}\begin{array}{cc} \begin{array}{r} \textrm{data:} \\ \textrm{requirement:} \end{array} & \begin{array}{c} \begin{array}{cc} & extra \\ & \downarrow \\ \begin{array}{ccc}\textbf{A} & \textbf{C} & \textbf{D} \end{array} & \begin{array}{ccc} \textbf{E} & \textbf{F} & ... \end{array} \\ \begin{array}{ccc}\textbf{A} & \textbf{B} & \textbf{C} \end{array} & \begin{array}{ccc} \textbf{D} & \textbf{E} & ... \end{array} \\ \uparrow & \\ missing & \\ \end{array} \end{array} \end{array}\end{split}\]The validation fails with the following error:
ValidationError: does not match required order (2 differences): [ Missing((1, 'B')), Extra((4, 'F')), ]
Notice there are no differences for
'C'
,'D'
, and'E'
because their relative order matches the requirement—even though their index positions are different.
Note
Calling
validate()
or its methods will either raise an exception or pass without error. To get an explicit True/False return value, use thevalid()
function instead.-
-
datatest.
valid
(data, requirement)¶ Return True if data satisfies requirement else return False.
See
validate()
for supported data and requirement values and detailed validation behavior.
-
exception
datatest.
ValidationError
(differences, description=None)¶ This exception is raised when data validation fails.
-
differences
¶ A collection of “difference” objects to describe elements in the data under test that do not satisfy the requirement.
-
description
¶ An optional description of the failed requirement.
-
Differences¶
-
class
datatest.
BaseDifference
¶ The base class for “difference” objects—all other difference classes are derived from this base.
-
class
datatest.
Missing
(value)¶ Created when value is missing from the data under test.
In the following example, the required value
'A'
is missing from the data under test:data = ['B', 'C'] requirement = {'A', 'B', 'C'} datatest.validate(data, requirement)
Running this example raises the following error:
ValidationError: does not satisfy set membership (1 difference): [ Missing('A'), ]
-
class
datatest.
Extra
(value)¶ Created when value is unexpectedly found in the data under test.
In the following example, the value
'C'
is found in the data under test but it’s not part of the required values:data = ['A', 'B', 'C'] requirement = {'A', 'B'} datatest.validate(data, requirement)
Running this example raises the following error:
ValidationError: does not satisfy set membership (1 difference): [ Extra('C'), ]
-
class
datatest.
Invalid
(invalid, expected=<no value>)¶ Created when a value does not satisfy a function, equality, or regular expression requirement.
In the following example, the value
9
does not satisfy the required function:data = [2, 4, 6, 9] def is_even(x): return x % 2 == 0 datatest.validate(data, is_even)
Running this example raises the following error:
ValidationError: does not satisfy is_even() (1 difference): [ Invalid(9), ]
-
invalid
¶ The invalid value under test.
-
expected
¶ The expected value (optional).
-
-
class
datatest.
Deviation
(deviation, expected)¶ Created when a quantative value deviates from its expected value.
In the following example, the dictionary item
'C': 33
does not satisfy the required item'C': 30
:data = {'A': 10, 'B': 20, 'C': 33} requirement = {'A': 10, 'B': 20, 'C': 30} datatest.validate(data, requirement)
Running this example raises the following error:
ValidationError: does not satisfy mapping requirement (1 difference): { 'C': Deviation(+3, 30), }
-
deviation
¶ Quantative deviation from expected value.
-
expected
¶ The expected value.
-
Acceptances¶
Acceptances are context managers that operate on a ValidationError
’s
collection of differences.
-
datatest.
accepted
(obj, msg=None, scope=None)¶ Returns a context manager that accepts differences that match obj without triggering a test failure. The given obj can be a difference class, a difference instance, or a collection of instances.
When obj is a difference class, differences are accepted if they are instances of the class. When obj is a difference instance or collection of instances, then differences are accepted if they compare as equal to one of the accepted instances.
If given, the scope can be
'element'
,'group'
, or'whole'
. An element-wise scope will accept all differences that have a match in obj. A group-wise scope will accept one difference per match in obj per group. A whole-error scope will accept one difference per match in obj over the ValidationError as a whole.If unspecified, scope will default to
'element'
if obj is a single element and'group'
if obj is a collection of elements. If obj is a mapping, the scope is limited to the group of differences associated with a given key (which effectively treats whole-error scopes the same as group-wise scopes).Accepted Type:
When obj is a class (
Missing
,Extra
,Deviation
,Invalid
, etc.), differences are accepted if they are instances of the class.The following example accepts all instances of the
Missing
class:from datatest import validate, accepted, Missing data = ['A', 'B'] requirement = {'A', 'B', 'C'} with accepted(Missing): validate(data, requirement)
Without this acceptance, the validation would have failed with the following error:
ValidationError: does not satisfy set membership (1 difference): [ Missing('C'), ]
Accepted Difference:
When obj is an instance, differences are accepted if they match the instance exactly.
The following example accepts all differences that match
Extra('D')
:from datatest import validate, accepted, Extra data = ['A', 'B', 'C', 'D'] requirement = {'A', 'B', 'C'} with accepted(Extra('D')): validate(data, requirement)
Without this acceptance, the validation would have failed with the following error:
ValidationError: does not satisfy set membership (1 difference): [ Extra('D'), ]
Accepted Collection:
When obj is a collection of difference instances, then an error’s differences are accepted if they match an instance in the given collection:
from datatest import validate, accepted, Missing, Extra data = ['x', 'y', 'q'] requirement = {'x', 'y', 'z'} known_issues = accepted([ Extra('q'), Missing('z'), ]) with known_issues: validate(data, requirement)
A dictionary of acceptances can accept groups of differences by matching key:
from datatest import validate, accepted, Missing, Extra data = { 'A': ['x', 'y', 'q'], 'B': ['x', 'y'], } requirement = {'x', 'y', 'z'} known_issues = accepted({ 'A': [Extra('q'), Missing('z')], 'B': [Missing('z')], }) with known_issues: validate(data, requirement)
-
keys
(predicate, msg=None)¶ Returns a context manager that accepts differences whose associated keys satisfy the given predicate (see Predicates for details).
The following example accepts differences associated with the key
'B'
:from datatest import validate, accepted data = {'A': 'x', 'B': 'y'} requirement = 'x' with accepted.keys('B'): validate(data, requirement)
Without this acceptance, the validation would have failed with the following error:
ValidationError: does not satisfy 'x' (1 difference): { 'B': Invalid('y'), }
-
args
(predicate, msg=None)¶ Returns a context manager that accepts differences whose
args
satisfy the given predicate (see Predicates for details).The example below accepts differences that contain the value
'y'
:from datatest import validate, accepted data = {'A': 'x', 'B': 'y'} requirement = 'x' with accepted.args('y'): validate(data, requirement)
Without this acceptance, the validation would have failed with the following error:
ValidationError: does not satisfy 'x' (1 difference): { 'B': Invalid('y'), }
-
tolerance
(tolerance, /, msg=None)¶ -
tolerance
(lower, upper, msg=None) Accepts quantitative differences within a given tolerance without triggering a test failure:
from datatest import validate, accepted data = {'A': 45, 'B': 205} requirement = {'A': 50, 'B': 200} with accepted.tolerance(5): validate(data, requirement)
The example above accepts differences within a tolerance of ±5. Without this acceptance, the validation would have failed with the following error:
ValidationError: does not satisfy mapping requirements (2 differences): { 'A': Deviation(-5, 50), 'B': Deviation(+5, 200), }
Specifying different lower and upper bounds:
with accepted.tolerance(-2, 7): # <- tolerance from -2 to +7 validate(..., ...)
Deviations within the given range are suppressed while those outside the range will trigger a test failure.
-
percent
(tolerance, /, msg=None)¶ -
percent
(lower, upper, msg=None) Accepts percentages of error within a given tolerance without triggering a test failure:
from datatest import validate, accepted data = {'A': 47, 'B': 318} requirement = {'A': 50, 'B': 300} with accepted.percent(0.06): validate(data, requirement)
The example above accepts differences within a tolerance of ±6%. Without this acceptance, the validation would have failed with the following error:
ValidationError: does not satisfy mapping requirements (2 differences): { 'A': Deviation(-3, 50), 'B': Deviation(+18, 300), }
Specifying different lower and upper bounds:
with accepted.percent(-0.02, 0.01): # <- tolerance from -2% to +1% validate(..., ...)
Deviations within the given range are suppressed while those outside the range will trigger a test failure.
-
fuzzy
(cutoff=0.6, msg=None)¶ Returns a context manager that accepts invalid strings that match their expected value with a similarity greater than or equal to cutoff (default 0.6). Similarity measures are determined using
SequenceMatcher.ratio()
from the Standard Library’sdifflib
module. The values range from1.0
(exactly the same) to0.0
(completely different).The following example accepts string differences that match with a ratio of
0.6
or greater:from datatest import validate, accepted data = {'A': 'aax', 'B': 'bbx'} requirement = {'A': 'aaa', 'B': 'bbb'} with accepted.fuzzy(cutoff=0.6): validate(data, requirement)
Without this acceptance, the validation would have failed with the following error:
ValidationError: does not satisfy mapping requirements (2 differences): { 'A': Invalid('aax', expected='aaa'), 'B': Invalid('bbx', expected='bbb'), }
-
count
(number, msg=None, scope=None)¶ Returns a context manager that accepts up to a given number of differences without triggering a test failure. If the count of differences exceeds the given number, the test case will fail with a
ValidationError
containing the remaining differences.The following example accepts up to
2
differences:from datatest import validate, accepted data = ['A', 'B', 'A', 'C'] requirement = 'A' with accepted.count(2): validate(data, requirement)
Without this acceptance, the validation would have failed with the following error:
ValidationError: does not satisfy 'A' (2 differences): [ Invalid('B'), Invalid('C'), ]
-
Composability¶
Acceptances can be combined to create new acceptances with modified behavior.
The &
operator can be used to create an intersection of
acceptance criteria. In the following example, accepted(Missing)
and accepted.count(5)
are combined
into a single acceptance that accepts up to five Missing differences:
from datatest import validate, accepted
with accepted(Missing) & accepted.count(5):
validate(..., ...)
The |
operator can be used to create union of acceptance
criteria. In the following example, accepted.tolerance()
and accepted.percent()
are combined into a single acceptance
that accepts Deviations of ±10 as well as Deviations of ±5%:
from datatest import validate, accepted
with accepted.tolerance(10) | accepted.percent(0.05):
validate(..., ...)
And composed acceptances, themselves, can be composed to define increasingly specific criteria:
from datatest import validate, accepted
five_missing = accepted(Missing) & accepted.count(5)
minor_deviations = accepted.tolerance(10) | accepted.percent(0.05)
with five_missing | minor_deviations:
validate(..., ...)
Order of Operations¶
Acceptance composition uses the following order of operations—shown from highest precedence to lowest precedence. Operations with the same precedence level (appearing in the same cell) are evaluated from left to right.
Order |
Operation |
Description |
---|---|---|
1 |
() |
Parentheses |
2 |
& |
Bitwise AND (intersection) |
3 |
| |
Bitwise OR (union) |
4 |
Element-wise acceptances |
|
5 |
Group-wise acceptances |
|
6 |
Whole-error acceptances |
Predicates¶
Datatest can use Predicate
objects for validation and to define
certain acceptances.
-
class
datatest.
Predicate
(obj, name=None)¶ A Predicate is used like a function of one argument that returns
True
when applied to a matching value andFalse
when applied to a non-matching value. The criteria for matching is determined by the obj type used to define the predicate:obj type
matches when
function
the result of
function(value)
tests as Truetype
value is an instance of the type
re.compile(pattern)
value matches the regular expression pattern
True
value is truthy (
bool(value)
returns True)False
value is falsy (
bool(value)
returns False)str or non-container
value is equal to the object
set
value is a member of the set
tuple of predicates
tuple of values satisfies corresponding tuple of predicates—each according to their type
...
(Ellipsis literal)(used as a wildcard, matches any value)
Example matches:
obj example
value
matches
def is_even(x): return x % 2 == 0
4
Yes
9
No
float
1.0
Yes
1
No
re.compile('[bc]ake')
'bake'
Yes
'cake'
Yes
'fake'
No
True
'x'
Yes
''
No
False
''
Yes
'x'
No
'foo'
'foo'
Yes
'bar'
No
{'A', 'B'}
'A'
Yes
'C'
No
('A', float)
('A', 1.0)
Yes
('A', 2)
No
('A', ...)
Uses ellipsis wildcard.
('A', 'X')
Yes
('A', 'Y')
Yes
('B', 'X')
No
Example code:
>>> pred = Predicate({'A', 'B'}) >>> pred('A') True >>> pred('C') False
Predicate matching behavior can also be inverted with the inversion operator (
~
). Inverted Predicates returnFalse
when applied to a matching value andTrue
when applied to a non-matching value:>>> pred = ~Predicate({'A', 'B'}) >>> pred('A') False >>> pred('C') True
If the name argument is given, a
__name__
attribute is defined using the given value:>>> pred = Predicate({'A', 'B'}, name='a_or_b') >>> pred.__name__ 'a_or_b'
If the name argument is omitted, the object will not have a
__name__
attribute:>>> pred = Predicate({'A', 'B'}) >>> pred.__name__ Traceback (most recent call last): File "<input>", line 1, in <module> pred.__name__ AttributeError: 'Predicate' object has no attribute '__name__'