A Tour of Datatest¶
This document introduces datatest’s support for validation, error reporting, and acceptance declarations.
Validation¶
The validation process works by comparing some data to a given
requirement. If the requirement is satisfied, the data is considered
valid. But if the requirement is not satisfied, a ValidationError
is raised.
The validate()
function checks that the data under test
satisfies a given requirement:
1 2 3 4 5 | from datatest import validate
data = ...
requirement = ...
validate(data, requirement)
|
Smart Comparisons¶
The validate()
function implements smart comparisons and will
use different validation methods for different requirement types.
For example, when requirement is a set
, validation
checks that elements in data are members of that set:
1 2 3 4 5 | from datatest import validate
data = ['A', 'B', 'A']
requirement = {'A', 'B'}
validate(data, requirement)
|
When requirement is a function, validation checks that the function returns True when applied to each element in data:
1 2 3 4 5 6 7 8 | from datatest import validate
data = [2, 4, 6, 8]
def is_even(x):
return x % 2 == 0
validate(data, requirement=is_even)
|
When requirement is a type, validation checks that the elements in data are a instances of that type:
1 2 3 4 5 | from datatest import validate
data = [2, 4, 6, 8]
requirement = int
validate(data, requirement)
|
And when requirement is a tuple
, validation
checks for tuple elements in data using multiple methods
at the same time—one method for each item in the required
tuple:
1 2 3 4 5 6 7 8 9 | from datatest import validate
data = [('a', 2), ('b', 4), ('c', 6)]
def is_even(x):
return x % 2 == 0
requirement = (str, is_even)
validate(data, requirement)
|
In addition to the examples above, several other validation behaviors are available. For a complete listing with detailed examples, see Validation.
Automatic Data Handling¶
Along with the smart comparison behavior, validation can apply a given requirement to data objects of different formats.
The following examples perform type-checking to see if elements
are int
values. Switch between the different tabs
below and notice that the same requirement (requirement = int
)
works for all of the different data formats:
An individual element:
1 2 3 4 5 | from datatest import validate
data = 42
requirement = int # <- Same for all formats.
validate(data, requirement)
|
A data value is treated as single element if it’s a string, tuple, or non-iterable object.
A group of elements:
1 2 3 4 5 | from datatest import validate
data = [1, 2, 3]
requirement = int # <- Same for all formats.
validate(data, requirement)
|
A data value is treated as a group of elements if it’s any iterable
other than a string, tuple, or mapping (e.g., in this case a
list
).
A mapping of elements:
1 2 3 4 5 | from datatest import validate
data = {'A': 1, 'B': 2, 'C': 3}
requirement = int # <- Same for all formats.
validate(data, requirement)
|
When data is a mapping, its values are checked as individual elements if they are strings, tuples, non-iterable objects, or nested mappings.
A mapping of groups of elements:
1 2 3 4 5 | from datatest import validate
data = {'X': [1, 2, 3], 'Y': [4, 5, 6], 'Z': [7, 8, 9]}
requirement = int # <- Same for all formats.
validate(data, requirement)
|
A mapping’s values are treated as a group of individual elements when they are any iterable other than a string, tuple, or another nested mapping.
Of course, not all formats are comparable. When requirement is itself a mapping, there’s no clear way to handle validation if data is a single element or a non-mapping container. In cases like this, the validation process will error-out before the data elements can be checked.
In addition to built-in generic types, Datatest also provides automatic handling for several third-party data types.
Datatest can work with pandas
DataFrame, Series, Index,
and MultiIndex objects:
1 2 3 4 5 6 7 8 9 | import pandas as pd
import datatest as dt
df = pd.DataFrame([('x', 1, 12.25),
('y', 2, 33.75),
('z', 3, 101.5)],
columns=['A', 'B', 'C'])
dt.validate(df[['A', 'B']], (str, int))
|
For users who prefer a more tightly integrated API, Datatest
provides the validate
accessor for testing pandas objects:
1 2 3 4 5 6 7 8 9 10 11 | import pandas as pd
import datatest as dt
dt.register_accessors()
df = pd.DataFrame([('x', 1, 12.25),
('y', 2, 33.75),
('z', 3, 101.5)],
columns=['A', 'B', 'C'])
df[['A', 'B']].validate((str, int))
|
After calling the register_accessors()
function, you
can use validate()
as a method of your existing
DataFrame, Series, Index, and MultiIndex objects.
Handling is also supported for numpy
objects including one-
or two-dimentional array, recarray, and structured array objects.
1 2 3 4 5 6 7 8 9 | import numpy as np
import datatest as dt
a = np.array([('x', 1, 12.25),
('y', 2, 33.75),
('z', 3, 101.5)],
dtype='U10, int32, float32')
dt.validate(a[['f0', 'f1']], (str, int))
|
Datatest also works well with squint
Select, Query, and Result
objects:
1 2 3 4 5 6 7 8 9 | from squint import Select
from datatest import validate
select = Select([('A', 'B', 'C'),
('x', 1, 12.25),
('y', 2, 33.75),
('z', 3, 101.5)])
validate(select(('A', 'B')), (str, int))
|
Origins
Squint was originally part of Datatest itself—it grew out of Datatest’s old validation API. But as Datatest matured, the need for a built-in query interface stoped making sense. This simple query interface was named “Squint” and the code was moved into its own project.
Database queries can also be validated directly:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 | import sqlite3
from datatest import validate
conn = sqlite3.connect(':memory:')
conn.executescript('''
CREATE TABLE mydata(A, B, C);
INSERT INTO mydata VALUES('x', 1, 12.25);
INSERT INTO mydata VALUES('y', 2, 33.75);
INSERT INTO mydata VALUES('z', 3, 101.5);
''')
cursor = conn.cursor()
cursor.execute('SELECT A, B FROM mydata;')
validate(cursor, (str, int))
|
This requires a cursor
object to conform to Python’s DBAPI2
specification (see PEP 249). Most of Python’s database packages
support this interface.
Errors¶
When validation fails, a ValidationError
is raised. A ValidationError
contains a collection of difference objects—one difference for each element
in data that fails to satisfy the requirement.
Difference objects can be one of four types: Missing
,
Extra
, Deviation
or Invalid
.
“Missing” Differences¶
In this example, we check that the list ['A', 'B']
contains members
of the set {'A', 'B', 'C', 'D'}
:
1 2 3 4 5 | from datatest import validate
data = ['A', 'B']
requirement = {'A', 'B', 'C', 'D'}
validate(data, requirement)
|
This fails because the elements 'C'
and 'D'
are not present in
data. They appear below as Missing
differences:
Traceback (most recent call last):
File "example.py", line 5, in <module>
validate(data, requirement)
datatest.ValidationError: does not satisfy set membership (2 differences): [
Missing('C'),
Missing('D'),
]
“Extra” Differences¶
In this next example, we will reverse the previous situation by checking
that elements in the list ['A', 'B', 'C', 'D']
are members of the set
{'A', 'B'}
:
1 2 3 4 5 | from datatest import validate
data = ['A', 'B', 'C', 'D']
requirement = {'A', 'B'}
validate(data, requirement)
|
Of course, this validation fails because the elements 'C'
and 'D'
are not members of the requirement set. They appear below as Extra
differences:
Traceback (most recent call last):
File "example.py", line 5, in <module>
validate(data, requirement)
datatest.ValidationError: does not satisfy set membership (2 differences): [
Extra('C'),
Extra('D'),
]
“Invalid” Differences¶
In this next example, the requirement is a tuple, (str, is_even)
.
It checks for tuple elements where the first value is a string and the
second value is an even number:
1 2 3 4 5 6 7 8 9 | from datatest import validate
data = [('a', 2), ('b', 4), ('c', 6), (1.25, 8), ('e', 9)]
def is_even(x):
return x % 2 == 0
requirement = (str, is_even)
validate(data, requirement)
|
Two of the elements in data fail to satisfy the requirement: (1.25, 8)
fails because 1.25 is not a string and ('e', 9)
fails because 9 is
not an even number. These are represented in the error as Invalid
differences:
Traceback (most recent call last):
File "example.py", line 9, in <module>
validate(data, requirement)
datatest.ValidationError: does not satisfy `(str, is_even())` (2 differences): [
Invalid((1.25, 8)),
Invalid(('e', 9)),
]
“Deviation” Differences¶
In the following example, the requirement is a dictionary of numbers. The data elements are checked against reqirement elements of the same key:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 | from datatest import validate
data = {
'A': 100,
'B': 200,
'C': 299,
'D': 405,
}
requirement = {
'A': 100,
'B': 200,
'C': 300,
'D': 400,
}
validate(data, requirement)
|
This validation fails because some of the values don’t match (C: 299
≠ 300
and D: 405
≠ 400
). Failed quantitative comparisons raise
Deviation
differences:
Traceback (most recent call last):
File "example.py", line 17, in <module>
validate(data, requirement)
datatest.ValidationError: does not satisfy mapping requirements (2 differences): {
'C': Deviation(-1, 300),
'D': Deviation(+5, 400),
}
Acceptances¶
Sometimes a failing test cannot be addressed by changing the data itself. Perhaps two equally-authoritative sources disagree, perhaps it’s important to keep the original data unchanged, or perhaps a lack of information makes correction impossible. For cases like these, datatest can accept certain discrepancies when users judge that doing so is appropriate.
The accepted()
function returns a context manager that operates
on a ValidationError’s collection of differences.
Accepted Type¶
Without an acceptance, the following validation would fail because the
values 'C'
and 'D'
are not members of the set (see below). But
if we decide that Extra
differences are acceptible, we can use
accepted(Extra)
:
1 2 3 4 5 6 7 8 9 10 | from datatest import (
validate,
accepted,
Extra,
)
data = ['A', 'B', 'C', 'D']
requirement = {'A', 'B'}
with accepted(Extra):
validate(data, requirement)
|
1 2 3 4 5 6 7 8 9 | from datatest import (
validate,
accepted,
Extra,
)
data = ['A', 'B', 'C', 'D']
requirement = {'A', 'B'}
validate(data, requirement)
|
Traceback (most recent call last):
File "example.py", line 9, in <module>
validate(data, requirement)
datatest.ValidationError: does not satisfy set membership (2 differences): [
Extra('C'),
Extra('D'),
]
Using the acceptance, we suppress the error caused by all of
the Extra
differences. But without the acceptance, the
ValidationError is raised.
Accepted Instance¶
If we want more precision, we can accept a specific difference—rather
than all differences of a given type. For example, if the difference
Extra('C')
is acceptible, we can use
accepted(Extra('C'))
:
1 2 3 4 5 6 7 8 9 10 | from datatest import (
validate,
accepted,
Extra,
)
data = ['A', 'B', 'C', 'D']
requirement = {'A', 'B'}
with accepted(Extra('C')):
validate(data, requirement)
|
Traceback (most recent call last):
File "example.py", line 10, in <module>
validate(data, requirement)
datatest.ValidationError: does not satisfy set membership (1 difference): [
Extra('D'),
]
This acceptance suppresses the extra 'C'
but does not address
the extra 'D'
so the ValidationError is still raised. This
remaining error can be addressed by correcting the data, modifying
the requirement, or altering the acceptance.
1 2 3 4 5 6 7 8 9 | from datatest import (
validate,
accepted,
Extra,
)
data = ['A', 'B', 'C', 'D']
requirement = {'A', 'B'}
validate(data, requirement)
|
Traceback (most recent call last):
File "example.py", line 9, in <module>
validate(data, requirement)
datatest.ValidationError: does not satisfy set membership (2 differences): [
Extra('C'),
Extra('D'),
]
Accepted Container of Instances¶
We can also accept multiple specific differences by defining a
container of difference objects. To build on the previous example,
we can use accepted([Extra('C'), Extra('D')])
to accept the
two differences explicitly:
1 2 3 4 5 6 7 8 9 10 | from datatest import (
validate,
accepted,
Extra,
)
data = ['A', 'B', 'C', 'D']
requirement = {'A', 'B'}
with accepted([Extra('C'), Extra('D')]):
validate(data, requirement)
|
1 2 3 4 5 6 7 8 9 | from datatest import (
validate,
accepted,
Extra,
)
data = ['A', 'B', 'C', 'D']
requirement = {'A', 'B'}
validate(data, requirement)
|
Traceback (most recent call last):
File "example.py", line 9, in <module>
validate(data, requirement)
datatest.ValidationError: does not satisfy set membership (2 differences): [
Extra('C'),
Extra('D'),
]
Accepted Tolerance¶
When comparing quantative values, you may decide that
deviations of a certain magnitude are acceptible. Calling
accepted.tolerance(5)
returns
a context manager that accepts differences within a tolerance
of plus-or-minus five without triggering a test failure:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 | from datatest import validate
from datatest import accepted
data = {
'A': 100,
'B': 200,
'C': 299,
'D': 405,
}
requirement = {
'A': 100,
'B': 200,
'C': 300,
'D': 400,
}
with accepted.tolerance(5): # accepts ±5
validate(data, requirement)
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 | from datatest import validate
from datatest import accepted
data = {
'A': 100,
'B': 200,
'C': 299,
'D': 405,
}
requirement = {
'A': 100,
'B': 200,
'C': 300,
'D': 400,
}
validate(data, requirement)
|
Traceback (most recent call last):
File "example.py", line 16, in <module>
validate(data, requirement)
datatest.ValidationError: does not satisfy mapping requirements (2 differences): {
'C': Deviation(-1, 300),
'D': Deviation(+5, 400),
}
Other Acceptances¶
In addtion to the previous examples, there are other acceptances
available for specific cases—accepted.keys()
, accepted.args()
,
accepted.percent()
, etc. For a list of all possible acceptances, see
Acceptances.
Combining Acceptances¶
Acceptances can also be combined using the operators &
and |
to define more complex criteria:
from datatest import (
validate,
accepted,
)
# Accept up to five missing differences.
with accepted(Missing) & accepted.count(5):
validate(..., ...)
# Accept differences of ±10 or ±5%.
with accepted.tolerance(10) | accepted.percent(0.05):
validate(..., ...)
To learn more about these features, see Composability and Order of Operations.
Data Handling Tools¶
Working Directory¶
You can use working_directory
(a context manager and decorator)
to assure that relative file paths behave consistently:
import pandas as pd
from datatest import working_directory
with working_directory(__file__):
my_df = pd.read_csv('myfile.csv')
Repeating Container¶
You can use a RepeatingContainer
to operate on multiple
objects at the same time rather than duplicating the same operation
for each object:
import pandas as pd
from datatest import RepeatingContainer
repeating = RepeatingContainer([
pd.read_csv('file1.csv'),
pd.read_csv('file2.csv'),
])
counted1, counted2 = repeating['C'].count()
filled1, filled2 = repeating.fillna(method='backfill')
summed1, summed2 = repeating[['A', 'C']].groupby('A').sum()
In the three statements above, operations are performed on multiple pandas DataFrames using single lines of code. Results are then unpacked into individual variable names. Compare this example with code in the “No RepeatingContainer” tab.
import pandas as pd
df1 = pd.read_csv('file1.csv')
df2 = pd.read_csv('file2.csv')
counted1 = df1['C'].count()
counted2 = df2['C'].count()
filled1 = df1.fillna(method='backfill')
filled2 = df2.fillna(method='backfill')
summed1 = df1[['A', 'C']].groupby('A').sum()
summed2 = df2[['A', 'C']].groupby('A').sum()
Without a RepeatingContainer, operations are duplicated for each individual DataFrame.