A Tour of Datatest

This document introduces datatest’s support for validation, error reporting, and acceptance declarations.

Validation

The validation process works by comparing some data to a given requirement. If the requirement is satisfied, the data is considered valid. But if the requirement is not satisfied, a ValidationError is raised.

The validate() function checks that the data under test satisfies a given requirement:

1
2
3
4
5
from datatest import validate

data = ...
requirement = ...
validate(data, requirement)

Smart Comparisons

The validate() function implements smart comparisons and will use different validation methods for different requirement types.

For example, when requirement is a set, validation checks that elements in data are members of that set:

1
2
3
4
5
from datatest import validate

data = ['A', 'B', 'A']
requirement = {'A', 'B'}
validate(data, requirement)

When requirement is a function, validation checks that the function returns True when applied to each element in data:

1
2
3
4
5
6
7
8
from datatest import validate

data = [2, 4, 6, 8]

def is_even(x):
    return x % 2 == 0

validate(data, requirement=is_even)

When requirement is a type, validation checks that the elements in data are a instances of that type:

1
2
3
4
5
from datatest import validate

data = [2, 4, 6, 8]
requirement = int
validate(data, requirement)

And when requirement is a tuple, validation checks for tuple elements in data using multiple methods at the same time—one method for each item in the required tuple:

1
2
3
4
5
6
7
8
9
from datatest import validate

data = [('a', 2), ('b', 4), ('c', 6)]

def is_even(x):
    return x % 2 == 0

requirement = (str, is_even)
validate(data, requirement)

In addition to the examples above, several other validation behaviors are available. For a complete listing with detailed examples, see Validation.

Automatic Data Handling

Along with the smart comparison behavior, validation can apply a given requirement to data objects of different formats.

The following examples perform type-checking to see if elements are int values. Switch between the different tabs below and notice that the same requirement (requirement = int) works for all of the different data formats:

An individual element:

1
2
3
4
5
from datatest import validate

data = 42
requirement = int  # <- Same for all formats.
validate(data, requirement)

A data value is treated as single element if it’s a string, tuple, or non-iterable object.

Of course, not all formats are comparable. When requirement is itself a mapping, there’s no clear way to handle validation if data is a single element or a non-mapping container. In cases like this, the validation process will error-out before the data elements can be checked.

In addition to built-in generic types, Datatest also provides automatic handling for several third-party data types.

Datatest can work with pandas DataFrame, Series, Index, and MultiIndex objects:

1
2
3
4
5
6
7
8
9
import pandas as pd
import datatest as dt

df = pd.DataFrame([('x', 1, 12.25),
                   ('y', 2, 33.75),
                   ('z', 3, 101.5)],
                  columns=['A', 'B', 'C'])

dt.validate(df[['A', 'B']], (str, int))

Errors

When validation fails, a ValidationError is raised. A ValidationError contains a collection of difference objects—one difference for each element in data that fails to satisfy the requirement.

Difference objects can be one of four types: Missing, Extra, Deviation or Invalid.

“Missing” Differences

In this example, we check that the list ['A', 'B'] contains members of the set {'A', 'B', 'C', 'D'}:

1
2
3
4
5
from datatest import validate

data = ['A', 'B']
requirement = {'A', 'B', 'C', 'D'}
validate(data, requirement)

This fails because the elements 'C' and 'D' are not present in data. They appear below as Missing differences:

Traceback (most recent call last):
  File "example.py", line 5, in <module>
    validate(data, requirement)
datatest.ValidationError: does not satisfy set membership (2 differences): [
    Missing('C'),
    Missing('D'),
]

“Extra” Differences

In this next example, we will reverse the previous situation by checking that elements in the list ['A', 'B', 'C', 'D'] are members of the set {'A', 'B'}:

1
2
3
4
5
from datatest import validate

data = ['A', 'B', 'C', 'D']
requirement = {'A', 'B'}
validate(data, requirement)

Of course, this validation fails because the elements 'C' and 'D' are not members of the requirement set. They appear below as Extra differences:

Traceback (most recent call last):
  File "example.py", line 5, in <module>
    validate(data, requirement)
datatest.ValidationError: does not satisfy set membership (2 differences): [
    Extra('C'),
    Extra('D'),
]

“Invalid” Differences

In this next example, the requirement is a tuple, (str, is_even). It checks for tuple elements where the first value is a string and the second value is an even number:

1
2
3
4
5
6
7
8
9
from datatest import validate

data = [('a', 2), ('b', 4), ('c', 6), (1.25, 8), ('e', 9)]

def is_even(x):
    return x % 2 == 0

requirement = (str, is_even)
validate(data, requirement)

Two of the elements in data fail to satisfy the requirement: (1.25, 8) fails because 1.25 is not a string and ('e', 9) fails because 9 is not an even number. These are represented in the error as Invalid differences:

Traceback (most recent call last):
  File "example.py", line 9, in <module>
    validate(data, requirement)
datatest.ValidationError: does not satisfy `(str, is_even())` (2 differences): [
    Invalid((1.25, 8)),
    Invalid(('e', 9)),
]

“Deviation” Differences

In the following example, the requirement is a dictionary of numbers. The data elements are checked against reqirement elements of the same key:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
from datatest import validate

data = {
    'A': 100,
    'B': 200,
    'C': 299,
    'D': 405,
}

requirement = {
    'A': 100,
    'B': 200,
    'C': 300,
    'D': 400,
}

validate(data, requirement)

This validation fails because some of the values don’t match (C: 299300 and D: 405400). Failed quantitative comparisons raise Deviation differences:

Traceback (most recent call last):
  File "example.py", line 17, in <module>
    validate(data, requirement)
datatest.ValidationError: does not satisfy mapping requirements (2 differences): {
    'C': Deviation(-1, 300),
    'D': Deviation(+5, 400),
}

Acceptances

Sometimes a failing test cannot be addressed by changing the data itself. Perhaps two equally-authoritative sources disagree, perhaps it’s important to keep the original data unchanged, or perhaps a lack of information makes correction impossible. For cases like these, datatest can accept certain discrepancies when users judge that doing so is appropriate.

The accepted() function returns a context manager that operates on a ValidationError’s collection of differences.

Accepted Type

Without an acceptance, the following validation would fail because the values 'C' and 'D' are not members of the set (see below). But if we decide that Extra differences are acceptible, we can use accepted(Extra):

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
from datatest import (
    validate,
    accepted,
    Extra,
)

data = ['A', 'B', 'C', 'D']
requirement = {'A', 'B'}
with accepted(Extra):
    validate(data, requirement)

Using the acceptance, we suppress the error caused by all of the Extra differences. But without the acceptance, the ValidationError is raised.

Accepted Instance

If we want more precision, we can accept a specific difference—rather than all differences of a given type. For example, if the difference Extra('C') is acceptible, we can use accepted(Extra('C')):

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
from datatest import (
    validate,
    accepted,
    Extra,
)

data = ['A', 'B', 'C', 'D']
requirement = {'A', 'B'}
with accepted(Extra('C')):
    validate(data, requirement)
Traceback (most recent call last):
  File "example.py", line 10, in <module>
    validate(data, requirement)
datatest.ValidationError: does not satisfy set membership (1 difference): [
    Extra('D'),
]

This acceptance suppresses the extra 'C' but does not address the extra 'D' so the ValidationError is still raised. This remaining error can be addressed by correcting the data, modifying the requirement, or altering the acceptance.

Accepted Container of Instances

We can also accept multiple specific differences by defining a container of difference objects. To build on the previous example, we can use accepted([Extra('C'), Extra('D')]) to accept the two differences explicitly:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
from datatest import (
    validate,
    accepted,
    Extra,
)

data = ['A', 'B', 'C', 'D']
requirement = {'A', 'B'}
with accepted([Extra('C'), Extra('D')]):
    validate(data, requirement)

Accepted Tolerance

When comparing quantative values, you may decide that deviations of a certain magnitude are acceptible. Calling accepted.tolerance(5) returns a context manager that accepts differences within a tolerance of plus-or-minus five without triggering a test failure:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
from datatest import validate
from datatest import accepted

data = {
    'A': 100,
    'B': 200,
    'C': 299,
    'D': 405,
}
requirement = {
    'A': 100,
    'B': 200,
    'C': 300,
    'D': 400,
}
with accepted.tolerance(5):  # accepts ±5
    validate(data, requirement)

Other Acceptances

In addtion to the previous examples, there are other acceptances available for specific cases—accepted.keys(), accepted.args(), accepted.percent(), etc. For a list of all possible acceptances, see Acceptances.

Combining Acceptances

Acceptances can also be combined using the operators & and | to define more complex criteria:

from datatest import (
    validate,
    accepted,
)

# Accept up to five missing differences.
with accepted(Missing) & accepted.count(5):
    validate(..., ...)

# Accept differences of ±10 or ±5%.
with accepted.tolerance(10) | accepted.percent(0.05):
    validate(..., ...)

To learn more about these features, see Composability and Order of Operations.

Data Handling Tools

Working Directory

You can use working_directory (a context manager and decorator) to assure that relative file paths behave consistently:

import pandas as pd
from datatest import working_directory

with working_directory(__file__):
    my_df = pd.read_csv('myfile.csv')

Repeating Container

You can use a RepeatingContainer to operate on multiple objects at the same time rather than duplicating the same operation for each object:

import pandas as pd
from datatest import RepeatingContainer

repeating = RepeatingContainer([
    pd.read_csv('file1.csv'),
    pd.read_csv('file2.csv'),
])

counted1, counted2 = repeating['C'].count()

filled1, filled2 = repeating.fillna(method='backfill')

summed1, summed2 = repeating[['A', 'C']].groupby('A').sum()

In the three statements above, operations are performed on multiple pandas DataFrames using single lines of code. Results are then unpacked into individual variable names. Compare this example with code in the “No RepeatingContainer” tab.