Tips and Tricks for Data Testing

This document is intended for users who are already familiar with datatest and its features. It’s a grab-bag of ideas and patterns you can use in your own code.

Using methodcaller()

It’s common to have helper functions that simply call a method on an object. For example:

1
2
3
4
5
6
7
8
from datatest import validate

data = [...]

def is_upper(x):  # <- Helper function.
    return x.isupper()

validate(data, is_upper)

In the case above, our helper function calls the isupper() method. But instead of defining a function for this, we can simply use operator.methodcaller():

1
2
3
4
5
from datatest import validate
from operator import methodcaller

data = [...]
validate(data, methodcaller('isupper'))

RepeatingContainer and Argument Unpacking

The RepeatingContainer class is useful for comparing similarly shaped data sources:

1
2
3
4
5
6
7
8
from datatest import validate, working_directory
import pandas as pd

with working_directory(__file__):
    compare = RepeatingContainer([
        pd.read_csv('data_under_test.csv'),
        pd.read_csv('reference_data.csv'),
    ])

You can use iterable unpacking to get the individual results for validation:

 8
 9
10
11
12
13
...

result = compare[['A', 'C']].groupby('A').sum()
data, requirement = result  # Unpack result items.

dt.validate(data, requirement)

But you can make this even more succinct by unpacking the arguments directly inside the validate() call itself—via the asterisk prefix, *:

 8
 9
10
...

validate(*compare[['A', 'C']].groupby('A').sum())

Failure Message Handling

If validation fails and you’ve provided a msg argument, its value is used in the error message. But if no msg is given, then validate() will automatically generate its own message.

This example includes a msg value so the error message reads Values should be even numbers.:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
from datatest import validate

data = [2, 4, 5, 6, 8, 10, 11]

def is_even(x):
    """Should be even."""
    return x % 2 == 0

msg = 'Values should be even numbers.'
validate(data, is_even, msg=msg)
Traceback (most recent call last):
  File "example.py", line 10, in <module>
    validate(data, is_even, msg=msg)
datatest.ValidationError: Values should be even numbers. (2 differences): [
    Invalid(5),
    Invalid(11),
]

1. Docstring Message

If validation fails but no msg was given, then the first line of the requirement’s docstring is used. For this reason, it’s useful to start your docstrings with normative language, e.g., “must be…”, “should have…”, “needs to…”, “requires…”, etc.

In the following example, the docstring Should be even. is used in the error:

1
2
3
4
5
6
7
8
from datatest import validate

def is_even(x):
    """Should be even."""
    return x % 2 == 0

data = [2, 4, 5, 6, 8, 10, 11]
validate(data, is_even)
Traceback (most recent call last):
  File "example.py", line 8, in <module>
    validate(data, is_even)
datatest.ValidationError: Should be even. (2 differences): [
    Invalid(5),
    Invalid(11),
]

2. __name__ Message

If validation fails but there’s no msg and no docstring, then the requirement’s __name__ will be used to construct a message.

In this example, the function’s name, is_even, is used in the error:

1
2
3
4
5
6
7
from datatest import validate

def is_even(x):
    return x % 2 == 0

data = [2, 4, 5, 6, 8, 10, 11]
validate(data, is_even)
Traceback (most recent call last):
  File "example.py", line 7, in <module>
    validate(data, is_even)
datatest.ValidationError: does not satisfy is_even() (2 differences): [
    Invalid(5),
    Invalid(11),
]

3. __repr__() Message

If validation fails but there’s no msg, no docstring, and no __name__, then the requirement’s representation is used (i.e., the result of its __repr__() method).

Template Tests

If you need to integrate multiple published datasets, you might want to prepare a template test suite. The template can serve as a starting point for validating each datatset. In the template scripts, you can prompt users for action by explicitly failing with an instruction message:

We can use pytest.fail() to fail with a message to the user:

21
22
23
24
25
...

def test_state(data):
    pytest.fail('Set requirement to appropriate state abbreviation.')
    validate(data['state'], requirement='XX')
=================================== FAILURES ===================================
__________________________________ test_state __________________________________

    def test_state(data):
>       pytest.fail('Set requirement to appropriate state abbreviation.')
E       Failed: Set requirement to appropriate state abbreviation.

example.py:24: Failed
=========================== short test summary info ============================
FAILED example.py::test_state - Failed: Set requirement to appropriate state...
========================= 1 failed, 41 passed in 0.14s =========================

When you see this failure, you can remove the pytest.fail() line and add the needed state abbreviation:

21
22
23
24
...

def test_state(data):
    validate(data['state'], requirement='CA')

Using a Function Factory

If you find yourself writing multiple helper functions that perform similar actions, consider writing a function factory instead. A function factory is a function that makes other functions.

In the following example, the ends_with() function makes helper functions:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
from datatest import validate

def ends_with(suffix):  # <- Helper function factory.
    suffix = suffix.lower()
    def helper(x):
        return x.lower().endswith(suffix)
    helper.__doc__ = f'should end with {suffix!r}'
    return helper

data1 = [...]
validate(data1, ends_with('.csv'))

data2 = [...]
validate(data2, ends_with('.txt'))

data3 = [...]
validate(data3, ends_with('.ini'))

Lambda Expressions

It’s common to use simple helper functions like the one below:

1
2
3
4
5
6
7
from datatest import validate

def is_even(n):  # <- Helper function.
    return n % 2 == 0

data = [...]
validate(data, is_even)

If your helper function is a single statement, you could also write it as a lambda expression:

1
2
3
4
from datatest import validate

data = [...]
validate(data, lambda n: n % 2 == 0)

But you should be careful with lambdas because they don’t have names or docstrings. If the validation fails without an explicit msg value, the default message can’t provide any useful context—it would read “does not satisfy <lambda>”.

So if you use a lambda, it’s good practice to provide a msg argument, too:

1
2
3
4
from datatest import validate

data = [...]
validate(data, lambda n: n % 2 == 0, msg='shoud be even')

Skip Missing Test Fixtures (pytest)

Usually, you want a test to fail if a required fixture is unavailable. But in some cases, you may want to skip tests that rely on one fixture while continuing to run tests that rely on other fixtures.

To do this, you can call pytest.skip() from within the fixture itself. Tests that rely on this fixture will be automatically skipped when the data source is unavailable:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
import pytest
import pandas as pd
from datatest import working_directory

@pytest.fixture(scope='module')
@working_directory(__file__)
def data_source1():
    file_path = 'data_source1.csv'
    try:
        return pd.read_csv(file_path)
    except FileNotFoundError:
        pytest.skip(f'cannot find {file_path}')

...

Test Configuration With conftest.py (pytest)

To share the same fixture with multiple test modules, you can move the fixture function into a separate file named conftest.py. See more from the pytest docs: Sharing Fixture Functions.