Developer Interface

DataTestCase

class datatest.DataTestCase(methodName='runTest')

This class wraps and extends unittest.TestCase and implements additional properties and methods for testing data quality. When a data assertion fails, it raises a DataAssertionError which contains a list of detected errors.

In addition to the new functionality, the familiar TestCase methods (like setUp, assertEqual, etc.) are still available.

subjectData

A data source containing the data under test—the subject of the tests. The subjectData can be defined at the class-level or at the module-level. To use DataTestCase, you must define subjectData.

When defining subjectData at the module-level, this property will reach into its parent scopes and return the subjectData from the nearest enclosed scope:

def setUpModule():
    global subjectData
    subjectData = datatest.CsvSource('myfile.csv')

Class-level declaration:

class TestMyFile(datatest.DataTestCase):
    @classmethod
    def setUpClass(cls):
        cls.subjectData = datatest.CsvSource('myfile.csv')
referenceData

An optional data source containing data that is trusted to be correct. Like the subjectData, referenceData can be defined at the class-level or at the module-level and this property will return the referenceData from the nearest enclosed scope. Unlike subjectData, defining referenceData is completely optional.

Module-level declaration:

def setUpModule():
    global subjectData, referenceData
    subjectData = datatest.CsvSource('myfile.csv')
    referenceData = datatest.CsvSource('myreference.csv')

Class-level declaration:

class TestMyFile(datatest.DataTestCase):
    @classmethod
    def setUpClass(cls):
        cls.subjectData = datatest.CsvSource('myfile.csv')
        cls.referenceData = datatest.CsvSource('myreference.csv')
Method Checks that
assertDataColumns(required=None) column names match required
assertDataSet(columns, required=None) one or more columns contains required
assertDataSum(column, keys, required=None) sums of column values, grouped by keys, match required dict
assertDataCount(column, keys, required=None) counts of column values, grouped by keys, match required dict
assertDataRegex(column, required) required.search(val) for each val in column
assertDataNotRegex(column, required) not required.search(val) for each val in column
assertDataColumns(required=None, msg=None)

Test that the column names in subjectData match the required values. The required argument can be a collection, callable, data source, or None:

def test_columns(self):
    required_names = {'col1', 'col2'}
    self.assertDataColumns(required_names)

If required is omitted, the column names from referenceData are used in its place:

def test_columns(self):
    self.assertDataColumns()
assertDataSet(columns, required=None, msg=None, **kwds_filter)

Test that the column or columns in subjectData contain the required values:

def test_column1(self):
    required_values = {'a', 'b'}
    self.assertDataSet('col1', required_values)

If columns is a sequence of strings, we can check for distinct groups of values:

def test_column1and2(self):
    required_groups = {('a', 'x'), ('a', 'y'), ('b', 'x'), ('b', 'y')}
    self.assertDataSet(['col1', 'col2'], required_groups)

If the required argument is a helper-function (or other callable), it is used as a key which must return True for acceptable values:

def test_column1(self):
    def length_of_one(x):  # <- Helper function.
        return len(str(x)) == 1
    self.assertDataSet('col1', length_of_one)

If the required argument is omitted, then values from referenceData will be used in its place:

def test_column1(self):
    self.assertDataSet('col1')

def test_column1and2(self):
    self.assertDataSet(['col1', 'col2'])
assertDataSum(column, keys, required=None, msg=None, **kwds_filter)

Test that the sum of column in subjectData, when grouped by keys, matches a dict of required values:

per_dept = {'finance': 146564,
            'marketing': 152530,
            'research': 158397}
self.assertDataSum('budget', 'department', per_dept)

Grouping by multiple keys:

dept_quarter = {('finance', 'q1'): 85008,
                ('finance', 'q2'): 61556,
                ('marketing', 'q1'): 86941,
                ('marketing', 'q2'): 65589,
                ('research', 'q1'): 93454,
                ('research', 'q2'): 64943}
self.assertDataSum('budget', ['department', 'quarter'], dept_quarter)

If required argument is omitted, then values from referenceData are used in its place:

self.assertDataSum('budget', ['department', 'quarter'])
assertDataCount(column, keys, required=None, msg=None, **kwds_filter)

Test that the count of non-empty values in subjectData column matches the the required values dict. If required is omitted, the sum of values in referenceData column (not the count) is used in its place.

The required argument can be a dict, callable, data source, or None. See assertDataSet for more details.

assertDataRegex(column, required, msg=None, **kwds_filter)

Test that column in subjectData contains values that match a required regular expression:

def test_date(self):
    wellformed = r'\d\d\d\d-\d\d-\d\d'  # Matches YYYY-MM-DD.
    self.assertDataRegex('date', wellformed)

The required argument must be a string or a compiled regular expression object (it can not be omitted).

assertDataNotRegex(column, required, msg=None, **kwds_filter)

Test that column in subjectData contains values that do not match a required regular expression:

def test_name(self):
    bad_whitespace = r'^\s|\s$'  # Leading or trailing whitespace.
    self.assertDataNotRegex('name', bad_whitespace)

The required argument must be a string or a compiled regular expression object (it can not be omitted).

Context Manager Allows
allowOnly(differences) only specified differences
allowAny(number) given number of differences of any class
allowAny(**kwds_filter) unlimited number of differences of any class that match given keyword filters
allowExtra(number=None) given number of Extra differences or unlimited number if None
allowMissing(number=None) given number of Missing differences or unlimited number if None
allowDeviation(tolerance) Deviations of plus or minus given tolerance
allowPercentDeviation(tolerance) Deviations of plus or minus given tolerance percentage
allowOnly(differences, msg=None)

Context manager to allow specific differences without triggering a test failure:

differences = [
    Extra('foo'),
    Missing('bar'),
]
with self.allowOnly(differences):
    self.assertDataSet('column1')

If the raised differences do not match differences, the test will fail with a DataAssertionError of the remaining differences.

In the above example, differences is a list but it is also possible to pass a single difference or a dictionary.

Using a single difference:

with self.allowOnly(Extra('foo')):
    self.assertDataSet('column2')

When using a dictionary, the keys are strings that provide context (for future reference and derived reports) and the values are the individual difference objects themselves:

differences = {
    'Totals from state do not match totals from county.': [
        Deviation(+436, 38032, town='Springfield'),
        Deviation(-83, 8631, town='Union')
    ],
    'Some small towns were omitted from county report.': [
        Deviation(-102, 102, town='Anderson'),
        Deviation(-177, 177, town='Westfield')
    ]
}
with self.allowOnly(differences):
    self.assertDataSum('population', ['town'])
allowAny(number=None, msg=None, **kwds_filter)

Allows a given number of differences (of any kind) without triggering a test failure:

with self.allowAny(10):  # Allows up to ten differences.
    self.assertDataSet('city_name')

If number is omitted, allows an unlimited number of differences as long as they match a given keyword filter:

with self.allowAny(city_name='not a city'):
    self.assertDataSum('population', ['city_name'])

If the count of differences exceeds the given number, the test case will fail with a DataAssertionError containing all observed differences.

allowMissing(number=None, msg=None)

Context manager to allow for missing values without triggering a test failure:

with self.allowMissing():  # Allows Missing differences.
    self.assertDataSet('column1')
allowExtra(number=None, msg=None)

Context manager to allow for extra values without triggering a test failure:

with self.allowExtra():  # Allows Extra differences.
    self.assertDataSet('column1')
allowDeviation(tolerance, /, msg=None, **kwds_filter)
allowDeviation(lower, upper, msg=None, **kwds_filter)

Context manager to allow for deviations from required numeric values without triggering a test failure.

Allowing deviations of plus-or-minus a given tolerance:

with self.allowDeviation(5):  # tolerance of +/- 5
    self.assertDataSum('column2', group_by=['column1'])

Specifying different lower and upper bounds:

with self.allowDeviation(-2, 3):  # tolerance from -2 to +3
    self.assertDataSum('column2', group_by=['column1'])

All deviations within the accepted tolerance range are suppressed but those that exceed the range will trigger a test failure.

Note

The “tolerance, /,” part of this method’s signature means that tolerance is a positional-only parameter—it cannot be specified using keyword syntax.

allowPercentDeviation(deviation, msg=None, **kwds_filter)

Context manager to allow positive or negative numeric differences of less than or equal to the given deviation as a percentage of the matching reference value:

with self.allowPercentDeviation(0.02):  # Allows +/- 2%
    self.assertDataSum('column2', keys=['column1'])

If differences exceed deviation, the test case will fail with a DataAssertionError containing the excessive differences.

Filter by Keywords (using **kwds_filter)

Many datatest methods support optional keyword arguments to quickly filter the data being tested. For example, adding state='Ohio' to a data assertion would limit the test to those records where the “state” column contains the value “Ohio”:

self.assertDataSet('postal_code', state='Ohio')

Multiple keywords can be used to further specify the data being tested. The keyword filter state='Ohio', city='Columbus' limits the test to records where the “state” column contains the value “Ohio” and the “city” column contains the value “Columbus”:

self.assertDataSet('postal_code', state='Ohio', city='Columbus')

Keyword arguments can also contain multiple values. Using state=['Indiana', 'Ohio'] limits the test to records where the “state” column contains the value “Indiana” or the value “Ohio”:

self.assertDataSet('postal_code', state=['Indiana', 'Ohio'])

Data Sources

Data source objects are used to access data in various formats.

Class Loads
CsvSource(file) CSV from path or file-like object file
SqliteSource(connection, table) table from SQLite connection
ExcelSource(path, worksheet=None) Excel worksheet from XLSX or XLS path, defaults to the first worksheet if None (requires xlrd package)
PandasSource(df) pandas DataFrame df (requires pandas)
MultiSource(*sources) multiple data sources which can be treated as single data source

CsvSource

class datatest.CsvSource(file, encoding=None, in_memory=False)

Loads CSV data from file (path or file-like object):

subjectData = datatest.CsvSource('mydata.csv')
create_index(*columns)

Creating an index for certain columns can speed up data testing in some cases.

See SqliteSource.create_index for more details.

SqliteSource

class datatest.SqliteSource(connection, table)

Loads table data from given SQLite connection:

conn = sqlite3.connect('mydatabase.sqlite3')
subjectData = datatest.SqliteSource(conn, 'mytable')
classmethod from_records(data, columns=None)

Alternate constructor to load an existing collection of records. Loads data (an iterable of lists, tuples, or dicts) into a new SQLite database with the given columns:

subjectData = datatest.SqliteSource.from_records(records, columns)

The columns argument can be omitted if data contains dict or namedtuple records:

dict_rows = [
    { ... },
    { ... },
]
subjectData = datatest.SqliteSource.from_records(dict_rows)
create_index(*columns)

Creating an index for certain columns can speed up data testing in some cases.

Indexes should be added one-by-one to tune a test suite’s over-all performance. Creating several indexes before testing even begins could lead to worse performance so use them with discretion.

For example: If you’re using “town” to group aggregation tests (like self.assertDataSum('population', ['town'])), then you might be able to improve performance by adding an index for the “town” column:

subjectData.create_index('town')

Using two or more columns creates a multi-column index:

subjectData.create_index('town', 'zipcode')

Calling the function multiple times will create multiple indexes:

subjectData.create_index('town')
subjectData.create_index('zipcode')

If you have the appropriate, optional dependencies installed, datatest provides additional data source options:

ExcelSource

class datatest.ExcelSource(path, worksheet=None, in_memory=False)

Loads first worksheet from XLSX or XLS file path:

subjectData = datatest.ExcelSource('mydata.xlsx')

Specific worksheets can be accessed by name:

subjectData = datatest.ExcelSource('mydata.xlsx', 'Sheet 2')

This is an optional data source that requires the third-party library xlrd.

create_index(*columns)

Creating an index for certain columns can speed up data testing in some cases.

See SqliteSource.create_index for more details.

PandasSource

class datatest.PandasSource(df)

Loads pandas DataFrame as a data source:

subjectData = datatest.PandasSource(df)

This is an optional data source that requires the third-party library pandas.

Todo

Optimize. PandasSource is not yet optimized for speed (although it will be in the future). Testing large DataFrames will be slow—it is faster to use CsvSource or SqliteSource.


MultiSource

class datatest.MultiSource(*sources, missing='')

A wrapper class that allows multiple data sources to be treated as a single, composite data source:

subjectData = datatest.MultiSource(
    datatest.CsvSource('file1.csv'),
    datatest.CsvSource('file2.csv'),
    datatest.CsvSource('file3.csv')
)

The original sources are stored in the __wrapped__ attribute.

Data is aligned by column name and missing values are filled with empty strings:

AdapterSource

class datatest.AdapterSource(source, interface, missing='')

A wrapper class that adapts a data source to an interface of column names. The interface should be a sequence of 2-tuples where the first item is the existing column name and the second item is the desired column name. If column order is not important, the interface can, alternatively, be a dict.

For example, a CSV file that contains the columns ‘old_1’, ‘old_2’, and ‘old_4’ can be adapted to behave as if it has the columns ‘new_1’, ‘new_2’, ‘new_3’ and ‘new_4’ with the following:

source = CsvSource('mydata.csv')
interface = [
    ('old_1', 'new_1'),
    ('old_2', 'new_2'),
    (None, 'new_3'),
    ('old_4', 'new_4'),
]
subjectData = AdapterSource(source, interface)

An AdapterSource can be thought of as a virtual source that renames, reorders, adds, or removes columns of the original source. To add a column that does not exist in original, use None in place of a column name (see ‘new_3’, above). Columns mapped to None will contain missing values (defaults to empty string).

The original source can be accessed via the __wrapped__ property.

Common Methods

All data sources implement a set of common methods which are inhereted from a common BaseSource. Typically, these methods are used indirectly via DataTestCase but it is also possible to call them directly:

class datatest.BaseSource

Common base class for all data sources. Custom sources can be created by subclassing BaseSource and implementing __init__(), __repr__(), columns() and __iter__(). Optionally, performance can be improved by implementing filter_rows(), distinct(), sum(), count(), and mapreduce().

columns()

Returns list of column names.

__iter__()

Returns iterable of dictionary rows (like csv.DictReader).

filter_rows(**kwds)

Returns iterable of dictionary rows (like csv.DictReader) filtered by keywords. E.g., where column1=value1, column2=value2, etc. (uses slow __iter__).

distinct(columns, **kwds_filter)

Returns CompareSet of distinct values or distinct tuples of values if given multiple columns (uses slow __iter__).

sum(column, keys=None, **kwds_filter)

Returns CompareDict containing sums of column values grouped by keys.

count(column, keys=None, **kwds_filter)

Returns CompareDict containing count of non-empty column values grouped by keys.

mapreduce(mapper, reducer, columns, keys=None, **kwds_filter)

Apply mapper to the values in columns (which are grouped by keys and filtered by filter) then apply reducer of two arguments cumulatively to the mapped values, from left to right, so as to reduce the values to a single result per group of keys. If keys is omitted, a single result is returned, otherwise returns a CompareDict object.

mapper (function or other callable):
Should accept column values from a single row and return a single computed result. Mapper always receives a single argument–if columns is a sequence, mapper will receive a tuple of values containing in the specified columns.
reducer (function or other callable):
Should accept two arguments that are applied cumulatively to the intermediate mapped results (produced by mapper), from left to right, so as to reduce them to a single value for each group of keys.
columns (string or sequence):
Name of column or columns that are passed into mapper.
keys (None, string, or sequence):
Name of key or keys used to group column values.
kwds_filter:
Keywords used to filter rows.

Comparison Objects

Querying a data source with various methods will return a CompareSet or a CompareDict.

CompareSet

class datatest.CompareSet(data)

DataSource query result set.

make_rows(names)

Return an iterable of dictionary rows (like csv.DictReader) using names to construct dictionary keys.

compare(other, op='==')

Compare self to other and return a list of difference objects. If other is callable, constructs a list of Invalid objects for values where other returns False. If other is a CompareSet or other collection, differences are compiled as a list of Extra and Missing objects.

CompareDict

class datatest.CompareDict(data, key_names)

DataSource query result mapping.

key_names

Column names for result keys.

make_rows(names)

Return an iterable of dictionary rows (like csv.DictReader) using names to construct dictionary keys.

compare(other)

Compare self to other and return a list of difference objects. If other is callable, constructs a list of Invalid objects for values where other returns False. If other is a CompareDict or other mapping object (like a dict), differences are compiled as a list of Deviation and Invalid objects.

Error and Differences

class datatest.DataAssertionError(msg, differences, subject=None, required=None)

Data assertion failure—includes a list of differences.

class datatest.Extra(value, **kwds)

Additional value that is not part of a required set.

class datatest.Missing(value, **kwds)

Missing value that is part of a required set.

class datatest.Invalid(value, required=None, **kwds)

Invalid item that does not match a required check.

class datatest.Deviation(value, required, **kwds)

Deviation from a required numeric value.

Test Runner Program

@datatest.mandatory

A decorator to mark whole test cases or individual methods as mandatory. If a mandatory test fails, DataTestRunner will stop immediately (this is similar to the --failfast command line argument behavior):

@datatest.mandatory
class TestFileFormat(datatest.DataTestCase):
    def test_columns(self):
        ...
@datatest.skip(reason)

A decorator to unconditionally skip a test:

@datatest.skip('Not finished collecting raw data.')
class TestSumTotals(datatest.DataTestCase):
    def test_totals(self):
        ...
@datatest.skipIf(condition, reason)

A decorator to skip a test if the condition is true.

@datatest.skipUnless(condition, reason)

A decorator to skip a test unless the condition is true.

class datatest.DataTestRunner(stream=None, descriptions=True, verbosity=1, failfast=False, buffer=False, resultclass=None, ignore=False)

A data test runner (wraps unittest.TextTestRunner) that displays results in textual form.

resultclass

alias of DataTestResult

run(test)

Run the given tests in order of line number from source file.

class datatest.DataTestProgram(module='__main__', defaultTest=None, argv=None, testRunner=datatest.DataTestRunner, testLoader=unittest.TestLoader, exit=True, verbosity=1, failfast=None, catchbreak=None, buffer=None, warnings=None)

datatest.main

alias of DataTestProgram