Data Handling API Reference

working_directory

class datatest.working_directory(path)

A context manager to temporarily set the working directory to a given path. If path specifies a file, the file’s directory is used. When exiting the with-block, the working directory is automatically changed back to its previous location.

You can use Python’s __file__ constant to load data relative to a file’s current directory:

from datatest import working_directory
import pandas as pd

with working_directory(__file__):
    my_df = pd.read_csv('myfile.csv')

This context manager can also be used as a decorator:

from datatest import working_directory
import pandas as pd

@working_directory(__file__)
def my_df():
    return pd.read_csv('myfile.csv')

In some cases, you may want to forgo the use of a context manager or decorator. You can explicitly control directory switching with the change() and revert() methods:

from datatest import working_directory

work_dir = working_directory(__file__)
work_dir.change()

...

work_dir.revert()

Tip

Take care when using pytest’s fixture finalization in combination with “session” or “module” level fixtures. In these cases, you should use working_directory() as a context manager—not as a decorator.

In the first example below, the original working directory is restored immediately when the with statement ends. But in the second example, the original directory isn’t restored until after the entire session is finished (not usually what you want):

# Correct:

@pytest.fixture(scope='session')
def connection():
    with working_directory(__file__):
        conn = ...  # Establish database connection.
    yield conn
    conn.close()
# Wrong:

@pytest.fixture(scope='session')
@working_directory(__file__)
def connection():
    conn = ...  # Establish database connection.
    yield conn
    conn.close()

When a fixture does not require finalization or if the fixture is short-lived (e.g., a function-level fixture) then either form is acceptible.

Pandas Accessors

Datatest provides an optional extension accessor for integrating validation directly with pandas objects.

datatest.register_accessors()

Register the validate accessor for tighter pandas integration. This provides an alternate syntax for validating DataFrame, Series, Index, and MultiIndex objects.

After calling register_accessors(), you can use “validate” as a method:

import pandas as pd
import datatest as dt

df = pd.read_csv('example.csv')

dt.validate(df['A'], {'x', 'y', 'z'})  # <- Validate column 'A'.

dt.register_accessors()
df['A'].validate({'x', 'y', 'z'})  # <- Validate 'A' using accessor syntax.

Accessor Equivalencies

Below, you can compare the accessor syntax against the equivalent non-accessor syntax:

import datatest as dt
dt.register_accessors()
...

df.columns.validate({'A', 'B', 'C'})      # Index

df['A'].validate({'x', 'y', 'z'})         # Series

df['C'].validate.interval(10, 30)         # Series

df[['A', 'C']].validate((str, int))       # DataFrame

Here is the full list of accessor equivalencies:

Accessor Expression

Equivalent Non-accessor Expression

obj.validate(requirement)

validate(obj, requirement)

obj.validate.predicate(requirement)

validate.predicate(obj, requirement)

obj.validate.regex(requirement)

validate.regex(obj, requirement)

obj.validate.approx(requirement)

validate.approx(obj, requirement)

obj.validate.fuzzy(requirement)

validate.fuzzy(obj, requirement)

obj.validate.interval(min, max)

validate.interval(obj, min, max)

obj.validate.set(requirement)

validate.set(obj, requirement)

obj.validate.subset(requirement)

validate.subset(obj, requirement)

obj.validate.superset(requirement)

validate.superset(obj, requirement)

obj.validate.unique()

validate.unique(obj)

obj.validate.order(requirement)

validate.order(obj, requirement)

RepeatingContainer

class datatest.RepeatingContainer(iterable)

A container that repeats attribute lookups, method calls, operations, and expressions on the objects it contains. When an action is performed, it is forwarded to each object in the container and a new RepeatingContainer is returned with the resulting values.

In the following example, a RepeatingContainer with two strings is created. A method call to upper() is forwarded to the individual strings and a new RepeatingContainer is returned that contains the uppercase values:

>>> repeating = RepeatingContainer(['foo', 'bar'])
>>> repeating.upper()
RepeatingContainer(['FOO', 'BAR'])

A RepeatingContainer is an iterable and its individual items can be accessed through sequence unpacking or iteration. Below, the individual objects are unpacked into the variables x and y:

>>> repeating = RepeatingContainer(['foo', 'bar'])
>>> repeating = repeating.upper()
>>> x, y = repeating  # <- Unpack values.
>>> x
'FOO'
>>> y
'BAR'

If the RepeatingContainer was created with a dict (or other mapping), then iterating over it will return a sequence of (key, value) tuples. This sequence can be used as-is or used to create another dict:

>>> repeating = RepeatingContainer({'a': 'foo', 'b': 'bar'})
>>> repeating = repeating.upper()
>>> list(repeating)
[('a', 'FOO'), ('b', 'BAR')]
>>> dict(repeating)
{'a': 'FOO', 'b': 'BAR'}

Validating RepeatingContainer Results

When comparing the data under test against a set of similarly-shaped reference data, it’s common to perform the same operations on both data sources. When queries and selections become more complex, this duplication can grow cumbersome. But duplication can be mitigated by using a RepeatingContainer object.

A RepeatingContainer is compatible with many types of objects—pandas.DataFrame, squint.Select, etc.

In the following example, a RepeatingContainer is created with two objects. Then, an operation is forwarded to each object in the group. Finally, the results are unpacked and validated:

Below, the indexing and method calls ...[['A', 'C']].groupby('A').sum() are forwarded to each pandas.DataFrame and the results are returned inside a new RepeatingContainer:

import datatest as dt
import pandas as pd

compare = RepeatingContainer([
    pd.read_csv('data_under_test.csv'),
    pd.read_csv('reference_data.csv'),
])

result = compare[['A', 'C']].groupby('A').sum()

data, requirement = result
dt.validate(data, requirement)

The example above can be expressed even more concisely using Python’s asterisk unpacking (*) to unpack the values directly inside the validate() call itself:

import datatest as dt
import pandas as pd

compare = RepeatingContainer([
    pd.read_csv('data_under_test.csv'),
    pd.read_csv('reference_data.csv'),
])

dt.validate(*compare[['A', 'C']].groupby('A').sum())