Data Handling API Reference¶
working_directory¶
-
class
datatest.
working_directory
(path)¶ A context manager to temporarily set the working directory to a given path. If path specifies a file, the file’s directory is used. When exiting the with-block, the working directory is automatically changed back to its previous location.
You can use Python’s
__file__
constant to load data relative to a file’s current directory:from datatest import working_directory import pandas as pd with working_directory(__file__): my_df = pd.read_csv('myfile.csv')
This context manager can also be used as a decorator:
from datatest import working_directory import pandas as pd @working_directory(__file__) def my_df(): return pd.read_csv('myfile.csv')
In some cases, you may want to forgo the use of a context manager or decorator. You can explicitly control directory switching with the
change()
andrevert()
methods:from datatest import working_directory work_dir = working_directory(__file__) work_dir.change() ... work_dir.revert()
Tip
Take care when using pytest’s fixture finalization in combination with “session” or “module” level fixtures. In these cases, you should use
working_directory()
as a context manager—not as a decorator.In the first example below, the original working directory is restored immediately when the
with
statement ends. But in the second example, the original directory isn’t restored until after the entire session is finished (not usually what you want):# Correct: @pytest.fixture(scope='session') def connection(): with working_directory(__file__): conn = ... # Establish database connection. yield conn conn.close()
# Wrong: @pytest.fixture(scope='session') @working_directory(__file__) def connection(): conn = ... # Establish database connection. yield conn conn.close()
When a fixture does not require finalization or if the fixture is short-lived (e.g., a function-level fixture) then either form is acceptible.
Pandas Accessors¶
Datatest provides an optional extension accessor for integrating validation directly with pandas objects.
-
datatest.
register_accessors
()¶ Register the
validate
accessor for tighterpandas
integration. This provides an alternate syntax for validating DataFrame, Series, Index, and MultiIndex objects.After calling
register_accessors()
, you can use “validate” as a method:import pandas as pd import datatest as dt df = pd.read_csv('example.csv') dt.validate(df['A'], {'x', 'y', 'z'}) # <- Validate column 'A'. dt.register_accessors() df['A'].validate({'x', 'y', 'z'}) # <- Validate 'A' using accessor syntax.
Accessor Equivalencies¶
Below, you can compare the accessor syntax against the equivalent non-accessor syntax:
import datatest as dt
dt.register_accessors()
...
df.columns.validate({'A', 'B', 'C'}) # Index
df['A'].validate({'x', 'y', 'z'}) # Series
df['C'].validate.interval(10, 30) # Series
df[['A', 'C']].validate((str, int)) # DataFrame
import datatest as dt
...
dt.validate(df.columns, {'A', 'B', 'C'}) # Index
dt.validate(df['A'], {'x', 'y', 'z'}) # Series
dt.validate.interval(df['C'], 10, 30) # Series
dt.validate(df[['A', 'C']], (str, int)) # DataFrame
Here is the full list of accessor equivalencies:
Accessor Expression |
Equivalent Non-accessor Expression |
---|---|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
RepeatingContainer¶
-
class
datatest.
RepeatingContainer
(iterable)¶ A container that repeats attribute lookups, method calls, operations, and expressions on the objects it contains. When an action is performed, it is forwarded to each object in the container and a new RepeatingContainer is returned with the resulting values.
In the following example, a RepeatingContainer with two strings is created. A method call to
upper()
is forwarded to the individual strings and a new RepeatingContainer is returned that contains the uppercase values:>>> repeating = RepeatingContainer(['foo', 'bar']) >>> repeating.upper() RepeatingContainer(['FOO', 'BAR'])
A RepeatingContainer is an iterable and its individual items can be accessed through sequence unpacking or iteration. Below, the individual objects are unpacked into the variables
x
andy
:>>> repeating = RepeatingContainer(['foo', 'bar']) >>> repeating = repeating.upper() >>> x, y = repeating # <- Unpack values. >>> x 'FOO' >>> y 'BAR'
If the RepeatingContainer was created with a dict (or other mapping), then iterating over it will return a sequence of
(key, value)
tuples. This sequence can be used as-is or used to create another dict:>>> repeating = RepeatingContainer({'a': 'foo', 'b': 'bar'}) >>> repeating = repeating.upper() >>> list(repeating) [('a', 'FOO'), ('b', 'BAR')] >>> dict(repeating) {'a': 'FOO', 'b': 'BAR'}
Validating RepeatingContainer Results¶
When comparing the data under test against a set of similarly-shaped
reference data, it’s common to perform the same operations on both
data sources. When queries and selections become more complex, this
duplication can grow cumbersome. But duplication can be mitigated by
using a RepeatingContainer
object.
A RepeatingContainer is compatible with many types of
objects—pandas.DataFrame
, squint.Select
, etc.
In the following example, a RepeatingContainer is created with two objects. Then, an operation is forwarded to each object in the group. Finally, the results are unpacked and validated:
Below, the indexing and method calls
...[['A', 'C']].groupby('A').sum()
are forwarded to each
pandas.DataFrame
and the results are returned inside
a new RepeatingContainer:
import datatest as dt
import pandas as pd
compare = RepeatingContainer([
pd.read_csv('data_under_test.csv'),
pd.read_csv('reference_data.csv'),
])
result = compare[['A', 'C']].groupby('A').sum()
data, requirement = result
dt.validate(data, requirement)
Below, the method calls ...({'A': 'C'}).sum()
are forwarded to
each squint.Select
and the results are returned inside a
new RepeatingContainer:
from datatest import validate
from squint import Select
compare = RepeatingContainer([
Select('data_under_test.csv'),
Select('reference_data.csv'),
])
result = compare({'A': 'C'}).sum()
data, requirement = result
validate(data, requirement)
The example above can be expressed even more concisely using
Python’s asterisk unpacking (*
) to unpack the values directly
inside the validate()
call itself:
import datatest as dt
import pandas as pd
compare = RepeatingContainer([
pd.read_csv('data_under_test.csv'),
pd.read_csv('reference_data.csv'),
])
dt.validate(*compare[['A', 'C']].groupby('A').sum())
from datatest import validate
from squint import Select
compare = RepeatingContainer([
Select('data_under_test.csv'),
Select('reference_data.csv'),
])
validate(*compare({'A': 'C'}).sum())