Data Handling

working_directory

class datatest.working_directory(path)

A context manager to temporarily set the working directory to a given path. If path specifies a file, the file’s directory is used. When exiting the with-block, the working directory is automatically changed back to its previous location.

Use the global __file__ variable to load data relative to test file’s current directory:

with datatest.working_directory(__file__):
    source = datatest.DataSource.from_csv('myfile.csv')

This context manager can also be used as a decorator.

DataSource

class datatest.DataSource(data, fieldnames=None)

A basic data source to quickly load and query data.

The given data should be an iterable of rows. The rows themselves can be lists (as below), dictionaries, or other sequences or mappings. fieldnames must be a sequence of strings to use when referencing data by field:

data = [
    ['x', 100],
    ['y', 200],
    ['z', 300],
]
fieldnames = ['A', 'B']
source = datatest.DataSource(data, fieldnames)

If data is an iterable of dict or namedtuple rows, then fieldnames can be omitted:

data = [
    {'A': 'x', 'B': 100},
    {'A': 'y', 'B': 200},
    {'A': 'z', 'B': 300},
]
source = datatest.DataSource(data)
classmethod from_csv(file, encoding=None, **fmtparams)

Create a DataSource from a CSV file (a path or file-like object):

source = datatest.DataSource.from_csv('mydata.csv')

If file is an iterable of files, data will be loaded and aligned by column name:

files = ['mydata1.csv', 'mydata2.csv']
source = datatest.DataSource.from_csv(files)
classmethod from_excel(path, worksheet=0)

Create a DataSource from an Excel worksheet. The path must specify to an XLSX or XLS file and the worksheet must specify the index or name of the worksheet to load (defaults to the first worksheet). This constructor requires the optional, third-party library xlrd.

Load first worksheet:

source = datatest.DataSource.from_excel('mydata.xlsx')

Specific worksheets can be loaded by name (a string) or index (an integer):

source = datatest.DataSource.from_excel('mydata.xlsx', 'Sheet 2')
fieldnames

A list of field names used by the data source.

__call__(select, **where)

Calling a DataSource like a function returns a DataQuery object that is automatically associated with the source (see DataQuery for select and where syntax):

query = source(['A'])

This is a shorthand for:

query = DataQuery(source, ['A'])

DataQuery

class datatest.DataQuery(select, **where)
class datatest.DataQuery(defaultsource, select, **where)

A class to query data from a DataSource object. Queries can be created, modified and passed around without actually computing the result—computation doesn’t occur until the execute() method is called.

The select argument must be a container of one field name (a string) or of an inner-container of multiple filed names. The optional where keywords can narrow a selection to rows where fields match specified values. A defaultsource can be provided to associate the query with a specific DataSource object.

Queries are usually created from an existing source (the originating source is automatically associated with the new query):

source = DataSource(...)
query = source(['A'])  # <- DataQuery created from source.

Queries can be created directly as well:

source = DataSource(...)
query = DataQuery(source, ['A'])  # <- Direct initialization.

Queries can also be created independent of any single data source:

query = DataQuery(['A'])
defaultsource

A property for setting a predetermined DataSource to use when execute() is called without a source argument.

When a query is created from a DataSource call, this property is assigned automatically. When a query is created directly, the value can be passed explicitly or it can be omitted.

sum()

Get the sum of non-None elements.

count()

Get the count of non-None elements.

avg()

Get the average of non-None elements. Strings and other objects that do not look like numbers are interpreted as 0.

min()

Get the minimum value from elements.

max()

Get the maximum value from elements.

distinct()

Filter elements, removing duplicate values.

map(function)

Apply function to each element keeping the resulting data.

filter(function=None)

Filter elements, keeping only those values for which function returns True. If function is None, this method keeps all elements for which bool returns True.

reduce(function)

Reduce elements to a single value by applying a function of two arguments cumulatively to all elements from left to right.

execute(source=None, *, evaluate=True, optimize=True)

Execute the query and return its result. The source should be a DataSource on which the query will operate. If source is omitted, the defaultsource is used.

By default, results are eagerly evaluated (and loaded into memory). For lazy evaluation, set evaluate to False to return a DataResult iterator instead.

Setting optimize to False turns-off query optimization.

__call__(source=None)

A DataQuery can be called like a function to execute it and return a DataResult appropriate for lazy evaluation:

query = source(['A'])
result = query()  # <- Returns DataResult (iterator)

This is a shorthand for calling the execute() method with evaluate set to False.

DataResult

class datatest.DataResult(iterable, evaluation_type)

A simple iterator that wraps the results of DataQuery execution. This iterator is used to facilitate the lazy evaluation of data objects (where possible) when asserting data validity.

Although DataResult objects are usually constructed automatically, it’s possible to create them directly:

iterable = iter([...])
result = DataResult(iterable, evaluation_type=list)

When iterated over, the iterable must yield only those values necessary for constructing an object of the given evaluation_type and no more. When the evaluation_type is a set, the iterable must not contain duplicate values. When the evaluation_type is a dict or other mapping, the iterable must contain suitable key-value pairs or a mapping.

evaluation_type

The type of instance returned by the evaluate method.

evaluate()

Evaluate the entire iterator and return its result:

result = DataResult(iter([...]), evaluation_type=set)
result_set = result.evaluate()  # <- Returns a set of values.

When evaluating a dict or other mapping type, any values that are, themselves, DataResult objects will also be evaluated.

__wrapped__

The underlying iterator—useful when introspecting or rewrapping.