Validating Pandas Objects

The pandas data analysis package is commonly used for data work. This page explains how datatest handles the validation of DataFrame, Series, Index, and MultiIndex objects.

Accessor Syntax

Examples on this page use the validate accessor:

# Accessor syntax:

df['A'].validate({'x', 'y', 'z'})

We could also use the equivalent non-accessor syntax:

# Basic syntax:

dt.validate(df['A'], {'x', 'y', 'z'})

DataFrame

For validation, DataFrame objects using the default index type are treated as sequences. DataFrames using an index of any other type are treated as mappings:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
import pandas as pd
import datatest as dt

dt.register_accessors()

df = pd.DataFrame(data={'A': ['foo', 'bar', 'baz', 'qux'],
                        'B': [10, 20, 'x', 'y']})


requirement = [
    ('foo', 10),
    ('bar', 20),
    ('baz', 'x'),
    ('qux', 'y'),
]

df.validate(requirement)

Since no index was specified, df uses the default RangeIndex type—which tells validate() to treat the DataFrame as a sequence.

The distinction between implicit and explicit indexing is also apparent in error reporting. Compare the examples on each of the tabs below:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
import pandas as pd
import datatest as dt

dt.register_accessors()

df = pd.DataFrame(data={'A': ['foo', 'bar', 'baz', 'qux'],
                        'B': [10, 20, 'x', 'y']})


df.validate((str, int))
Traceback (most recent call last):
  File "example.py", line 10, in <module>
    df.validate((str, int))
datatest.ValidationError: does not satisfy `(str, int)` (2 differences): [
    Invalid(('baz', 'x')),
    Invalid(('qux', 'y')),
]

Since the DataFrame was treated as a sequence, the error includes a sequence of differences.

Series

Series objects are handled the same way as DataFrames. Series with a default index are treated as sequences and Series with explicitly defined indexes are treated as mappings:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
import pandas as pd
import datatest as dt

dt.register_accessors()

s = pd.Series(data=[10, 20, 'x', 'y'])


requirement = [10, 20, 'x', 'y']

s.validate(requirement)

Like before, the sequence and mapping handling is also apparent in the error reporting:

1
2
3
4
5
6
7
8
9
import pandas as pd
import datatest as dt

dt.register_accessors()

s = pd.Series(data=[10, 20, 'x', 'y'])


s.validate(int)
Traceback (most recent call last):
  File "example.py", line 9, in <module>
    s.validate(int)
datatest.ValidationError: does not satisfy `int` (2 differences): [
    Invalid('x'),
    Invalid('y'),
]

Index and MultiIndex

Index and MultiIndex objects are all treated as sequences:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
import pandas as pd
import datatest as dt

dt.register_accessors()

index = pd.Index(['I', 'II', 'III', 'IV'])
requirement = ['I', 'II', 'III', 'IV']
index.validate(requirement)

multi = pd.MultiIndex.from_tuples([
    ('I', 'a'),
    ('II', 'b'),
    ('III', 'c'),
    ('IV', 'd'),
])
requirement = [('I', 'a'), ('II', 'b'), ('III', 'c'), ('IV', 'd')]
multi.validate(requirement)