Validating Pandas Objects¶
The pandas
data analysis package is commonly used for data
work. This page explains how datatest handles the validation of
DataFrame
, Series
, Index
, and MultiIndex
objects.
Accessor Syntax
Examples on this page use the validate
accessor:
# Accessor syntax:
df['A'].validate({'x', 'y', 'z'})
We could also use the equivalent non-accessor syntax:
# Basic syntax:
dt.validate(df['A'], {'x', 'y', 'z'})
DataFrame¶
For validation, DataFrame
objects using
the default index type are treated as sequences. DataFrames using an
index of any other type are treated as mappings:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 | import pandas as pd
import datatest as dt
dt.register_accessors()
df = pd.DataFrame(data={'A': ['foo', 'bar', 'baz', 'qux'],
'B': [10, 20, 'x', 'y']})
requirement = [
('foo', 10),
('bar', 20),
('baz', 'x'),
('qux', 'y'),
]
df.validate(requirement)
|
Since no index was specified, df
uses the default
RangeIndex
type—which tells
validate()
to treat the DataFrame as a sequence.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 | import pandas as pd
import datatest as dt
dt.register_accessors()
df = pd.DataFrame(data={'A': ['foo', 'bar', 'baz', 'qux'],
'B': [10, 20, 'x', 'y']},
index=['I', 'II', 'III', 'IV'])
requirement = {
'I': ('foo', 10),
'II': ('bar', 20),
'III': ('baz', 'x'),
'IV': ('qux', 'y'),
}
df.validate(requirement)
|
In this example, we’ve specified an index and therefore df
is handled as a mapping.
The distinction between implicit and explicit indexing is also apparent in error reporting. Compare the examples on each of the tabs below:
1 2 3 4 5 6 7 8 9 10 | import pandas as pd
import datatest as dt
dt.register_accessors()
df = pd.DataFrame(data={'A': ['foo', 'bar', 'baz', 'qux'],
'B': [10, 20, 'x', 'y']})
df.validate((str, int))
|
Traceback (most recent call last):
File "example.py", line 10, in <module>
df.validate((str, int))
datatest.ValidationError: does not satisfy `(str, int)` (2 differences): [
Invalid(('baz', 'x')),
Invalid(('qux', 'y')),
]
Since the DataFrame was treated as a sequence, the error includes a sequence of differences.
1 2 3 4 5 6 7 8 9 10 | import pandas as pd
import datatest as dt
dt.register_accessors()
df = pd.DataFrame(data={'A': ['foo', 'bar', 'baz', 'qux'],
'B': [10, 20, 'x', 'y']},
index=['I', 'II', 'III', 'IV'])
df.validate((str, int))
|
Traceback (most recent call last):
File "example.py", line 10, in <module>
df.validate((str, int))
datatest.ValidationError: does not satisfy `(str, int)` (2 differences): {
'III': Invalid(('baz', 'x')),
'IV': Invalid(('qux', 'y')),
}
In this example, the DataFrame was treated as a mapping, so the error includes a mapping of differences.
Series¶
Series
objects are handled the same way as
DataFrames. Series with a default index are treated as sequences and
Series with explicitly defined indexes are treated as mappings:
1 2 3 4 5 6 7 8 9 10 11 | import pandas as pd
import datatest as dt
dt.register_accessors()
s = pd.Series(data=[10, 20, 'x', 'y'])
requirement = [10, 20, 'x', 'y']
s.validate(requirement)
|
1 2 3 4 5 6 7 8 9 10 11 | import pandas as pd
import datatest as dt
dt.register_accessors()
s = pd.Series(data=[10, 20, 'x', 'y'],
index=['I', 'II', 'III', 'IV'])
requirement = {'I': 10, 'II': 20, 'III': 'x', 'IV': 'y'}
s.validate(requirement)
|
Like before, the sequence and mapping handling is also apparent in the error reporting:
1 2 3 4 5 6 7 8 9 | import pandas as pd
import datatest as dt
dt.register_accessors()
s = pd.Series(data=[10, 20, 'x', 'y'])
s.validate(int)
|
Traceback (most recent call last):
File "example.py", line 9, in <module>
s.validate(int)
datatest.ValidationError: does not satisfy `int` (2 differences): [
Invalid('x'),
Invalid('y'),
]
1 2 3 4 5 6 7 8 9 | import pandas as pd
import datatest as dt
dt.register_accessors()
s = pd.Series(data=[10, 20, 'x', 'y'],
index=['I', 'II', 'III', 'IV'])
s.validate(int)
|
Traceback (most recent call last):
File "example.py", line 9, in <module>
s.validate(int)
datatest.ValidationError: does not satisfy `int` (2 differences): {
'III': Invalid('x'),
'IV': Invalid('y'),
}
Index and MultiIndex¶
Index
and MultiIndex
objects are all treated as sequences:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 | import pandas as pd
import datatest as dt
dt.register_accessors()
index = pd.Index(['I', 'II', 'III', 'IV'])
requirement = ['I', 'II', 'III', 'IV']
index.validate(requirement)
multi = pd.MultiIndex.from_tuples([
('I', 'a'),
('II', 'b'),
('III', 'c'),
('IV', 'd'),
])
requirement = [('I', 'a'), ('II', 'b'), ('III', 'c'), ('IV', 'd')]
multi.validate(requirement)
|