How to Validate Data Types¶

To check that data is of a particular type, call validate() with a type as the requirement argument (see Predicates).

Simple Type Checking¶

In the following example, we use the float type as the requirement. The elements in data are considered valid if they are float instances:

from datatest import validate

data = [0.0, 1.0, 2.0]
validate(data, float)

In this example, we use the str type as the requirement. The elements in data are considered valid if they are strings:

from datatest import validate

data = ['a', 'b', 'c']
validate(data, str)

Using a Tuple of Types¶

You can also use a predicate tuple to test the types contained in tuples. The elements in data are considered valid if the tuples contain a number followed by a string:

from numbers import Number
from datatest import validate

data = [(0.0, 'a'), (1.0, 'b'), (2, 'c'), (3, 'd')]
validate(data, (Number, str))

In the example above, the Number base class is used to check for numbers of any type (int, float, complex, Decimal, etc.).

Checking Pandas Types¶

Type Inference and Conversion

A Quick Refresher

Import the pandas package:

>>> import pandas as pd

INFERENCE

When a column’s values are all integers (1, 2, and 3), then Pandas infers an integer dtype:

>>> pd.Series([1, 2, 3])
0    1
1    2
2    3
dtype: int64

When a column’s values are a mix of integers (1 and 3) and floating point numbers (2.0), then Pandas will infer a floating point dtype—notice that the original integers have been coerced into float values:

>>> pd.Series([1, 2.0, 3])
0    1.0
1    2.0
2    3.0
dtype: float64

When certain non-numeric types are present, 'three', then pandas will use a generic “object” dtype:

>>> pd.Series([1, 2.0, 'three'])
0        1
1        2
2    three
dtype: object

CONVERSION

When a dtype is specified, dtype=float, Pandas will attempt to convert values into the given type. Here, the integers are explicitly converted into float values:

>>> pd.Series([1, 2, 3], dtype=float)
0    1.0
1    2.0
2    3.0
dtype: float64

In this example, integers and floating point numbers are converted into string values, dtype=str:

>>> pd.Series([1, 2.0, 3], dtype=str)
0      1
1    2.0
2      3
dtype: object

When a value cannot be converted into a specified type, an error is raised:

>>> pd.Series([1, 2.0, 'three'], dtype=int)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "~/myproject/venv/lib64/python3.8/site-packages/pandas/core/series.py", line
  327, in __init__
    data = sanitize_array(data, index, dtype, copy, raise_cast_failure=True)
  File "~/myproject/venv/lib64/python3.8/site-packages/pandas/core/construction.py",
   line 447, in sanitize_array
    subarr = _try_cast(data, dtype, copy, raise_cast_failure)
  File "~/myproject/venv/lib64/python3.8/site-packages/pandas/core/construction.py",
   line 555, in _try_cast
    maybe_cast_to_integer_array(arr, dtype)
  File "~/myproject/venv/lib64/python3.8/site-packages/pandas/core/dtypes/cast.py",
  line 1674, in maybe_cast_to_integer_array
    casted = np.array(arr, dtype=dtype, copy=copy)
ValueError: invalid literal for int() with base 10: 'three'

SEE ALSO

For more details, see the Pandas documentation regarding object conversion.

Check the types for each row of elements within a DataFrame:

import pandas as pd
import datatest as dt

dt.register_accessors()

df = pd.DataFrame(data={'A': ['foo', 'bar', 'baz', 'qux'],
                        'B': [10, 20, 30, 40]})

df.validate((str, int))

import pandas as pd
import datatest as dt

dt.register_accessors()

df = pd.DataFrame(data={'A': ['foo', 'bar', 'baz', 'qux'],
                        'B': [10, 20, 'x', 'y']})

df.validate((str, int))

Traceback (most recent call last):
  File "example.py", line 9, in <module>
    df.validate((str, int))
datatest.ValidationError: does not satisfy `(str, int)` (2 differences): [
    Invalid(('baz', 'x')),
    Invalid(('qux', 'y')),
]

Check the type of each element, one column at a time:

import pandas as pd
import datatest as dt

dt.register_accessors()

df = pd.DataFrame(data={'A': ['foo', 'bar', 'baz', 'qux'],
                        'B': [10, 20, 30, 40]})

df['A'].validate(str)
df['B'].validate(int)

import pandas as pd
import datatest as dt

dt.register_accessors()

df = pd.DataFrame(data={'A': ['foo', 'bar', 'baz', 'qux'],
                        'B': [10, 20, 'x', 'y']})

df['A'].validate(str)
df['B'].validate(int)

Traceback (most recent call last):
  File "example.py", line 10, in <module>
    df['B'].validate(int)
datatest.ValidationError: does not satisfy `int` (2 differences): [
    Invalid('x'),
    Invalid('y'),
]

Check the dtypes of the columns themselves (not the elements they contain):

import pandas as pd
import numpy as np
import datatest as dt

dt.register_accessors()

df = pd.DataFrame(data={'A': ['foo', 'bar', 'baz', 'qux'],
                        'B': [10, 20, 30, 40]})

required = {
    'A': np.dtype(object),
    'B': np.dtype(int),
}
df.dtypes.validate(required)

import pandas as pd
import numpy as np
import datatest as dt

dt.register_accessors()

df = pd.DataFrame(data={'A': ['foo', 'bar', 'baz', 'qux'],
                        'B': [10, 20, 'x', 'y']})

required = {
    'A': np.dtype(object),
    'B': np.dtype(int),
}
df.dtypes.validate(required)

Traceback (most recent call last):
  File "example.py", line 14, in <module>
    df.dtypes.validate(required)
datatest.ValidationError: does not satisfy `dtype('int64')` (1 difference): {
    'B': Invalid(dtype('O'), expected=dtype('int64')),
}

NumPy Types¶

Type Inference and Conversion

A Quick Refresher

Import the numpy package:

>>> import numpy as np

INFERENCE

When a column’s values are all integers (1, 2, and 3), then NumPy infers an integer dtype:

>>> a = np.array([1, 2, 3])
>>> a
array([1, 2, 3])
>>> a.dtype
dtype('int64')

When a column’s values are a mix of integers (1 and 3) and floating point numbers (2.0), then NumPy will infer a floating point dtype—notice that the original integers have been coerced into float values:

>>> a = np.array([1, 2.0, 3])
>>> a
array([1., 2., 3.])
>>> a.dtype
dtype('float64')

When given a string, 'three', NumPy will infer a unicode text dtype. This is different than how Pandas handles the situation. Notice that all of the values are converted to text:

>>> a = np.array([1, 2.0, 'three'])
>>> a
array(['1', '2.0', 'three'], dtype='<U32')
>>> a.dtype
dtype('<U32')

When certain non-numeric types are present, e.g. {4}, then Numpy will use a generic “object” dtype. In this case, the values maintain their original types—no conversion takes place:

>>> a = np.array([1, 2.0, 'three', {4}])
>>> a
array([1, 2.0, 'three', {4}], dtype=object)
>>> a.dtype
dtype('O')

CONVERSION

When a dtype is specified, dtype=float, NumPy will attempt to convert values into the given type. Here, the integers are explicitly converted into float values:

>>> a = np.array([1, 2, 3], dtype=float)
>>> a
array([1., 2., 3.])
>>> a.dtype
dtype('float64')

In this example, integers and floating point numbers are converted into unicode text values, dtype=str:

>>> a = np.array([1, 2.0, 3], dtype=str)
>>> a
array(['1', '2.0', '3'], dtype='<U3')
>>> a.dtype
dtype('<U3')

When a value cannot be converted into a specified type, an error is raised:

>>> a = np.array([1, 2.0, 'three'], dtype=int)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ValueError: invalid literal for int() with base 10: 'three'

For more details on NumPy types see:

With Predicate matching, you can use Python’s built-in str, int, float, and complex to validate types in NumPy arrays.

Check the type of each element in a one-dimentional array:

import numpy as np
import datatest as dt

a = np.array([1.0, 2.0, 3.0])

dt.validate(a, float)

import numpy as np
import datatest as dt

a = np.array([1.0, 2.0, frozenset({3})])

dt.validate(a, float)

Traceback (most recent call last):
  File "example.py", line 6, in <module>
    dt.validate(a, float)
datatest.ValidationError: does not satisfy `float` (1 difference): [
    Invalid(frozenset({3})),
]

Check the types for each row of elements within a two-dimentional array:

import numpy as np
import datatest as dt

a = np.array([(1.0, 12.25),
              (2.0, 33.75),
              (3.0, 101.5)])

dt.validate(a, (float, float))

import numpy as np
import datatest as dt

a = np.array([(1.0, 12.25),
              (2.0, 33.75),
              (frozenset({3}), 101.5)])

dt.validate(a, (float, float))

Traceback (most recent call last):
  File "example.py", line 8, in <module>
    dt.validate(a, (float, float))
datatest.ValidationError: does not satisfy `(float, float)` (1 difference): [
    Invalid((frozenset({3}), 101.5)),
]

Check the dtype of an array itself (not the elements it contains):

import numpy as np
import datatest as dt

a = np.array([(1.0, 12.25),
              (2.0, 33.75),
              (3.0, 101.5)])

dt.validate(a.dtype, np.dtype(float))

import numpy as np
import datatest as dt

a = np.array([(1.0, 12.25),
              (2.0, 33.75),
              (frozenset({3}), 101.5)])

dt.validate(a.dtype, np.dtype(float))

Traceback (most recent call last):
  File "example.py", line 8, in <module>
    dt.validate(a.dtype, np.dtype(float))
datatest.ValidationError: does not satisfy `dtype('float64')` (1 difference): [
    Invalid(dtype('O')),
]

Structured Arrays¶

If you can define your structured array directly, there’s little need to validate the types it contains (unless it’s an “object” dtype that could countain multiple types). But you may want to check the types in a structured array if it was constructed indirectly or was passed in from another source.

Check the types for each row of elements within a two-dimentional structured array:

import numpy as np
import datatest as dt

a = np.array([(1, 'x'),
              (2, 'y'),
              (3, 'z')],
             dtype='int, object')

dt.validate(a, (int, str))

import numpy as np
import datatest as dt

a = np.array([(1, 'x'),
              (2, 'y'),
              (3, 4.0)],
             dtype='int, object')

dt.validate(a, (int, str))

Traceback (most recent call last):
  File "example.py", line 9, in <module>
    dt.validate(a, (int, str))
datatest.ValidationError: does not satisfy `(int, str)` (1 difference): [
    Invalid((3, 4.0)),
]

You can also validate types with greater precision using NumPy’s very specific dtypes (np.uint32, np.float64, etc.). or you can use NumPy’s broader, generic types, like np.character, np.integer, np.floating, etc.:

import numpy as np
import datatest as dt

a = np.array([(1, 12.25),
              (2, 33.75),
              (3, 101.5)],
             dtype='int32, float32')

dt.validate(a, (np.integer, np.floating))

import numpy as np
import datatest as dt

a = np.array([(1, 12.25),
              (2, 33.75),
              (3, 101.5)],
             dtype='int32, object')

dt.validate(a, (np.integer, np.floating))

Since the “object” dtype was used for the second column of elements, the original type was unchanged. And although they are float objects, they aren’t NumPy floating point objects. Since this is the case, all of the rows fail validation:

Traceback (most recent call last):
  File "example.py", line 9, in <module>
    dt.validate(a, (np.integer, np.floating))
datatest.ValidationError: does not satisfy `(integer, floating)` (3 differences): [
    Invalid((1, 12.25)),
    Invalid((2, 33.75)),
    Invalid((3, 101.5)),
]

Check the dtype values of a structured array itself (not the elements it contains):

import numpy as np
import datatest as dt

a = np.array([(1, 'x'),
              (2, 'y'),
              (3, 'z')],
             dtype='int, object')

data = [a.dtype[x] for x in a.dtype.names]
requirement = [np.dtype(int), np.dtype(object)]
dt.validate(data, requirement)

import numpy as np
import datatest as dt

a = np.array([(1, 'x'),
              (2, 'y'),
              (3, 'z')],
             dtype='int, str')

data = [a.dtype[x] for x in a.dtype.names]
requirement = [np.dtype(int), np.dtype(object)]
dt.validate(data, requirement)

Traceback (most recent call last):
  File "example.py", line 11, in <module>
    dt.validate(data, requirement)
datatest.ValidationError: does not match required sequence (1 difference): [
    Invalid(dtype('<U'), expected=dtype('O')),
]