Testing With Pandas

Datatest can validate pandas objects (DataFrame, Series, and Index) the same way it does with built-in types.

Some Examples

This example uses a DataFrame to load and inspect data from a CSV file (movies.csv). The CSV file uses the following format:

title

rating

year

runtime

Almost Famous

R

2000

122

American Pie

R

1999

95

Back to the Future

PG

1985

116

Blade Runner

R

1982

117

The test_movies_df.py script demonstrates pytest-style tests:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import pytest
import pandas as pd
import datatest as dt


@pytest.fixture(scope='module')
@dt.working_directory(__file__)
def df():
    return pd.read_csv('movies.csv')


@pytest.mark.mandatory
def test_columns(df):
    dt.validate(
        df.columns,
        {'title', 'rating', 'year', 'runtime'},
    )


def test_title(df):
    dt.validate.regex(df['title'], r'^[A-Z]')


def test_rating(df):
    dt.validate.superset(
        df['rating'],
        {'G', 'PG', 'PG-13', 'R', 'NC-17', 'Not Rated'},
    )


def test_year(df):
    dt.validate(df['year'], int)


def test_runtime(df):
    dt.validate(df['runtime'], int)

You can run these tests, use the following command:

pytest test_movies_df.py

Step by Step Explanation

1. Define a test fixture

Define a test fixture that loads the CSV file into a DataFrame:

 8
 9
10
11
@pytest.fixture(scope='module')
@dt.working_directory(__file__)
def df():
    return pd.read_csv('movies.csv')

2. Check column names

Check that the data includes the expected column names:

14
15
16
17
18
19
@pytest.mark.mandatory
def test_columns(df):
    dt.validate(
        df.columns,
        {'title', 'rating', 'year', 'runtime'},
    )

This validation requires that the set of values in df.columns matches the required set. The df.columns attribute is an Index object—datatest treats this the same as any other sequence of values.

This test is marked mandatory because it’s a prerequisite that must be satisfied before any of the other tests can pass. When a mandatory test fails, the test suite stops immediately and no more tests are run.

3. Check ‘title’ values

Check that values in the title column begin with an upper-case letter:

22
23
def test_title(df):
    dt.validate.regex(df['title'], r'^[A-Z]')

This validation checks that each value in the df['title'] matches the regular expression ^[A-Z].

4. Check ‘rating’ values

Check that values in the rating column match one of the allowed codes:

26
27
28
29
30
def test_rating(df):
    dt.validate.superset(
        df['rating'],
        {'G', 'PG', 'PG-13', 'R', 'NC-17', 'Not Rated'},
    )

This validation checks that the values in df['rating'] are also contained in the given set.

5. Check ‘year’ and ‘runtime’ types

Check that values in the year and runtime columns are integers:

33
34
def test_year(df):
    dt.validate(df['year'], int)
37
38
def test_runtime(df):
    dt.validate(df['runtime'], int)

More Information

See also

See the Validating Pandas Objects introduction docs for more information and examples.

See Pandas Accessors to learn about the alternate validation syntax provided by pandas accessor extensions.