Testing With Pandas¶
Datatest can validate pandas objects (DataFrame, Series, and
Index) the same way it does with
built-in types.
Some Examples¶
This example uses a DataFrame to
load and inspect data from a CSV file (movies.csv). The CSV file uses the
following format:
title |
rating |
year |
runtime |
|---|---|---|---|
Almost Famous |
R |
2000 |
122 |
American Pie |
R |
1999 |
95 |
Back to the Future |
PG |
1985 |
116 |
Blade Runner |
R |
1982 |
117 |
… |
… |
… |
… |
The test_movies_df.py
script demonstrates pytest-style tests:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 | #!/usr/bin/env python
# -*- coding: utf-8 -*-
import pytest
import pandas as pd
import datatest as dt
@pytest.fixture(scope='module')
@dt.working_directory(__file__)
def df():
return pd.read_csv('movies.csv')
@pytest.mark.mandatory
def test_columns(df):
dt.validate(
df.columns,
{'title', 'rating', 'year', 'runtime'},
)
def test_title(df):
dt.validate.regex(df['title'], r'^[A-Z]')
def test_rating(df):
dt.validate.superset(
df['rating'],
{'G', 'PG', 'PG-13', 'R', 'NC-17', 'Not Rated'},
)
def test_year(df):
dt.validate(df['year'], int)
def test_runtime(df):
dt.validate(df['runtime'], int)
|
The test_movies_df_unit.py
script demonstrates unittest-style tests:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 | #!/usr/bin/env python
# -*- coding: utf-8 -*-
import pandas as pd
import datatest as dt
def setUpModule():
global df
with dt.working_directory(__file__):
df = pd.read_csv('movies.csv')
class TestMovies(dt.DataTestCase):
@dt.mandatory
def test_columns(self):
self.assertValid(
df.columns,
{'title', 'rating', 'year', 'runtime'},
)
def test_title(self):
self.assertValidRegex(df['title'], r'^[A-Z]')
def test_rating(self):
self.assertValidSuperset(
df['rating'],
{'G', 'PG', 'PG-13', 'R', 'NC-17', 'Not Rated'},
)
def test_year(self):
self.assertValid(df['year'], int)
def test_runtime(self):
self.assertValid(df['runtime'], int)
|
You can run these tests, use the following command:
pytest test_movies_df.py
python -m datatest test_movies_df_unit.py
Step by Step Explanation¶
1. Define a test fixture¶
Define a test fixture that loads the CSV file into a
DataFrame:
8 9 10 11 | @pytest.fixture(scope='module')
@dt.working_directory(__file__)
def df():
return pd.read_csv('movies.csv')
|
7 8 9 10 | def setUpModule():
global df
with dt.working_directory(__file__):
df = pd.read_csv('movies.csv')
|
2. Check column names¶
Check that the data includes the expected column names:
14 15 16 17 18 19 | @pytest.mark.mandatory
def test_columns(df):
dt.validate(
df.columns,
{'title', 'rating', 'year', 'runtime'},
)
|
14 15 16 17 18 19 | @dt.mandatory
def test_columns(self):
self.assertValid(
df.columns,
{'title', 'rating', 'year', 'runtime'},
)
|
This validation requires that the set of values in df.columns
matches the required set. The df.columns attribute is
an Index object—datatest treats this the same
as any other sequence of values.
This test is marked mandatory because it’s a prerequisite that must
be satisfied before any of the other tests can pass. When a mandatory
test fails, the test suite stops immediately and no more tests are run.
3. Check ‘title’ values¶
Check that values in the title column begin with an upper-case letter:
22 23 | def test_title(df):
dt.validate.regex(df['title'], r'^[A-Z]')
|
21 22 | def test_title(self):
self.assertValidRegex(df['title'], r'^[A-Z]')
|
This validation checks that each value in the df['title'] matches
the regular expression ^[A-Z].
4. Check ‘rating’ values¶
Check that values in the rating column match one of the allowed codes:
26 27 28 29 30 | def test_rating(df):
dt.validate.superset(
df['rating'],
{'G', 'PG', 'PG-13', 'R', 'NC-17', 'Not Rated'},
)
|
24 25 26 27 28 | def test_rating(self):
self.assertValidSuperset(
df['rating'],
{'G', 'PG', 'PG-13', 'R', 'NC-17', 'Not Rated'},
)
|
This validation checks that the values in df['rating'] are also
contained in the given set.
5. Check ‘year’ and ‘runtime’ types¶
Check that values in the year and runtime columns are integers:
33 34 | def test_year(df):
dt.validate(df['year'], int)
|
37 38 | def test_runtime(df):
dt.validate(df['runtime'], int)
|
30 31 | def test_year(self):
self.assertValid(df['year'], int)
|
33 34 | def test_runtime(self):
self.assertValid(df['runtime'], int)
|
More Information¶
See also
See the Validating Pandas Objects introduction docs for more information and examples.
See Pandas Accessors to learn about the alternate validation syntax provided by pandas accessor extensions.