Datatest: Test driven data-wrangling and data validation¶
Datatest helps to speed up and formalize data-wrangling and data validation tasks. It was designed to work with poorly formatted data by detecting and describing validation failures.
Validate the format, type, set membership, and more from a variety of data sources including pandas
DataFrames
andSeries
, NumPyndarrays
, built-in data structures, etc.Smart comparison behavior applies the appropriate validation method for a given data requirement.
Automatic data handling manages the validation of single elements, sequences, sets, dictionaries, and other containers of elements.
Difference objects characterize the discrepancies and deviations between a dataset and its requirements.
Acceptance managers distinguish between ideal criteria and acceptable differences.
Test driven data-wrangling is a process for taking data from a source of unverified quality or format and producing a verified, well-formatted dataset. It repurposes software testing practices for data preparation and quality assurance projects. Pipeline validation monitors the status and quality of data as it passes through a pipeline and identifies where in a pipeline an error occurs.
See the project README file for full details regarding supported versions, backward compatibility, and more.
Table of Contents¶
- Introduction
- How-to Guide
- Install Datatest
- Get Started Testing
- Run Tests
- Column Names
- Customize Differences
- Data Types
- Date and Time Strings
- Date and Time Objects
- File Names
- Test File Properties
- Excel Auto-Formatting
- Mailing Addresses
- Fuzzy Matching
- NaN Values
- Negative Matches
- Outliers
- Phone Numbers
- Re-order Acceptances
- Sequences
- Reference
- Discussion