Datatest: Test driven data-wrangling and data validation

Apache 2.0 License Supported Python Versions Installation Requirements Full Version Development Repository

Datatest helps to speed up and formalize data-wrangling and data validation tasks. It was designed to work with poorly formatted data by detecting and describing validation failures.

  • Validate the format, type, set membership, and more from a variety of data sources including pandas DataFrames and Series, NumPy ndarrays, built-in data structures, etc.

  • Smart comparison behavior applies the appropriate validation method for a given data requirement.

  • Automatic data handling manages the validation of single elements, sequences, sets, dictionaries, and other containers of elements.

  • Difference objects characterize the discrepancies and deviations between a dataset and its requirements.

  • Acceptance managers distinguish between ideal criteria and acceptable differences.

Test driven data-wrangling is a process for taking data from a source of unverified quality or format and producing a verified, well-formatted dataset. It repurposes software testing practices for data preparation and quality assurance projects. Pipeline validation monitors the status and quality of data as it passes through a pipeline and identifies where in a pipeline an error occurs.

See the project README file for full details regarding supported versions, backward compatibility, and more.