Data Preparation

In the practice of data science, data preparation is a huge part of the job. Practitioners often spend 50 to 80 percent of their time wrangling data 1 2 3 4. This critically important phase is time-consuming, unglamorous, and often poorly structured.

The datatest package was created to support test driven data-wrangling and provide a disciplined approach to an otherwise messy process.

A datatest suite can facilitate quick edit-test cycles to help guide the selection, cleaning, integration, and formatting of data. Data tests can also help to automate check-lists, measure progress, and promote best practices.

Test Driven Data-Wrangling

When data is messy, poorly structured, or uses an incompatible format, it’s oftentimes not possible to prepare it using an automated process. There are a multitude of ways for messy data to counfound a processing system or schema. Dealing with data like this requires a data-wrangling approach where users are actively involved with making decisions and judgment calls about cleaning and formatting the data.

A well-structured suite of data tests can serve as a template to guide the data-wrangling process. Using a quick edit-test cycle, users can:

  1. focus on a failing test

  2. make change to the data or the test

  3. re-run the suite to check that the test now passes

  4. then, move on to the next failing test

The work of cleaning and formatting data takes place outside of the datatest package itself. Users can work with with the tools they find the most productive (Excel, pandas, R, sed, etc.).

Footnotes

1

“Data scientists, according to interviews and expert estimates, spend from 50 percent to 80 percent of their time mired in this more mundane labor of collecting and preparing unruly digital data…” Steve Lohraug in For Big-Data Scientists, ‘Janitor Work’ Is Key Hurdle to Insights. Retrieved from http://www.nytimes.com/2014/08/18/technology/for-big-data-scientists-hurdle-to-insights-is-janitor-work.html

2

“This [data preparation step] has historically taken the largest part of the overall time in the data mining solution process, which in some cases can approach 80% of the time.” Dynamic Warehousing: Data Mining Made Easy (p. 19)

3

Online poll of data mining practitioners: See image, Data preparation (Oct 2003). Retrieved from http://www.kdnuggets.com/polls/2003/data_preparation.htm [While this poll is quite old, the situation has not changed drastically.]

4

“As much as 80% of KDD is about preparing data, and the remaining 20% is about mining.” Data Mining for Design and Manufacturing (p. 44)