Data Validation With data.validator: An Open-Source Package from Appsilon

13 July 2021

Why Data Validation

Data validation is a crucial step in any data science project. It ensures clean and well-formatted data that is ready for input pipelines to ML models and dashboards. Cleaned data also minimizes errors further down the line. Often, functions and model training pipelines throw errors when presented with missing values, incorrect data types, out-of-range data, etc. It’s possible to avoid the resulting time and monetary wastage through data validation techniques that ensure checks are passed before feeding the data into the program.

Data Quality Case Studies: How We Saved Clients Real Money Thanks to Data Validation

It’s also important to note that data validation is not a one-off occurrence. When updating ML models, new data is required. And the volume of input will likely change. Having scalable, automated validation in the workflow with every update is necessary. The question now becomes, how do you achieve all of this?

Enter: data.validator

Why Data Validation
Data Validation With data.validator
Getting Started
Pipeline
Custom HTML Reporting
Example of data.validator in Production
Conclusion

Data Validation With data.validator

Today we will look at data.validator, an R package that offers scalable and reproducible data validation in a user-friendly way. The R package data.validator handles data validation beyond simple structure and format, with reporting tools for preventative maintenance and in a way that makes it easier to identify and track the story behind the data. Some features of data.validator include:

Validation in %>% pipelines with functions: validate_if(), validate_cols(), and validate_rows()
Support for predicate functions from the assertr package like: in_set(), within_bounds(), etc.
Functions for creating user-friendly reports that can be sent to email, stored in logs folder, or generated automatically with RStudio Connect
Customizable HTML reports

Getting Started

There are two options to install the package:

CRAN

Latest Development Version

Pipeline

Step 1. First, create a blank report object:

Step 2. Next, load your data set and prepare it for data validation. We will use the standard mtcars data set for this demonstration.

After creating the empty report object above, we can now start using the validate() function to perform the required validations on the dataset. We add the dataset and the name as arguments to the validate() function.

Step 3. After the validate() function, we can use the validate_*() functions and predicates to validate the data with %>% operator.

Step 4. We can also add custom predicates by first defining a function and then using it inside validate_*() functions.

Step 5. Once all the validations are done, we add the add_results(report_name) to add this validation result to the created report.

Step 6. Finally, we print the report or generate an HTML document.

We can turn off certain parts of the report like this:

We can also view the raw report like this:

data.validator provides other ways of saving the report:

Custom HTML Reporting

data.validator also supports custom report templates. Results can be shown with various interactive elements (e.g., leaflet map). In the example below, you can see the validation results from setting a predicate function to check Polish district populations that are within 3 standard deviations – assertr::within_n_sds(3).

You may find a predefined report template here. To use the template as a base, load the package in RStudio and go to File > New File > R Markdown > From template > Simple structure for HTML report summary. Here you can modify the template with custom titles and graphics.

Example of data.validator in Production

Workflow for data.validator can be implemented as follows:

Running RStudio Connect Scheduler (daily)
Scheduler sources the data from PostgreSQL table and validates it based on predefined rules.
Based on validations results, a new data.validator report is created
Data Response and Action
- Violation occurrence:
  - data provider and person responsible for data quality receive a report via email
  - thanks to assertr functionality, the report is easily understandable both for technical and non-technical personnel
  - data provider makes required data fixes
- Passes inspection:
  - a specific trigger is sent in order to reload Shiny data

Conclusion

Whether your dataset was built internally or pulled from external sources, you need to check that it meets the expectations you have defined. Detecting incomplete, duplicate, corrupt, or irrelevant data can be a huge undertaking but if not addressed can negatively impact your analysis. That’s why Appsilon developed the data.validator package, to easily compose and integrate validation rules, scale for fluctuating volumes of data, and deliver clear customizable reports.

If you need assistance with your project, consider reaching out to the Appsilon Data Science Machine Learning team. Our data science professionals deliver modern ML and computer vision solutions for Fortune 500 companies. If you are a public sector institution, NGO, academic institution, or public benefit corporation working on ML projects to solve climate change and environmental degradation issues, please reach out to us through our Data for Good initiative.

Appsilon is a Full Service Certified RStudio Partner: discover how we can support you in your project here

We Need Your Help!

At Appsilon our Tech Team Members regularly contribute to open source packages as part of our commitment to positively impacting the world through technology. If you find our packages useful, please consider dropping a star on your favorite shiny packages at our Github. It helps let us know we’re on the right track. And if you have any comments or questions swing by our feedback threads like the ongoing discussion at our new shiny.fluent package, we love to hear from the community.

We’re Hiring!

Interested in working with the leading experts in Shiny? Appsilon is looking for creative thinkers around the globe. We’re a remote-first company, with team members in 7+ countries. Our team members are leaders in the R dev community and we take our core purpose seriously.

Advance technology to preserve and improve human life #purpose

We promote an inclusive work environment and strive to create a friendly team with a diverse set of skills and a commitment to excellence. Contact us and see what it’s like to work on groundbreaking projects with Fortune 500 companies, NGOs, and non-profit organizations.

Appsilon is hiring for remote roles! See our Careers page for all open positions, including a React Developer and R Shiny Developers. Join Appsilon and work on groundbreaking projects with the world’s most influential Fortune 500 companies.

Contact Us

Maria Grycuk

Project Manager

Data Validation With data.validator: An Open-Source Package from Appsilon

Why Data Validation