Get started

Create a free Wyzoo account to have access to all your reports anytime.

Already have an account?

Sign Up

5 Steps to Achieve High Data Quality by Cleansing and Tuning With AI

Data Quality Statistics

According to the Global Data Management Benchmark Report from Experian, 77 percent of c-level executives say their operations have been greatly disrupted by data in the last year. Data quality statistics for all industries report that on a percentage basis for all organizations:

  • 99% have a strategy to maintain high quality data
  • 98% believe they have inaccurate data
  • 83% see data as an integral part of forming a business strategy
  • 69% say inaccurate data undermines customer experience efforts

The report also zeroes in on three major industries and shows us the following:

01 Retail industry

  • 77% have seen a return on investment on their data quality solutions
  • 65% believe inaccurate data undermines customer experience efforts
  • 78% trust their data to make key business decisions
  • 89% struggle to implement a data governance program

02 Finance industry

  • 74% have seen a return on investment on their data quality solutions
  • 86% see data as an integral part of a business strategy
  • 74% believe data quality issues impact customer trust and perception
  • 97% have a data management project planned in the next 12 months

03 IT industry

  • 77% have seen a return on investment on their data quality solutions
  • 85% trust their data to make key business decisions
  • 96% struggle to implement a data governance program
  • 99% have a data management project planned in the next 12 months

How can it be that 99 percent of all organizations have a strategy to maintain high quality data, yet 98 percent of them believe they have inaccurate day and 69 percent blame inaccurate data for undermining customer experience efforts? The answer is two-fold.

  • Cleansing data is difficult and time-consuming, even when you have a strategy in place.
  • Data will never be 100 percent accurate, but data that is 99.9 percent accurate is far more valuable than data that is 80 percent accurate.

Fortunately, Artificial Intelligence (AI) and Machine Learning (ML) —
a subset of AI — make the job of cleansing and tuning data much easier, faster and more accurate than ever before.

Why Data Quality Matters

Inaccurate, incomplete and duplicate data are extremely wasteful and lead to less than optimal results when running campaigns based on that data.

10 Revealing Data Quality Stats

These data quality stats from multiple sources paint an ugly picture of the data quality landscape.

  1. Bad data costs sales and marketing departments around 550 hours and as much as $32,000 per sales rep (DiscoverOrg).
  2. 25-30 percent of data becomes inaccurate every year, making sales and marketing campaigns less effective (MarketingSherpa).
  3. Poor data quality costs businesses up to 20 percent of revenue each year (Kissmetrics).
  4. Untreated duplicates cost an average of $1 to prevent, $10 to correct and $100 to store (SiriusDecisions).
  5. 50 percent of employee time is wasted dealing with commonplace data quality tasks (MITSloan).
  6. 40 percent of all leads contain inaccurate data (Integrate)
  7. Inconsistent data across technologies (CRM’s, Marketing Automation System, etc) is considered the biggest challenge by 41 percent of companies (dun&bradstreet).
  8. Only 16 percent of companies say their data is “very good” (ChiefMarketing).
  9. Every hour, 59 business addresses change, 11 companies change their names and 41 new businesses open. (Reachforce). These figures don’t include the companies that close every hour.
  10. 15 percent of leads contain duplicate records (Integrate)

In Kaggle’s 2017 survey of data scientists, 7,376 responded to the question, “What barriers are faced at work?” The number one answer was, “Dirty data.”

How Duplicates Impact Your Business

Duplicates are one of the biggest problems with data, especially if that data comes from multiple sources. HubSpot uncovered a number of ways that duplicate data impacts your business:

  • Costs

    After an outside firm lowered their percentage of duplicate data from 22 percent to 0.2 percent, Children’s Medical Center Dallas learned that each duplicate record cost them $96. Poor data quality costs U.S. businesses more than $600 billion each year (Data Warehouse Institute).

  • Productivity

    A CSR trying to help a customer for which there are multiple records wastes their own and the customer’s time. The customer’s frustration is exacerbated if they have to call back multiple times for the same issue due to duplicate records.

  • Waste

    Sending multiple direct mail pieces — especially catalogs — to one customer because they are in your database more than once is a complete waste of marketing dollars.

  • Brand

    Marketing campaigns that send the same materials to customers with duplicate records several times makes your company look incompetent and can lead to a 25 percent drop in revenue gains.

  • Confidence

    When employees run into the same duplication issues on an ongoing basis, they may become prone to losing confidence in the database, which promotes cutting corners and sloppiness (human error).

What Is Bad Data?

There is no single, accurate definition for bad data. All of the following factors are descriptors of bad data.

  • Inaccurate

    Erroneous data is often the fault of poor manual data entry. These errors are often worse than having no data at all, as they use up valuable resources and can be difficult to catch.

  • Incomplete

    Due to the faulty picture it creates, incomplete data leads to faulty decision making.

  • Inconsistent

    Data sourced from multiple locations and platforms is usually formatted differently, which can lead to faulty interpretations.

  • Repetitive

    Duplicate data presents many problems, as the section above discusses.

Analyzing bad data leads to making bad decisions.

What Is
High-Quality Data?

Any one of the descriptors for bad data suffice to consider it bad. But with high-quality data, all of these descriptors work together.

  • Valid

    Meets defined business rules or constraints for data

  • Accurate

    Corrected for all errors

  • Complete

    Thorough within the bounds of all available data

  • Consistent

    Formatted to avoid ambiguous interpretation

  • Uniform

    Using the same units and measures for all data

  • Traceable

    Able to find and use the source data

  • Timely

    Recently updated data

Some of these points are about cleansing data, while others speak to tuning its structure to make it more useful and reliable in AI projects.

What Are Data Cleansing and Tuning?

Data cleansing and tuning are the two essential processes that turn bad data into high-quality data. Data cleansing fixes errors, removes duplicates and adds the data necessary to complete records. Data tuning structures the data to be consistent and usable regardless of the source.

It is not unusual for data cleansing (or data cleaning) to be used as an umbrella term for both processes. The distinctions can be confusing. Here’s a simple way to look at how the terms used to describe data evolve as it is cleaned and tuned.

  • Data that has not been cleansed is raw data

  • Once raw data is cleansed it becomes technically accurate data

  • Technically accurate data that has been tuned is uniform data

While data scientists are best known for creating highly accurate predictive models using uniform data, it’s not unusual for those scientists to spend half their time (or more) cleaning and tuning the data.

Some of the challenges of data cleansing include:

  • Lack of clarity as to what is causing anomalies in the system

  • Records that have been partially deleted and cannot be accurately completed

  • Time and expense of ongoing data maintenance

  • Difficulty in building a data cleansing graph ahead of time to help direct the process

But the excessive costs of bad data and the rewards of high-quality data more than justify the time and resources devoted to cleansing and tuning data.

What Is
High-Quality Data?

Before you get started with a data-cleansing project, you have to be sure of what you want to accomplish and how you will accomplish it. Keeping these best practices in mind will give you a good conceptual footing.

  • Take a holistic view of your data.

    Think about who will be using the results from the data, not just the person doing the analysis.

  • Increase controls on inputs.

    Make sure that only the cleanest data is used in the system.

  • Identify and resolve bad data.

    Stop bad data before it becomes problematic by choosing software that gives you that capability.

  • Limit the sample size.

    With large datasets, large samples unnecessarily increase prep time and slow performance.

  • Run spot checks.

    Catch errors before they can be replicated throughout the data.

This is another list of best practices, but this one is a step-by-step checklist for your cleansing project.

  • Set up a quality plan before you begin

  • Fill out missing values

  • Remove rows with missing values

  • Fix errors in the structure

  • Reduce data for proper data handling

Let AI Do the Hard Work

Most large data cleansing projects are now performed with Artificial Intelligence — or, to be more precise, Machine Learning. Here’s why.

The amount of data that organizations collect increases significantly every year. When cleansing data without the aid of Machine Learning, whenever bad data is found the data scientist or data analyst has to write a new rule to tell the system what to do with it. Because new data (including bad data) enters the system at an increasingly rapid rate, the data scientist is constantly playing catch up. Consider the scope of the problem. Data creates patterns. Bad data is found by recognizing anomalies in the patterns. The more data there is, the more complex the patterns become, making them more difficult for humans to analyze. That means some anomalies are not found unless the data scientist spends a massive amount of time looking for them among increasingly complex patterns.

Every minute the data scientist spends cleaning up data is a minute they are not using the data to make the organization more productive. And if they can’t catch up with the bad data, they are effectively caught in a trap like a hamster on a wheel. This is not as uncommon a problem as you might think.

Machine Learning frees the data scientist from the trap. The data scientist creates a learning model to predict matches. As more data enters the system, the more Machine Learning fine tunes the model.

With manual data cleaning using standard computer programs, as more data is added, the worse the problem becomes. With Machine Learning, as more data is added, the more the problem is eliminated. The key is to focus machine learning on systematically improving how it analyzes, rates and utilizes data so that it becomes the expert at correcting, updating, repairing, and improving data.

That puts the data scientist back to doing what they were hired to do.

Benefits of Data Cleaning and Tuning With AI

Clean and tuned data is a necessity for numerous parts of an organization, including:

  • Marketing

    Reach the right people with the right messages at the right time

  • Sales

    Rely on a complete and accurate view of your customer

  • Compliance

    Be confident that you are complying with all regulations for every customer

  • Operations

    Better data = better decisions

Moreover, tuned data is needed to conduct any type of analysis with confidence.

The fastest and most accurate way to cleanse and tune data is by incorporating AI. Machine Learning produces more accurate data models in less than a day than it takes a data scientist armed only with standard computer programs to produce in weeks. An example of one tool often used to build automated data cleaning processes that integrate AI is KNIME. For other tools see technology.

There is simply no way to be sure that uncleansed data is accurate enough to use in any type of campaign. Data cleansing and tuning are necessities, and the more effective and efficient way to accomplish them is with Artificial Intelligence.