All Solutions> Data Cleaning
According to the Global Data Management Benchmark Report from Experian, 77 percent of c-level executives say their operations have been greatly disrupted by data in the last year. Data quality statistics for all industries report that on a percentage basis for all organizations:
The report also zeroes in on three major industries and shows us the following:
How can it be that 99 percent of all organizations have a strategy to maintain high quality data, yet 98 percent of them believe they have inaccurate day and 69 percent blame inaccurate data for undermining customer experience efforts? The answer is two-fold.
Fortunately, Artificial Intelligence (AI) and Machine Learning (ML) —
a subset of AI — make the job of cleansing and tuning data much easier, faster and more accurate than ever before.
Continue reading or use the chapter links to skip ahead:
Inaccurate, incomplete and duplicate data are extremely wasteful and lead to less than optimal results when running campaigns based on that data.
These data quality stats from multiple sources paint an ugly picture of the data quality landscape.
In Kaggle’s 2017 survey of data scientists, 7,376 responded to the question, “What barriers are faced at work?” The number one answer was, “Dirty data.”
Duplicates are one of the biggest problems with data, especially if that data comes from multiple sources. HubSpot uncovered a number of ways that duplicate data impacts your business:
After an outside firm lowered their percentage of duplicate data from 22 percent to 0.2 percent, Children’s Medical Center Dallas learned that each duplicate record cost them $96. Poor data quality costs U.S. businesses more than $600 billion each year (Data Warehouse Institute).
A CSR trying to help a customer for which there are multiple records wastes their own and the customer’s time. The customer’s frustration is exacerbated if they have to call back multiple times for the same issue due to duplicate records.
Sending multiple direct mail pieces — especially catalogs — to one customer because they are in your database more than once is a complete waste of marketing dollars.
Marketing campaigns that send the same materials to customers with duplicate records several times makes your company look incompetent and can lead to a 25 percent drop in revenue gains.
When employees run into the same duplication issues on an ongoing basis, they may become prone to losing confidence in the database, which promotes cutting corners and sloppiness (human error).
There is no single, accurate definition for bad data. All of the following factors are descriptors of bad data.
Erroneous data is often the fault of poor manual data entry. These errors are often worse than having no data at all, as they use up valuable resources and can be difficult to catch.
Due to the faulty picture it creates, incomplete data leads to faulty decision making.
Data sourced from multiple locations and platforms is usually formatted differently, which can lead to faulty interpretations.
Duplicate data presents many problems, as the section above discusses.
Analyzing bad data leads to making bad decisions.
Any one of the descriptors for bad data suffice to consider it bad. But with high-quality data, all of these descriptors work together.
Meets defined business rules or constraints for data
Corrected for all errors
Thorough within the bounds of all available data
Formatted to avoid ambiguous interpretation
Using the same units and measures for all data
Able to find and use the source data
Recently updated data
Some of these points are about cleansing data, while others speak to tuning its structure to make it more useful and reliable in AI projects.
Data cleansing and tuning are the two essential processes that turn bad data into high-quality data. Data cleansing fixes errors, removes duplicates and adds the data necessary to complete records. Data tuning structures the data to be consistent and usable regardless of the source.
It is not unusual for data cleansing (or data cleaning) to be used as an umbrella term for both processes. The distinctions can be confusing. Here’s a simple way to look at how the terms used to describe data evolve as it is cleaned and tuned.
Data that has not been cleansed is raw data
Once raw data is cleansed it becomes technically accurate data
Technically accurate data that has been tuned is uniform data
While data scientists are best known for creating highly accurate predictive models using uniform data, it’s not unusual for those scientists to spend half their time (or more) cleaning and tuning the data.
Some of the challenges of data cleansing include:
Lack of clarity as to what is causing anomalies in the system
Records that have been partially deleted and cannot be accurately completed
Time and expense of ongoing data maintenance
Difficulty in building a data cleansing graph ahead of time to help direct the process
But the excessive costs of bad data and the rewards of high-quality data more than justify the time and resources devoted to cleansing and tuning data.
Before you get started with a data-cleansing project, you have to be sure of what you want to accomplish and how you will accomplish it. Keeping these best practices in mind will give you a good conceptual footing.
Think about who will be using the results from the data, not just the person doing the analysis.
Make sure that only the cleanest data is used in the system.
Stop bad data before it becomes problematic by choosing software that gives you that capability.
With large datasets, large samples unnecessarily increase prep time and slow performance.
Catch errors before they can be replicated throughout the data.
This is another list of best practices, but this one is a step-by-step checklist for your cleansing project.
Most large data cleansing projects are now performed with Artificial Intelligence — or, to be more precise, Machine Learning. Here’s why.
The amount of data that organizations collect increases significantly every year. When cleansing data without the aid of Machine Learning, whenever bad data is found the data scientist or data analyst has to write a new rule to tell the system what to do with it. Because new data (including bad data) enters the system at an increasingly rapid rate, the data scientist is constantly playing catch up. Consider the scope of the problem. Data creates patterns. Bad data is found by recognizing anomalies in the patterns. The more data there is, the more complex the patterns become, making them more difficult for humans to analyze. That means some anomalies are not found unless the data scientist spends a massive amount of time looking for them among increasingly complex patterns.
Every minute the data scientist spends cleaning up data is a minute they are not using the data to make the organization more productive. And if they can’t catch up with the bad data, they are effectively caught in a trap like a hamster on a wheel. This is not as uncommon a problem as you might think.
Machine Learning frees the data scientist from the trap. The data scientist creates a learning model to predict matches. As more data enters the system, the more Machine Learning fine tunes the model.
With manual data cleaning using standard computer programs, as more data is added, the worse the problem becomes. With Machine Learning, as more data is added, the more the problem is eliminated. The key is to focus machine learning on systematically improving how it analyzes, rates and utilizes data so that it becomes the expert at correcting, updating, repairing, and improving data.
That puts the data scientist back to doing what they were hired to do.
Clean and tuned data is a necessity for numerous parts of an organization, including:
Reach the right people with the right messages at the right time
Rely on a complete and accurate view of your customer
Be confident that you are complying with all regulations for every customer
Better data = better decisions
Moreover, tuned data is needed to conduct any type of analysis with confidence.
The fastest and most accurate way to cleanse and tune data is by incorporating AI. Machine Learning produces more accurate data models in less than a day than it takes a data scientist armed only with standard computer programs to produce in weeks. An example of one tool often used to build automated data cleaning processes that integrate AI is KNIME. For other tools see technology.
There is simply no way to be sure that uncleansed data is accurate enough to use in any type of campaign. Data cleansing and tuning are necessities, and the more effective and efficient way to accomplish them is with Artificial Intelligence.