Demystifying deduplication and merge/purge

Posted on 03/12/2019 at 10:30By Corpdata

All organisations build up databases as they go about their business.
The data is one of the key assets of the organisation, it’s precious. Unfortunately though, the value can be eroded over time.…

All organisations build up databases as they go about their business.

The data is one of the key assets of the organisation, it’s precious. Unfortunately though, the value can be eroded over time. We recently published an article to help demystify data cleansing, this item sets out to do the same for de-duplication, and merge/purge.

But first we need to consider the problems duplication causes.

How do duplicates harm the value of data?

Duplicate entries in some systems are an obvious problem, few organisations would accept duplication in the payroll systems because of the possibility for direct financial costs (double paying). The same applies to purchase ledger and other similar data stores. The risks of having duplication are just too high. Organisations have evolved strategies to cope, such as having a team, or even department, given over to these functions. Unique ID values such as National Insurance numbers, and approved supplier numbers, all help reduce the potential for errors and fraud.

Other data resources, superficially, seem to pose less risk, perhaps where money flows IN, rather than where it goes OUT.

Customer data is often seen as a by-product of the sales process and prospect data is merely the fuel for the sales and marketing operations.

There is a natural tendency to accumulate duplicates within this data. The customer who bought once before is very likely to have similar requirements again. The prospect who was a target previously remains likely to be within the targeting in the future.

Without other action duplicates will probably creep in.

Whilst within organisations we can understand this tendency, duplication can be seen by the wider world as a lack of care or respect at all sorts of levels, from environmental to individual preferences. In the modern world, sending multiple mailers to the same recipient, may be more than simply wasteful. Failing to correctly manage relationships with your customers can be seen as wrong, or even rude.

Furthermore, recent changes to the data protection landscape, in particular GDPR, magnify the risks, meaning duplicates may expose organisations to significant financial costs.

GDPR and duplication

To correctly appreciate the risks, it is important to align our thinking with the law. GDPR changes a few things, but most fundamental is the idea:

you are guilty unless you prove you are innocent

The law also says data subjects have an absolute right to opt out of direct marketing. It places the duty for keeping data accurate and up to date in your lap. Taken together, these things significantly increase the risks posed by duplicates.

Imagine this scenario. Your database contains 2 records for the same person. You are taking reasonable steps to manage opt outs. But if only one record is marked as an opt out, but you contact the other, you could be hard pressed to prove you are not responsible.

What is more, the recipient knows they said to stop contacting them, and now it's happened again!

It's almost tailor made to aggravate them:
they have an opinion
they exercised their right
it wasn't respected!

If there are multiple data stores, you can immediately see how the complexity grows. Finding all the duplicates in one file is difficult, but you might need to consider data held in many locations, or by diverse teams or individuals.

Eventually, you are likely to need to deal with duplicates in different TYPES of data store, such as customers and prospects. The decision about which record to remove is straightforward in this case, but what about duplicates between ‘Equipment Customers’ and ‘Service Contract Customers’?

These challenges have led many of the largest organisations towards complex approaches and systems to provide ‘Master Data Management’, the holy grail! A single, authoritative, ‘official’ repository for the organisation’s data.

MDM solutions are well beyond the reach, and need of most organisations who still need to manage duplication.

What are deduplication and merge/purge?

Deduplication, or deduping, is the process of removing duplicates from a set of data. When more than one set of data is being combined, or merged, it is a good time to undertake a ‘purge’ of duplicates. Merge/purge, therefore, is the operation of bringing more than one data set together and in the process ensuring duplicates between the sources are removed. Purge may also be used to describe the removal of records from one dataset, where they appear in another.

These tasks seem well suited to automation using computers. It’s data after all, and computers are really fast at these things right?

An amazing truth about duplicates is they are often really obvious to a human eye, yet it is remarkably difficult to explain how that judgement is arrived at. Humans are extremely good at classifying things, and we do this by spotting similarities. We are so good at it, we can see similarities even when the data is badly formatted, incomplete or has errors on one, or even both, sides. The fact that we cannot easily explain how we know data is duplicated makes the task challenging to automate.

In fact, just like matching, deduping is something of an art form. Because we find it so difficult to describe (or codify) how we identify a duplicate, we tend to use complex techniques all of which contribute to a ‘degree of certainty’. As we mentioned in the earlier article demystify data cleansing, the process of ‘Matching’ is very similar to the process of ‘Deduping’.

The difference tends to be in the ‘degree of certainty’ needed for two records to be considered the same entity. The purposes of the matching or deduping will also affect the certainty required. Most organisations are much more anxious when removing records from their customer data than they are from their prospect data for example. However, as a guide for simple circumstances, we tend to find in the direct marketing arena, the certainty requirement for matching is much higher than required for deduplication.

For more complex operations, such as the merging and purging of multiple sales and prospecting databases, a hierarchal approach can be used, treating files in descending order of importance and removing duplication where it is found in less important sources, often merging details from lower files up towards the remaining record, to minimise any loss of ‘richness’.

PRO TIP

In the very highest value data, it is often wise to have the final decision on data removal taken by the people involved, often the sales team. So the process identifies ‘clusters’ of probable duplicates and presents them to staff members for merge/purge decisions with human level understanding.

For all the advances in machine learning and AI, software still cannot reliably manage this feat. But you can improve data quality without it.

How to remove duplicates yourself

There are steps you can take yourself. We assume you will be using a spreadsheet or similar tool.

If you are working with multiple sets of data, your first task will be to create a suitable column layout to accommodate all the data from the various sources. This normally means adding extra columns to your master spreadsheet to accommodate each column you need from the others.

Experience tells us to always create a new blank workbook, because even if you make mistakes you will be able to start again – time consuming, but at least you don’t lose anything!

... and save your work often, you will make mistakes but at least you won't have to go back too far!

Import each dataset by adding and removing columns appropriately until the column layouts agree, then copy and paste the data from each spreadsheet into the master.

PRO TIP

Don’t forget to add a column to tell you which spreadsheet the data came from. You could also add a ‘record ID’ to allow you to put the data back in it's original order.

Once all the data is in one place, basic duplication can be identified by sorting your data by various items, such as telephone number, email address, postcode and company name. This will help you see groups of similar records and highlight the more obvious challenges.

Data standardisation often helps to make similar data more comparable. For example, it doesn't matter how you store phone numbers, but you should do the same every time. 01626777400 is the same as (01626) 777400 or +44 (1626) 777 400, or many other layouts, but managing your data, especially identifying duplication, is much more difficult where no standards, or many standards, are applied.

You may choose to standardise existing data, but beware of damaging the data quality. For example, replacing every occurrence of "St." with "Street" will leave you with some odd data, such as "Street Michael's Street".

PRO TIP

If you want to standardise your data without damaging it, you can create modified data in extra columns. We call these ‘keys’ and they are only used in the matching/deduping process, to help you identify similarity more easily.

All spreadsheet tools allow you to perform 'functions' on data in one column and put the output in another. This is a good way to create 'keys'.

BEWARE: if you change data in the 'source' field, it will affect data in the 'key' field.

Having identified potential duplicates, this leaves a manual process of tidying up the data. As mentioned above, this is probably the safest method. If you find this all a bit intimidating, you can always get help from a data bureau or a data specialist like Corpdata.

Bear in mind though, cost should only be one of the factors in selecting your supplier.

Choosing a merge/purge or deduping supplier

Here are three important considerations to help you choose a suitable third party for data operations, (under GDPR, they are classed as a ‘data processor’):

Supplier trustworthiness

You will be passing this very precious resource to a third party. Ask yourself how valuable that would be to your greatest competitor, or how damaging it would be for data subjects if their data were released. You need to ensure your data will be secure and not passed to anyone else, or used for other purposes than cleaning. Our ‘Due Diligence Questions to Ask Data Suppliers’ are an excellent place to start.

GDPR compliance

If your database contains personal data, it is important to conduct suitable due diligence on your prospective data partner. The key issues are about their undertakings as a data processor, to you as data controller. This is covered by article 28 of GDPR. Once again our ‘Due Diligence Questions to Ask Data Suppliers’ are a great start.

Quality of matching

Deduping is an ‘artistic science’. It is far from simple to create a good deduping tool, and a great tool is one which is tuned to return (almost) no ‘false positives’ (see ‘About Corpdata Matching Systems’ in our previous article). Suppliers claiming high strike rates typically have a significant proportion of false positives in the results.

Details, Details

False Positive Strikes

A false positive is where a ‘duplicate’ is reported, but it isn’t really.

Examples of this often include overly simplistic matching, such as based on telephone numbers, or company names.

This can lead to every record with the same number being recorded as the same entity. Sometimes this is correct, but often a centralised 0800 number masks a large, geographically diverse organisation.

This is important because you might remove records which appear to be duplicates and risk seriously damaging the value of your data.

Whilst a few records lost from a prospecting list may be forgiveable, indiscriminate deletion from your customer data based on a bad match could cost a fortune and be very embarrassing.

How to ensure good results

We suggest you shouldn’t take too much for granted. You should understand, in broad terms, what is being done and why. Any good service provider will talk you through what they are going to do. You should always undertake a test of the data treatment process on a subset of your data.

Take a subset of your data which you know or suspect contains duplicates, choosing a postcode or region is often good.
If you are using multiple datasets, take the same region from each.
If you are using more than one dataset, decide upon a priority.
Ask for a report on what they would recommend for each record.
Evaluate how the proposed actions would impact your data.
Only proceed if you are happy with any quality trade-offs you find.

We are happy to help

If you have any questions or feel you need a bit more guidance, please feel free to call us on (01626) 777400, we are always happy to help. You have nothing to lose.

Summary

Some simple steps to help you merge and purge data:

Create a single column layout suited to all data sets
Add 'source dataset' labels
Add 'source record number' labels if helpful (for returning data as it was originally supplied)
Standardise data - perhaps create 'keys' in additional columns
Order your data by the standardised values to highlight blocks of similar data
Decide how to deal with duplicated data you find