We update our blog with regular posts to keep you up to speed on the world of B2B data.
Posted on 03/12/2019 at 10:30By Corpdata
All organisations build up databases as they go about their business.
The data is one of the key assets of the organisation, it’s precious. Unfortunately though, the value can be eroded over time. We recently published an article to help demystify data cleansing, this item sets out to do the same for de-duplication, and merge/purge.
But first we need to consider the problems duplication causes.
Duplicate entries in some systems are an obvious problem, few organisations would accept duplication in the payroll systems because of the possibility for direct financial costs (double paying). The same applies to purchase ledger and other similar data stores. The risks of having duplication are just too high. Organisations have evolved strategies to cope, such as having a team, or even department, given over to these functions. Unique ID values such as National Insurance numbers, and approved supplier numbers, all help reduce the potential for errors and fraud.
Other data resources, superficially, seem to pose less risk, perhaps where money flows IN, rather than where it goes OUT.
Customer data is often seen as a by-product of the sales process and prospect data is merely the fuel for the sales and marketing operations.
There is a natural tendency to accumulate duplicates within this data. The customer who bought once before is very likely to have similar requirements again. The prospect who was a target previously remains likely to be within the targeting in the future.
Without other action duplicates will probably creep in.
Whilst within organisations we can understand this tendency, duplication can be seen by the wider world as a lack of care or respect at all sorts of levels, from environmental to individual preferences. In the modern world, sending multiple mailers to the same recipient, may be more than simply wasteful. Failing to correctly manage relationships with your customers can be seen as wrong, or even rude.
Furthermore, recent changes to the data protection landscape, in particular GDPR, magnify the risks, meaning duplicates may expose organisations to significant financial costs.
To correctly appreciate the risks, it is important to align our thinking with the law. GDPR changes a few things, but most fundamental is the idea:
you are guilty unless you prove you are innocent
The law also says data subjects have an absolute right to opt out of direct marketing. It places the duty for keeping data accurate and up to date in your lap. Taken together, these things significantly increase the risks posed by duplicates.
Imagine this scenario. Your database contains 2 records for the same person. You are taking reasonable steps to manage opt outs. But if only one record is marked as an opt out, but you contact the other, you could be hard pressed to prove you are not responsible.
What is more, the recipient knows they said to stop contacting them, and now it's happened again!
It's almost tailor made to aggravate them:
- they have an opinion
- they exercised their right
- it wasn't respected!
If there are multiple data stores, you can immediately see how the complexity grows. Finding all the duplicates in one file is difficult, but you might need to consider data held in many locations, or by diverse teams or individuals.
Eventually, you are likely to need to deal with duplicates in different TYPES of data store, such as customers and prospects. The decision about which record to remove is straightforward in this case, but what about duplicates between ‘Equipment Customers’ and ‘Service Contract Customers’?
These challenges have led many of the largest organisations towards complex approaches and systems to provide ‘Master Data Management’, the holy grail! A single, authoritative, ‘official’ repository for the organisation’s data.
MDM solutions are well beyond the reach, and need of most organisations who still need to manage duplication.
Deduplication, or deduping, is the process of removing duplicates from a set of data. When more than one set of data is being combined, or merged, it is a good time to undertake a ‘purge’ of duplicates. Merge/purge, therefore, is the operation of bringing more than one data set together and in the process ensuring duplicates between the sources are removed. Purge may also be used to describe the removal of records from one dataset, where they appear in another.
These tasks seem well suited to automation using computers. It’s data after all, and computers are really fast at these things right?
An amazing truth about duplicates is they are often really obvious to a human eye, yet it is remarkably difficult to explain how that judgement is arrived at. Humans are extremely good at classifying things, and we do this by spotting similarities. We are so good at it, we can see similarities even when the data is badly formatted, incomplete or has errors on one, or even both, sides. The fact that we cannot easily explain how we know data is duplicated makes the task challenging to automate.
In fact, just like matching, deduping is something of an art form. Because we find it so difficult to describe (or codify) how we identify a duplicate, we tend to use complex techniques all of which contribute to a ‘degree of certainty’. As we mentioned in the earlier article demystify data cleansing, the process of ‘Matching’ is very similar to the process of ‘Deduping’.
The difference tends to be in the ‘degree of certainty’ needed for two records to be considered the same entity. The purposes of the matching or deduping will also affect the certainty required. Most organisations are much more anxious when removing records from their customer data than they are from their prospect data for example. However, as a guide for simple circumstances, we tend to find in the direct marketing arena, the certainty requirement for matching is much higher than required for deduplication.
For more complex operations, such as the merging and purging of multiple sales and prospecting databases, a hierarchal approach can be used, treating files in descending order of importance and removing duplication where it is found in less important sources, often merging details from lower files up towards the remaining record, to minimise any loss of ‘richness’.
For all the advances in machine learning and AI, software still cannot reliably manage this feat. But you can improve data quality without it.
There are steps you can take yourself. We assume you will be using a spreadsheet or similar tool.
If you are working with multiple sets of data, your first task will be to create a suitable column layout to accommodate all the data from the various sources. This normally means adding extra columns to your master spreadsheet to accommodate each column you need from the others.
Experience tells us to always create a new blank workbook, because even if you make mistakes you will be able to start again – time consuming, but at least you don’t lose anything!
... and save your work often, you will make mistakes but at least you won't have to go back too far!
Import each dataset by adding and removing columns appropriately until the column layouts agree, then copy and paste the data from each spreadsheet into the master.
Once all the data is in one place, basic duplication can be identified by sorting your data by various items, such as telephone number, email address, postcode and company name. This will help you see groups of similar records and highlight the more obvious challenges.
Data standardisation often helps to make similar data more comparable. For example, it doesn't matter how you store phone numbers, but you should do the same every time. 01626777400 is the same as (01626) 777400 or +44 (1626) 777 400, or many other layouts, but managing your data, especially identifying duplication, is much more difficult where no standards, or many standards, are applied.
You may choose to standardise existing data, but beware of damaging the data quality. For example, replacing every occurrence of "St." with "Street" will leave you with some odd data, such as "Street Michael's Street".
Having identified potential duplicates, this leaves a manual process of tidying up the data. As mentioned above, this is probably the safest method. If you find this all a bit intimidating, you can always get help from a data bureau or a data specialist like Corpdata.
Bear in mind though, cost should only be one of the factors in selecting your supplier.
Here are three important considerations to help you choose a suitable third party for data operations, (under GDPR, they are classed as a ‘data processor’):
You will be passing this very precious resource to a third party. Ask yourself how valuable that would be to your greatest competitor, or how damaging it would be for data subjects if their data were released. You need to ensure your data will be secure and not passed to anyone else, or used for other purposes than cleaning. Our ‘Due Diligence Questions to Ask Data Suppliers’ are an excellent place to start.
If your database contains personal data, it is important to conduct suitable due diligence on your prospective data partner. The key issues are about their undertakings as a data processor, to you as data controller. This is covered by article 28 of GDPR. Once again our ‘Due Diligence Questions to Ask Data Suppliers’ are a great start.
Deduping is an ‘artistic science’. It is far from simple to create a good deduping tool, and a great tool is one which is tuned to return (almost) no ‘false positives’ (see ‘About Corpdata Matching Systems’ in our previous article). Suppliers claiming high strike rates typically have a significant proportion of false positives in the results.
We suggest you shouldn’t take too much for granted. You should understand, in broad terms, what is being done and why. Any good service provider will talk you through what they are going to do. You should always undertake a test of the data treatment process on a subset of your data.
If you have any questions or feel you need a bit more guidance, please feel free to call us on (01626) 777400, we are always happy to help. You have nothing to lose.
Some simple steps to help you merge and purge data: