Keep up to date with us

We update our blog with regular posts to keep you up to speed on the world of B2B data.

Pro-Tips for Calc - Preserving Row Order When Cleaning Data

Posted on 27/01/2020 at 10:44By Corpdata

Pro-tips: Spreadsheets and your data

In this article we look at one issue which our customers often have trouble with. We mentioned the issue of duplication and it's risks in our recent articles…
Data management you can trust

Pro-tips: Spreadsheets and your data



In this article we look at one issue which our customers often have trouble with. We mentioned the issue of duplication and it's risks in our recent articles 'Demystifying Data Cleansing' and 'Demystifying deduplication and merge/purge'.

Having decided upon a suitable order, people often work through data in spreadsheets methodically. Changing the order of the rows can be an inhibitor in cleaning up data. Here we provide simple techniques to help you overcome these challenges.

We will provide instructions for two of the most commonly used spreadsheets, Microsoft Excel, we will use 2019 Professional Plus edition, and LibreOffice Calc (5.3.6.12 to be precise). Other spreadsheets have similar methods of working but these instructions may need interpreting. We are working on PCs, but similar instructions will apply to other platforms such as Mac.

These instructions are for LibreOffice Calc. To see the instructions for Excel, click here.

Here is a view of the data we will be working with.Starting SpreadsheetIt's a fairly well populated and orderly example data set. Of course, it's fictitious to prevent personal data privacy issues! (Some of you may recognise where the inspiration came from.)

You can download this dataset to work with.

Keeping order


Data sometimes contains 'natural' orders, things like postcode, or name. On other occasions, there is no intrinsic ordering key within data. Then we need to create one. There is no 'right place' to create an index key column, though for simplicity, creating a new column A on the left works well.

Right-click on the column header for column A, and from the context menu select 'Insert Columns Left'. This will leave your spreadsheet looking like this:Added Index Column

Adding row IDs


Now we have a column into which we can put a sequential number. This is where we can use the accounting tools within spreadsheets to our advantage, but there are a few 'gotchas' to watch out for.

You can include the title row, row 1 in the index numbering or not. I will include it.

  1. Click in cell A1 and type the number 1.

  2. Click in cell A2 and type: =A1+1 and press Enter. This should now show the number 2 in the cell A2, like this:Added Row Number


In case you didn't get what you just did, you created a 'formula' which says "take the value from cell A1, add 1 to it and show the result here. You can see the formula if you click in cell A2, if shows in the 'input line' box like this:Added Row NumberFormulae always start with =. We won't go into this here, but there are an awful lot of possibilities when you become familiar with using formulae.

Now we just need to copy the formula down through all the other rows we want populated.

  1. Click in cell A2 then press the 'Control' and and the 'C' key together. This is often written as Control + C, and it copies the contents you have highlighted, in this case cell A2.

  2. Now click in cell A3, and holding down the 'shift' key, press the 'down-arrow' until the highlighting covers down to A14, like this:Added Row Number

  3. Now press Control + V to paste the formula into all the highlighted cells. Your spreadsheet should look like this:Added Row Number


BEWARE: GOTHCA ALERT
The numbers in the first column are dynamic, they are all based off the number in cell A1, so if we re-order the rows at this point, the IDs will change! We need to make them 'static'.

  1. Click in cell A1 and highlight down to A14. You can do this by pressing Control + Shift + Down-Arrow, now press Control + C to copy the contents.

  2. We now need to do a 'special' paste operation, NOT Control + V.

  3. Right click on cell A1 then choose 'Paste Special' from the context menu. Make sure 'Text' is selected and 'Formulas' is not selected. Press OK.Added Row Number


We can check this has worked, by clicking in cell A2, we can see in the 'input line' the value in the cell is simply '2', and no longer contains the formula. This number is now fixed.

Sorting to highlight issues


Now we have a way to put our data back in the right order, we can start to manipulate it. With our small data set this is easy, we can see all the data on-screen at once. The techniques we are using here come into their own when we have much more data, but this is just a learning example.

We will look for duplicates by sorting the data by different columns.

BEWARE: GOTHCA ALERT

We need to make sure we select all the data, from the first column to the last column, and from the first row to the last row. If we only select part, we will only sort that part, thus scrambling the data.

Remember, UNDO is a very useful function. To undo your last action, press Control + Z

Let's see if we have any duplicates by email address.

  1. Click the small blank box above 1 and to the left of A, here:Sorting data to find duplicates

  2. This will highlight, or 'select' the entire spreadsheet, like this:Sorting data to find duplicates

  3. SUPER PRO-TIP: Unselect the title row to leave it at the top and not include it in the sort, holding down Control and clicking the row number 1 at the left, here:Sorting data to find duplicates

  4. The spreadsheet should now look like this:Sorting data to find duplicates

  5. Now, on the top menu bar, click 'Data' -> 'Sort...' and from the context menu choose 'Column H' (which contains email addresses), as 'Sort Key 1'. Click OK.Sorting data to find duplicates

  6. Click on the column header H, and drag the right-hand side of the column header to the right to display the full width of the data.Sorting data to find duplicates

We have a range of different email addresses, in fact we have 3 at the bottom that look quite similar, but are actually different.
It's tricky to identify duplicates, but we can use formulae again!
  1. Right click on column Header I and choose 'Insert Column Left' from the context menu.

  2. Click in cell I2 and type =IF(H2=H3,"DUPE","OK") (Spreadsheets are picky, you must use double quotes around the DUPE and OK text.) This should show OK in I2.

  3. Now, just like before, copy the formula down from cell I2 to cells I3 to I14:
    Click I2, press Control + C

  4. Press Down-arrow

  5. Press Control + Shift + Down-arrow

  6. Press Control + V

Your spreadsheet should now look like this:Sorting data to find duplicates
WHOA! We have a dupe! What's more, this would be tricky to spot another way, we have Gollum and Smeagol, we have different postcodes, we have different addresses too!

Naturally you would now want to do something with the data at this point, but that can be considered homework. We will simply remove the row marked DUPE.

Click the row number 10 marker at the left to highlight the single row with DUPE in the I column. Right click, and choose 'Delete Rows'.

Obviously you can sort by any columns, such as postcode, company name, surname and even combinations of them. Having done any work you need to do, you will want to reset your data to it's original order.

Resetting to the original order


You have probably guessed this is just another sort operation, but we will also do some tidying up.

First, let's remove column I. Right click in the Column Header I, and select 'Delete Columns' from the context menu.

Now let's restore the data to it's original order:
Select the entire spreadsheet by clicking the top left blank square, then de-select the title row if you want, though it doesn't matter.

Next select 'Data' -> 'Sort...' and choose 'Column A' for 'Sort Key 1' and click OK.Tidying UpYour data is now back in it's original order. You will notice ID 2 and the row that was alongside it is now missing. That is the row we deleted earlier.

Now to delete the ID column, click the Column A letter marker at the top to highlight the single column with the numbers we created in. Right click, and choose 'Delete Columns'.Tidying Up