Highlight Duplicate Records in an Excel List

With Excel's conditional formatting, you can highlight cells based on specific rules. There are some built-in rules available, and you can use formulas to create your own formatting rules.

Highlight Duplicates

In this example, we want highlight the duplicate rows in a table. There is a built-in rule for highlighting duplicate values in a single column, but nothing that will check an entire row.

highlightduplicatesribbon

So, we’ll create our own rule, and it will require a new column on the worksheet, before we add the conditional formatting.

Concatenate the Data

In the sample data, there are two identical rows, and these should be highlighted after we apply our conditional formatting.

highlightduplicaterows03

The first step is to use the CONCATENATE function to combine all the data into one cell in each row. Add a new heading in cell G1 – AllData – and in cell G2, enter this formula, to combine the data from all the cells in that row.

=CONCATENATE(A2,B2,C2,D2,E2,F2)

highlightduplicaterows04

Next, copy the formula down to the last row of data.

Apply the Conditional Formatting

Then, a conditional formatting rule is set, to color the rows that are duplicate records. We’ll use the COUNTIF function to check for duplicates in the AllData column.

=COUNTIF($G$2:$G$8,$G2)>1

highlightduplicaterows01

If there is more than one instance of a data combination, that indicates a duplicate row, and the cells in columns A:F will be coloured. The two rows with duplicate records are highlighted, so our conditional formatting formula worked!

highlightduplicaterows05

Download the Sample File

For detailed instructions, and to download the sample file, please visit my Contextures website: Highlight Duplicate Records in a List


Watch the Video

To see the steps for setting up the conditional formatting, please watch this video tutorial.

__

_________________________

17 comments to Highlight Duplicate Records in an Excel List

  • Jeff Weir

    Cool!

    What's also cool about tables is that you can filter by color. So using a bit of VBA, you can then filter to show only duplicate rows, and then do something with them.

  • John

    One very rare bug that may happen with this is if the data you have in two columns, when concatenated, looks the same. For example:


    Col A Col B Concatenated
    123 456 123456
    1234 56 123456

    This will show the rows above as duplicate when they are not.

  • Jeff Weir

    On the subject of delimiters, you can't beat the pipe character | (Shift + \)

  • Great Tutorial. But is there any way to highlight only a single instance of duplicate records? I mean if I could do so then it would be easier to just sort out the highlighted records and later delete them.

    • @Ankit Mac, if you only want to highlight the duplicate records, and not the first instance of a duplicated record, use this formula: =COUNTIF($G$2:$G2,$G2)>1

      That checks the rows down to the current row for duplicates, so for the first instance, the count would be 1, and that row wouldn't be highlighted.

  • Rick

    I'm working on a project where I need to find records where certain fields are close matches. For example in the Company field one entry may be "SC Johnson Wax" and another "S.C. Johnson Wax" or "S C Johnson Wax". Another would be where someone has misspelled a word so maybe they typed "Johnston" instead of "Johnson". These are all things that stick out to the human eye but how to look for them programmatically?

  • Jason M

    @Rick,

    Trying looking at using multiple wildcards, such as:

    =COUNTIF(A1,"*s*c*john*wax*")

    This will get you a hit on most entries, but of course if the user leaves out the word "wax", it won't work.

  • Steve

    @Rick

    I had to do something similar, and used VBA. Depending on your application, you might try something like this:
    http://en.wikipedia.org/wiki/Damerau%E2%80%93Levenshtein_distance

    It has sample code written in C#, but I was able to convert this to VBA without too much trouble and use it very effectively.

    My application was to select the correct employee (from an employee roster) even if the input name was spelled incorrectly.

    • Steve

      Here is the code I came up with:

      Public Function iDLD_EditDistance(ByVal sSource As Variant, ByVal sTarget As Variant) As Integer
      ''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''
      ' Comments: Determines the Damerau–Levenshtein distance
      ' (Edit Distance) of two strings. More information
      ' can be found at:
      ' http://en.wikipedia.org/wiki/Damerau%E2%80%93Levenshtein_distance
      '
      ' Arguments: sSource - Source String
      ' sTarget - Target String
      '
      ' Returns: Integer - This should be an integer value used
      ' by calling routine to determine a matching
      ' string based on lowest edit distance
      '
      ' -------------------------------------------------------------
      '
      '
      Dim m As Integer, n As Integer
      Dim i As Integer, j As Integer
      Dim i1 As Integer, j1 As Integer
      Dim xc As Integer
      Dim INF As Integer
      Dim DB As Integer
      Dim H()
      Dim sd()
      Dim sdlen As Long
      Dim tsd, tchr, tchrS, tchrT As String
      ' Quick check to resolve 0 lenthg stings
      If Len(sSource) = 0 Then
      iDLD_EditDistance = Len(sTarget)
      Exit Function
      Else
      If Len(sTarget) = 0 Then
      iDLD_EditDistance = Len(sSource)
      Exit Function
      End If
      End If

      ' Prevent inappropriately high Edit Distance values with short source strings.
      ' Otherwise, the Edit Distance for a string like "Weinb" would actually be smaller with "Way"
      ' rather than "Weinbrenner", given a false 'match'. It will also speed things up.
      sTarget = Left(sTarget, Len(sSource) + 1) ' note that if Len(sSource) + 1 > Len(sTarget) then it will return the whole string

      ' Remove Capital letters
      sSource = LCase(sSource)
      sTarget = LCase(sTarget)

      ' Initialize Variables
      m = Len(sSource)
      n = Len(sTarget)
      ReDim H(m + 2, n + 2)
      INF = m + n
      'Initialize Matrix
      H(0, 0) = INF
      For i = 0 To m
      H(i + 1, 1) = i
      H(i + 1, 0) = INF
      Next i
      For i = 0 To n
      H(1, i + 1) = i
      H(0, i + 1) = INF
      Next i
      'Initialize sd array
      sdlen = 1 ' Initialize sd with first char using tsd
      tsd = Mid(sSource, 1, 1) ' Enter first char of sSource into tsd
      For i = 2 To m ' pick up iteration with 2nd char of sSource
      tchr = Mid(sSource, i, 1)
      If InStr(1, tsd, tchr) = 0 Then ' If the current Char is not in tsd then add it
      sdlen = sdlen + 1
      tsd = tsd & tchr
      End If
      Next i
      For j = 1 To n ' Now do same for chars from sTarget
      tchr = Mid(sTarget, j, 1)
      If InStr(1, tsd, tchr) = 0 Then
      sdlen = sdlen + 1
      tsd = tsd & tchr
      End If
      Next j
      ReDim sd(1 To Len(tsd), 1 To 2) ' ReDim sd() to actual deminsions
      For i = 1 To sdlen ' Move tsd into sd()
      sd(i, 1) = Mid(tsd, i, 1)
      sd(i, 2) = 0
      Next i

      'Begin processing strings
      ' Not really sure how it all works, but it does! :D
      ' It uses a matrix, H(), in conjuction with sd()
      ' By flood filling the matrix, it will determine Edit Distance
      For i = 1 To m
      DB = 0
      For j = 1 To n
      For xc = 1 To sdlen
      tsd = sd(xc, 1)
      tchr = Mid(sTarget, j, 1)
      If tsd = tchr Then
      i1 = sd(xc, 2)
      Exit For
      End If
      Next xc
      j1 = DB
      tchrT = Mid(sTarget, j, 1)
      tchrS = Mid(sSource, i, 1)
      If tchrS = tchrT Then ' Filling matrix with new value
      H(i + 1, j + 1) = H(i, j) ' If = then previous low value
      DB = j
      Else
      H(i + 1, j + 1) = H(i, j) + 1 ' if then Min of adjacent matrix cells
      If H(i + 1, j + 1) > H(i + 1, j) + 1 Then H(i + 1, j + 1) = H(i + 1, j) + 1
      If H(i + 1, j + 1) > H(i, j + 1) + 1 Then H(i + 1, j + 1) = H(i, j + 1) + 1
      End If

      If H(i + 1, j + 1) > H(i1, j1) + (i - i1 - 1) + 1 + (j - j1 - 1) Then ' Magic formula that is used for the matrix
      H(i + 1, j + 1) = H(i1, j1) + (i - i1 - 1) + 1 + (j - j1 - 1) ' in the case of transposition
      End If
      Next j
      For xc = 1 To sdlen ' More magic that helps with transpostion
      tsd = sd(xc, 1)
      tchr = Mid(sSource, i, 1)
      If tsd = tchr Then
      sd(xc, 2) = i
      Exit For
      End If
      Next xc
      Next i ' Matrix is now complete

      iDLD_EditDistance = H(m + 1, n + 1)

      End Function

  • Hugo

    Thanks, Steve.
    Great to see your implementation.
    But how this could be used to color-mark cells having a short DLD?

    • Steve

      Hi Hugo,
      I will say upfront that my suggestion was beyond the original post on using the Conditional Formatting feature. Because of that, the implementation in more complicated, and required the use of VBA. However, the question poised in the comment I replied to was a more complicated problem: “I need to find records where certain fields are close matches” How close is close enough? :)

      First, here is the very basics of how I used it.
      1) I started with a roster of names to check against. This is kept up to date with all persons possible correct names (employees in my case).
      2) When a name is entered into a sheet (in designated name cells), the name is checked using a VBA call from a change event.
      2.1) If there is an match with the roster, then do nothing
      2.2) If no match, find the name with lowest edit distance by iterating through the roster. Replace entered text with the name found.

      So, if I were to color mark cells with potential name matches, here is how I would do it.

      Assume I’m starting with a known good name.

      1) Parse the good name so there are no spaces or punctuation, just letters.
      2) Iterate through the cells in question, parsing the cell contents to remove everything but the letters. At this point, things can be done to skip the cell of it has certain features. i.e. empty, only numbers, etc.
      2.1) If there is an exact match, change the color of the cell
      2.2) If not, find the edit distance. Change the color of any name that is within a certain of edit distance. This is where things can get fuzzy. There could be a weighted set of criteria. Edit Distance, number of letters, first and last letters the same, etc. Each of these criteria can be used to help produce more accurate results.

      The nature of the problem is such that there will likely to be some false positives (cells with names that are highlighted that really aren’t matches) and some missed matches. I would think in this case the false positives would be preferred over missing a name that should be highlighted, but wasn’t. In the end, the person looking at the results will have to verify that any name highlighted is a correct ‘close match.’

  • Hugo

    Thanx, Steve.
    When operating only one name at a time, then your approach will work perfectly.
    Would be fine to have a solution showing the minimum of DLD distances of a name in line to all the others.
    When having this, you can mark all the lines showing a very low DLD value, so indicating a possible doubled name.
    No idea how to implement this.

  • Brian B

    I am curious as to how you could assign different colors to different sets of duplicates.

  • You can avoid the need for concatenation if you use the COUNTIFS function instead of the COUNTIF function:

    =COUNTIFS($A$2:$A$8,A2,$B$2:$B$8,B2,$C$2:$C$8,C2,$D$2:$D$8,D2,$E$2:$E$8,E2,$F$2:$F$8,F2,$G$2:$G$8,G2)

    This will look at all of the values in the row and count how many other rows have the same set of values.

    You could then use a modified version of this to create your conditional formatting rule without the need to create a separate column in the spreadsheet to count the duplicates:

    =COUNTIFS($A$2:$A$8,A2,$B$2:$B$8,B2,$C$2:$C$8,C2,$D$2:$D$8,D2,$E$2:$E$8,E2,$F$2:$F$8,F2,$G$2:$G$8,G2)>1

    You can read more about this approach here: http://fiveminutelessons.com/learn-microsoft-excel/find-duplicate-rows-excel-across-multiple-columns

    David

Leave a Reply

  

  

  

You can use these HTML tags

<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>