Thursday, February 5, 2015

Removing Duplicate Data

Introduction

A common problem almost everyone runs into at one time or another is duplicated data in a table. If you haven't encountered this yet you are one, very lucky and two, your day is coming. Good news! We have a solution in SQL that makes sorting through this a much easier task! The ROW_NUMBER() function can make very short work of eliminating those pesky extra rows. First you'll want to identify how those rows made it into your table and close the hole so it doesn't happen again. Now let's take a look at how we can use this function to make our problems go away.


Creating the Problem

Lets create a table in our TestDB. We'll call it Employees. This table is for tracking the progression of our employees careers over time.

USE [TestDB]
GO
--Create a test table
CREATE TABLE [dbo].[Employees](
[ID] [int] IDENTITY(1,1) NOT NULL,
[FirstName] [varchar](20) NOT NULL,
[LastName] [varchar](20) NOT NULL,
[EmployeeNumber] [varchar](10) NOT NULL,
[Title] [varchar](20) NOT NULL,
 CONSTRAINT [PK__Employee__ID] PRIMARY KEY CLUSTERED ([ID] ASC)
) ON [PRIMARY]

Now lets fill it with some bogus data and make up a few career progressions along the way so we have some rows on all our employees. Any similarity to persons either living or dead is just coincidence.

--Fill it with some data
INSERT INTO [dbo].[Employees] ([FirstName],[LastName],[EmployeeNumber],[Title])
VALUES
('Daniel','Slingblade','1001101111','Troll')
,('Peter','PumpkinEater','0110010000','Troll')
,('Bob','UpandDown','4242424242','Troll')
,('Dan','TheMan','ffffffffff','Troll')
,('Peter','PumpkinEater','0110010000','Lesser-Deity')
,('Daniel','Slingblade','1001101111','Gate Keeper')
,('Bob','UpandDown','4242424242','Key Master')
,('Dan','TheMan','ffffffffff','Stream Crosser')
,('Bob','UpandDown','4242424242','Lesser-Deity')
,('Dan','TheMan','ffffffffff','Lesser-Deity')
,('Peter','PumpkinEater','0110010000','Demigod')
,('Daniel','Slingblade','1001101111','Wizard')
,('Guy','Redshirt','0000000000','Expendable')
,('Buffy','Bendy','4572475047','Slayer')

Now that we have some data in our table, we have to create a duplication situation in order to resolve it. Take this query and run it a few times. Really mess things up.

INSERT INTO [dbo].[Employees] ([FirstName],[LastName],[EmployeeNumber],[Title])
SELECT [FirstName],[LastName],[EmployeeNumber],[Title]
FROM [dbo].[Employees] [Emp] WITH (NOLOCK)
WHERE Firstname LIKE 'Dan%'

INSERT INTO [dbo].[Employees] ([FirstName],[LastName],[EmployeeNumber],[Title])
SELECT [FirstName],[LastName],[EmployeeNumber],[Title]
FROM [dbo].[Employees] [Emp] WITH (NOLOCK)
WHERE Firstname LIKE 'P%'


Solving the Problem

I now have 41 rows of duplicated data in my test table. We're going to attempt to clean the table so that we only have a single row of data for any given condition.

The SQL function ROW_NUMBER() adds an integer column to a selection of data. This allows you to easily number a set of rows for sorting, paging or in this case, identifying duplicate rows.

Step one, we'll decide what identifies a row. In this case it will be [FirstName], [LastName], [EmployeeNumber] and [Title]. You can change your criteria to meet the needs of the situation you're in but the end game is to get a unique key. So let's do a SELECT DISTINCT on these fields and see what we come up with. With the query below we have our original 14 rows so this plan will work.



How do we split them off? In this case a SELECT INTO... would solve your problem with a quick truncate table, but we're going to pretend there are millions of rows here. There are several different ways but one of the easiest is to just delete the duplicate entries.

Using this query we can easily number our rows 1...X by our key we identified earlier.

SELECT ROW_NUMBER() OVER ( PARTITION BY [FirstName],[LastName],[Title] ORDER BY [LastName] ) [ROW]
      ,[FirstName]
      ,[LastName]
      ,[EmployeeNumber]
      ,[Title]
FROM [TestDB].[dbo].[Employees] 

You should see something similar to the table listing below. Based on this, we can now delete anything with [ROW] > 1 and remove all the duplicates.



Using this statement inside a transaction we can test the operation before we actually commit it. What we do is select the rows numbered as [Rows] deleting any of them with [Row] > 1.

BEGIN TRANSACTION
DELETE [ROWS] FROM
(
SELECT ROW_NUMBER() OVER ( PARTITION BY [FirstName],[LastName],[Title] ORDER BY [LastName] ) [ROW]
 ,[FirstName]
 ,[LastName]
 ,[EmployeeNumber]
 ,[Title]
FROM [TestDB].[dbo].[Employees]
) [ROWS]
WHERE [ROW] > 1
ROLLBACK

Now just change the ROLLBACK to a COMMIT and run the statement again and the duplicate rows will be gone. .

Conclusion

I hope you can use this at some point in time to save yourself a little bit of work. This is just one of the possible uses of ROW_NUMBER(). What other possible uses can you think of?

No comments:

Post a Comment