Friday, June 5, 2009

Delete or remove duplicate rows from SQL Server table using TSQL




Please visit my new Web Site WWW.Codedisplay.com



In most of the SQL Server forum i found that they are asking to delete or eliminate duplicate rows form their table. The most common reasons are import twice, Front-end application save same data for each refresh etc. Here i want to share with you how we can remove duplicate rows data from a database table. One thing keep in mind that for different scenario you need different TSQL query or script to remove or eliminate dulicate data. Scenarios like:

1. Table has primary key(PK) but others column data duplicated.
2. Table has no primary key.


For Primary key:
If your table contains primary key but other column data is duplicated then it will more easy to remove than if the table has no primary key. Since your table has primary key then you can write a query in such a way that first group data(column or column list which is duplicated) then collect the Max or Min ID & then start deletion. For example i have the below table structure & dataset:



Look at the table, let you want to delete or eliminate data where both col & col2 were same. So the query should be:

SQL Server 2000:

DELETE FROM DUPLICATE WHERE ID NOT IN
(
SELECT MIN(ID) FROM DUPLICATE GROUP BY COL,COL2)

--OR

DELETE
FROM DUPLICATE WHERE ID IN
(
SELECT MAX(ID) FROM DUPLICATE GROUP BY COL,COL2 HAVING COUNT(*)>1
* If you want to remove duplicate value for only COL then remove COL2 from group clause.
* For DateTime column, must format first like CONVERT(...) in Group By Clause.
Click Here for Format details.


SQL Server 2005:
WITH tbl_online AS (
SELECT *,RANK () OVER (
PARTITION BY COL,COL2 ORDER BY ID ) AS Rnum
FROM Duplicate)
DELETE FROM tbl_online WHERE Rnum > 1

WITHOUT Primary key:
For exactly same data rows just use DISTINCT Like: SELECT DISTINCT * FROM Table. Otherwise its a bit complex. Since your table contains 4 columns where you want to remove or delete duplicate rows for COL & COL2 then the solution will be varried from SQL Server 2000 to 2005.

SQL Server 2000:
Since we have no built in function so we have to use temporary table. Filter distinct columns first then merge first or top value for duplicate set & store into the temporary table. Now delete the main table & reinsert from temporary table. Before executing the below query remove ID column first.

SELECT DISTINCT T1.Col,T1.COL2(
SELECT TOP 1 Col3 FROM DUPLICATE
WHERE COL=T1.COL AND COL2=T1.COL2) COL3 INTO #tab2
FROM Duplicate T1
TRUNCATE TABLE DUPLICATE
INSERT INTO DUPLICATE SELECT * FROM #tab2
DROP TABLE #tab2

FOR SQL SERVER 2005:
SQL Server provide us few built in function like RANK(), ROW_NUMBER() which will ease our lives:

WITH tbl_online AS (
SELECT *,ROW_NUMBER () OVER (
PARTITION BY COL,COL2 ORDER BY COL ) AS Rnum
FROM Duplicate)
DELETE FROM tbl_online WHERE Rnum > 1

Since ordering may degrade the query performence, for alternative read the below post:
http://www.sql-server-performance.com/articles/per/ranking_functions_performance_p1.aspx

2 comments:

susheel said...

U r doing good and useful.. The codes u gave are really simple and i think they are for everyday .net applications

Anonymous said...

Dude, thanks for the remove dups script in 2005. Nice and elegant.

Cheers,
Jon

Want to say something?
I WOULD BE DELIGHTED TO HEAR FROM YOU

Want To Search More?
Google Search on Internet
Subscribe RSS Subscribe RSS
Article Categories
  • Asp.net
  • Gridview
  • Javascript
  • AJAX
  • Sql server
  • XML
  • CSS
  • Free Web Site Templates
  • Free Desktop Wallpapers
  • TopOfBlogs
     
    Free ASP.NET articles,C#.NET,VB.NET tutorials and Examples,Ajax,SQL Server,Javascript,Jquery,XML,GridView Articles and code examples -- by Shawpnendu Bikash