我需要从一个相当大的SQL Server表(即300,000+行)中删除重复的行。

当然,由于RowID标识字段的存在,这些行不会完全重复。

MyTable

RowID int not null identity(1,1) primary key,
Col1 varchar(20) not null,
Col2 varchar(2048) not null,
Col3 tinyint not null

我该怎么做呢?


当前回答

使用这个

WITH tblTemp as
(
SELECT ROW_NUMBER() Over(PARTITION BY Name,Department ORDER BY Name)
   As RowNumber,* FROM <table_name>
)
DELETE FROM tblTemp where RowNumber >1

其他回答

alter table MyTable add sno int identity(1,1)
    delete from MyTable where sno in
    (
    select sno from (
    select *,
    RANK() OVER ( PARTITION BY RowID,Col3 ORDER BY sno DESC )rank
    From MyTable
    )T
    where rank>1
    )

    alter table MyTable 
    drop  column sno

我更喜欢CTE从sql server表中删除重复的行

强烈推荐阅读本文::http://codaffection.com/sql-server-article/delete-duplicate-rows-in-sql-server/

保持原创性

WITH CTE AS
(
SELECT *,ROW_NUMBER() OVER (PARTITION BY col1,col2,col3 ORDER BY col1,col2,col3) AS RN
FROM MyTable
)

DELETE FROM CTE WHERE RN<>1

不保留原创

WITH CTE AS
(SELECT *,R=RANK() OVER (ORDER BY col1,col2,col3)
FROM MyTable)
 
DELETE CTE
WHERE R IN (SELECT R FROM CTE GROUP BY R HAVING COUNT(*)>1)

另一种可能的方法是

; 

--Ensure that any immediately preceding statement is terminated with a semicolon above
WITH cte
     AS (SELECT ROW_NUMBER() OVER (PARTITION BY Col1, Col2, Col3 
                                       ORDER BY ( SELECT 0)) RN
         FROM   #MyTable)
DELETE FROM cte
WHERE  RN > 1;

我在上面使用ORDER BY (SELECT 0),因为在出现平局的情况下,保留哪一行是任意的。

例如,要以RowID顺序保存最新的一个,您可以使用order BY RowID DESC

执行计划

它的执行计划通常比接受的答案更简单和更有效,因为它不需要自连接。

然而,情况并非总是如此。GROUP BY解决方案可能会优先于选择散列聚合而不是流聚合的情况。

ROW_NUMBER解决方案总是给出几乎相同的计划,而GROUP BY策略则更加灵活。

可能有利于哈希聚合方法的因素是

分区列上没有有用的索引 相对较少的组,每组的重复数相对较多

在第二种情况的极端版本中(如果每个组中有很多重复的组),还可以考虑简单地插入要保留到新表中的行,然后截断原始的行并将它们复制回来,以最大限度地减少日志记录,而不是删除非常高比例的行。

对于表结构

MyTable

RowID int not null identity(1,1) primary key,
Col1 varchar(20) not null,
Col2 varchar(2048) not null,
Col3 tinyint not null

删除重复项的查询:

DELETE t1
FROM MyTable t1
INNER JOIN MyTable t2
WHERE t1.RowID > t2.RowID
  AND t1.Col1 = t2.Col1
  AND t1.Col2=t2.Col2
  AND t1.Col3=t2.Col3;

我假设RowID是一种自动递增,其余列有重复的值。

这是删除重复记录最简单的方法

 DELETE FROM tblemp WHERE id IN 
 (
  SELECT MIN(id) FROM tblemp
   GROUP BY  title HAVING COUNT(id)>1
 )