我需要从一个相当大的SQL Server表(即300,000+行)中删除重复的行。

当然,由于RowID标识字段的存在,这些行不会完全重复。

MyTable

RowID int not null identity(1,1) primary key,
Col1 varchar(20) not null,
Col2 varchar(2048) not null,
Col3 tinyint not null

我该怎么做呢?


当前回答

我更喜欢CTE从sql server表中删除重复的行

强烈推荐阅读本文::http://codaffection.com/sql-server-article/delete-duplicate-rows-in-sql-server/

保持原创性

WITH CTE AS
(
SELECT *,ROW_NUMBER() OVER (PARTITION BY col1,col2,col3 ORDER BY col1,col2,col3) AS RN
FROM MyTable
)

DELETE FROM CTE WHERE RN<>1

不保留原创

WITH CTE AS
(SELECT *,R=RANK() OVER (ORDER BY col1,col2,col3)
FROM MyTable)
 
DELETE CTE
WHERE R IN (SELECT R FROM CTE GROUP BY R HAVING COUNT(*)>1)

其他回答

我更喜欢CTE从sql server表中删除重复的行

强烈推荐阅读本文::http://codaffection.com/sql-server-article/delete-duplicate-rows-in-sql-server/

保持原创性

WITH CTE AS
(
SELECT *,ROW_NUMBER() OVER (PARTITION BY col1,col2,col3 ORDER BY col1,col2,col3) AS RN
FROM MyTable
)

DELETE FROM CTE WHERE RN<>1

不保留原创

WITH CTE AS
(SELECT *,R=RANK() OVER (ORDER BY col1,col2,col3)
FROM MyTable)
 
DELETE CTE
WHERE R IN (SELECT R FROM CTE GROUP BY R HAVING COUNT(*)>1)

这个查询为我展示了非常好的性能:

DELETE tbl
FROM
    MyTable tbl
WHERE
    EXISTS (
        SELECT
            *
        FROM
            MyTable tbl2
        WHERE
            tbl2.SameValue = tbl.SameValue
        AND tbl.IdUniqueValue < tbl2.IdUniqueValue
    )

它在30秒多一点的时间内从2M的表中删除了1M行(50%重复)

这是另一篇关于删除重复的好文章。

它讨论了为什么它很难:“SQL是基于关系代数的,在关系代数中不能出现重复,因为在一个集合中不允许重复。”

临时表解决方案,以及两个mysql示例。

将来,您是打算在数据库级别还是从应用程序的角度防止它?我建议数据库级别,因为你的数据库应该负责维护引用完整性,开发人员只会造成问题;)

通过使用下面的查询,我们可以基于单列或多列删除重复的记录。下面的查询是基于两列进行删除。表名为:testing,列名为empno,empname

DELETE FROM testing WHERE empno not IN (SELECT empno FROM (SELECT empno, ROW_NUMBER() OVER (PARTITION BY empno ORDER BY empno) 
AS [ItemNumber] FROM testing) a WHERE ItemNumber > 1)
or empname not in
(select empname from (select empname,row_number() over(PARTITION BY empno ORDER BY empno) 
AS [ItemNumber] FROM testing) a WHERE ItemNumber > 1)

对于表结构

MyTable

RowID int not null identity(1,1) primary key,
Col1 varchar(20) not null,
Col2 varchar(2048) not null,
Col3 tinyint not null

删除重复项的查询:

DELETE t1
FROM MyTable t1
INNER JOIN MyTable t2
WHERE t1.RowID > t2.RowID
  AND t1.Col1 = t2.Col1
  AND t1.Col2=t2.Col2
  AND t1.Col3=t2.Col3;

我假设RowID是一种自动递增,其余列有重复的值。