我需要从一个相当大的SQL Server表(即300,000+行)中删除重复的行。

当然,由于RowID标识字段的存在,这些行不会完全重复。

MyTable

RowID int not null identity(1,1) primary key,
Col1 varchar(20) not null,
Col2 varchar(2048) not null,
Col3 tinyint not null

我该怎么做呢?


当前回答

这个查询为我展示了非常好的性能:

DELETE tbl
FROM
    MyTable tbl
WHERE
    EXISTS (
        SELECT
            *
        FROM
            MyTable tbl2
        WHERE
            tbl2.SameValue = tbl.SameValue
        AND tbl.IdUniqueValue < tbl2.IdUniqueValue
    )

它在30秒多一点的时间内从2M的表中删除了1M行(50%重复)

其他回答

这是删除重复记录最简单的方法

 DELETE FROM tblemp WHERE id IN 
 (
  SELECT MIN(id) FROM tblemp
   GROUP BY  title HAVING COUNT(id)>1
 )

我更喜欢子查询\having count(*) > 1解决方案内部连接,因为我发现它更容易阅读,它很容易变成一个SELECT语句来验证什么将被删除,然后再运行它。

--DELETE FROM table1 
--WHERE id IN ( 
     SELECT MIN(id) FROM table1 
     GROUP BY col1, col2, col3 
     -- could add a WHERE clause here to further filter
     HAVING count(*) > 1
--)

通过使用下面的查询,我们可以基于单列或多列删除重复的记录。下面的查询是基于两列进行删除。表名为:testing,列名为empno,empname

DELETE FROM testing WHERE empno not IN (SELECT empno FROM (SELECT empno, ROW_NUMBER() OVER (PARTITION BY empno ORDER BY empno) 
AS [ItemNumber] FROM testing) a WHERE ItemNumber > 1)
or empname not in
(select empname from (select empname,row_number() over(PARTITION BY empno ORDER BY empno) 
AS [ItemNumber] FROM testing) a WHERE ItemNumber > 1)

有时使用软删除机制,其中记录日期以指示删除的日期。在这种情况下,可以使用UPDATE语句根据重复的条目更新该字段。

UPDATE MY_TABLE
   SET DELETED = getDate()
 WHERE TABLE_ID IN (
    SELECT x.TABLE_ID
      FROM MY_TABLE x
      JOIN (SELECT min(TABLE_ID) id, COL_1, COL_2, COL_3
              FROM MY_TABLE d
             GROUP BY d.COL_1, d.COL_2, d.COL_3
            HAVING count(*) > 1) AS d ON d.COL_1 = x.COL_1
                                     AND d.COL_2 = x.COL_2
                                     AND d.COL_3 = x.COL_3
                                     AND d.TABLE_ID <> x.TABLE_ID
             /*WHERE x.COL_4 <> 'D' -- Additional filter*/)

对于包含大约3000万行、重复量有高有低的中等数据表,这种方法非常适用。

我知道这个问题已经回答了,但我已经创建了非常有用的sp,它将为表副本创建一个动态删除语句:

    CREATE PROCEDURE sp_DeleteDuplicate @tableName varchar(100), @DebugMode int =1
AS 
BEGIN
SET NOCOUNT ON;

IF(OBJECT_ID('tempdb..#tableMatrix') is not null) DROP TABLE #tableMatrix;

SELECT ROW_NUMBER() OVER(ORDER BY name) as rn,name into #tableMatrix FROM sys.columns where [object_id] = object_id(@tableName) ORDER BY name

DECLARE @MaxRow int = (SELECT MAX(rn) from #tableMatrix)
IF(@MaxRow is null)
    RAISERROR  ('I wasn''t able to find any columns for this table!',16,1)
ELSE 
    BEGIN
DECLARE @i int =1 
DECLARE @Columns Varchar(max) ='';

WHILE (@i <= @MaxRow)
BEGIN 
    SET @Columns=@Columns+(SELECT '['+name+'],' from #tableMatrix where rn = @i)

    SET @i = @i+1;
END

---DELETE LAST comma
SET @Columns = LEFT(@Columns,LEN(@Columns)-1)

DECLARE @Sql nvarchar(max) = '
WITH cteRowsToDelte
     AS (
SELECT ROW_NUMBER() OVER (PARTITION BY '+@Columns+' ORDER BY ( SELECT 0)) as rowNumber,* FROM '+@tableName
+')

DELETE FROM cteRowsToDelte
WHERE  rowNumber > 1;
'
SET NOCOUNT OFF;
    IF(@DebugMode = 1)
       SELECT @Sql
    ELSE
       EXEC sp_executesql @Sql
    END
END

如果你创建这样的表格

IF(OBJECT_ID('MyLitleTable') is not null)
    DROP TABLE MyLitleTable 


CREATE TABLE MyLitleTable
(
    A Varchar(10),
    B money,
    C int
)
---------------------------------------------------------

    INSERT INTO MyLitleTable VALUES
    ('ABC',100,1),
    ('ABC',100,1), -- only this row should be deleted
    ('ABC',101,1),
    ('ABC',100,2),
    ('ABCD',100,1)

    -----------------------------------------------------------

     exec sp_DeleteDuplicate 'MyLitleTable',0

它将从表中删除所有重复项。如果运行它时不带第二个参数,它将返回一条SQL语句来运行。

如果您需要排除任何列,只需在调试模式下运行它,获取代码并按照您的喜好修改它。