我需要从一个相当大的SQL Server表(即300,000+行)中删除重复的行。

当然,由于RowID标识字段的存在,这些行不会完全重复。

MyTable

RowID int not null identity(1,1) primary key,
Col1 varchar(20) not null,
Col2 varchar(2048) not null,
Col3 tinyint not null

我该怎么做呢?


当前回答

下面的查询用于删除重复的行。本例中的表以ID作为标识列,具有重复数据的列是Column1、Column2和Column3。

DELETE FROM TableName
WHERE  ID NOT IN (SELECT MAX(ID)
                  FROM   TableName
                  GROUP  BY Column1,
                            Column2,
                            Column3
                  /*Even if ID is not null-able SQL Server treats MAX(ID) as potentially
                    nullable. Because of semantics of NOT IN (NULL) including the clause
                    below can simplify the plan*/
                  HAVING MAX(ID) IS NOT NULL) 

下面的脚本显示GROUP BY、HAVING、ORDER BY在一个查询中的用法,并返回带有重复列及其计数的结果。

SELECT YourColumnName,
       COUNT(*) TotalCount
FROM   YourTableName
GROUP  BY YourColumnName
HAVING COUNT(*) > 1
ORDER  BY COUNT(*) DESC 

其他回答

我知道这个问题已经回答了,但我已经创建了非常有用的sp,它将为表副本创建一个动态删除语句:

    CREATE PROCEDURE sp_DeleteDuplicate @tableName varchar(100), @DebugMode int =1
AS 
BEGIN
SET NOCOUNT ON;

IF(OBJECT_ID('tempdb..#tableMatrix') is not null) DROP TABLE #tableMatrix;

SELECT ROW_NUMBER() OVER(ORDER BY name) as rn,name into #tableMatrix FROM sys.columns where [object_id] = object_id(@tableName) ORDER BY name

DECLARE @MaxRow int = (SELECT MAX(rn) from #tableMatrix)
IF(@MaxRow is null)
    RAISERROR  ('I wasn''t able to find any columns for this table!',16,1)
ELSE 
    BEGIN
DECLARE @i int =1 
DECLARE @Columns Varchar(max) ='';

WHILE (@i <= @MaxRow)
BEGIN 
    SET @Columns=@Columns+(SELECT '['+name+'],' from #tableMatrix where rn = @i)

    SET @i = @i+1;
END

---DELETE LAST comma
SET @Columns = LEFT(@Columns,LEN(@Columns)-1)

DECLARE @Sql nvarchar(max) = '
WITH cteRowsToDelte
     AS (
SELECT ROW_NUMBER() OVER (PARTITION BY '+@Columns+' ORDER BY ( SELECT 0)) as rowNumber,* FROM '+@tableName
+')

DELETE FROM cteRowsToDelte
WHERE  rowNumber > 1;
'
SET NOCOUNT OFF;
    IF(@DebugMode = 1)
       SELECT @Sql
    ELSE
       EXEC sp_executesql @Sql
    END
END

如果你创建这样的表格

IF(OBJECT_ID('MyLitleTable') is not null)
    DROP TABLE MyLitleTable 


CREATE TABLE MyLitleTable
(
    A Varchar(10),
    B money,
    C int
)
---------------------------------------------------------

    INSERT INTO MyLitleTable VALUES
    ('ABC',100,1),
    ('ABC',100,1), -- only this row should be deleted
    ('ABC',101,1),
    ('ABC',100,2),
    ('ABCD',100,1)

    -----------------------------------------------------------

     exec sp_DeleteDuplicate 'MyLitleTable',0

它将从表中删除所有重复项。如果运行它时不带第二个参数,它将返回一条SQL语句来运行。

如果您需要排除任何列,只需在调试模式下运行它,获取代码并按照您的喜好修改它。

我更喜欢CTE从sql server表中删除重复的行

强烈推荐阅读本文::http://codaffection.com/sql-server-article/delete-duplicate-rows-in-sql-server/

保持原创性

WITH CTE AS
(
SELECT *,ROW_NUMBER() OVER (PARTITION BY col1,col2,col3 ORDER BY col1,col2,col3) AS RN
FROM MyTable
)

DELETE FROM CTE WHERE RN<>1

不保留原创

WITH CTE AS
(SELECT *,R=RANK() OVER (ORDER BY col1,col2,col3)
FROM MyTable)
 
DELETE CTE
WHERE R IN (SELECT R FROM CTE GROUP BY R HAVING COUNT(*)>1)

我想这会很有帮助。这里,ROW_NUMBER() OVER(分区由res1。Title ORDER BY res1.Id)作为num来区分重复的行。

delete FROM
(SELECT res1.*,ROW_NUMBER() OVER(PARTITION BY res1.Title ORDER BY res1.Id)as num
 FROM 
(select * from [dbo].[tbl_countries])as res1
)as res2
WHERE res2.num > 1

另一种方法是创建一个具有相同字段和唯一索引的新表。然后将所有数据从旧表移动到新表。自动SQL SERVER忽略(也有一个选项说明如果有重复值该怎么做:忽略,中断或…)重复值。所以我们有相同的表,没有重复的行。如果你不想要唯一索引,传输数据后,你可以放弃它。

特别是对于较大的表,您可以使用DTS (SSIS包导入/导出数据),以便将所有数据快速传输到新的唯一索引表中。700万行只需要几分钟。

哦,当然。使用临时表。如果你想要一个“工作”的单一的、性能不太好的语句,你可以使用:

DELETE FROM MyTable WHERE NOT RowID IN
    (SELECT 
        (SELECT TOP 1 RowID FROM MyTable mt2 
        WHERE mt2.Col1 = mt.Col1 
        AND mt2.Col2 = mt.Col2 
        AND mt2.Col3 = mt.Col3) 
    FROM MyTable mt)

基本上,对于表中的每一行,子选择将查找与所考虑行的完全相同的所有行的顶部RowID。因此,您最终会得到一个表示“原始”非重复行的RowIDs列表。