我需要从一个相当大的SQL Server表(即300,000+行)中删除重复的行。

当然,由于RowID标识字段的存在,这些行不会完全重复。

MyTable

RowID int not null identity(1,1) primary key,
Col1 varchar(20) not null,
Col2 varchar(2048) not null,
Col3 tinyint not null

我该怎么做呢?


当前回答

delete t1
from table t1, table t2
where t1.columnA = t2.columnA
and t1.rowid>t2.rowid

邮政:

delete
from table t1
using table t2
where t1.columnA = t2.columnA
and t1.rowid > t2.rowid

其他回答

使用CTE。这个想法是连接一个或多个列,形成一个重复的记录,然后删除你喜欢的:

;with cte as (
    select 
        min(PrimaryKey) as PrimaryKey
        UniqueColumn1,
        UniqueColumn2
    from dbo.DuplicatesTable 
    group by
        UniqueColumn1, UniqueColumn1
    having count(*) > 1
)
delete d
from dbo.DuplicatesTable d 
inner join cte on 
    d.PrimaryKey > cte.PrimaryKey and
    d.UniqueColumn1 = cte.UniqueColumn1 and 
    d.UniqueColumn2 = cte.UniqueColumn2;

哦,当然。使用临时表。如果你想要一个“工作”的单一的、性能不太好的语句,你可以使用:

DELETE FROM MyTable WHERE NOT RowID IN
    (SELECT 
        (SELECT TOP 1 RowID FROM MyTable mt2 
        WHERE mt2.Col1 = mt.Col1 
        AND mt2.Col2 = mt.Col2 
        AND mt2.Col3 = mt.Col3) 
    FROM MyTable mt)

基本上,对于表中的每一行,子选择将查找与所考虑行的完全相同的所有行的顶部RowID。因此,您最终会得到一个表示“原始”非重复行的RowIDs列表。

alter table MyTable add sno int identity(1,1)
    delete from MyTable where sno in
    (
    select sno from (
    select *,
    RANK() OVER ( PARTITION BY RowID,Col3 ORDER BY sno DESC )rank
    From MyTable
    )T
    where rank>1
    )

    alter table MyTable 
    drop  column sno

获取重复的行:

SELECT
name, email, COUNT(*)
FROM 
users
GROUP BY
name, email
HAVING COUNT(*) > 1

删除重复的行。

DELETE users 
WHERE rowid NOT IN 
(SELECT MIN(rowid)
FROM users
GROUP BY name, email);      

在postgresql中删除重复行的一个非常简单的方法。

DELETE FROM table1 a
USING table1 b
WHERE a.id < b.id
AND a.column1 = b.column1
AND a.column2 = b.column2;