我需要从一个相当大的SQL Server表(即300,000+行)中删除重复的行。
当然,由于RowID标识字段的存在,这些行不会完全重复。
MyTable
RowID int not null identity(1,1) primary key,
Col1 varchar(20) not null,
Col2 varchar(2048) not null,
Col3 tinyint not null
我该怎么做呢?
我需要从一个相当大的SQL Server表(即300,000+行)中删除重复的行。
当然,由于RowID标识字段的存在,这些行不会完全重复。
MyTable
RowID int not null identity(1,1) primary key,
Col1 varchar(20) not null,
Col2 varchar(2048) not null,
Col3 tinyint not null
我该怎么做呢?
当前回答
这个查询为我展示了非常好的性能:
DELETE tbl
FROM
MyTable tbl
WHERE
EXISTS (
SELECT
*
FROM
MyTable tbl2
WHERE
tbl2.SameValue = tbl.SameValue
AND tbl.IdUniqueValue < tbl2.IdUniqueValue
)
它在30秒多一点的时间内从2M的表中删除了1M行(50%重复)
其他回答
这是删除重复记录最简单的方法
DELETE FROM tblemp WHERE id IN
(
SELECT MIN(id) FROM tblemp
GROUP BY title HAVING COUNT(id)>1
)
我更喜欢子查询\having count(*) > 1解决方案内部连接,因为我发现它更容易阅读,它很容易变成一个SELECT语句来验证什么将被删除,然后再运行它。
--DELETE FROM table1
--WHERE id IN (
SELECT MIN(id) FROM table1
GROUP BY col1, col2, col3
-- could add a WHERE clause here to further filter
HAVING count(*) > 1
--)
通过使用下面的查询,我们可以基于单列或多列删除重复的记录。下面的查询是基于两列进行删除。表名为:testing,列名为empno,empname
DELETE FROM testing WHERE empno not IN (SELECT empno FROM (SELECT empno, ROW_NUMBER() OVER (PARTITION BY empno ORDER BY empno)
AS [ItemNumber] FROM testing) a WHERE ItemNumber > 1)
or empname not in
(select empname from (select empname,row_number() over(PARTITION BY empno ORDER BY empno)
AS [ItemNumber] FROM testing) a WHERE ItemNumber > 1)
有时使用软删除机制,其中记录日期以指示删除的日期。在这种情况下,可以使用UPDATE语句根据重复的条目更新该字段。
UPDATE MY_TABLE
SET DELETED = getDate()
WHERE TABLE_ID IN (
SELECT x.TABLE_ID
FROM MY_TABLE x
JOIN (SELECT min(TABLE_ID) id, COL_1, COL_2, COL_3
FROM MY_TABLE d
GROUP BY d.COL_1, d.COL_2, d.COL_3
HAVING count(*) > 1) AS d ON d.COL_1 = x.COL_1
AND d.COL_2 = x.COL_2
AND d.COL_3 = x.COL_3
AND d.TABLE_ID <> x.TABLE_ID
/*WHERE x.COL_4 <> 'D' -- Additional filter*/)
对于包含大约3000万行、重复量有高有低的中等数据表,这种方法非常适用。
我知道这个问题已经回答了,但我已经创建了非常有用的sp,它将为表副本创建一个动态删除语句:
CREATE PROCEDURE sp_DeleteDuplicate @tableName varchar(100), @DebugMode int =1
AS
BEGIN
SET NOCOUNT ON;
IF(OBJECT_ID('tempdb..#tableMatrix') is not null) DROP TABLE #tableMatrix;
SELECT ROW_NUMBER() OVER(ORDER BY name) as rn,name into #tableMatrix FROM sys.columns where [object_id] = object_id(@tableName) ORDER BY name
DECLARE @MaxRow int = (SELECT MAX(rn) from #tableMatrix)
IF(@MaxRow is null)
RAISERROR ('I wasn''t able to find any columns for this table!',16,1)
ELSE
BEGIN
DECLARE @i int =1
DECLARE @Columns Varchar(max) ='';
WHILE (@i <= @MaxRow)
BEGIN
SET @Columns=@Columns+(SELECT '['+name+'],' from #tableMatrix where rn = @i)
SET @i = @i+1;
END
---DELETE LAST comma
SET @Columns = LEFT(@Columns,LEN(@Columns)-1)
DECLARE @Sql nvarchar(max) = '
WITH cteRowsToDelte
AS (
SELECT ROW_NUMBER() OVER (PARTITION BY '+@Columns+' ORDER BY ( SELECT 0)) as rowNumber,* FROM '+@tableName
+')
DELETE FROM cteRowsToDelte
WHERE rowNumber > 1;
'
SET NOCOUNT OFF;
IF(@DebugMode = 1)
SELECT @Sql
ELSE
EXEC sp_executesql @Sql
END
END
如果你创建这样的表格
IF(OBJECT_ID('MyLitleTable') is not null)
DROP TABLE MyLitleTable
CREATE TABLE MyLitleTable
(
A Varchar(10),
B money,
C int
)
---------------------------------------------------------
INSERT INTO MyLitleTable VALUES
('ABC',100,1),
('ABC',100,1), -- only this row should be deleted
('ABC',101,1),
('ABC',100,2),
('ABCD',100,1)
-----------------------------------------------------------
exec sp_DeleteDuplicate 'MyLitleTable',0
它将从表中删除所有重复项。如果运行它时不带第二个参数,它将返回一条SQL语句来运行。
如果您需要排除任何列,只需在调试模式下运行它,获取代码并按照您的喜好修改它。