我需要从一个相当大的SQL Server表(即300,000+行)中删除重复的行。
当然,由于RowID标识字段的存在,这些行不会完全重复。
MyTable
RowID int not null identity(1,1) primary key,
Col1 varchar(20) not null,
Col2 varchar(2048) not null,
Col3 tinyint not null
我该怎么做呢?
我需要从一个相当大的SQL Server表(即300,000+行)中删除重复的行。
当然,由于RowID标识字段的存在,这些行不会完全重复。
MyTable
RowID int not null identity(1,1) primary key,
Col1 varchar(20) not null,
Col2 varchar(2048) not null,
Col3 tinyint not null
我该怎么做呢?
当前回答
下面的查询用于删除重复的行。本例中的表以ID作为标识列,具有重复数据的列是Column1、Column2和Column3。
DELETE FROM TableName
WHERE ID NOT IN (SELECT MAX(ID)
FROM TableName
GROUP BY Column1,
Column2,
Column3
/*Even if ID is not null-able SQL Server treats MAX(ID) as potentially
nullable. Because of semantics of NOT IN (NULL) including the clause
below can simplify the plan*/
HAVING MAX(ID) IS NOT NULL)
下面的脚本显示GROUP BY、HAVING、ORDER BY在一个查询中的用法,并返回带有重复列及其计数的结果。
SELECT YourColumnName,
COUNT(*) TotalCount
FROM YourTableName
GROUP BY YourColumnName
HAVING COUNT(*) > 1
ORDER BY COUNT(*) DESC
其他回答
我知道这个问题已经回答了,但我已经创建了非常有用的sp,它将为表副本创建一个动态删除语句:
CREATE PROCEDURE sp_DeleteDuplicate @tableName varchar(100), @DebugMode int =1
AS
BEGIN
SET NOCOUNT ON;
IF(OBJECT_ID('tempdb..#tableMatrix') is not null) DROP TABLE #tableMatrix;
SELECT ROW_NUMBER() OVER(ORDER BY name) as rn,name into #tableMatrix FROM sys.columns where [object_id] = object_id(@tableName) ORDER BY name
DECLARE @MaxRow int = (SELECT MAX(rn) from #tableMatrix)
IF(@MaxRow is null)
RAISERROR ('I wasn''t able to find any columns for this table!',16,1)
ELSE
BEGIN
DECLARE @i int =1
DECLARE @Columns Varchar(max) ='';
WHILE (@i <= @MaxRow)
BEGIN
SET @Columns=@Columns+(SELECT '['+name+'],' from #tableMatrix where rn = @i)
SET @i = @i+1;
END
---DELETE LAST comma
SET @Columns = LEFT(@Columns,LEN(@Columns)-1)
DECLARE @Sql nvarchar(max) = '
WITH cteRowsToDelte
AS (
SELECT ROW_NUMBER() OVER (PARTITION BY '+@Columns+' ORDER BY ( SELECT 0)) as rowNumber,* FROM '+@tableName
+')
DELETE FROM cteRowsToDelte
WHERE rowNumber > 1;
'
SET NOCOUNT OFF;
IF(@DebugMode = 1)
SELECT @Sql
ELSE
EXEC sp_executesql @Sql
END
END
如果你创建这样的表格
IF(OBJECT_ID('MyLitleTable') is not null)
DROP TABLE MyLitleTable
CREATE TABLE MyLitleTable
(
A Varchar(10),
B money,
C int
)
---------------------------------------------------------
INSERT INTO MyLitleTable VALUES
('ABC',100,1),
('ABC',100,1), -- only this row should be deleted
('ABC',101,1),
('ABC',100,2),
('ABCD',100,1)
-----------------------------------------------------------
exec sp_DeleteDuplicate 'MyLitleTable',0
它将从表中删除所有重复项。如果运行它时不带第二个参数,它将返回一条SQL语句来运行。
如果您需要排除任何列,只需在调试模式下运行它,获取代码并按照您的喜好修改它。
我更喜欢CTE从sql server表中删除重复的行
强烈推荐阅读本文::http://codaffection.com/sql-server-article/delete-duplicate-rows-in-sql-server/
保持原创性
WITH CTE AS
(
SELECT *,ROW_NUMBER() OVER (PARTITION BY col1,col2,col3 ORDER BY col1,col2,col3) AS RN
FROM MyTable
)
DELETE FROM CTE WHERE RN<>1
不保留原创
WITH CTE AS
(SELECT *,R=RANK() OVER (ORDER BY col1,col2,col3)
FROM MyTable)
DELETE CTE
WHERE R IN (SELECT R FROM CTE GROUP BY R HAVING COUNT(*)>1)
我想这会很有帮助。这里,ROW_NUMBER() OVER(分区由res1。Title ORDER BY res1.Id)作为num来区分重复的行。
delete FROM
(SELECT res1.*,ROW_NUMBER() OVER(PARTITION BY res1.Title ORDER BY res1.Id)as num
FROM
(select * from [dbo].[tbl_countries])as res1
)as res2
WHERE res2.num > 1
另一种方法是创建一个具有相同字段和唯一索引的新表。然后将所有数据从旧表移动到新表。自动SQL SERVER忽略(也有一个选项说明如果有重复值该怎么做:忽略,中断或…)重复值。所以我们有相同的表,没有重复的行。如果你不想要唯一索引,传输数据后,你可以放弃它。
特别是对于较大的表,您可以使用DTS (SSIS包导入/导出数据),以便将所有数据快速传输到新的唯一索引表中。700万行只需要几分钟。
哦,当然。使用临时表。如果你想要一个“工作”的单一的、性能不太好的语句,你可以使用:
DELETE FROM MyTable WHERE NOT RowID IN
(SELECT
(SELECT TOP 1 RowID FROM MyTable mt2
WHERE mt2.Col1 = mt.Col1
AND mt2.Col2 = mt.Col2
AND mt2.Col3 = mt.Col3)
FROM MyTable mt)
基本上,对于表中的每一行,子选择将查找与所考虑行的完全相同的所有行的顶部RowID。因此,您最终会得到一个表示“原始”非重复行的RowIDs列表。