我需要从一个相当大的SQL Server表(即300,000+行)中删除重复的行。
当然,由于RowID标识字段的存在,这些行不会完全重复。
MyTable
RowID int not null identity(1,1) primary key,
Col1 varchar(20) not null,
Col2 varchar(2048) not null,
Col3 tinyint not null
我该怎么做呢?
我需要从一个相当大的SQL Server表(即300,000+行)中删除重复的行。
当然,由于RowID标识字段的存在,这些行不会完全重复。
MyTable
RowID int not null identity(1,1) primary key,
Col1 varchar(20) not null,
Col2 varchar(2048) not null,
Col3 tinyint not null
我该怎么做呢?
当前回答
下面的查询用于删除重复的行。本例中的表以ID作为标识列,具有重复数据的列是Column1、Column2和Column3。
DELETE FROM TableName
WHERE ID NOT IN (SELECT MAX(ID)
FROM TableName
GROUP BY Column1,
Column2,
Column3
/*Even if ID is not null-able SQL Server treats MAX(ID) as potentially
nullable. Because of semantics of NOT IN (NULL) including the clause
below can simplify the plan*/
HAVING MAX(ID) IS NOT NULL)
下面的脚本显示GROUP BY、HAVING、ORDER BY在一个查询中的用法,并返回带有重复列及其计数的结果。
SELECT YourColumnName,
COUNT(*) TotalCount
FROM YourTableName
GROUP BY YourColumnName
HAVING COUNT(*) > 1
ORDER BY COUNT(*) DESC
其他回答
使用这个
WITH tblTemp as
(
SELECT ROW_NUMBER() Over(PARTITION BY Name,Department ORDER BY Name)
As RowNumber,* FROM <table_name>
)
DELETE FROM tblTemp where RowNumber >1
从应用程序级别(不幸的是)。我同意防止重复的正确方法是在数据库级别上使用唯一的索引,但是在SQL Server 2005中,一个索引只能是900字节,而我的varchar(2048)字段打破了这一规定。
我不知道它的性能如何,但我认为您可以编写一个触发器来强制执行这一点,即使您不能直接使用索引。喜欢的东西:
-- given a table stories(story_id int not null primary key, story varchar(max) not null)
CREATE TRIGGER prevent_plagiarism
ON stories
after INSERT, UPDATE
AS
DECLARE @cnt AS INT
SELECT @cnt = Count(*)
FROM stories
INNER JOIN inserted
ON ( stories.story = inserted.story
AND stories.story_id != inserted.story_id )
IF @cnt > 0
BEGIN
RAISERROR('plagiarism detected',16,1)
ROLLBACK TRANSACTION
END
另外,varchar(2048)对我来说听起来很可疑(生活中有些东西是2048字节,但这很少见);它真的应该不是varchar(max)吗?
CREATE TABLE car(Id int identity(1,1), PersonId int, CarId int)
INSERT INTO car(PersonId,CarId)
VALUES(1,2),(1,3),(1,2),(2,4)
--SELECT * FROM car
;WITH CTE as(
SELECT ROW_NUMBER() over (PARTITION BY personid,carid order by personid,carid) as rn,Id,PersonID,CarId from car)
DELETE FROM car where Id in(SELECT Id FROM CTE WHERE rn>1)
DELETE
FROM
table_name T1
WHERE
rowid > (
SELECT
min(rowid)
FROM
table_name T2
WHERE
T1.column_name = T2.column_name
);
我更喜欢子查询\having count(*) > 1解决方案内部连接,因为我发现它更容易阅读,它很容易变成一个SELECT语句来验证什么将被删除,然后再运行它。
--DELETE FROM table1
--WHERE id IN (
SELECT MIN(id) FROM table1
GROUP BY col1, col2, col3
-- could add a WHERE clause here to further filter
HAVING count(*) > 1
--)