I have a table with a very large amount of rows. Duplicates are not allowed but due to a problem with how the rows were created I know there are some duplicates in this table.
I need to eliminate the extra rows from the perspective of the key columns. Some other columns may have slightly different data but I do not care about that. I still need to keep one of these rows however. SELECT DISTINCT won't work because it operates on all columns and I need to suppress duplicates based on the key columns.
我如何删除额外的行,但仍然有效地保留一个?
下面是我对它的曲解,有一个可运行的例子。注意,这只适用于Id是唯一的,并且在其他列中有重复值的情况。
DECLARE @SampleData AS TABLE (Id int, Duplicate varchar(20))
INSERT INTO @SampleData
SELECT 1, 'ABC' UNION ALL
SELECT 2, 'ABC' UNION ALL
SELECT 3, 'LMN' UNION ALL
SELECT 4, 'XYZ' UNION ALL
SELECT 5, 'XYZ'
DELETE FROM @SampleData WHERE Id IN (
SELECT Id FROM (
SELECT
Id
,ROW_NUMBER() OVER (PARTITION BY [Duplicate] ORDER BY Id) AS [ItemNumber]
-- Change the partition columns to include the ones that make the row distinct
FROM
@SampleData
) a WHERE ItemNumber > 1 -- Keep only the first unique item
)
SELECT * FROM @SampleData
结果是:
Id Duplicate
----------- ---------
1 ABC
3 LMN
4 XYZ
不知道为什么我一开始会这么想…这绝对不是最简单的方法,但确实有效。
您没有说您使用的是哪个版本,但是在SQL 2005及以上版本中,您可以使用带有OVER子句的公共表表达式。大概是这样的:
WITH cte AS (
SELECT[foo], [bar],
row_number() OVER(PARTITION BY foo, bar ORDER BY baz) AS [rn]
FROM TABLE
)
DELETE cte WHERE [rn] > 1
试试它,看看你会得到什么。
(编辑:为了提供帮助,有人编辑了CTE中的ORDER BY子句。需要明确的是,您可以在这里按任何您想要的顺序进行排序,它不必是cte返回的列之一。事实上,这里常见的用例是“foo, bar”是组标识符,而“baz”是某种时间戳。为了保持最新,你应该按订单订购。