我需要从一个相当大的SQL Server表(即300,000+行)中删除重复的行。

当然,由于RowID标识字段的存在,这些行不会完全重复。

MyTable

RowID int not null identity(1,1) primary key,
Col1 varchar(20) not null,
Col2 varchar(2048) not null,
Col3 tinyint not null

我该怎么做呢?


当前回答

DELETE
FROM
    table_name T1
WHERE
    rowid > (
        SELECT
            min(rowid)
        FROM
            table_name T2
        WHERE
            T1.column_name = T2.column_name
    );

其他回答

delete t1
from table t1, table t2
where t1.columnA = t2.columnA
and t1.rowid>t2.rowid

邮政:

delete
from table t1
using table t2
where t1.columnA = t2.columnA
and t1.rowid > t2.rowid

现在让我们看看elasticalsearch表,这个表有重复的行,Id是相同的uniq字段。我们知道如果某个id存在于某个组条件下,那么我们可以删除该组作用域之外的其他行。我的举止表明了这一标准。

很多情况下,这个线程是在类似的状态,我。只需根据删除重复(重复)行的情况更改目标组条件。

DELETE 
FROM elasticalsearch
WHERE Id NOT IN 
               (SELECT min(Id)
                     FROM elasticalsearch
                     GROUP BY FirmId,FilterSearchString
                     ) 

干杯

下面的查询用于删除重复的行。本例中的表以ID作为标识列,具有重复数据的列是Column1、Column2和Column3。

DELETE FROM TableName
WHERE  ID NOT IN (SELECT MAX(ID)
                  FROM   TableName
                  GROUP  BY Column1,
                            Column2,
                            Column3
                  /*Even if ID is not null-able SQL Server treats MAX(ID) as potentially
                    nullable. Because of semantics of NOT IN (NULL) including the clause
                    below can simplify the plan*/
                  HAVING MAX(ID) IS NOT NULL) 

下面的脚本显示GROUP BY、HAVING、ORDER BY在一个查询中的用法,并返回带有重复列及其计数的结果。

SELECT YourColumnName,
       COUNT(*) TotalCount
FROM   YourTableName
GROUP  BY YourColumnName
HAVING COUNT(*) > 1
ORDER  BY COUNT(*) DESC 

有时使用软删除机制,其中记录日期以指示删除的日期。在这种情况下,可以使用UPDATE语句根据重复的条目更新该字段。

UPDATE MY_TABLE
   SET DELETED = getDate()
 WHERE TABLE_ID IN (
    SELECT x.TABLE_ID
      FROM MY_TABLE x
      JOIN (SELECT min(TABLE_ID) id, COL_1, COL_2, COL_3
              FROM MY_TABLE d
             GROUP BY d.COL_1, d.COL_2, d.COL_3
            HAVING count(*) > 1) AS d ON d.COL_1 = x.COL_1
                                     AND d.COL_2 = x.COL_2
                                     AND d.COL_3 = x.COL_3
                                     AND d.TABLE_ID <> x.TABLE_ID
             /*WHERE x.COL_4 <> 'D' -- Additional filter*/)

对于包含大约3000万行、重复量有高有低的中等数据表,这种方法非常适用。

对于表结构

MyTable

RowID int not null identity(1,1) primary key,
Col1 varchar(20) not null,
Col2 varchar(2048) not null,
Col3 tinyint not null

删除重复项的查询:

DELETE t1
FROM MyTable t1
INNER JOIN MyTable t2
WHERE t1.RowID > t2.RowID
  AND t1.Col1 = t2.Col1
  AND t1.Col2=t2.Col2
  AND t1.Col3=t2.Col3;

我假设RowID是一种自动递增,其余列有重复的值。