我有一个表与以下字段:

id (Unique)
url (Unique)
title
company
site_id

现在,我需要删除具有相同标题、company和site_id的行。一种方法是使用下面的SQL和脚本(PHP):

SELECT title, site_id, location, id, count( * ) 
FROM jobs
GROUP BY site_id, company, title, location
HAVING count( * ) >1

运行此查询后,可以使用服务器端脚本删除重复项。

但是,我想知道这是否只能使用SQL查询。


当前回答

更快的方法是在临时表中插入不同的行。使用delete,我花了几个小时从一个800万行的表中删除重复项。使用insert和distinct,只花了13分钟。

CREATE TABLE tempTableName LIKE tableName;  
CREATE INDEX ix_all_id ON tableName(cellId,attributeId,entityRowId,value);  
INSERT INTO tempTableName(cellId,attributeId,entityRowId,value) SELECT DISTINCT cellId,attributeId,entityRowId,value FROM tableName;  
TRUNCATE TABLE tableName;
INSERT INTO tableName SELECT * FROM tempTableName; 
DROP TABLE tempTableName;  

其他回答

一种简单易懂且不需要主键的解决方案:

add a new boolean column alter table mytable add tokeep boolean; add a constraint on the duplicated columns AND the new column alter table mytable add constraint preventdupe unique (mycol1, mycol2, tokeep); set the boolean column to true. This will succeed only on one of the duplicated rows because of the new constraint update ignore mytable set tokeep = true; delete rows that have not been marked as tokeep delete from mytable where tokeep is null; drop the added column alter table mytable drop tokeep;

我建议您保留您添加的约束,以便将来防止出现新的重复。

我有这个查询片段的SQLServer,但我认为它可以用在其他DBMS与小的变化:

DELETE
FROM Table
WHERE Table.idTable IN  (  
    SELECT MAX(idTable)
    FROM idTable
    GROUP BY field1, field2, field3
    HAVING COUNT(*) > 1)

我忘了告诉您,这个查询不会删除重复行中id最低的行。如果这对你有用,试试这个查询:

DELETE
FROM jobs
WHERE jobs.id IN  (  
    SELECT MAX(id)
    FROM jobs
    GROUP BY site_id, company, title, location
    HAVING COUNT(*) > 1)

为了复制具有唯一列的记录,例如COL1,COL2, COL3不应该被复制(假设我们在表结构中遗漏了3个唯一列,并且在表中添加了多个重复项)

DROP TABLE TABLE_NAME_copy;
CREATE TABLE TABLE_NAME_copy LIKE TABLE_NAME;
INSERT INTO TABLE_NAME_copy
SELECT * FROM TABLE_NAME
GROUP BY COLUMN1, COLUMN2, COLUMN3; 
DROP TABLE TABLE_NAME;
ALTER TABLE TABLE_NAME_copy RENAME TO TABLE_NAME;

希望将有助于开发。

使用Delete JOIN语句删除重复的行:

DELETE t1 FROM table_name t1
JOIN table_name t2
WHERE
    t1.id < t2.id AND
    t1.title = t2.title AND t1.company = t2.company AND t1.site_id = t2.site_id;

要做到这一点,一个非常简单的方法是在3列上添加UNIQUE索引。在编写ALTER语句时,请包含IGNORE关键字。像这样:

ALTER IGNORE TABLE jobs
ADD UNIQUE INDEX idx_name (site_id, title, company);

这将删除所有重复的行。作为一个额外的好处,将来重复的insert将出错。像往常一样,在运行这样的程序之前,您可能想要进行备份…

编辑:不再工作在MySQL 5.7+

这个特性在MySQL 5.6中已经被弃用,在MySQL 5.7中被移除,所以它不起作用。