我有一个表与以下字段:

id (Unique)
url (Unique)
title
company
site_id

现在,我需要删除具有相同标题、company和site_id的行。一种方法是使用下面的SQL和脚本(PHP):

SELECT title, site_id, location, id, count( * ) 
FROM jobs
GROUP BY site_id, company, title, location
HAVING count( * ) >1

运行此查询后,可以使用服务器端脚本删除重复项。

但是,我想知道这是否只能使用SQL查询。


当前回答

MySQL对引用要删除的表有限制。你可以用一个临时表来解决这个问题,比如:

create temporary table tmpTable (id int);

insert  into tmpTable
        (id)
select  id
from    YourTable yt
where   exists
        (
        select  *
        from    YourTabe yt2
        where   yt2.title = yt.title
                and yt2.company = yt.company
                and yt2.site_id = yt.site_id
                and yt2.id > yt.id
        );

delete  
from    YourTable
where   ID in (select id from tmpTable);

以下是Kostanos在评论中的建议: 上面唯一缓慢的查询是DELETE,适用于数据库非常大的情况。这个查询可以更快:

DELETE FROM YourTable USING YourTable, tmpTable WHERE YourTable.id=tmpTable.id

其他回答

这个解决方案将把重复的数据移到一个表中,唯一的数据移到另一个表中。

-- speed up creating uniques table if dealing with many rows
CREATE INDEX temp_idx ON jobs(site_id, company, title, location);

-- create the table with unique rows
INSERT jobs_uniques SELECT * FROM
    (
    SELECT * 
    FROM jobs
    GROUP BY site_id, company, title, location
    HAVING count(1) > 1
    UNION
    SELECT *
    FROM jobs
    GROUP BY site_id, company, title, location
    HAVING count(1) = 1
) x

-- create the table with duplicate rows
INSERT jobs_dupes 
SELECT * 
FROM jobs
WHERE id NOT IN
(SELECT id FROM jobs_uniques)

-- confirm the difference between uniques and dupes tables
SELECT COUNT(1)
AS jobs, 
(SELECT COUNT(1) FROM jobs_dupes) + (SELECT COUNT(1) FROM jobs_uniques)
AS sum
FROM jobs

一种简单易懂且不需要主键的解决方案:

add a new boolean column alter table mytable add tokeep boolean; add a constraint on the duplicated columns AND the new column alter table mytable add constraint preventdupe unique (mycol1, mycol2, tokeep); set the boolean column to true. This will succeed only on one of the duplicated rows because of the new constraint update ignore mytable set tokeep = true; delete rows that have not been marked as tokeep delete from mytable where tokeep is null; drop the added column alter table mytable drop tokeep;

我建议您保留您添加的约束,以便将来防止出现新的重复。

我想更具体地说明我删除了哪些记录,下面是我的解决方案:

delete
from jobs c1
where not c1.location = 'Paris'
and  c1.site_id > 64218
and exists 
(  
select * from jobs c2 
where c2.site_id = c1.site_id
and   c2.company = c1.company
and   c2.location = c1.location
and   c2.title = c1.title
and   c2.site_id > 63412
and   c2.site_id < 64219
)

我找到了一个简单的方法。(保持最新的)

DELETE t1 FROM table_name t1 INNER JOIN table_name t2 
WHERE t1.primary_id < t2.primary_id 
AND t1.check_duplicate_col_1 = t2.check_duplicate_col_1 
AND t1.check_duplicate_col_2 = t2.check_duplicate_col_2
...

Deleting duplicates on MySQL tables is a common issue, that's genarally the result of a missing constraint to avoid those duplicates before hand. But this common issue usually comes with specific needs... that do require specific approaches. The approach should be different depending on, for example, the size of the data, the duplicated entry that should be kept (generally the first or the last one), whether there are indexes to be kept, or whether we want to perform any additional action on the duplicated data.

MySQL本身也有一些特殊性,比如在执行表UPDATE时不能在FROM上引用同一个表(它会引发MySQL错误#1093)。这种限制可以通过使用带有临时表的内部查询来克服(如上面的一些方法所建议的)。但是这种内部查询在处理大数据源时表现不佳。

然而,确实存在一种更好的方法来删除副本,这种方法既有效又可靠,并且可以很容易地适应不同的需求。

一般的想法是创建一个新的临时表,通常添加一个唯一的约束以避免进一步的重复,并将前一个表中的数据插入到新表中,同时处理重复的数据。这种方法依赖于简单的MySQL INSERT查询,创建一个新的约束以避免进一步的重复,并且跳过了使用内部查询来搜索重复和应该保存在内存中的临时表的需要(因此也适合大数据源)。

这就是实现它的方法。假设我们有一个表employee,有以下列:

employee (id, first_name, last_name, start_date, ssn)

为了删除具有重复ssn列的行,并只保留找到的第一个条目,可以遵循以下过程:

-- create a new tmp_eployee table
CREATE TABLE tmp_employee LIKE employee;

-- add a unique constraint
ALTER TABLE tmp_employee ADD UNIQUE(ssn);

-- scan over the employee table to insert employee entries
INSERT IGNORE INTO tmp_employee SELECT * FROM employee ORDER BY id;

-- rename tables
RENAME TABLE employee TO backup_employee, tmp_employee TO employee;

技术的解释

Line #1 creates a new tmp_eployee table with exactly the same structure as the employee table Line #2 adds a UNIQUE constraint to the new tmp_eployee table to avoid any further duplicates Line #3 scans over the original employee table by id, inserting new employee entries into the new tmp_eployee table, while ignoring duplicated entries Line #4 renames tables, so that the new employee table holds all the entries without the duplicates, and a backup copy of the former data is kept on the backup_employee table

使用这种方法,160万个寄存器在不到200秒的时间内转换为6k。

Chetan,按照这个过程,你可以快速轻松地删除所有副本,并通过运行创建一个UNIQUE约束:

CREATE TABLE tmp_jobs LIKE jobs;

ALTER TABLE tmp_jobs ADD UNIQUE(site_id, title, company);

INSERT IGNORE INTO tmp_jobs SELECT * FROM jobs ORDER BY id;

RENAME TABLE jobs TO backup_jobs, tmp_jobs TO jobs;

当然,在删除重复项时,可以进一步修改此过程以适应不同的需要。以下是一些例子。

✔保留最后一个条目而不是第一个条目的变化

有时我们需要保留最后一个重复的条目,而不是第一个。

CREATE TABLE tmp_employee LIKE employee;

ALTER TABLE tmp_employee ADD UNIQUE(ssn);

INSERT IGNORE INTO tmp_employee SELECT * FROM employee ORDER BY id DESC;

RENAME TABLE employee TO backup_employee, tmp_employee TO employee;

在第3行,ORDER BY id DESC子句使最后一个id优先于其他id

在副本上执行一些任务的变化,例如统计发现的副本

有时,我们需要对找到的重复条目执行一些进一步的处理(例如保持重复条目的计数)。

CREATE TABLE tmp_employee LIKE employee;

ALTER TABLE tmp_employee ADD UNIQUE(ssn);

ALTER TABLE tmp_employee ADD COLUMN n_duplicates INT DEFAULT 0;

INSERT INTO tmp_employee SELECT * FROM employee ORDER BY id ON DUPLICATE KEY UPDATE n_duplicates=n_duplicates+1;

RENAME TABLE employee TO backup_employee, tmp_employee TO employee;

在第3行中,创建了一个新列n_duplicate 在第4行,INSERT INTO…ON DUPLICATE KEY UPDATE查询用于在发现副本时执行额外的更新(在本例中,增加计数器) 插入…ON DUPLICATE KEY UPDATE查询可用于对找到的副本执行不同类型的更新。

重新生成自动递增字段id的变量

有时我们使用自动增量字段,为了使索引尽可能紧凑,我们可以利用删除重复项来在新的临时表中重新生成自动增量字段。

CREATE TABLE tmp_employee LIKE employee;

ALTER TABLE tmp_employee ADD UNIQUE(ssn);

INSERT IGNORE INTO tmp_employee SELECT (first_name, last_name, start_date, ssn) FROM employee ORDER BY id;

RENAME TABLE employee TO backup_employee, tmp_employee TO employee;

在第3行中,没有选择表中的所有字段,而是跳过了id字段,以便DB引擎自动生成一个新字段

进一步的变化

根据所需的行为,还可以进行许多进一步的修改。例如,下面的查询将使用第二个临时表,除了1)保留最后一个条目而不是第一个条目;2)在发现的副本上增加计数器;另外3)重新生成自动增量字段id,同时保持输入顺序,因为它是在以前的数据。

CREATE TABLE tmp_employee LIKE employee;

ALTER TABLE tmp_employee ADD UNIQUE(ssn);

ALTER TABLE tmp_employee ADD COLUMN n_duplicates INT DEFAULT 0;

INSERT INTO tmp_employee SELECT * FROM employee ORDER BY id DESC ON DUPLICATE KEY UPDATE n_duplicates=n_duplicates+1;

CREATE TABLE tmp_employee2 LIKE tmp_employee;

INSERT INTO tmp_employee2 SELECT (first_name, last_name, start_date, ssn) FROM tmp_employee ORDER BY id;

DROP TABLE tmp_employee;

RENAME TABLE employee TO backup_employee, tmp_employee2 TO employee;