我有一个大约有500k行的表;varchar(255) UTF8列文件名包含一个文件名;

我试图从文件名中剥离出各种奇怪的字符-我想我会使用字符类:[^a- za - z0 -9()_ .\-]

现在,MySQL中是否有一个函数允许你通过正则表达式进行替换?我正在寻找一个类似的功能REPLACE()函数-简化的例子如下:

SELECT REPLACE('stackowerflow', 'ower', 'over');

Output: "stackoverflow"

/* does something like this exist? */
SELECT X_REG_REPLACE('Stackoverflow','/[A-Zf]/','-'); 

Output: "-tackover-low"

我知道REGEXP/RLIKE,但它们只检查是否有匹配,而不检查匹配是什么。

(我可以做一个“SELECT pkey_id,filename FROM foo WHERE filename RLIKE '[^a- za - z0 -9()_ .\-]'”从一个PHP脚本,做一个preg_replace,然后“更新foo…WHERE pkey_id=…”,但这看起来像一个最后的手段缓慢和丑陋的黑客)


MySQL 8.0 +:

您可以使用本机REGEXP_REPLACE函数。

旧版本:

您可以使用用户定义函数(UDF),如mysql-udf-regexp。


我最近写了一个MySQL函数来使用正则表达式替换字符串。你可以在以下地点找到我的职位:

http://techras.wordpress.com/2011/06/02/regex-replace-for-mysql/

下面是函数代码:

DELIMITER $$

CREATE FUNCTION  `regex_replace`(pattern VARCHAR(1000),replacement VARCHAR(1000),original VARCHAR(1000))
RETURNS VARCHAR(1000)
DETERMINISTIC
BEGIN 
 DECLARE temp VARCHAR(1000); 
 DECLARE ch VARCHAR(1); 
 DECLARE i INT;
 SET i = 1;
 SET temp = '';
 IF original REGEXP pattern THEN 
  loop_label: LOOP 
   IF i>CHAR_LENGTH(original) THEN
    LEAVE loop_label;  
   END IF;
   SET ch = SUBSTRING(original,i,1);
   IF NOT ch REGEXP pattern THEN
    SET temp = CONCAT(temp,ch);
   ELSE
    SET temp = CONCAT(temp,replacement);
   END IF;
   SET i=i+1;
  END LOOP;
 ELSE
  SET temp = original;
 END IF;
 RETURN temp;
END$$

DELIMITER ;

示例执行:

mysql> select regex_replace('[^a-zA-Z0-9\-]','','2my test3_text-to. check \\ my- sql (regular) ,expressions ._,');

我使用的蛮力方法是:

转储表- mysqldump -u user -p数据库表> Dump .sql 查找并替换两个模式——查找/path/to/dump。sql -type f -exec sed -i 's/old_string/new_string/g'{} \;,显然还有其他perl正则表达式可以在文件上执行。 导入表- mysqlimport -u user -p database table < dump.sql

如果您希望确保字符串不在数据集中的其他地方,请运行一些正则表达式以确保它们都出现在类似的环境中。在运行替换之前创建备份也不是那么困难,以防您意外地破坏了一些失去深度信息的东西。


你“能”做到……但这不太明智……这是我最大胆的尝试……至于完整的RegEx支持,你最好使用perl或类似的。

UPDATE db.tbl
SET column = 
CASE 
WHEN column REGEXP '[[:<:]]WORD_TO_REPLACE[[:>:]]' 
THEN REPLACE(column,'WORD_TO_REPLACE','REPLACEMENT')
END 
WHERE column REGEXP '[[:<:]]WORD_TO_REPLACE[[:>:]]'

我很高兴地告诉大家,既然问了这个问题,现在有了一个令人满意的答案!看看这个很棒的套餐吧:

https://github.com/mysqludf/lib_mysqludf_preg

示例SQL:

SELECT PREG_REPLACE('/(.*?)(fox)/' , 'dog' , 'the quick brown fox' ) AS demo;

我从这篇博客文章中找到了这个问题的链接。


如果您使用的是MariaDB或MySQL 8.0,它们有一个功能

REGEXP_REPLACE(col, regexp, replace)

参见MariaDB文档和PCRE正则表达式增强

注意,您也可以使用regexp分组(我发现这非常有用):

SELECT REGEXP_REPLACE("stackoverflow", "(stack)(over)(flow)", '\\2 - \\1 - \\3')

返回

over - stack - flow

我们可以在SELECT查询中使用IF条件,如下:

假设对于包含“ABC”,“ABC1”,“ABC2”,“ABC3”的任何东西,…,我们想用“ABC”替换,然后在SELECT查询中使用REGEXP和IF()条件,我们可以实现这一点。

语法:

SELECT IF(column_name REGEXP 'ABC[0-9]$','ABC',column_name)
FROM table1 
WHERE column_name LIKE 'ABC%';

例子:

SELECT IF('ABC1' REGEXP 'ABC[0-9]$','ABC','ABC1');

我们不用正则表达式来解决这个问题 此查询只替换精确匹配字符串。

update employee set
employee_firstname = 
trim(REPLACE(concat(" ",employee_firstname," "),' jay ',' abc '))

例子:

emp_id employee_firstname 1杰 2 jay ajay 三杰

执行查询结果后:

emp_id employee_firstname 1美国广播公司 2 ABC ajay 3 abc


更新2:MySQL 8.0中提供了一组有用的regex函数,包括REGEXP_REPLACE。这将使阅读变得不必要,除非您必须使用较早的版本。


更新1:现在已经把这变成了一篇博客:http://stevettt.blogspot.co.uk/2018/02/a-mysql-regular-expression-replace.html


下面扩展了Rasika Godawatte提供的函数,但它涵盖了所有必要的子字符串,而不仅仅是测试单个字符:

-- ------------------------------------------------------------------------------------
-- USAGE
-- ------------------------------------------------------------------------------------
-- SELECT reg_replace(<subject>,
--                    <pattern>,
--                    <replacement>,
--                    <greedy>,
--                    <minMatchLen>,
--                    <maxMatchLen>);
-- where:
-- <subject> is the string to look in for doing the replacements
-- <pattern> is the regular expression to match against
-- <replacement> is the replacement string
-- <greedy> is TRUE for greedy matching or FALSE for non-greedy matching
-- <minMatchLen> specifies the minimum match length
-- <maxMatchLen> specifies the maximum match length
-- (minMatchLen and maxMatchLen are used to improve efficiency but are
--  optional and can be set to 0 or NULL if not known/required)
-- Example:
-- SELECT reg_replace(txt, '^[Tt][^ ]* ', 'a', TRUE, 2, 0) FROM tbl;
DROP FUNCTION IF EXISTS reg_replace;
DELIMITER //
CREATE FUNCTION reg_replace(subject VARCHAR(21845), pattern VARCHAR(21845),
  replacement VARCHAR(21845), greedy BOOLEAN, minMatchLen INT, maxMatchLen INT)
RETURNS VARCHAR(21845) DETERMINISTIC BEGIN 
  DECLARE result, subStr, usePattern VARCHAR(21845); 
  DECLARE startPos, prevStartPos, startInc, len, lenInc INT;
  IF subject REGEXP pattern THEN
    SET result = '';
    -- Sanitize input parameter values
    SET minMatchLen = IF(minMatchLen IS NULL OR minMatchLen < 1, 1, minMatchLen);
    SET maxMatchLen = IF(maxMatchLen IS NULL OR maxMatchLen < 1
                         OR maxMatchLen > CHAR_LENGTH(subject),
                         CHAR_LENGTH(subject), maxMatchLen);
    -- Set the pattern to use to match an entire string rather than part of a string
    SET usePattern = IF (LEFT(pattern, 1) = '^', pattern, CONCAT('^', pattern));
    SET usePattern = IF (RIGHT(pattern, 1) = '$', usePattern, CONCAT(usePattern, '$'));
    -- Set start position to 1 if pattern starts with ^ or doesn't end with $.
    IF LEFT(pattern, 1) = '^' OR RIGHT(pattern, 1) <> '$' THEN
      SET startPos = 1, startInc = 1;
    -- Otherwise (i.e. pattern ends with $ but doesn't start with ^): Set start pos
    -- to the min or max match length from the end (depending on "greedy" flag).
    ELSEIF greedy THEN
      SET startPos = CHAR_LENGTH(subject) - maxMatchLen + 1, startInc = 1;
    ELSE
      SET startPos = CHAR_LENGTH(subject) - minMatchLen + 1, startInc = -1;
    END IF;
    WHILE startPos >= 1 AND startPos <= CHAR_LENGTH(subject)
      AND startPos + minMatchLen - 1 <= CHAR_LENGTH(subject)
      AND !(LEFT(pattern, 1) = '^' AND startPos <> 1)
      AND !(RIGHT(pattern, 1) = '$'
            AND startPos + maxMatchLen - 1 < CHAR_LENGTH(subject)) DO
      -- Set start length to maximum if matching greedily or pattern ends with $.
      -- Otherwise set starting length to the minimum match length.
      IF greedy OR RIGHT(pattern, 1) = '$' THEN
        SET len = LEAST(CHAR_LENGTH(subject) - startPos + 1, maxMatchLen), lenInc = -1;
      ELSE
        SET len = minMatchLen, lenInc = 1;
      END IF;
      SET prevStartPos = startPos;
      lenLoop: WHILE len >= 1 AND len <= maxMatchLen
                 AND startPos + len - 1 <= CHAR_LENGTH(subject)
                 AND !(RIGHT(pattern, 1) = '$' 
                       AND startPos + len - 1 <> CHAR_LENGTH(subject)) DO
        SET subStr = SUBSTRING(subject, startPos, len);
        IF subStr REGEXP usePattern THEN
          SET result = IF(startInc = 1,
                          CONCAT(result, replacement), CONCAT(replacement, result));
          SET startPos = startPos + startInc * len;
          LEAVE lenLoop;
        END IF;
        SET len = len + lenInc;
      END WHILE;
      IF (startPos = prevStartPos) THEN
        SET result = IF(startInc = 1, CONCAT(result, SUBSTRING(subject, startPos, 1)),
                        CONCAT(SUBSTRING(subject, startPos, 1), result));
        SET startPos = startPos + startInc;
      END IF;
    END WHILE;
    IF startInc = 1 AND startPos <= CHAR_LENGTH(subject) THEN
      SET result = CONCAT(result, RIGHT(subject, CHAR_LENGTH(subject) + 1 - startPos));
    ELSEIF startInc = -1 AND startPos >= 1 THEN
      SET result = CONCAT(LEFT(subject, startPos), result);
    END IF;
  ELSE
    SET result = subject;
  END IF;
  RETURN result;
END//
DELIMITER ;

Demo

Rextester演示

限制

This method is of course going to take a while when the subject string is large. Update: Have now added minimum and maximum match length parameters for improved efficiency when these are known (zero = unknown/unlimited). It won't allow substitution of backreferences (e.g. \1, \2 etc.) to replace capturing groups. If this functionality is needed, please see this answer which attempts to provide a workaround by updating the function to allow a secondary find and replace within each found match (at the expense of increased complexity). If ^and/or $ is used in the pattern, they must be at the very start and very end respectively - e.g. patterns such as (^start|end$) are not supported. There is a "greedy" flag to specify whether the overall matching should be greedy or non-greedy. Combining greedy and lazy matching within a single regular expression (e.g. a.*?b.*) is not supported.

用法示例

该函数用于回答以下StackOverflow问题:

如何计数单词在MySQL /正则表达式 替代者? 如何提取第n个词和计数词出现在一个MySQL 字符串? 如何从文本字段中提取两个连续的数字 MySQL吗? 如何从一个字符串中删除所有非字母数字字符 MySQL吗? 如何替换MySQL中一个特定字符的每一个其他实例 字符串? 如何从MySQL表中的多个列中获得指定的最小长度的所有不同的单词?


在MySQL 8.0+中,可以使用本地REGEXP_REPLACE函数。

12.5.2正则表达式:

REGEXP_REPLACE(expr, pat, repl[, pos[, occurrence[, match_type]]]) 将字符串expr中与模式pat指定的正则表达式匹配的事件替换为替换字符串repl,并返回结果字符串。如果expr、pat或repl为NULL,则返回值为NULL。

和正则表达式支持:

Previously, MySQL used the Henry Spencer regular expression library to support regular expression operators (REGEXP, RLIKE). Regular expression support has been reimplemented using International Components for Unicode (ICU), which provides full Unicode support and is multibyte safe. The REGEXP_LIKE() function performs regular expression matching in the manner of the REGEXP and RLIKE operators, which now are synonyms for that function. In addition, the REGEXP_INSTR(), REGEXP_REPLACE(), and REGEXP_SUBSTR() functions are available to find match positions and perform substring substitution and extraction, respectively.

SELECT REGEXP_REPLACE('Stackoverflow','[A-Zf]','-',1,0,'c'); 
-- Output:
-tackover-low

DBFiddle演示


下面的一个基本上从左边找到第一个匹配,然后替换它的所有出现(在mysql-5.6中测试)。

用法:

SELECT REGEX_REPLACE('dis ambiguity', 'dis[[:space:]]*ambiguity', 'disambiguity');

实现:

DELIMITER $$
CREATE FUNCTION REGEX_REPLACE(
  var_original VARCHAR(1000),
  var_pattern VARCHAR(1000),
  var_replacement VARCHAR(1000)
  ) RETURNS
    VARCHAR(1000)
  COMMENT 'Based on https://techras.wordpress.com/2011/06/02/regex-replace-for-mysql/'
BEGIN
  DECLARE var_replaced VARCHAR(1000) DEFAULT var_original;
  DECLARE var_leftmost_match VARCHAR(1000) DEFAULT
    REGEX_CAPTURE_LEFTMOST(var_original, var_pattern);
    WHILE var_leftmost_match IS NOT NULL DO
      IF var_replacement <> var_leftmost_match THEN
        SET var_replaced = REPLACE(var_replaced, var_leftmost_match, var_replacement);
        SET var_leftmost_match = REGEX_CAPTURE_LEFTMOST(var_replaced, var_pattern);
        ELSE
          SET var_leftmost_match = NULL;
        END IF;
      END WHILE;
  RETURN var_replaced;
END $$
DELIMITER ;

DELIMITER $$
CREATE FUNCTION REGEX_CAPTURE_LEFTMOST(
  var_original VARCHAR(1000),
  var_pattern VARCHAR(1000)
  ) RETURNS
    VARCHAR(1000)
  COMMENT '
  Captures the leftmost substring that matches the [var_pattern]
  IN [var_original], OR NULL if no match.
  '
BEGIN
  DECLARE var_temp_l VARCHAR(1000);
  DECLARE var_temp_r VARCHAR(1000);
  DECLARE var_left_trim_index INT;
  DECLARE var_right_trim_index INT;
  SET var_left_trim_index = 1;
  SET var_right_trim_index = 1;
  SET var_temp_l = '';
  SET var_temp_r = '';
  WHILE (CHAR_LENGTH(var_original) >= var_left_trim_index) DO
    SET var_temp_l = LEFT(var_original, var_left_trim_index);
    IF var_temp_l REGEXP var_pattern THEN
      WHILE (CHAR_LENGTH(var_temp_l) >= var_right_trim_index) DO
        SET var_temp_r = RIGHT(var_temp_l, var_right_trim_index);
        IF var_temp_r REGEXP var_pattern THEN
          RETURN var_temp_r;
          END IF;
        SET var_right_trim_index = var_right_trim_index + 1;
        END WHILE;
      END IF;
    SET var_left_trim_index = var_left_trim_index + 1;
    END WHILE;
  RETURN NULL;
END $$
DELIMITER ;

我认为有一个简单的方法来实现这一点,这对我来说很有效。

使用REGEX选择行

SELECT * FROM `table_name` WHERE `column_name_to_find` REGEXP 'string-to-find'

使用REGEX更新行

UPDATE `table_name` SET column_name_to_find=REGEXP_REPLACE(column_name_to_find, 'string-to-find', 'string-to-replace') WHERE column_name_to_find REGEXP 'string-to-find'

REGEXP参考: https://www.geeksforgeeks.org/mysql-regular-expressions-regexp/


是的,你可以。

UPDATE table_name 
  SET column_name = 'seach_str_name'
  WHERE column_name REGEXP '[^a-zA-Z0-9()_ .\-]';