我从各种RSS订阅中阅读大量文本,并将它们插入到我的数据库中。

当然,在提要中使用了几种不同的字符编码,例如UTF-8和ISO 8859-1。

不幸的是,文本的编码有时会有问题。例子:

“Fußball”中的“ß”在我的数据库中应该是这样的:“Ÿ”。如果是“Ÿ”,则显示正确。 有时,“Fußball”中的“ß”在我的数据库中看起来像这样:“ß”。当然,这样就会显示错误。 在其他情况下,“ß”被保存为“ß”-因此没有任何变化。然后它也会被错误地显示。

我怎么做才能避免情况2和3?

我如何使所有的编码相同,最好是UTF-8?什么时候我必须使用utf8_encode(),什么时候我必须使用utf8_decode()(很清楚的效果是什么,但什么时候我必须使用函数?),什么时候我必须对输入什么都不做?

如何让所有编码都相同呢?也许使用函数mb_detect_encoding()?我能写一个函数吗?所以我的问题是:

如何找出文本使用的编码? 我如何将其转换为UTF-8 -无论旧的编码是什么?

这样的函数可行吗?

function correct_encoding($text) {
    $current_encoding = mb_detect_encoding($text, 'auto');
    $text = iconv($current_encoding, 'UTF-8', $text);
    return $text;
}

我已经测试过了,但是不行。有什么问题吗?


当前回答

不带auto的Try

那就是:

mb_detect_encoding($text)

而不是:

mb_detect_encoding($text, 'auto')

更多信息可以在这里找到:mb_detect_encoding

其他回答

我知道这是一个老问题,但我认为一个有用的答案不会有坏处。我在桌面应用程序、SQLite和GET/POST变量之间的编码有问题。有些会使用UTF-8,有些会使用ASCII,当涉及到外国字符时,基本上所有事情都会搞砸。

这是我的解决方案。在处理之前,它会在每个页面加载时擦除GET/POST/REQUEST(我省略了cookie,但如果需要可以添加它们)。它在标题中工作得很好。如果PHP不能自动检测到源编码,它将抛出警告,因此这些警告将被@'s抑制。

//Convert everything in our vars to UTF-8 for playing nice with the database...
//Use some auto detection here to help us not double-encode...
//Suppress possible warnings with @'s for when encoding cannot be detected
try
{
    $process = array(&$_GET, &$_POST, &$_REQUEST);
    while (list($key, $val) = each($process)) {
        foreach ($val as $k => $v) {
            unset($process[$key][$k]);
            if (is_array($v)) {
                $process[$key][@mb_convert_encoding($k,'UTF-8','auto')] = $v;
                $process[] = &$process[$key][@mb_convert_encoding($k,'UTF-8','auto')];
            } else {
                $process[$key][@mb_convert_encoding($k,'UTF-8','auto')] = @mb_convert_encoding($v,'UTF-8','auto');
            }
        }
    }
    unset($process);
}
catch(Exception $ex){}

检测编码是困难的。

mb_detect_encoding works by guessing, based on a number of candidates that you pass it. In some encodings, certain byte-sequences are invalid, an therefore it can distinguish between various candidates. Unfortunately, there are a lot of encodings, where the same bytes are valid (but different). In these cases, there is no way to determine the encoding; You can implement your own logic to make guesses in these cases. For example, data coming from a Japanese site might be more likely to have a Japanese encoding.

As long as you only deal with Western European languages, the three major encodings to consider are utf-8, iso-8859-1 and cp-1252. Since these are defaults for many platforms, they are also the most likely to be reported wrongly about. Eg. if people use different encodings, they are likely to be frank about it, since else their software would break very often. Therefore, a good strategy is to trust the provider, unless the encoding is reported as one of those three. You should still doublecheck that it is indeed valid, using mb_check_encoding (note that valid is not the same as being - the same input may be valid for many encodings). If it is one of those, you can then use mb_detect_encoding to distinguish between them. Luckily that is fairly deterministic; You just need to use the proper detect-sequence, which is UTF-8,ISO-8859-1,WINDOWS-1252.

一旦检测到编码,就需要将其转换为内部表示(UTF-8是唯一明智的选择)。函数utf8_encode将ISO-8859-1转换为UTF-8,因此它只能用于特定的输入类型。对于其他编码,使用mb_convert_encoding。

这个备备单列出了PHP中与UTF-8处理相关的一些常见注意事项: http://developer.loftdigital.com/blog/php-utf-8-cheatsheet

这个函数在字符串中检测多字节字符也可能是有帮助的(来源):

function detectUTF8($string) { return preg_match('%(?: [\xC2-\xDF][\x80-\xBF] # non-overlong 2-byte |\xE0[\xA0-\xBF][\x80-\xBF] # excluding overlongs |[\xE1-\xEC\xEE\xEF][\x80-\xBF]{2} # straight 3-byte |\xED[\x80-\x9F][\x80-\xBF] # excluding surrogates |\xF0[\x90-\xBF][\x80-\xBF]{2} # planes 1-3 |[\xF1-\xF3][\x80-\xBF]{3} # planes 4-15 |\xF4[\x80-\x8F][\x80-\xBF]{2} # plane 16 )+%xs', $string); }

如果您将utf8_encode()应用于一个已经UTF-8的字符串,它将返回乱码的UTF-8输出。

我做了一个函数来解决所有这些问题。它被称为Encoding::toUTF8()。

你不需要知道字符串的编码是什么。它可以是Latin1 (ISO 8859-1)、Windows-1252或UTF-8,或者字符串可以是它们的混合。Encoding::toUTF8()将所有内容转换为UTF-8。

我这样做是因为一个服务给了我一个混乱的数据提要,在同一个字符串中混合了UTF-8和Latin1。

用法:

require_once('Encoding.php');
use \ForceUTF8\Encoding;  // It's namespaced now.

$utf8_string = Encoding::toUTF8($utf8_or_latin1_or_mixed_string);

$latin1_string = Encoding::toLatin1($utf8_or_latin1_or_mixed_string);

下载:

https://github.com/neitanod/forceutf8

我还包含了另一个函数Encoding::fixUFT8(),它将修复每个看起来乱码的UTF-8字符串。

用法:

require_once('Encoding.php');
use \ForceUTF8\Encoding;  // It's namespaced now.

$utf8_string = Encoding::fixUTF8($garbled_utf8_string);

例子:

echo Encoding::fixUTF8("Fédération Camerounaise de Football");
echo Encoding::fixUTF8("Fédération Camerounaise de Football");
echo Encoding::fixUTF8("FÃÂédÃÂération Camerounaise de Football");
echo Encoding::fixUTF8("Fédération Camerounaise de Football");

将输出:

Fédération Camerounaise de Football
Fédération Camerounaise de Football
Fédération Camerounaise de Football
Fédération Camerounaise de Football

我在一个名为Encoding的类上将函数(forceeutf8)转换为一系列静态函数。新函数是Encoding::toUTF8()。

试着用这个…所有不是UTF-8的文本都将被翻译。

function is_utf8($str) {
    return (bool) preg_match('//u', $str);
}

$myString = "Fußball";

if(!is_utf8($myString)){
    $myString = utf8_encode($myString);
}

// or 1 line version ;) 
$myString = !is_utf8($myString) ? utf8_encode($myString) : trim($myString);