在c++中进行不区分大小写字符串比较而不将字符串转换为全大写或全小写的最佳方法是什么?
请指出这些方法是否对unicode友好,以及它们的可移植性如何。
在c++中进行不区分大小写字符串比较而不将字符串转换为全大写或全小写的最佳方法是什么?
请指出这些方法是否对unicode友好,以及它们的可移植性如何。
当前回答
仅供参考,strcmp()和stricmp()容易受到缓冲区溢出的影响,因为它们只处理到遇到空结束符为止。使用_strncmp()和_strnicmp()更安全。
其他回答
如果您必须更频繁地将源字符串与其他字符串进行比较,一个优雅的解决方案是使用正则表达式。
std::wstring first = L"Test";
std::wstring second = L"TEST";
std::wregex pattern(first, std::wregex::icase);
bool isEqual = std::regex_match(second, pattern);
如果你不想使用Boost库,那么这里有一个解决方案,它只使用c++标准io头。
#include <iostream>
struct iequal
{
bool operator()(int c1, int c2) const
{
// case insensitive comparison of two characters.
return std::toupper(c1) == std::toupper(c2);
}
};
bool iequals(const std::string& str1, const std::string& str2)
{
// use std::equal() to compare range of characters using the functor above.
return std::equal(str1.begin(), str1.end(), str2.begin(), iequal());
}
int main(void)
{
std::string str_1 = "HELLO";
std::string str_2 = "hello";
if(iequals(str_1,str_2))
{
std::cout<<"String are equal"<<std::endl;
}
else
{
std::cout<<"String are not equal"<<std::endl;
}
return 0;
}
bool insensitive_c_compare(char A, char B){
static char mid_c = ('Z' + 'a') / 2 + 'Z';
static char up2lo = 'A' - 'a'; /// the offset between upper and lowers
if ('a' >= A and A >= 'z' or 'A' >= A and 'Z' >= A)
if ('a' >= B and B >= 'z' or 'A' >= B and 'Z' >= B)
/// check that the character is infact a letter
/// (trying to turn a 3 into an E would not be pretty!)
{
if (A > mid_c and B > mid_c or A < mid_c and B < mid_c)
{
return A == B;
}
else
{
if (A > mid_c)
A = A - 'a' + 'A';
if (B > mid_c)/// convert all uppercase letters to a lowercase ones
B = B - 'a' + 'A';
/// this could be changed to B = B + up2lo;
return A == B;
}
}
}
这可能会更有效,但这是一个笨重的版本,所有的比特都是空的。
不是所有的便携,但工作得很好,在我的电脑上(不知道,我是图片而不是文字)
Boost::iequals在string的情况下不兼容utf-8。 你可以使用boost::locale。
comparator<char,collator_base::secondary> cmpr;
cout << (cmpr(str1, str2) ? "str1 < str2" : "str1 >= str2") << endl;
Primary -- ignore accents and character case, comparing base letters only. For example "facade" and "Façade" are the same. Secondary -- ignore character case but consider accents. "facade" and "façade" are different but "Façade" and "façade" are the same. Tertiary -- consider both case and accents: "Façade" and "façade" are different. Ignore punctuation. Quaternary -- consider all case, accents, and punctuation. The words must be identical in terms of Unicode representation. Identical -- as quaternary, but compare code points as well.
无论你最终选择什么方法,如果该方法碰巧包含一些答案建议的strcmp的使用,请注意:
strcmp一般不处理Unicode数据。一般来说,它甚至不能使用基于字节的Unicode编码,比如utf-8,因为strcmp只进行逐字节的比较,而用utf-8编码的Unicode代码点可以占用超过1个字节。strcmp正确处理的唯一特定的Unicode情况是,使用基于字节的编码方式编码的字符串只包含低于U+00FF的代码点,那么每个字节的比较就足够了。