我正在使用一些包含字符串的XML:
<node>This is a string</node>
我传递给节点的一些字符串将有&,#,$等字符:
<node>This is a string & so is this</node>
由于&,这是无效的。
我不能在CDATA中包装这些字符串,因为它们需要这样。我尝试寻找一个字符列表,这些字符不能放在XML节点中,而不能放在CDATA中。
有人能给我指个方向或者给我一份非法字符的列表吗?
我正在使用一些包含字符串的XML:
<node>This is a string</node>
我传递给节点的一些字符串将有&,#,$等字符:
<node>This is a string & so is this</node>
由于&,这是无效的。
我不能在CDATA中包装这些字符串,因为它们需要这样。我尝试寻找一个字符列表,这些字符不能放在XML节点中,而不能放在CDATA中。
有人能给我指个方向或者给我一份非法字符的列表吗?
当前回答
在Woodstox XML处理器中,无效字符由以下代码分类:
if (c == 0) {
throw new IOException("Invalid null character in text to output");
}
if (c < ' ' || (c >= 0x7F && c <= 0x9F)) {
String msg = "Invalid white space character (0x" + Integer.toHexString(c) + ") in text to output";
if (mXml11) {
msg += " (can only be output using character entity)";
}
throw new IOException(msg);
}
if (c > 0x10FFFF) {
throw new IOException("Illegal unicode character point (0x" + Integer.toHexString(c) + ") to output; max is 0x10FFFF as per RFC");
}
/*
* Surrogate pair in non-quotable (not text or attribute value) content, and non-unicode encoding (ISO-8859-x,
* Ascii)?
*/
if (c >= SURR1_FIRST && c <= SURR2_LAST) {
throw new IOException("Illegal surrogate pair -- can only be output via character entities, which are not allowed in this content");
}
throw new IOException("Invalid XML character (0x"+Integer.toHexString(c)+") in text to output");
来自这里
其他回答
这是一个c#代码,用于从字符串中删除XML无效字符并返回一个新的有效字符串。
public static string CleanInvalidXmlChars(string text)
{
// From xml spec valid chars:
// #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]
// any Unicode character, excluding the surrogate blocks, FFFE, and FFFF.
string re = @"[^\x09\x0A\x0D\x20-\uD7FF\uE000-\uFFFD\u10000-\u10FFFF]";
return Regex.Replace(text, re, "");
}
ampersand (&) is escaped to &
double quotes (") are escaped to "
single quotes (') are escaped to '
less than (<) is escaped to <
greater than (>) is escaped to >
在c#中,使用system . security . securyelement.escape或System.Net.WebUtility.HtmlEncode来转义这些非法字符。
string xml = "<node>it's my \"node\" & i like it 0x12 x09 x0A 0x09 0x0A <node>";
string encodedXml1 = System.Security.SecurityElement.Escape(xml);
string encodedXml2= System.Net.WebUtility.HtmlEncode(xml);
encodedXml1
"<node>it's my "node" & i like it 0x12 x09 x0A 0x09 0x0A <node>"
encodedXml2
"<node>it's my "node" & i like it 0x12 x09 x0A 0x09 0x0A <node>"
“XmlWriter和低ASCII字符”对我很有用
string code = Regex.Replace(item.Code, @"[\u0000-\u0008,\u000B,\u000C,\u000E-\u001F]", "");
有效字符的列表在XML规范中:
Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF] /* any Unicode character, excluding the surrogate blocks, FFFE, and FFFF. */
另一个简单的方法是在c#中转义可能不需要的XML / XHTML字符:
WebUtility.HtmlEncode(stringWithStrangeChars)