在XML文档中必须转义哪些字符,或者在哪里可以找到这样的列表?
当前回答
对一个老问题的新的、简化的回答……
简化XML转义(有优先级,100%完成)
Always (90% important to remember) Escape < as < unless < is starting a <tag/> or other markup. Escape & as & unless & is starting an &entity;. Attribute Values (9% important to remember) attr=" 'Single quotes' are ok within double quotes." attr=' "Double quotes" are ok within single quotes.' Escape " as " and ' as ' otherwise. Comments, CDATA, and Processing Instructions (0.9% important to remember) <!-- Within comments --> nothing has to be escaped but no -- strings are allowed. <![CDATA[ Within CDATA ]]> nothing has to be escaped, but no ]]> strings are allowed. <?PITarget Within PIs ?> nothing has to be escaped, but no ?> strings are allowed. Esoterica (0.1% important to remember) Escape control codes in XML 1.1 via Base64 or Numeric Character References. Escape ]]> as ]]> unless ]]> is ending a CDATA section. (This rule applies to character data in general – even outside a CDATA section.)
其他回答
对一个老问题的新的、简化的回答……
简化XML转义(有优先级,100%完成)
Always (90% important to remember) Escape < as < unless < is starting a <tag/> or other markup. Escape & as & unless & is starting an &entity;. Attribute Values (9% important to remember) attr=" 'Single quotes' are ok within double quotes." attr=' "Double quotes" are ok within single quotes.' Escape " as " and ' as ' otherwise. Comments, CDATA, and Processing Instructions (0.9% important to remember) <!-- Within comments --> nothing has to be escaped but no -- strings are allowed. <![CDATA[ Within CDATA ]]> nothing has to be escaped, but no ]]> strings are allowed. <?PITarget Within PIs ?> nothing has to be escaped, but no ?> strings are allowed. Esoterica (0.1% important to remember) Escape control codes in XML 1.1 via Base64 or Numeric Character References. Escape ]]> as ]]> unless ]]> is ending a CDATA section. (This rule applies to character data in general – even outside a CDATA section.)
这取决于上下文。对于内容,它是<和&,和]]>(尽管是一个由三个字符组成的字符串而不是一个字符)。
对于属性值,它是<、&、"和'。
对于CDATA,为[]>。
公认的答案不正确。最好是使用一个库来转义xml。
正如在另一个问题中提到的
基本上,控制字符和超出Unicode范围的字符是不允许的。这也意味着,例如,调用字符实体是禁止的。”
如果你只转义这五个字符。您可能会遇到这样的问题:发现了一个无效的XML字符(Unicode: 0xc)
根据万维网联盟(w3C)的规范,有5个字符不能以文字形式出现在XML文档中,除非用作标记分隔符或在注释、处理指令或CDATA部分中使用。在所有其他情况下,这些字符必须使用对应的实体或根据下表的数字引用替换:
Original CharacterXML entity replacementXML numeric replacement < < < > > > " " " & & & ' ' '
注意,前面提到的实体也可以在HTML中使用,除了',它是在XHTML 1.0中引入的,在HTML 4中没有声明。因此,为了确保向后兼容性,XHTML规范建议使用'代替。
对于标签和属性,转义字符是不同的。
标签:
< <
> > (only for compatibility, read below)
& &
属性:
" "
' '
从字符数据和标记:
The ampersand character (&) and the left angle bracket (<) must not appear in their literal form, except when used as markup delimiters, or within a comment, a processing instruction, or a CDATA section. If they are needed elsewhere, they must be escaped using either numeric character references or the strings " & " and " < " respectively. The right angle bracket (>) may be represented using the string " > ", and must, for compatibility, be escaped using either " > " or a character reference when it appears in the string " ]]> " in content, when that string is not marking the end of a CDATA section. To allow attribute values to contain both single and double quotes, the apostrophe or single-quote character (') may be represented as " ' ", and the double-quote character (") as " " ".