在纯Java中转义HTML符号的推荐方法是什么?

在纯Java代码中输出HTML时，是否有一种推荐的方法来转义<，>，"和&字符?(除了手动执行以下操作之外)。

String source = "The less than sign (<) and ampersand (&) must be escaped before using them in HTML";
String escaped = source.replace("<", "&lt;").replace("&", "&amp;"); // ...

当前回答

大多数库都提供转义，包括数百个符号和数千个非ascii字符，这在UTF-8世界中不是你想要的。

而且，正如Jeff Williams所指出的，没有单一的“转义HTML”选项，有几个上下文。

假设你从未使用过不带引号的属性，并记住存在不同的上下文，它写了我自己的版本:

private static final long TEXT_ESCAPE =
        1L << '&' | 1L << '<';
private static final long DOUBLE_QUOTED_ATTR_ESCAPE =
        TEXT_ESCAPE | 1L << '"';
private static final long SINGLE_QUOTED_ATTR_ESCAPE =
        TEXT_ESCAPE | 1L << '\'';
private static final long ESCAPES =
        DOUBLE_QUOTED_ATTR_ESCAPE | SINGLE_QUOTED_ATTR_ESCAPE;

// 'quot' and 'apos' are 1 char longer than '#34' and '#39'
// which I've decided to use
private static final String REPLACEMENTS = "&#34;&amp;&#39;&lt;";
private static final int REPL_SLICES = /*  [0,   5,   10,  15, 19) */
        5<<5 | 10<<10 | 15<<15 | 19<<20;
// These 5-bit numbers packed into a single int
// are indices within REPLACEMENTS which is a 'flat' String[]

private static void appendEscaped(
        Appendable builder, CharSequence content, long escapes) {
    try {
        int startIdx = 0, len = content.length();
        for (int i = 0; i < len; i++) {
            char c = content.charAt(i);
            long one;
            if (((c & 63) == c) && ((one = 1L << c) & escapes) != 0) {
            // -^^^^^^^^^^^^^^^   -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
            // |                  | take only dangerous characters
            // | java shifts longs by 6 least significant bits,
            // | e. g. << 0b110111111 is same as >> 0b111111.
            // | Filter out bigger characters

                int index = Long.bitCount(ESCAPES & (one - 1));
                builder.append(content, startIdx, i /* exclusive */).append(
                        REPLACEMENTS,
                        REPL_SLICES >>> (5 * index) & 31,
                        REPL_SLICES >>> (5 * (index + 1)) & 31
                );
                startIdx = i + 1;
            }
        }
        builder.append(content, startIdx, len);
    } catch (IOException e) {
        // typically, our Appendable is StringBuilder which does not throw;
        // also, there's no way to declare 'if A#append() throws E,
        // then appendEscaped() throws E, too'
        throw new UncheckedIOException(e);
    }
}

考虑从Gist复制粘贴，没有行长限制。

UPD:正如另一个答案所暗示的，>转义是不必要的;同样，“within attr='…'也是允许的。我已经相应地更新了代码。

你可以自己去看看:

<!DOCTYPE html>
<html lang="en">
<head><title>Test</title></head>
<body>

<p title="&lt;&#34;I'm double-quoted!&#34;>">&lt;"Hello!"></p>
<p title='&lt;"I&#39;m single-quoted!">'>&lt;"Goodbye!"></p>

</body>
</html>

2020-04-14 19:45:12

其他回答

有一个更新版本的Apache Commons Lang库，它使用了一个不同的包名(org.apache.commons.lang3)。StringEscapeUtils现在有不同的静态方法来转义不同类型的文档(http://commons.apache.org/proper/commons-lang/javadocs/api-3.0/index.html)。转义HTML 4.0版本的字符串:

import static org.apache.commons.lang3.StringEscapeUtils.escapeHtml4;

String output = escapeHtml4("The less than sign (<) and ampersand (&) must be escaped before using them in HTML");

2011-07-19 14:58:06

Apache Commons的替代方案:使用Spring的htmltils。htmlEscape(字符串输入)方法。

2009-08-12 10:22:49

StringEscapeUtils from Apache Commons Lang:

import static org.apache.commons.lang.StringEscapeUtils.escapeHtml;
// ...
String source = "The less than sign (<) and ampersand (&) must be escaped before using them in HTML";
String escaped = escapeHtml(source);

版本3:

import static org.apache.commons.lang3.StringEscapeUtils.escapeHtml4;
// ...
String escaped = escapeHtml4(source);

2009-08-12 10:00:06

虽然@dfa答案的org.apache.commons.lang.StringEscapeUtils.escapeHtml是很好的，我过去使用过它，它不应该用于转义HTML(或XML)属性，否则空白将被规范化(意味着所有相邻的空白字符成为一个单独的空格)。

我知道这一点，因为我的库(JATL)中有一些没有保留空白的属性的bug。因此，我有一个drop in (copy n’paste)类(其中一些是从JDOM中偷来的)来区分属性和元素内容的转义。

虽然这在过去可能没有那么重要(适当的属性转义)，但考虑到HTML5的数据属性使用，它变得越来越有趣。

2013-08-07 20:26:10