在纯Java代码中输出HTML时,是否有一种推荐的方法来转义<,>,"和&字符?(除了手动执行以下操作之外)。

String source = "The less than sign (<) and ampersand (&) must be escaped before using them in HTML";
String escaped = source.replace("<", "&lt;").replace("&", "&amp;"); // ...

当前回答

大多数库都提供转义,包括数百个符号和数千个非ascii字符,这在UTF-8世界中不是你想要的。

而且,正如Jeff Williams所指出的,没有单一的“转义HTML”选项,有几个上下文。

假设你从未使用过不带引号的属性,并记住存在不同的上下文,它写了我自己的版本:

private static final long TEXT_ESCAPE =
        1L << '&' | 1L << '<';
private static final long DOUBLE_QUOTED_ATTR_ESCAPE =
        TEXT_ESCAPE | 1L << '"';
private static final long SINGLE_QUOTED_ATTR_ESCAPE =
        TEXT_ESCAPE | 1L << '\'';
private static final long ESCAPES =
        DOUBLE_QUOTED_ATTR_ESCAPE | SINGLE_QUOTED_ATTR_ESCAPE;

// 'quot' and 'apos' are 1 char longer than '#34' and '#39'
// which I've decided to use
private static final String REPLACEMENTS = "&#34;&amp;&#39;&lt;";
private static final int REPL_SLICES = /*  [0,   5,   10,  15, 19) */
        5<<5 | 10<<10 | 15<<15 | 19<<20;
// These 5-bit numbers packed into a single int
// are indices within REPLACEMENTS which is a 'flat' String[]

private static void appendEscaped(
        Appendable builder, CharSequence content, long escapes) {
    try {
        int startIdx = 0, len = content.length();
        for (int i = 0; i < len; i++) {
            char c = content.charAt(i);
            long one;
            if (((c & 63) == c) && ((one = 1L << c) & escapes) != 0) {
            // -^^^^^^^^^^^^^^^   -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
            // |                  | take only dangerous characters
            // | java shifts longs by 6 least significant bits,
            // | e. g. << 0b110111111 is same as >> 0b111111.
            // | Filter out bigger characters

                int index = Long.bitCount(ESCAPES & (one - 1));
                builder.append(content, startIdx, i /* exclusive */).append(
                        REPLACEMENTS,
                        REPL_SLICES >>> (5 * index) & 31,
                        REPL_SLICES >>> (5 * (index + 1)) & 31
                );
                startIdx = i + 1;
            }
        }
        builder.append(content, startIdx, len);
    } catch (IOException e) {
        // typically, our Appendable is StringBuilder which does not throw;
        // also, there's no way to declare 'if A#append() throws E,
        // then appendEscaped() throws E, too'
        throw new UncheckedIOException(e);
    }
}

考虑从Gist复制粘贴,没有行长限制。

UPD:正如另一个答案所暗示的,>转义是不必要的;同样,“within attr='…'也是允许的。我已经相应地更新了代码。

你可以自己去看看:

<!DOCTYPE html>
<html lang="en">
<head><title>Test</title></head>
<body>

<p title="&lt;&#34;I'm double-quoted!&#34;>">&lt;"Hello!"></p>
<p title='&lt;"I&#39;m single-quoted!">'>&lt;"Goodbye!"></p>

</body>
</html>

其他回答

虽然@dfa答案的org.apache.commons.lang.StringEscapeUtils.escapeHtml是很好的,我过去使用过它,它不应该用于转义HTML(或XML)属性,否则空白将被规范化(意味着所有相邻的空白字符成为一个单独的空格)。

我知道这一点,因为我的库(JATL)中有一些没有保留空白的属性的bug。因此,我有一个drop in (copy n’paste)类(其中一些是从JDOM中偷来的)来区分属性和元素内容的转义。

虽然这在过去可能没有那么重要(适当的属性转义),但考虑到HTML5的数据属性使用,它变得越来越有趣。

StringEscapeUtils from Apache Commons Lang:

import static org.apache.commons.lang.StringEscapeUtils.escapeHtml;
// ...
String source = "The less than sign (<) and ampersand (&) must be escaped before using them in HTML";
String escaped = escapeHtml(source);

版本3:

import static org.apache.commons.lang3.StringEscapeUtils.escapeHtml4;
// ...
String escaped = escapeHtml4(source);

Java 8+解决方案:

public static String escapeHTML(String str) {
    return str.chars().mapToObj(c -> c > 127 || "\"'<>&".indexOf(c) != -1 ?
       "&#" + c + ";" : String.valueOf((char) c)).collect(Collectors.joining());
}

String#chars返回String中的char值的IntStream。然后,我们可以使用mapToObj来转义字符代码大于127的字符(非ascii字符)以及双引号(")、单引号(')、左尖括号(<)、右尖括号(>)和&号(&)。收藏家。join将字符串连接在一起。

为了更好地处理Unicode字符,可以使用String#codePoints代替。

public static String escapeHTML(String str) {
    return str.codePoints().mapToObj(c -> c > 127 || "\"'<>&".indexOf(c) != -1 ?
            "&#" + c + ";" : new String(Character.toChars(c)))
       .collect(Collectors.joining());
}

Be careful with this. There are a number of different 'contexts' within an HTML document: Inside an element, quoted attribute value, unquoted attribute value, URL attribute, javascript, CSS, etc... You'll need to use a different encoding method for each of these to prevent Cross-Site Scripting (XSS). Check the OWASP XSS Prevention Cheat Sheet for details on each of these contexts. You can find escaping methods for each of these contexts in the OWASP ESAPI library -- https://github.com/ESAPI/esapi-java-legacy.

Apache Commons的替代方案:使用Spring的htmltils。htmlEscape(字符串输入)方法。