从字符串中删除HTML标签

是否有一个好方法从Java字符串中删除HTML ?一个简单的正则表达式

replaceAll("\\<.*?>", "")

会起作用，但有些东西像&将不能正确地转换，并且两个尖括号之间的非html将被删除(即。*?在正则表达式中将消失)。

当前回答

要获得格式化的纯html文本，您可以这样做:

String BR_ESCAPED = "&lt;br/&gt;";
Element el=Jsoup.parse(html).select("body");
el.select("br").append(BR_ESCAPED);
el.select("p").append(BR_ESCAPED+BR_ESCAPED);
el.select("h1").append(BR_ESCAPED+BR_ESCAPED);
el.select("h2").append(BR_ESCAPED+BR_ESCAPED);
el.select("h3").append(BR_ESCAPED+BR_ESCAPED);
el.select("h4").append(BR_ESCAPED+BR_ESCAPED);
el.select("h5").append(BR_ESCAPED+BR_ESCAPED);
String nodeValue=el.text();
nodeValue=nodeValue.replaceAll(BR_ESCAPED, "<br/>");
nodeValue=nodeValue.replaceAll("(\\s*<br[^>]*>){3,}", "<br/><br/>");

要获得格式化的纯文本，将<br/>更改\n，并更改最后一行:

nodeValue=nodeValue.replaceAll("(\\s*\n){3,}", "<br/><br/>");

2013-04-25 16:57:13

其他回答

你可以使用这个方法从字符串中删除HTML标签，

public static String stripHtmlTags(String html) {

    return html.replaceAll("<.*?>", "");

}

2021-03-01 15:44:46

我认为过滤html标签最简单的方法是:

private static final Pattern REMOVE_TAGS = Pattern.compile("<.+?>");

public static String removeTags(String string) {
    if (string == null || string.length() == 0) {
        return string;
    }

    Matcher m = REMOVE_TAGS.matcher(string);
    return m.replaceAll("");
}

2010-11-04 10:13:09

您可能希望在剥离HTML之前将<br/>和</p>标记替换为换行符，以防止它像Tim建议的那样变得难以识别。

我能想到的唯一方法是删除HTML标记，但在尖括号之间留下非HTML标记，将检查HTML标记列表。沿着这条线……

replaceAll("\\<[\s]*tag[^>]*>","")

然后html解码特殊字符，如&结果不应该被认为是消毒的。

2008-10-27 23:52:37

您可以使用此代码删除HTML标记，包括换行符。

function remove_html_tags(html) {
    html = html.replace(/<div>/g, "").replace(/<\/div>/g, "<br>");
    html = html.replace(/<br>/g, "$br$");
    html = html.replace(/(?:\r\n|\r|\n)/g, '$br$');
    var tmp = document.createElement("DIV");
    tmp.innerHTML = html;
    html = tmp.textContent || tmp.innerText;
    html = html.replace(/\$br\$/g, "\n");
    return html;
}

2021-09-06 11:00:27

在Android上试试这个:

String result = Html.fromHtml(html).toString();

2015-05-04 04:29:30

从字符串中删除HTML标签

推荐文章

最新文章

标签