是否有一个好方法从Java字符串中删除HTML ?一个简单的正则表达式

replaceAll("\\<.*?>", "") 



听起来好像您想从HTML转换为纯文本。 如果是这样的话,请查看www.htmlparser.org。下面是一个示例,它从URL中找到的html文件中剥离所有标记。 它使用org.htmlparser.beans.StringBean。

static public String getUrlContentsAsText(String url) {
    String content = "";
    StringBean stringBean = new StringBean();
    content = stringBean.getStrings();
    return content;


值得注意的是,如果您试图在Service Stack项目中完成此操作,那么它已经是一个内置的字符串扩展

using ServiceStack.Text;
// ...
"The <b>quick</b> brown <p> fox </p> jumps over the lazy dog".StripHtml();



  text.replaceAll('<.*?>' , " ") -> This will replace all the html tags with a space.

  text.replaceAll('&.*?;' , "")-> this will replace all the tags which starts with "&" and ends with ";" like &nbsp;, &amp;, &gt; etc.


String result = Html.fromHtml(html).toString();

接受的答案并不适用于我所指出的测试用例:“a < b or b > c”的结果是“a b or b > c”。


import java.io.IOException;
import java.io.StringReader;
import java.util.logging.Logger;

import org.ccil.cowan.tagsoup.Parser;
import org.xml.sax.Attributes;
import org.xml.sax.ContentHandler;
import org.xml.sax.InputSource;
import org.xml.sax.Locator;
import org.xml.sax.SAXException;
import org.xml.sax.XMLReader;

 * Take HTML and give back the text part while dropping the HTML tags.
 * There is some risk that using TagSoup means we'll permute non-HTML text.
 * However, it seems to work the best so far in test cases.
 * @author dan
 * @see <a href="http://home.ccil.org/~cowan/XML/tagsoup/">TagSoup</a> 
public class Html2Text2 implements ContentHandler {
private StringBuffer sb;

public Html2Text2() {

public void parse(String str) throws IOException, SAXException {
    XMLReader reader = new Parser();
    sb = new StringBuffer();
    reader.parse(new InputSource(new StringReader(str)));

public String getText() {
    return sb.toString();

public void characters(char[] ch, int start, int length)
    throws SAXException {
    for (int idx = 0; idx < length; idx++) {

public void ignorableWhitespace(char[] ch, int start, int length)
    throws SAXException {

// The methods below do not contribute to the text
public void endDocument() throws SAXException {

public void endElement(String uri, String localName, String qName)
    throws SAXException {

public void endPrefixMapping(String prefix) throws SAXException {

public void processingInstruction(String target, String data)
    throws SAXException {

public void setDocumentLocator(Locator locator) {

public void skippedEntity(String name) throws SAXException {

public void startDocument() throws SAXException {

public void startElement(String uri, String localName, String qName,
    Attributes atts) throws SAXException {

public void startPrefixMapping(String prefix, String uri)
    throws SAXException {


Document doc = Jsoup.parse(htmlstrl);
Whitelist wl = Whitelist.none();
String plain = Jsoup.clean(doc.text(), wl);
