是否有一个好方法从Java字符串中删除HTML ?一个简单的正则表达式
replaceAll("\\<.*?>", "")
会起作用,但有些东西像&将不能正确地转换,并且两个尖括号之间的非html将被删除(即。*?在正则表达式中将消失)。
是否有一个好方法从Java字符串中删除HTML ?一个简单的正则表达式
replaceAll("\\<.*?>", "")
会起作用,但有些东西像&将不能正确地转换,并且两个尖括号之间的非html将被删除(即。*?在正则表达式中将消失)。
当前回答
接受的答案并不适用于我所指出的测试用例:“a < b or b > c”的结果是“a b or b > c”。
所以,我用TagSoup代替。下面是一个对我的测试用例(以及其他一些测试用例)有效的示例:
import java.io.IOException;
import java.io.StringReader;
import java.util.logging.Logger;
import org.ccil.cowan.tagsoup.Parser;
import org.xml.sax.Attributes;
import org.xml.sax.ContentHandler;
import org.xml.sax.InputSource;
import org.xml.sax.Locator;
import org.xml.sax.SAXException;
import org.xml.sax.XMLReader;
/**
* Take HTML and give back the text part while dropping the HTML tags.
*
* There is some risk that using TagSoup means we'll permute non-HTML text.
* However, it seems to work the best so far in test cases.
*
* @author dan
* @see <a href="http://home.ccil.org/~cowan/XML/tagsoup/">TagSoup</a>
*/
public class Html2Text2 implements ContentHandler {
private StringBuffer sb;
public Html2Text2() {
}
public void parse(String str) throws IOException, SAXException {
XMLReader reader = new Parser();
reader.setContentHandler(this);
sb = new StringBuffer();
reader.parse(new InputSource(new StringReader(str)));
}
public String getText() {
return sb.toString();
}
@Override
public void characters(char[] ch, int start, int length)
throws SAXException {
for (int idx = 0; idx < length; idx++) {
sb.append(ch[idx+start]);
}
}
@Override
public void ignorableWhitespace(char[] ch, int start, int length)
throws SAXException {
sb.append(ch);
}
// The methods below do not contribute to the text
@Override
public void endDocument() throws SAXException {
}
@Override
public void endElement(String uri, String localName, String qName)
throws SAXException {
}
@Override
public void endPrefixMapping(String prefix) throws SAXException {
}
@Override
public void processingInstruction(String target, String data)
throws SAXException {
}
@Override
public void setDocumentLocator(Locator locator) {
}
@Override
public void skippedEntity(String name) throws SAXException {
}
@Override
public void startDocument() throws SAXException {
}
@Override
public void startElement(String uri, String localName, String qName,
Attributes atts) throws SAXException {
}
@Override
public void startPrefixMapping(String prefix, String uri)
throws SAXException {
}
}
其他回答
接受的答案并不适用于我所指出的测试用例:“a < b or b > c”的结果是“a b or b > c”。
所以,我用TagSoup代替。下面是一个对我的测试用例(以及其他一些测试用例)有效的示例:
import java.io.IOException;
import java.io.StringReader;
import java.util.logging.Logger;
import org.ccil.cowan.tagsoup.Parser;
import org.xml.sax.Attributes;
import org.xml.sax.ContentHandler;
import org.xml.sax.InputSource;
import org.xml.sax.Locator;
import org.xml.sax.SAXException;
import org.xml.sax.XMLReader;
/**
* Take HTML and give back the text part while dropping the HTML tags.
*
* There is some risk that using TagSoup means we'll permute non-HTML text.
* However, it seems to work the best so far in test cases.
*
* @author dan
* @see <a href="http://home.ccil.org/~cowan/XML/tagsoup/">TagSoup</a>
*/
public class Html2Text2 implements ContentHandler {
private StringBuffer sb;
public Html2Text2() {
}
public void parse(String str) throws IOException, SAXException {
XMLReader reader = new Parser();
reader.setContentHandler(this);
sb = new StringBuffer();
reader.parse(new InputSource(new StringReader(str)));
}
public String getText() {
return sb.toString();
}
@Override
public void characters(char[] ch, int start, int length)
throws SAXException {
for (int idx = 0; idx < length; idx++) {
sb.append(ch[idx+start]);
}
}
@Override
public void ignorableWhitespace(char[] ch, int start, int length)
throws SAXException {
sb.append(ch);
}
// The methods below do not contribute to the text
@Override
public void endDocument() throws SAXException {
}
@Override
public void endElement(String uri, String localName, String qName)
throws SAXException {
}
@Override
public void endPrefixMapping(String prefix) throws SAXException {
}
@Override
public void processingInstruction(String target, String data)
throws SAXException {
}
@Override
public void setDocumentLocator(Locator locator) {
}
@Override
public void skippedEntity(String name) throws SAXException {
}
@Override
public void startDocument() throws SAXException {
}
@Override
public void startElement(String uri, String localName, String qName,
Attributes atts) throws SAXException {
}
@Override
public void startPrefixMapping(String prefix, String uri)
throws SAXException {
}
}
听起来好像您想从HTML转换为纯文本。 如果是这样的话,请查看www.htmlparser.org。下面是一个示例,它从URL中找到的html文件中剥离所有标记。 它使用org.htmlparser.beans.StringBean。
static public String getUrlContentsAsText(String url) {
String content = "";
StringBean stringBean = new StringBean();
stringBean.setURL(url);
content = stringBean.getStrings();
return content;
}
你可以使用这个方法从字符串中删除HTML标签,
public static String stripHtmlTags(String html) {
return html.replaceAll("<.*?>", "");
}
〇应该可以
使用这个
text.replaceAll('<.*?>' , " ") -> This will replace all the html tags with a space.
这
text.replaceAll('&.*?;' , "")-> this will replace all the tags which starts with "&" and ends with ";" like , &, > etc.
从字符串中删除HTML标签。在某个地方,我们需要解析一些字符串,这些字符串是由服务器端的Httpresponse等响应接收到的。
所以我们需要解析它。
在这里,我将展示如何从字符串中删除html标签。
// sample text with tags
string str = "<html><head>sdfkashf sdf</head><body>sdfasdf</body></html>";
// regex which match tags
System.Text.RegularExpressions.Regex rx = new System.Text.RegularExpressions.Regex("<[^>]*>");
// replace all matches with empty strin
str = rx.Replace(str, "");
//now str contains string without html tags