有没有一种简单的方法可以在JavaScript中获取一个html字符串并去掉html?
当前回答
简单的2行jquery去掉html。
var content = "<p>checking the html source </p><p>
</p><p>with </p><p>all</p><p>the html </p><p>content</p>";
var text = $(content).text();//It gets you the plain text
console.log(text);//check the data in your console
cj("#text_area_id").val(text);//set your content to text area using text_area_id
其他回答
一个非常好的库是净化html,它是一个纯JavaScript函数,可以在任何环境中使用。
我的案例是React Native,我需要从给定文本中删除所有HTML标记。所以我创建了这个包装函数:
import sanitizer from 'sanitize-html';
const textSanitizer = (textWithHTML: string): string =>
sanitizer(textWithHTML, {
allowedTags: [],
});
export default textSanitizer;
现在,通过使用textSanitizer,我可以获得纯文本内容。
const-htmlParser=new DOMParser().parseFromString(“<h6>用户<p>名称</p></h6>”,'text/html');const textString=htmlParser.body.textContent;console.log(textString)
我修改了Jibberboy2000的答案,加入了几种<BR/>标记格式,删除了<SCRIPT>和<STYLE>标记中的所有内容,通过删除多个换行符和空格来格式化生成的HTML,并将一些HTML编码代码转换为普通代码。经过一些测试后,您似乎可以将大部分完整网页转换为保留页面标题和内容的简单文本。
在简单示例中,
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
<!--comment-->
<head>
<title>This is my title</title>
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
<style>
body {margin-top: 15px;}
a { color: #D80C1F; font-weight:bold; text-decoration:none; }
</style>
</head>
<body>
<center>
This string has <i>html</i> code i want to <b>remove</b><br>
In this line <a href="http://www.bbc.co.uk">BBC</a> with link is mentioned.<br/>Now back to "normal text" and stuff using <html encoding>
</center>
</body>
</html>
变成
这是我的头衔此字符串包含我要删除的html代码在这一行中,BBC(http://www.bbc.co.uk)其中提到了链接。现在回到“普通文本”和使用
JavaScript函数和测试页面如下所示:
function convertHtmlToText() {
var inputText = document.getElementById("input").value;
var returnText = "" + inputText;
//-- remove BR tags and replace them with line break
returnText=returnText.replace(/<br>/gi, "\n");
returnText=returnText.replace(/<br\s\/>/gi, "\n");
returnText=returnText.replace(/<br\/>/gi, "\n");
//-- remove P and A tags but preserve what's inside of them
returnText=returnText.replace(/<p.*>/gi, "\n");
returnText=returnText.replace(/<a.*href="(.*?)".*>(.*?)<\/a>/gi, " $2 ($1)");
//-- remove all inside SCRIPT and STYLE tags
returnText=returnText.replace(/<script.*>[\w\W]{1,}(.*?)[\w\W]{1,}<\/script>/gi, "");
returnText=returnText.replace(/<style.*>[\w\W]{1,}(.*?)[\w\W]{1,}<\/style>/gi, "");
//-- remove all else
returnText=returnText.replace(/<(?:.|\s)*?>/g, "");
//-- get rid of more than 2 multiple line breaks:
returnText=returnText.replace(/(?:(?:\r\n|\r|\n)\s*){2,}/gim, "\n\n");
//-- get rid of more than 2 spaces:
returnText = returnText.replace(/ +(?= )/g,'');
//-- get rid of html-encoded characters:
returnText=returnText.replace(/ /gi," ");
returnText=returnText.replace(/&/gi,"&");
returnText=returnText.replace(/"/gi,'"');
returnText=returnText.replace(/</gi,'<');
returnText=returnText.replace(/>/gi,'>');
//-- return
document.getElementById("output").value = returnText;
}
它与此HTML一起使用:
<textarea id="input" style="width: 400px; height: 300px;"></textarea><br />
<button onclick="convertHtmlToText()">CONVERT</button><br />
<textarea id="output" style="width: 400px; height: 300px;"></textarea><br />
https://developer.mozilla.org/en-US/docs/Web/API/Element/insertAdjacentHTML
var div = document.getElementsByTagName('div');
for (var i=0; i<div.length; i++) {
div[i].insertAdjacentHTML('afterend', div[i].innerHTML);
document.body.removeChild(div[i]);
}
这应该可以在任何Javascript环境(包括NodeJS)上完成工作。
const text = `
<html lang="en">
<head>
<style type="text/css">*{color:red}</style>
<script>alert('hello')</script>
</head>
<body><b>This is some text</b><br/><body>
</html>`;
// Remove style tags and content
text.replace(/<style[^>]*>.*<\/style>/gm, '')
// Remove script tags and content
.replace(/<script[^>]*>.*<\/script>/gm, '')
// Remove all opening, closing and orphan HTML tags
.replace(/<[^>]+>/gm, '')
// Remove leading spaces and repeated CR/LF
.replace(/([\r\n]+ +)+/gm, '');