基于lxml.html的解决方案(lxml是一个原生库,比纯python解决方案性能更好)。
要安装lxml模块,使用pip install lxml
移除所有标签
from lxml import html
## from file-like object or URL
tree = html.parse(file_like_object_or_url)
## from string
tree = html.fromstring('safe <script>unsafe</script> safe')
print(tree.text_content().strip())
### OUTPUT: 'safe unsafe safe'
删除预消毒HTML的所有标签(删除一些标签)
from lxml import html
from lxml.html.clean import clean_html
tree = html.fromstring("""<script>dangerous</script><span class="item-summary">
Detailed answers to any questions you might have
</span>""")
## text only
print(clean_html(tree).text_content().strip())
### OUTPUT: 'Detailed answers to any questions you might have'
还请参阅http://lxml.de/lxmlhtml.html#cleaning-up-html了解lxml. xml的具体内容。清洁。
如果你需要更多的控制哪些特定的标签应该在转换为文本之前删除,然后创建一个自定义的lxml Cleaner与所需的选项,例如:
cleaner = Cleaner(page_structure=True,
meta=True,
embedded=True,
links=True,
style=True,
processing_instructions=True,
inline_style=True,
scripts=True,
javascript=True,
comments=True,
frames=True,
forms=True,
annoying_tags=True,
remove_unknown_tags=True,
safe_attrs_only=True,
safe_attrs=frozenset(['src','color', 'href', 'title', 'class', 'name', 'id']),
remove_tags=('span', 'font', 'div')
)
sanitized_html = cleaner.clean_html(unsafe_html)
要自定义如何生成纯文本,您可以使用lxml.etree.tostring而不是text_content():
from lxml.etree import tostring
print(tostring(tree, method='text', encoding=str))