from mechanize import Browser
br = Browser()
br.open('http://somewebpage')
html = br.response().readlines()
for line in html:
  print line

当在HTML文件中打印一行时,我试图找到一种方法,只显示每个HTML元素的内容,而不是格式本身。如果它发现'<a href="等等。例如">some text</a>',它只会打印'some text', '<b>hello</b>'打印'hello',等等。该怎么做呢?


当前回答

如果你需要保留HTML实体(即&),我在Eloff的答案中添加了“handle_entityref”方法。

from HTMLParser import HTMLParser

class MLStripper(HTMLParser):
    def __init__(self):
        self.reset()
        self.fed = []
    def handle_data(self, d):
        self.fed.append(d)
    def handle_entityref(self, name):
        self.fed.append('&%s;' % name)
    def get_data(self):
        return ''.join(self.fed)

def html_to_text(html):
    s = MLStripper()
    s.feed(html)
    return s.get_data()

其他回答

如果你想去掉所有HTML标签,我发现最简单的方法是使用BeautifulSoup:

from bs4 import BeautifulSoup  # Or from BeautifulSoup import BeautifulSoup

def stripHtmlTags(htmlTxt):
    if htmlTxt is None:
            return None
        else:
            return ''.join(BeautifulSoup(htmlTxt).findAll(text=True)) 

我尝试了接受的答案的代码,但我得到了“RuntimeError:最大递归深度超出”,这没有发生在上面的代码块。

这是一个类似于目前接受的答案(https://stackoverflow.com/a/925630/95989)的解决方案,除了它直接使用内部HTMLParser类(即没有子类化),从而使它显着更简洁:

def strip_html(text):
    parts = []                                                                      
    parser = HTMLParser()                                                           
    parser.handle_data = parts.append                                               
    parser.feed(text)                                                               
    return ''.join(parts)

我正在解析Github自述,我发现下面的工作真的很好:

import re
import lxml.html

def strip_markdown(x):
    links_sub = re.sub(r'\[(.+)\]\([^\)]+\)', r'\1', x)
    bold_sub = re.sub(r'\*\*([^*]+)\*\*', r'\1', links_sub)
    emph_sub = re.sub(r'\*([^*]+)\*', r'\1', bold_sub)
    return emph_sub

def strip_html(x):
    return lxml.html.fromstring(x).text_content() if x else ''

然后

readme = """<img src="https://raw.githubusercontent.com/kootenpv/sky/master/resources/skylogo.png" />

            sky is a web scraping framework, implemented with the latest python versions in mind (3.4+). 
            It uses the asynchronous `asyncio` framework, as well as many popular modules 
            and extensions.

            Most importantly, it aims for **next generation** web crawling where machine intelligence 
            is used to speed up the development/maintainance/reliability of crawling.

            It mainly does this by considering the user to be interested in content 
            from *domains*, not just a collection of *single pages*
            ([templating approach](#templating-approach))."""

strip_markdown(strip_html(readme))

正确移除所有markdown和html。

我就是这么做的,但我不知道我在做什么。我通过剥离HTML标记从HTML表中获取数据。

它接受字符串“name”并返回不带HTML标记的字符串“name1”。

x = 0
anglebrackets = 0
name1 = ""
while x < len(name):
    
    if name[x] == "<":
        anglebrackets = anglebrackets + 1
    if name[x] == ">":
        anglebrackets = anglebrackets - 1
    if anglebrackets == 0:
        if name[x] != ">":
            name1 = name1 + name[x]
    x = x + 1

2020年更新

使用Mozilla Bleach库,它确实允许您自定义保留哪些标记和保留哪些属性,并且还可以根据值过滤属性

这里有两个例子来说明

1)不允许任何HTML标签或属性

获取原始文本样本

raw_text = """
<p><img width="696" height="392" src="https://news.bitcoin.com/wp-content/uploads/2020/08/ethereum-classic-51-attack-okex-crypto-exchange-suffers-5-6-million-loss-contemplates-delisting-etc-768x432.jpg" class="attachment-medium_large size-medium_large wp-post-image" alt="Ethereum Classic 51% Attack: Okex Crypto Exchange Suffers $5.6 Million Loss, Contemplates Delisting ETC" style="float:left; margin:0 15px 15px 0;" srcset="https://news.bitcoin.com/wp-content/uploads/2020/08/ethereum-classic-51-attack-okex-crypto-exchange-suffers-5-6-million-loss-contemplates-delisting-etc-768x432.jpg 768w, https://news.bitcoin.com/wp-content/uploads/2020/08/ethereum-classic-51-attack-okex-crypto-exchange-suffers-5-6-million-loss-contemplates-delisting-etc-300x169.jpg 300w, https://news.bitcoin.com/wp-content/uploads/2020/08/ethereum-classic-51-attack-okex-crypto-exchange-suffers-5-6-million-loss-contemplates-delisting-etc-1024x576.jpg 1024w, https://news.bitcoin.com/wp-content/uploads/2020/08/ethereum-classic-51-attack-okex-crypto-exchange-suffers-5-6-million-loss-contemplates-delisting-etc-696x392.jpg 696w, https://news.bitcoin.com/wp-content/uploads/2020/08/ethereum-classic-51-attack-okex-crypto-exchange-suffers-5-6-million-loss-contemplates-delisting-etc-1068x601.jpg 1068w, https://news.bitcoin.com/wp-content/uploads/2020/08/ethereum-classic-51-attack-okex-crypto-exchange-suffers-5-6-million-loss-contemplates-delisting-etc-747x420.jpg 747w, https://news.bitcoin.com/wp-content/uploads/2020/08/ethereum-classic-51-attack-okex-crypto-exchange-suffers-5-6-million-loss-contemplates-delisting-etc-190x107.jpg 190w, https://news.bitcoin.com/wp-content/uploads/2020/08/ethereum-classic-51-attack-okex-crypto-exchange-suffers-5-6-million-loss-contemplates-delisting-etc-380x214.jpg 380w, https://news.bitcoin.com/wp-content/uploads/2020/08/ethereum-classic-51-attack-okex-crypto-exchange-suffers-5-6-million-loss-contemplates-delisting-etc-760x428.jpg 760w, https://news.bitcoin.com/wp-content/uploads/2020/08/ethereum-classic-51-attack-okex-crypto-exchange-suffers-5-6-million-loss-contemplates-delisting-etc.jpg 1280w" sizes="(max-width: 696px) 100vw, 696px" />Cryptocurrency exchange Okex reveals it suffered the $5.6 million loss as a result of the double-spend carried out by the attacker(s) in Ethereum Classic 51% attack. Okex says it fully absorbed the loss as per its user-protection policy while insisting that the attack did not cause any loss to the platform&#8217;s users. Also as part [&#8230;]</p>
<p>The post <a rel="nofollow" href="https://news.bitcoin.com/ethereum-classic-51-attack-okex-crypto-exchange-suffers-5-6-million-loss-contemplates-delisting-etc/">Ethereum Classic 51% Attack: Okex Crypto Exchange Suffers $5.6 Million Loss, Contemplates Delisting ETC</a> appeared first on <a rel="nofollow" href="https://news.bitcoin.com">Bitcoin News</a>.</p> 
"""

2)从原始文本中删除所有HTML标签和属性

# DO NOT ALLOW any tags or any attributes
from bleach.sanitizer import Cleaner
cleaner = Cleaner(tags=[], attributes={}, styles=[], protocols=[], strip=True, strip_comments=True, filters=None)
print(cleaner.clean(raw_text))

输出

Cryptocurrency exchange Okex reveals it suffered the $5.6 million loss as a result of the double-spend carried out by the attacker(s) in Ethereum Classic 51% attack. Okex says it fully absorbed the loss as per its user-protection policy while insisting that the attack did not cause any loss to the platform&#8217;s users. Also as part [&#8230;]
The post Ethereum Classic 51% Attack: Okex Crypto Exchange Suffers $5.6 Million Loss, Contemplates Delisting ETC appeared first on Bitcoin News. 

3只允许带有srcset属性的img标签

from bleach.sanitizer import Cleaner
# ALLOW ONLY img tags with src attribute
cleaner = Cleaner(tags=['img'], attributes={'img': ['srcset']}, styles=[], protocols=[], strip=True, strip_comments=True, filters=None)
print(cleaner.clean(raw_text))

输出

<img srcset="https://news.bitcoin.com/wp-content/uploads/2020/08/ethereum-classic-51-attack-okex-crypto-exchange-suffers-5-6-million-loss-contemplates-delisting-etc-768x432.jpg 768w, https://news.bitcoin.com/wp-content/uploads/2020/08/ethereum-classic-51-attack-okex-crypto-exchange-suffers-5-6-million-loss-contemplates-delisting-etc-300x169.jpg 300w, https://news.bitcoin.com/wp-content/uploads/2020/08/ethereum-classic-51-attack-okex-crypto-exchange-suffers-5-6-million-loss-contemplates-delisting-etc-1024x576.jpg 1024w, https://news.bitcoin.com/wp-content/uploads/2020/08/ethereum-classic-51-attack-okex-crypto-exchange-suffers-5-6-million-loss-contemplates-delisting-etc-696x392.jpg 696w, https://news.bitcoin.com/wp-content/uploads/2020/08/ethereum-classic-51-attack-okex-crypto-exchange-suffers-5-6-million-loss-contemplates-delisting-etc-1068x601.jpg 1068w, https://news.bitcoin.com/wp-content/uploads/2020/08/ethereum-classic-51-attack-okex-crypto-exchange-suffers-5-6-million-loss-contemplates-delisting-etc-747x420.jpg 747w, https://news.bitcoin.com/wp-content/uploads/2020/08/ethereum-classic-51-attack-okex-crypto-exchange-suffers-5-6-million-loss-contemplates-delisting-etc-190x107.jpg 190w, https://news.bitcoin.com/wp-content/uploads/2020/08/ethereum-classic-51-attack-okex-crypto-exchange-suffers-5-6-million-loss-contemplates-delisting-etc-380x214.jpg 380w, https://news.bitcoin.com/wp-content/uploads/2020/08/ethereum-classic-51-attack-okex-crypto-exchange-suffers-5-6-million-loss-contemplates-delisting-etc-760x428.jpg 760w, https://news.bitcoin.com/wp-content/uploads/2020/08/ethereum-classic-51-attack-okex-crypto-exchange-suffers-5-6-million-loss-contemplates-delisting-etc.jpg 1280w">Cryptocurrency exchange Okex reveals it suffered the $5.6 million loss as a result of the double-spend carried out by the attacker(s) in Ethereum Classic 51% attack. Okex says it fully absorbed the loss as per its user-protection policy while insisting that the attack did not cause any loss to the platform&#8217;s users. Also as part [&#8230;]
The post Ethereum Classic 51% Attack: Okex Crypto Exchange Suffers $5.6 Million Loss, Contemplates Delisting ETC appeared first on Bitcoin News.