我试图开发一个简单的网页刮板。我想提取没有HTML代码的文本。它适用于普通HTML,但不适用于JavaScript代码添加文本的某些页面。

例如,如果一些JavaScript代码添加了一些文本,我不能看到它,因为当我调用:

response = urllib2.urlopen(request)

我得到了原始文本而没有添加的文本(因为JavaScript是在客户端执行的)。

所以,我正在寻找一些解决这个问题的想法。


当前回答

EDIT 2021年9月:phantomjs也不再维护

EDIT 30/Dec/2017:这个答案出现在谷歌搜索的顶部结果中,所以我决定更新它。老答案仍然在最后。

dryscape不再维护,开发人员推荐的库dryscape仅适用于Python 2。我发现使用Selenium的python库和Phantom JS作为web驱动程序足够快,也很容易完成工作。

一旦你安装了Phantom JS,确保phantomjs二进制文件在当前路径下可用:

phantomjs --version
# result:
2.1.1

#例子 为了给出一个例子,我用下面的HTML代码创建了一个示例页面。(链接):

<!DOCTYPE html>
<html>
<head>
  <meta charset="utf-8">
  <title>Javascript scraping test</title>
</head>
<body>
  <p id='intro-text'>No javascript support</p>
  <script>
     document.getElementById('intro-text').innerHTML = 'Yay! Supports javascript';
  </script> 
</body>
</html>

没有javascript,它说:不支持javascript和javascript:耶!支持javascript

#抓取没有JS支持:

import requests
from bs4 import BeautifulSoup
response = requests.get(my_url)
soup = BeautifulSoup(response.text)
soup.find(id="intro-text")
# Result:
<p id="intro-text">No javascript support</p>

#抓取与JS支持:

from selenium import webdriver
driver = webdriver.PhantomJS()
driver.get(my_url)
p_element = driver.find_element_by_id(id_='intro-text')
print(p_element.text)
# result:
'Yay! Supports javascript'

你也可以使用Python库dryscraping来抓取javascript驱动的网站。

#抓取与JS支持:

import dryscrape
from bs4 import BeautifulSoup
session = dryscrape.Session()
session.visit(my_url)
response = session.body()
soup = BeautifulSoup(response)
soup.find(id="intro-text")
# Result:
<p id="intro-text">Yay! Supports javascript</p>

其他回答

如果你以前曾经使用过python的Requests模块,我最近发现开发人员创建了一个名为Requests- html的新模块,现在它也有呈现JavaScript的能力。

你也可以访问https://html.python-requests.org/来了解更多关于这个模块的信息,或者如果你只对呈现JavaScript感兴趣,那么你可以访问https://html.python-requests.org/?#javascript-support来直接学习如何使用该模块使用Python来呈现JavaScript。

从本质上讲,一旦你正确安装了Requests-HTML模块,下面的例子,在上面的链接中显示,展示了你如何使用这个模块来抓取一个网站,并呈现网站中包含的JavaScript:

from requests_html import HTMLSession
session = HTMLSession()

r = session.get('http://python-requests.org/')

r.html.render()

r.html.search('Python 2 will retire in only {months} months!')['months']

'<time>25</time>' #This is the result.

我最近从YouTube上的一个视频中了解到这一点。点击这里!观看YouTube上演示该模块如何工作的视频。

You'll want to use urllib, requests, beautifulSoup and selenium web driver in your script for different parts of the page, (to name a few). Sometimes you'll get what you need with just one of these modules. Sometimes you'll need two, three, or all of these modules. Sometimes you'll need to switch off the js on your browser. Sometimes you'll need header info in your script. No websites can be scraped the same way and no website can be scraped in the same way forever without having to modify your crawler, usually after a few months. But they can all be scraped! Where there's a will there's a way for sure. If you need scraped data continuously into the future just scrape everything you need and store it in .dat files with pickle. Just keep searching how to try what with these modules and copying and pasting your errors into the Google.

这似乎是一个很好的解决方案,从一个伟大的博客文章

import sys  
from PyQt4.QtGui import *  
from PyQt4.QtCore import *  
from PyQt4.QtWebKit import *  
from lxml import html 

#Take this class for granted.Just use result of rendering.
class Render(QWebPage):  
  def __init__(self, url):  
    self.app = QApplication(sys.argv)  
    QWebPage.__init__(self)  
    self.loadFinished.connect(self._loadFinished)  
    self.mainFrame().load(QUrl(url))  
    self.app.exec_()  

  def _loadFinished(self, result):  
    self.frame = self.mainFrame()  
    self.app.quit()  

url = 'http://pycoders.com/archive/'  
r = Render(url)  
result = r.frame.toHtml()
# This step is important.Converting QString to Ascii for lxml to process

# The following returns an lxml element tree
archive_links = html.fromstring(str(result.toAscii()))
print archive_links

# The following returns an array containing the URLs
raw_links = archive_links.xpath('//div[@class="campaign"]/a/@href')
print raw_links

EDIT 2021年9月:phantomjs也不再维护

EDIT 30/Dec/2017:这个答案出现在谷歌搜索的顶部结果中,所以我决定更新它。老答案仍然在最后。

dryscape不再维护,开发人员推荐的库dryscape仅适用于Python 2。我发现使用Selenium的python库和Phantom JS作为web驱动程序足够快,也很容易完成工作。

一旦你安装了Phantom JS,确保phantomjs二进制文件在当前路径下可用:

phantomjs --version
# result:
2.1.1

#例子 为了给出一个例子,我用下面的HTML代码创建了一个示例页面。(链接):

<!DOCTYPE html>
<html>
<head>
  <meta charset="utf-8">
  <title>Javascript scraping test</title>
</head>
<body>
  <p id='intro-text'>No javascript support</p>
  <script>
     document.getElementById('intro-text').innerHTML = 'Yay! Supports javascript';
  </script> 
</body>
</html>

没有javascript,它说:不支持javascript和javascript:耶!支持javascript

#抓取没有JS支持:

import requests
from bs4 import BeautifulSoup
response = requests.get(my_url)
soup = BeautifulSoup(response.text)
soup.find(id="intro-text")
# Result:
<p id="intro-text">No javascript support</p>

#抓取与JS支持:

from selenium import webdriver
driver = webdriver.PhantomJS()
driver.get(my_url)
p_element = driver.find_element_by_id(id_='intro-text')
print(p_element.text)
# result:
'Yay! Supports javascript'

你也可以使用Python库dryscraping来抓取javascript驱动的网站。

#抓取与JS支持:

import dryscrape
from bs4 import BeautifulSoup
session = dryscrape.Session()
session.visit(my_url)
response = session.body()
soup = BeautifulSoup(response)
soup.find(id="intro-text")
# Result:
<p id="intro-text">Yay! Supports javascript</p>

我们没有得到正确的结果,因为任何javascript生成的内容都需要在DOM上呈现。当我们获取一个HTML页面时,我们获取初始的,未经javascript修改的DOM。

因此,我们需要在抓取页面之前呈现javascript内容。

由于selenium已经在本线程中多次提到(有时也提到了它的速度有多慢),我将列出其他两种可能的解决方案。


解决方案1:这是一个关于如何使用Scrapy抓取javascript生成内容的非常好的教程,我们将遵循这一点。

我们需要:

Docker installed in our machine. This is a plus over other solutions until this point, as it utilizes an OS-independent platform. Install Splash following the instruction listed for our corresponding OS.Quoting from splash documentation: Splash is a javascript rendering service. It’s a lightweight web browser with an HTTP API, implemented in Python 3 using Twisted and QT5. Essentially we are going to use Splash to render Javascript generated content. Run the splash server: sudo docker run -p 8050:8050 scrapinghub/splash. Install the scrapy-splash plugin: pip install scrapy-splash Assuming that we already have a Scrapy project created (if not, let's make one), we will follow the guide and update the settings.py: Then go to your scrapy project’s settings.py and set these middlewares: DOWNLOADER_MIDDLEWARES = { 'scrapy_splash.SplashCookiesMiddleware': 723, 'scrapy_splash.SplashMiddleware': 725, 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810, } The URL of the Splash server(if you’re using Win or OSX this should be the URL of the docker machine: How to get a Docker container's IP address from the host?): SPLASH_URL = 'http://localhost:8050' And finally you need to set these values too: DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter' HTTPCACHE_STORAGE = 'scrapy_splash.SplashAwareFSCacheStorage' Finally, we can use a SplashRequest: In a normal spider you have Request objects which you can use to open URLs. If the page you want to open contains JS generated data you have to use SplashRequest(or SplashFormRequest) to render the page. Here’s a simple example: class MySpider(scrapy.Spider): name = "jsscraper" start_urls = ["http://quotes.toscrape.com/js/"] def start_requests(self): for url in self.start_urls: yield SplashRequest( url=url, callback=self.parse, endpoint='render.html' ) def parse(self, response): for q in response.css("div.quote"): quote = QuoteItem() quote["author"] = q.css(".author::text").extract_first() quote["quote"] = q.css(".text::text").extract_first() yield quote SplashRequest renders the URL as html and returns the response which you can use in the callback(parse) method.


解决方案2:我们暂且称之为实验性的(2018年5月)…… 此解决方案仅适用于Python版本3.6(目前)。

你知道请求模块吗(谁不知道呢)? 现在它有了一个网络爬行的小兄弟:requests-HTML:

这个库旨在使解析HTML(例如抓取网页)尽可能简单和直观。

安装请求-html: pipenv 对页面的url进行请求: 导入HTMLSession 会话= HTMLSession() R = session.get(a_page_url) 渲染响应以获得Javascript生成的比特: r.html.render ()

最后,该模块似乎提供了抓取功能。 或者,我们也可以尝试使用我们刚刚渲染的r.html对象来使用BeautifulSoup。