使用Python的web抓取JavaScript页面

我试图开发一个简单的网页刮板。我想提取没有HTML代码的文本。它适用于普通HTML，但不适用于JavaScript代码添加文本的某些页面。

例如，如果一些JavaScript代码添加了一些文本，我不能看到它，因为当我调用:

response = urllib2.urlopen(request)

我得到了原始文本而没有添加的文本(因为JavaScript是在客户端执行的)。

所以，我正在寻找一些解决这个问题的想法。

当前回答

我们没有得到正确的结果，因为任何javascript生成的内容都需要在DOM上呈现。当我们获取一个HTML页面时，我们获取初始的，未经javascript修改的DOM。

因此，我们需要在抓取页面之前呈现javascript内容。

由于selenium已经在本线程中多次提到(有时也提到了它的速度有多慢)，我将列出其他两种可能的解决方案。

解决方案1:这是一个关于如何使用Scrapy抓取javascript生成内容的非常好的教程，我们将遵循这一点。

我们需要:

Docker installed in our machine. This is a plus over other solutions until this point, as it utilizes an OS-independent platform. Install Splash following the instruction listed for our corresponding OS.Quoting from splash documentation: Splash is a javascript rendering service. It’s a lightweight web browser with an HTTP API, implemented in Python 3 using Twisted and QT5. Essentially we are going to use Splash to render Javascript generated content. Run the splash server: sudo docker run -p 8050:8050 scrapinghub/splash. Install the scrapy-splash plugin: pip install scrapy-splash Assuming that we already have a Scrapy project created (if not, let's make one), we will follow the guide and update the settings.py: Then go to your scrapy project’s settings.py and set these middlewares: DOWNLOADER_MIDDLEWARES = { 'scrapy_splash.SplashCookiesMiddleware': 723, 'scrapy_splash.SplashMiddleware': 725, 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810, } The URL of the Splash server(if you’re using Win or OSX this should be the URL of the docker machine: How to get a Docker container's IP address from the host?): SPLASH_URL = 'http://localhost:8050' And finally you need to set these values too: DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter' HTTPCACHE_STORAGE = 'scrapy_splash.SplashAwareFSCacheStorage' Finally, we can use a SplashRequest: In a normal spider you have Request objects which you can use to open URLs. If the page you want to open contains JS generated data you have to use SplashRequest(or SplashFormRequest) to render the page. Here’s a simple example: class MySpider(scrapy.Spider): name = "jsscraper" start_urls = ["http://quotes.toscrape.com/js/"] def start_requests(self): for url in self.start_urls: yield SplashRequest( url=url, callback=self.parse, endpoint='render.html' ) def parse(self, response): for q in response.css("div.quote"): quote = QuoteItem() quote["author"] = q.css(".author::text").extract_first() quote["quote"] = q.css(".text::text").extract_first() yield quote SplashRequest renders the URL as html and returns the response which you can use in the callback(parse) method.

解决方案2:我们暂且称之为实验性的(2018年5月)…… 此解决方案仅适用于Python版本3.6(目前)。

你知道请求模块吗(谁不知道呢)? 现在它有了一个网络爬行的小兄弟:requests-HTML:

这个库旨在使解析HTML(例如抓取网页)尽可能简单和直观。

安装请求-html: pipenv 对页面的url进行请求: 导入HTMLSession 会话= HTMLSession() R = session.get(a_page_url) 渲染响应以获得Javascript生成的比特: r.html.render ()

最后，该模块似乎提供了抓取功能。或者，我们也可以尝试使用我们刚刚渲染的r.html对象来使用BeautifulSoup。

2018-05-30 19:52:45

其他回答

尝试直接访问API

在抓取中常见的场景是网页从API端点异步请求数据。一个最小的例子是以下网站:

身体< > < >脚本 fetch(“https://jsonplaceholder.typicode.com/posts/1”) .then(res => { if (!res.ok)抛出错误(res.status); 返回res.json (); }) .then(data => { //页面加载后通过JS动态注入数据 document.body.innerText = data.title; }) .catch(err => console.error(err)) ； > < /脚本身体< / >

在许多情况下，API将受到CORS或访问令牌的保护，或速率限制过高，但在其他情况下，它是公开可访问的，您可以完全绕过网站。对于CORS问题，你可以在任何地方尝试CORS。

一般的过程是使用浏览器的开发人员工具的网络选项卡来搜索页面发出的请求，以获得您想要抓取的数据的关键字/子字符串。通常，您会看到一个不受保护的API请求端点，该端点带有一个JSON有效负载，您可以直接使用urllib或请求模块访问该有效负载。上面的可运行代码片段就是这种情况，你可以用它来练习。点击“运行片段”后，下面是我如何在我的网络选项卡中找到端点:

这个例子是虚构的;从静态标记来看，端点URL可能不明显，因为它可以被动态组装、缩小并隐藏在数十个其他请求和端点之下。网络请求还将显示任何相关的请求有效负载细节，例如您可能需要的访问令牌。

在获取端点URL和相关细节后，使用标准HTTP库在Python中构建一个请求并请求数据:

>>> import requests
>>> res = requests.get("https://jsonplaceholder.typicode.com/posts/1")
>>> data = res.json()
>>> data["title"]
'sunt aut facere repellat provident occaecati excepturi optio reprehenderit'

当你可以摆脱它时，这往往比使用Selenium、Pyppeteer、Scrapy或其他流行的抓取库更容易、更快、更可靠。

如果您很不幸，数据没有通过API请求以良好的格式返回数据，那么它可能是原始浏览器负载的一部分，在<script>标记中，作为JSON字符串或(更可能的是)JS对象。例如:

<body> <script> var someHardcodedData = { 用户 ID： 1，编号： 1，题目： “Sunt aut facere repellat provident occaecati excepturi optio reprehenderit”， Body： 'quia et suscipit\nsuscipit recusandae con sequuntur expedita et\nreprehenderit molestiae ut ut quas totam\nnostrum rerum est autem sunt rem eveniet architecto' }; document.body.textContent = someHardcodedData.title; </script> </body>

没有一种万能的方法来获取这些数据。基本技术是使用BeautifulSoup访问<script>标记文本，然后应用正则表达式或解析来提取对象结构、JSON字符串或数据可能采用的任何格式。下面是上面所示的示例结构的概念证明:

import json
import re
from bs4 import BeautifulSoup

# pretend we've already used requests to retrieve the data, 
# so we hardcode it for the purposes of this example
text = """
<body>
<script>
  var someHardcodedData = {
    userId: 1,
    id: 1,
    title: 'sunt aut facere repellat provident occaecati excepturi optio reprehenderit', 
    body: 'quia et suscipit\nsuscipit recusandae con sequuntur expedita et cum\nreprehenderit molestiae ut ut quas totam\nnostrum rerum est autem sunt rem eveniet architecto'
  };
  document.body.textContent = someHardcodedData.title;
</script>
</body>
"""
soup = BeautifulSoup(text, "lxml")
script_text = str(soup.select_one("script"))
pattern = r"title: '(.*?)'"
print(re.search(pattern, script_text, re.S).group(1))

看看下面这些解析JS对象的资源，它们不是很有效的JSON:

如何将原始javascript对象转换为python字典? 如何修复JSON键值没有双引号?

以下是一些使用API绕过抓取的额外案例研究/概念证明:

如何使用Python beautifulsoup将yelp评论和星级评分刮到CSV Beautiful Soup对现有元素返回None 从BeautifulSoup Python中提取数据通过POST收集Bandcamp粉丝(使用一种混合方法，即向网站发出初始请求，从使用BeautifulSoup的标记中提取一个令牌，然后在对JSON端点的第二个请求中使用)

如果所有这些都失败了，请尝试本线程中列出的许多动态抓取库中的一个。

2021-03-30 21:30:58

如前所述，Selenium是呈现JavaScript结果的好选择:

from selenium.webdriver import Firefox
from selenium.webdriver.firefox.options import Options

options = Options()
options.headless = True
browser = Firefox(executable_path="/usr/local/bin/geckodriver", options=options)

url = "https://www.example.com"
browser.get(url)

gazpacho是一个非常容易解析渲染html的库:

from gazpacho import Soup

soup = Soup(browser.page_source)
soup.find("a").attrs['href']

2020-10-09 19:48:31

Pyppeteer

你可以考虑Pyppeteer，它是Chrome/Chromium驱动程序前端的Python移植版本。

下面是一个简单的例子，展示了如何使用pyppeterer动态地访问被注入到页面中的数据:

import asyncio
from pyppeteer import launch

async def main():
    browser = await launch({"headless": True})
    [page] = await browser.pages()

    # normally, you go to a live site...
    #await page.goto("http://www.example.com")
    # but for this example, just set the HTML directly:
    await page.setContent("""
    <body>
    <script>
    // inject content dynamically with JS, not part of the static HTML!
    document.body.innerHTML = `<p>hello world</p>`; 
    </script>
    </body>
    """)
    print(await page.content()) # shows that the `<p>` was inserted

    # evaluate a JS expression in browser context and scrape the data
    expr = "document.querySelector('p').textContent"
    print(await page.evaluate(expr, force_expr=True)) # => hello world

    await browser.close()

asyncio.get_event_loop().run_until_complete(main())

请参阅pyppeterer的参考文档。

2021-03-30 21:32:38

这似乎是一个很好的解决方案，从一个伟大的博客文章

import sys  
from PyQt4.QtGui import *  
from PyQt4.QtCore import *  
from PyQt4.QtWebKit import *  
from lxml import html 

#Take this class for granted.Just use result of rendering.
class Render(QWebPage):  
  def __init__(self, url):  
    self.app = QApplication(sys.argv)  
    QWebPage.__init__(self)  
    self.loadFinished.connect(self._loadFinished)  
    self.mainFrame().load(QUrl(url))  
    self.app.exec_()  

  def _loadFinished(self, result):  
    self.frame = self.mainFrame()  
    self.app.quit()  

url = 'http://pycoders.com/archive/'  
r = Render(url)  
result = r.frame.toHtml()
# This step is important.Converting QString to Ascii for lxml to process

# The following returns an lxml element tree
archive_links = html.fromstring(str(result.toAscii()))
print archive_links

# The following returns an array containing the URLs
raw_links = archive_links.xpath('//div[@class="campaign"]/a/@href')
print raw_links

2016-04-10 20:12:07

You'll want to use urllib, requests, beautifulSoup and selenium web driver in your script for different parts of the page, (to name a few). Sometimes you'll get what you need with just one of these modules. Sometimes you'll need two, three, or all of these modules. Sometimes you'll need to switch off the js on your browser. Sometimes you'll need header info in your script. No websites can be scraped the same way and no website can be scraped in the same way forever without having to modify your crawler, usually after a few months. But they can all be scraped! Where there's a will there's a way for sure. If you need scraped data continuously into the future just scrape everything you need and store it in .dat files with pickle. Just keep searching how to try what with these modules and copying and pasting your errors into the Google.

2017-03-28 16:59:07

使用Python的web抓取JavaScript页面

推荐文章

最新文章

标签