我试图开发一个简单的网页刮板。我想提取没有HTML代码的文本。它适用于普通HTML,但不适用于JavaScript代码添加文本的某些页面。

例如,如果一些JavaScript代码添加了一些文本,我不能看到它,因为当我调用:

response = urllib2.urlopen(request)

我得到了原始文本而没有添加的文本(因为JavaScript是在客户端执行的)。

所以,我正在寻找一些解决这个问题的想法。


当前回答

我最近使用requests_html库来解决这个问题。

他们的扩展文档在readthedocs。IO非常好(跳过pypi.org上的带注释的版本)。如果您的用例是基本的,那么您可能会取得一些成功。

from requests_html import HTMLSession
session = HTMLSession()
response = session.request(method="get",url="www.google.com/")
response.html.render()

如果你在使用response.html.render()呈现你需要的数据时遇到麻烦,你可以将一些javascript传递给呈现函数来呈现你需要的特定js对象。这是从他们的文档中复制的,但这可能正是你需要的:

如果指定了script,它将在 运行时。例子:

script = """
    () => {
        return {
            width: document.documentElement.clientWidth,
            height: document.documentElement.clientHeight,
            deviceScaleFactor: window.devicePixelRatio,
        }
    } 
"""

返回执行脚本的返回值,如果有的话:

>>> response.html.render(script=script)
{'width': 800, 'height': 600, 'deviceScaleFactor': 1}

In my case, the data I wanted were the arrays that populated a javascript plot but the data wasn't getting rendered as text anywhere in the html. Sometimes its not clear at all what the object names are of the data you want if the data is populated dynamically. If you can't track down the js objects directly from view source or inspect, you can type in "window" followed by ENTER in the debugger console in the browser (Chrome) to pull up a full list of objects rendered by the browser. If you make a few educated guesses about where the data is stored, you might have some luck finding it there. My graph data was under window.view.data in the console, so in the "script" variable passed to the .render() method quoted above, I used:

return {
    data: window.view.data
}

其他回答

You'll want to use urllib, requests, beautifulSoup and selenium web driver in your script for different parts of the page, (to name a few). Sometimes you'll get what you need with just one of these modules. Sometimes you'll need two, three, or all of these modules. Sometimes you'll need to switch off the js on your browser. Sometimes you'll need header info in your script. No websites can be scraped the same way and no website can be scraped in the same way forever without having to modify your crawler, usually after a few months. But they can all be scraped! Where there's a will there's a way for sure. If you need scraped data continuously into the future just scrape everything you need and store it in .dat files with pickle. Just keep searching how to try what with these modules and copying and pasting your errors into the Google.

Pyppeteer

你可以考虑Pyppeteer,它是Chrome/Chromium驱动程序前端的Python移植版本。

下面是一个简单的例子,展示了如何使用pyppeterer动态地访问被注入到页面中的数据:

import asyncio
from pyppeteer import launch

async def main():
    browser = await launch({"headless": True})
    [page] = await browser.pages()

    # normally, you go to a live site...
    #await page.goto("http://www.example.com")
    # but for this example, just set the HTML directly:
    await page.setContent("""
    <body>
    <script>
    // inject content dynamically with JS, not part of the static HTML!
    document.body.innerHTML = `<p>hello world</p>`; 
    </script>
    </body>
    """)
    print(await page.content()) # shows that the `<p>` was inserted

    # evaluate a JS expression in browser context and scrape the data
    expr = "document.querySelector('p').textContent"
    print(await page.evaluate(expr, force_expr=True)) # => hello world

    await browser.close()

asyncio.get_event_loop().run_until_complete(main())

请参阅pyppeterer的参考文档。

你也可以使用webdriver执行javascript。

from selenium import webdriver

driver = webdriver.Firefox()
driver.get(url)
driver.execute_script('document.title')

或者将值存储在变量中

result = driver.execute_script('var text = document.title ; return text')

我最近使用requests_html库来解决这个问题。

他们的扩展文档在readthedocs。IO非常好(跳过pypi.org上的带注释的版本)。如果您的用例是基本的,那么您可能会取得一些成功。

from requests_html import HTMLSession
session = HTMLSession()
response = session.request(method="get",url="www.google.com/")
response.html.render()

如果你在使用response.html.render()呈现你需要的数据时遇到麻烦,你可以将一些javascript传递给呈现函数来呈现你需要的特定js对象。这是从他们的文档中复制的,但这可能正是你需要的:

如果指定了script,它将在 运行时。例子:

script = """
    () => {
        return {
            width: document.documentElement.clientWidth,
            height: document.documentElement.clientHeight,
            deviceScaleFactor: window.devicePixelRatio,
        }
    } 
"""

返回执行脚本的返回值,如果有的话:

>>> response.html.render(script=script)
{'width': 800, 'height': 600, 'deviceScaleFactor': 1}

In my case, the data I wanted were the arrays that populated a javascript plot but the data wasn't getting rendered as text anywhere in the html. Sometimes its not clear at all what the object names are of the data you want if the data is populated dynamically. If you can't track down the js objects directly from view source or inspect, you can type in "window" followed by ENTER in the debugger console in the browser (Chrome) to pull up a full list of objects rendered by the browser. If you make a few educated guesses about where the data is stored, you might have some luck finding it there. My graph data was under window.view.data in the console, so in the "script" variable passed to the .render() method quoted above, I used:

return {
    data: window.view.data
}

也许硒可以做到。

from selenium import webdriver
import time

driver = webdriver.Firefox()
driver.get(url)
time.sleep(5)
htmlSource = driver.page_source