我试图开发一个简单的网页刮板。我想提取没有HTML代码的文本。它适用于普通HTML,但不适用于JavaScript代码添加文本的某些页面。

例如,如果一些JavaScript代码添加了一些文本,我不能看到它,因为当我调用:

response = urllib2.urlopen(request)

我得到了原始文本而没有添加的文本(因为JavaScript是在客户端执行的)。

所以,我正在寻找一些解决这个问题的想法。


当前回答

EDIT 2021年9月:phantomjs也不再维护

EDIT 30/Dec/2017:这个答案出现在谷歌搜索的顶部结果中,所以我决定更新它。老答案仍然在最后。

dryscape不再维护,开发人员推荐的库dryscape仅适用于Python 2。我发现使用Selenium的python库和Phantom JS作为web驱动程序足够快,也很容易完成工作。

一旦你安装了Phantom JS,确保phantomjs二进制文件在当前路径下可用:

phantomjs --version
# result:
2.1.1

#例子 为了给出一个例子,我用下面的HTML代码创建了一个示例页面。(链接):

<!DOCTYPE html>
<html>
<head>
  <meta charset="utf-8">
  <title>Javascript scraping test</title>
</head>
<body>
  <p id='intro-text'>No javascript support</p>
  <script>
     document.getElementById('intro-text').innerHTML = 'Yay! Supports javascript';
  </script> 
</body>
</html>

没有javascript,它说:不支持javascript和javascript:耶!支持javascript

#抓取没有JS支持:

import requests
from bs4 import BeautifulSoup
response = requests.get(my_url)
soup = BeautifulSoup(response.text)
soup.find(id="intro-text")
# Result:
<p id="intro-text">No javascript support</p>

#抓取与JS支持:

from selenium import webdriver
driver = webdriver.PhantomJS()
driver.get(my_url)
p_element = driver.find_element_by_id(id_='intro-text')
print(p_element.text)
# result:
'Yay! Supports javascript'

你也可以使用Python库dryscraping来抓取javascript驱动的网站。

#抓取与JS支持:

import dryscrape
from bs4 import BeautifulSoup
session = dryscrape.Session()
session.visit(my_url)
response = session.body()
soup = BeautifulSoup(response)
soup.find(id="intro-text")
# Result:
<p id="intro-text">Yay! Supports javascript</p>

其他回答

EDIT 2021年9月:phantomjs也不再维护

EDIT 30/Dec/2017:这个答案出现在谷歌搜索的顶部结果中,所以我决定更新它。老答案仍然在最后。

dryscape不再维护,开发人员推荐的库dryscape仅适用于Python 2。我发现使用Selenium的python库和Phantom JS作为web驱动程序足够快,也很容易完成工作。

一旦你安装了Phantom JS,确保phantomjs二进制文件在当前路径下可用:

phantomjs --version
# result:
2.1.1

#例子 为了给出一个例子,我用下面的HTML代码创建了一个示例页面。(链接):

<!DOCTYPE html>
<html>
<head>
  <meta charset="utf-8">
  <title>Javascript scraping test</title>
</head>
<body>
  <p id='intro-text'>No javascript support</p>
  <script>
     document.getElementById('intro-text').innerHTML = 'Yay! Supports javascript';
  </script> 
</body>
</html>

没有javascript,它说:不支持javascript和javascript:耶!支持javascript

#抓取没有JS支持:

import requests
from bs4 import BeautifulSoup
response = requests.get(my_url)
soup = BeautifulSoup(response.text)
soup.find(id="intro-text")
# Result:
<p id="intro-text">No javascript support</p>

#抓取与JS支持:

from selenium import webdriver
driver = webdriver.PhantomJS()
driver.get(my_url)
p_element = driver.find_element_by_id(id_='intro-text')
print(p_element.text)
# result:
'Yay! Supports javascript'

你也可以使用Python库dryscraping来抓取javascript驱动的网站。

#抓取与JS支持:

import dryscrape
from bs4 import BeautifulSoup
session = dryscrape.Session()
session.visit(my_url)
response = session.body()
soup = BeautifulSoup(response)
soup.find(id="intro-text")
# Result:
<p id="intro-text">Yay! Supports javascript</p>

如前所述,Selenium是呈现JavaScript结果的好选择:

from selenium.webdriver import Firefox
from selenium.webdriver.firefox.options import Options

options = Options()
options.headless = True
browser = Firefox(executable_path="/usr/local/bin/geckodriver", options=options)

url = "https://www.example.com"
browser.get(url)

gazpacho是一个非常容易解析渲染html的库:

from gazpacho import Soup

soup = Soup(browser.page_source)
soup.find("a").attrs['href']

我个人更喜欢在单独的容器中使用scrapy和selenium和dockerizing。通过这种方式,你既可以轻松安装,也可以抓取几乎所有包含某种形式javascript的现代网站。这里有一个例子:

使用scrapy startproject创建你的scraper并编写你的蜘蛛,骨架可以像这样简单:

import scrapy


class MySpider(scrapy.Spider):
    name = 'my_spider'
    start_urls = ['https://somewhere.com']

    def start_requests(self):
        yield scrapy.Request(url=self.start_urls[0])


    def parse(self, response):

        # do stuff with results, scrape items etc.
        # now were just checking everything worked

        print(response.body)

真正的魔力发生在middleware .py中。重写下载中间件中的两个方法__init__和process_request,方法如下:

# import some additional modules that we need
import os
from copy import deepcopy
from time import sleep

from scrapy import signals
from scrapy.http import HtmlResponse
from selenium import webdriver

class SampleProjectDownloaderMiddleware(object):

def __init__(self):
    SELENIUM_LOCATION = os.environ.get('SELENIUM_LOCATION', 'NOT_HERE')
    SELENIUM_URL = f'http://{SELENIUM_LOCATION}:4444/wd/hub'
    chrome_options = webdriver.ChromeOptions()

    # chrome_options.add_experimental_option("mobileEmulation", mobile_emulation)
    self.driver = webdriver.Remote(command_executor=SELENIUM_URL,
                                   desired_capabilities=chrome_options.to_capabilities())


def process_request(self, request, spider):

    self.driver.get(request.url)

    # sleep a bit so the page has time to load
    # or monitor items on page to continue as soon as page ready
    sleep(4)

    # if you need to manipulate the page content like clicking and scrolling, you do it here
    # self.driver.find_element_by_css_selector('.my-class').click()

    # you only need the now properly and completely rendered html from your page to get results
    body = deepcopy(self.driver.page_source)

    # copy the current url in case of redirects
    url = deepcopy(self.driver.current_url)

    return HtmlResponse(url, body=body, encoding='utf-8', request=request)

不要忘记在settings.py文件中取消下一行的注释来启用这个中间件:

DOWNLOADER_MIDDLEWARES = {
'sample_project.middlewares.SampleProjectDownloaderMiddleware': 543,}

接下来是dockerization。从一个轻量级映像创建Dockerfile(我在这里使用python Alpine),复制你的项目目录到它,安装要求:

# Use an official Python runtime as a parent image
FROM python:3.6-alpine

# install some packages necessary to scrapy and then curl because it's  handy for debugging
RUN apk --update add linux-headers libffi-dev openssl-dev build-base libxslt-dev libxml2-dev curl python-dev

WORKDIR /my_scraper

ADD requirements.txt /my_scraper/

RUN pip install -r requirements.txt

ADD . /scrapers

最后在docker-compose.yaml中把所有这些都整合在一起:

version: '2'
services:
  selenium:
    image: selenium/standalone-chrome
    ports:
      - "4444:4444"
    shm_size: 1G

  my_scraper:
    build: .
    depends_on:
      - "selenium"
    environment:
      - SELENIUM_LOCATION=samplecrawler_selenium_1
    volumes:
      - .:/my_scraper
    # use this command to keep the container running
    command: tail -f /dev/null

运行docker-compose up -d。如果你是第一次这样做,它将需要一段时间来获取最新的硒/独立铬和构建你的刮刀图像以及。

完成后,您可以检查容器是否使用docker ps运行,还可以检查selenium容器的名称是否与传递给scraper容器的环境变量的名称相匹配(在这里,它是SELENIUM_LOCATION=samplecrawler_selenium_1)。

使用docker exec -ti YOUR_CONTAINER_NAME sh进入你的刮板容器,我的命令是docker exec -ti samplecrawler_my_scraper_1 sh, cd到正确的目录下,并用scrapy爬行my_spider运行你的刮板。

所有内容都在我的github页面上,你可以从这里获取

使用PyQt5

from PyQt5.QtWidgets import QApplication
from PyQt5.QtCore import QUrl
from PyQt5.QtWebEngineWidgets import QWebEnginePage
import sys
import bs4 as bs
import urllib.request


class Client(QWebEnginePage):
    def __init__(self,url):
        global app
        self.app = QApplication(sys.argv)
        QWebEnginePage.__init__(self)
        self.html = ""
        self.loadFinished.connect(self.on_load_finished)
        self.load(QUrl(url))
        self.app.exec_()

    def on_load_finished(self):
        self.html = self.toHtml(self.Callable)
        print("Load Finished")

    def Callable(self,data):
        self.html = data
        self.app.quit()

# url = ""
# client_response = Client(url)
# print(client_response.html)

把BeautifulSoup和Selenium混合在一起对我来说效果很好。

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from bs4 import BeautifulSoup as bs

driver = webdriver.Firefox()
driver.get("http://somedomain/url_that_delays_loading")
    try:
        element = WebDriverWait(driver, 10).until(
        EC.presence_of_element_located((By.ID, "myDynamicElement"))) #waits 10 seconds until element is located. Can have other wait conditions  such as visibility_of_element_located or text_to_be_present_in_element

        html = driver.page_source
        soup = bs(html, "lxml")
        dynamic_text = soup.find_all("p", {"class":"class_name"}) #or other attributes, optional
    else:
        print("Couldnt locate element")

附注:你可以在这里找到更多的等待条件