我想刮取无限滚动实现的页面的所有数据。下面的python代码可以工作。

for i in range(100):
    driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
    time.sleep(5)

这意味着每当我向下滚动到底部时,我都需要等待5秒,这通常足以让页面完成加载新生成的内容。但是,这可能并不省时。页面可能在5秒内完成新内容的加载。如何在每次向下滚动时检测页面是否完成了新内容的加载?如果我能检测到这一点,一旦我知道页面完成加载,我就可以再次向下滚动以查看更多内容。这样更节省时间。


当前回答

另外,您可以检查DOM是否没有更多的修改,而不是向下滚动100次(在页面底部是AJAX惰性加载的情况下)

def scrollDown(driver, value):
    driver.execute_script("window.scrollBy(0,"+str(value)+")")

# Scroll down the page
def scrollDownAllTheWay(driver):
    old_page = driver.page_source
    while True:
        logging.debug("Scrolling loop")
        for i in range(2):
            scrollDown(driver, 500)
            time.sleep(2)
        new_page = driver.page_source
        if new_page != old_page:
            old_page = new_page
        else:
            break
    return True

其他回答

你试过driver.implicitly_wait吗?它就像驱动程序的一个设置,所以你只在会话中调用它一次,它基本上告诉驱动程序等待给定的时间,直到每个命令都可以执行。

driver = webdriver.Chrome()
driver.implicitly_wait(10)

因此,如果您设置等待时间为10秒,它将尽快执行命令,等待10秒后才放弃。我在类似的滚动场景中使用过这个,所以我不明白为什么它在您的情况下不起作用。希望这对你有帮助。

为了能够修复这个答案,我必须添加新的文本。确保在implicitly_wait中使用小写“w”。

webdriver将在默认情况下通过.get()方法等待页面加载。

正如你可能正在寻找一些特定的元素@user227215所说的,你应该使用WebDriverWait来等待位于你页面中的元素:

from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.common.exceptions import TimeoutException

browser = webdriver.Firefox()
browser.get("url")
delay = 3 # seconds
try:
    myElem = WebDriverWait(browser, delay).until(EC.presence_of_element_located((By.ID, 'IdOfMyElement')))
    print "Page is ready!"
except TimeoutException:
    print "Loading took too much time!"

我用它来检查提醒。您可以使用任何其他类型方法来查找定位器。

编辑1:

I should mention that the webdriver will wait for a page to load by default. It does not wait for loading inside frames or for ajax requests. It means when you use .get('url'), your browser will wait until the page is completely loaded and then go to the next command in the code. But when you are posting an ajax request, webdriver does not wait and it's your responsibility to wait an appropriate amount of time for the page or a part of page to load; so there is a module named expected_conditions.

我挣扎了一点,让这个工作,因为它没有为我工作的预期。任何还在努力让它工作的人,可以检查一下。

我想等待一个元素出现在网页上,然后再继续我的操作。

我们可以使用WebDriverWait(driver, 10,1).until(),但catch是until()期望一个函数,它可以执行一段时间的超时提供(在我们的情况下是10)每1秒。所以保持它如下对我有用。

element_found = wait_for_element.until(lambda x: x.find_element_by_class_name("MY_ELEMENT_CLASS_NAME").is_displayed())

下面是until()在幕后所做的事情

def until(self, method, message=''):
        """Calls the method provided with the driver as an argument until the \
        return value is not False."""
        screen = None
        stacktrace = None

        end_time = time.time() + self._timeout
        while True:
            try:
                value = method(self._driver)
                if value:
                    return value
            except self._ignored_exceptions as exc:
                screen = getattr(exc, 'screen', None)
                stacktrace = getattr(exc, 'stacktrace', None)
            time.sleep(self._poll)
            if time.time() > end_time:
                break
        raise TimeoutException(message, screen, stacktrace)

你可以通过这个函数简单地做到这一点:

def page_is_loading(driver):
    while True:
        x = driver.execute_script("return document.readyState")
        if x == "complete":
            return True
        else:
            yield False

当你想在页面加载完成后做一些事情时,你可以使用:

Driver = webdriver.Firefox(options=Options, executable_path='geckodriver.exe')
Driver.get("https://www.google.com/")

while not page_is_loading(Driver):
    continue

Driver.execute_script("alert('page is loaded')")

ajax页面连续加载数据的解决方案。所述的预览方法无效。我们可以做的是抓取页面dom并对其进行哈希,并在一段时间内比较新旧哈希值。

import time
from selenium import webdriver

def page_has_loaded(driver, sleep_time = 2):
    '''
    Waits for page to completely load by comparing current page hash values.
    '''

    def get_page_hash(driver):
        '''
        Returns html dom hash
        '''
        # can find element by either 'html' tag or by the html 'root' id
        dom = driver.find_element_by_tag_name('html').get_attribute('innerHTML')
        # dom = driver.find_element_by_id('root').get_attribute('innerHTML')
        dom_hash = hash(dom.encode('utf-8'))
        return dom_hash

    page_hash = 'empty'
    page_hash_new = ''
    
    # comparing old and new page DOM hash together to verify the page is fully loaded
    while page_hash != page_hash_new: 
        page_hash = get_page_hash(driver)
        time.sleep(sleep_time)
        page_hash_new = get_page_hash(driver)
        print('<page_has_loaded> - page not loaded')

    print('<page_has_loaded> - page loaded: {}'.format(driver.current_url))