我想刮取无限滚动实现的页面的所有数据。下面的python代码可以工作。

for i in range(100):
    driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
    time.sleep(5)

这意味着每当我向下滚动到底部时,我都需要等待5秒,这通常足以让页面完成加载新生成的内容。但是,这可能并不省时。页面可能在5秒内完成新内容的加载。如何在每次向下滚动时检测页面是否完成了新内容的加载?如果我能检测到这一点,一旦我知道页面完成加载,我就可以再次向下滚动以查看更多内容。这样更节省时间。


当前回答

如果您试图滚动并找到页面上的所有项目。您可以考虑使用以下方法。这是其他人在这里提到的一些方法的组合。它帮我完成了任务:

while True:
    try:
        driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
        driver.implicitly_wait(30)
        time.sleep(4)
        elem1 = WebDriverWait(driver, 30).until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, "element-name")))
        len_elem_1 = len(elem1)
        print(f"A list Length {len_elem_1}")
        driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
        driver.implicitly_wait(30)
        time.sleep(4)
        elem2 = WebDriverWait(driver, 30).until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, "element-name")))
        len_elem_2 = len(elem2)
        print(f"B list Length {len_elem_2}")
        if len_elem_1 == len_elem_2:
            print(f"final length = {len_elem_1}")
            break
    except TimeoutException:
            print("Loading took too much time!")

其他回答

webdriver将在默认情况下通过.get()方法等待页面加载。

正如你可能正在寻找一些特定的元素@user227215所说的,你应该使用WebDriverWait来等待位于你页面中的元素:

from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.common.exceptions import TimeoutException

browser = webdriver.Firefox()
browser.get("url")
delay = 3 # seconds
try:
    myElem = WebDriverWait(browser, delay).until(EC.presence_of_element_located((By.ID, 'IdOfMyElement')))
    print "Page is ready!"
except TimeoutException:
    print "Loading took too much time!"

我用它来检查提醒。您可以使用任何其他类型方法来查找定位器。

编辑1:

I should mention that the webdriver will wait for a page to load by default. It does not wait for loading inside frames or for ajax requests. It means when you use .get('url'), your browser will wait until the page is completely loaded and then go to the next command in the code. But when you are posting an ajax request, webdriver does not wait and it's your responsibility to wait an appropriate amount of time for the page or a part of page to load; so there is a module named expected_conditions.

找到以下3种方法:

请求处理

检查页面readyState(不可靠):

def page_has_loaded(self):
    self.log.info("Checking if {} page is loaded.".format(self.driver.current_url))
    page_state = self.driver.execute_script('return document.readyState;')
    return page_state == 'complete'

wait_for helper函数很好,但不幸的是,click_through_to_new_page处于竞速条件,在浏览器开始处理单击之前,我们设法在旧页面中执行脚本,而page_has_loaded直接返回true。

id

比较新页面id和旧页面id:

def page_has_loaded_id(self):
    self.log.info("Checking if {} page is loaded.".format(self.driver.current_url))
    try:
        new_page = browser.find_element_by_tag_name('html')
        return new_page.id != old_page.id
    except NoSuchElementException:
        return False

比较id可能不如等待过时的引用异常有效。

staleness_of

使用staleness_of方法:

@contextlib.contextmanager
def wait_for_page_load(self, timeout=10):
    self.log.debug("Waiting for page to load at {}.".format(self.driver.current_url))
    old_page = self.find_element_by_tag_name('html')
    yield
    WebDriverWait(self, timeout).until(staleness_of(old_page))

要了解更多细节,请查看Harry的博客。

另外,您可以检查DOM是否没有更多的修改,而不是向下滚动100次(在页面底部是AJAX惰性加载的情况下)

def scrollDown(driver, value):
    driver.execute_script("window.scrollBy(0,"+str(value)+")")

# Scroll down the page
def scrollDownAllTheWay(driver):
    old_page = driver.page_source
    while True:
        logging.debug("Scrolling loop")
        for i in range(2):
            scrollDown(driver, 500)
            time.sleep(2)
        new_page = driver.page_source
        if new_page != old_page:
            old_page = new_page
        else:
            break
    return True

正如David Cullen的回答中提到的,我总是看到这样的建议:

element_present = EC.presence_of_element_located((By.ID, 'element_id'))
WebDriverWait(driver, timeout).until(element_present)

对于我来说,很难找到所有可以与By一起使用的定位器,所以我认为在这里提供列表会很有用。 根据Ryan Mitchell的Web Scraping with Python:

ID Used in the example; finds elements by their HTML id attribute CLASS_NAME Used to find elements by their HTML class attribute. Why is this function CLASS_NAME not simply CLASS? Using the form object.CLASS would create problems for Selenium's Java library, where .class is a reserved method. In order to keep the Selenium syntax consistent between different languages, CLASS_NAME was used instead. CSS_SELECTOR Finds elements by their class, id, or tag name, using the #idName, .className, tagName convention. LINK_TEXT Finds HTML tags by the text they contain. For example, a link that says "Next" can be selected using (By.LINK_TEXT, "Next"). PARTIAL_LINK_TEXT Similar to LINK_TEXT, but matches on a partial string. NAME Finds HTML tags by their name attribute. This is handy for HTML forms. TAG_NAME Finds HTML tags by their tag name. XPATH Uses an XPath expression ... to select matching elements.

从硒/ webdriver /支持/ wait.py

driver = ...
from selenium.webdriver.support.wait import WebDriverWait
element = WebDriverWait(driver, 10).until(
    lambda x: x.find_element_by_id("someId"))