我想刮取无限滚动实现的页面的所有数据。下面的python代码可以工作。

for i in range(100):
    driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
    time.sleep(5)

这意味着每当我向下滚动到底部时,我都需要等待5秒,这通常足以让页面完成加载新生成的内容。但是,这可能并不省时。页面可能在5秒内完成新内容的加载。如何在每次向下滚动时检测页面是否完成了新内容的加载?如果我能检测到这一点,一旦我知道页面完成加载,我就可以再次向下滚动以查看更多内容。这样更节省时间。


当前回答

正如David Cullen的回答中提到的,我总是看到这样的建议:

element_present = EC.presence_of_element_located((By.ID, 'element_id'))
WebDriverWait(driver, timeout).until(element_present)

对于我来说,很难找到所有可以与By一起使用的定位器,所以我认为在这里提供列表会很有用。 根据Ryan Mitchell的Web Scraping with Python:

ID Used in the example; finds elements by their HTML id attribute CLASS_NAME Used to find elements by their HTML class attribute. Why is this function CLASS_NAME not simply CLASS? Using the form object.CLASS would create problems for Selenium's Java library, where .class is a reserved method. In order to keep the Selenium syntax consistent between different languages, CLASS_NAME was used instead. CSS_SELECTOR Finds elements by their class, id, or tag name, using the #idName, .className, tagName convention. LINK_TEXT Finds HTML tags by the text they contain. For example, a link that says "Next" can be selected using (By.LINK_TEXT, "Next"). PARTIAL_LINK_TEXT Similar to LINK_TEXT, but matches on a partial string. NAME Finds HTML tags by their name attribute. This is handy for HTML forms. TAG_NAME Finds HTML tags by their tag name. XPATH Uses an XPath expression ... to select matching elements.

其他回答

如果您试图滚动并找到页面上的所有项目。您可以考虑使用以下方法。这是其他人在这里提到的一些方法的组合。它帮我完成了任务:

while True:
    try:
        driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
        driver.implicitly_wait(30)
        time.sleep(4)
        elem1 = WebDriverWait(driver, 30).until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, "element-name")))
        len_elem_1 = len(elem1)
        print(f"A list Length {len_elem_1}")
        driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
        driver.implicitly_wait(30)
        time.sleep(4)
        elem2 = WebDriverWait(driver, 30).until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, "element-name")))
        len_elem_2 = len(elem2)
        print(f"B list Length {len_elem_2}")
        if len_elem_1 == len_elem_2:
            print(f"final length = {len_elem_1}")
            break
    except TimeoutException:
            print("Loading took too much time!")

试图将find_element_by_id传递给构造函数的presence_of_element_locate(如已接受的答案所示)会引发NoSuchElementException异常。我不得不在fragles的评论中使用语法:

from selenium import webdriver
from selenium.common.exceptions import TimeoutException
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By

driver = webdriver.Firefox()
driver.get('url')
timeout = 5
try:
    element_present = EC.presence_of_element_located((By.ID, 'element_id'))
    WebDriverWait(driver, timeout).until(element_present)
except TimeoutException:
    print "Timed out waiting for page to load"

这与文档中的示例相匹配。这里是By文档的链接。

回答得很好。等待XPATH的快速示例。

# wait for sizes to load - 2s timeout
try:
    WebDriverWait(driver, 2).until(expected_conditions.presence_of_element_located(
        (By.XPATH, "//div[@id='stockSizes']//a")))
except TimeoutException:
    pass

找到以下3种方法:

请求处理

检查页面readyState(不可靠):

def page_has_loaded(self):
    self.log.info("Checking if {} page is loaded.".format(self.driver.current_url))
    page_state = self.driver.execute_script('return document.readyState;')
    return page_state == 'complete'

wait_for helper函数很好,但不幸的是,click_through_to_new_page处于竞速条件,在浏览器开始处理单击之前,我们设法在旧页面中执行脚本,而page_has_loaded直接返回true。

id

比较新页面id和旧页面id:

def page_has_loaded_id(self):
    self.log.info("Checking if {} page is loaded.".format(self.driver.current_url))
    try:
        new_page = browser.find_element_by_tag_name('html')
        return new_page.id != old_page.id
    except NoSuchElementException:
        return False

比较id可能不如等待过时的引用异常有效。

staleness_of

使用staleness_of方法:

@contextlib.contextmanager
def wait_for_page_load(self, timeout=10):
    self.log.debug("Waiting for page to load at {}.".format(self.driver.current_url))
    old_page = self.find_element_by_tag_name('html')
    yield
    WebDriverWait(self, timeout).until(staleness_of(old_page))

要了解更多细节,请查看Harry的博客。

你可以通过这个函数简单地做到这一点:

def page_is_loading(driver):
    while True:
        x = driver.execute_script("return document.readyState")
        if x == "complete":
            return True
        else:
            yield False

当你想在页面加载完成后做一些事情时,你可以使用:

Driver = webdriver.Firefox(options=Options, executable_path='geckodriver.exe')
Driver.get("https://www.google.com/")

while not page_is_loading(Driver):
    continue

Driver.execute_script("alert('page is loaded')")