我想刮取无限滚动实现的页面的所有数据。下面的python代码可以工作。

for i in range(100):
    driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
    time.sleep(5)

这意味着每当我向下滚动到底部时,我都需要等待5秒,这通常足以让页面完成加载新生成的内容。但是,这可能并不省时。页面可能在5秒内完成新内容的加载。如何在每次向下滚动时检测页面是否完成了新内容的加载?如果我能检测到这一点,一旦我知道页面完成加载,我就可以再次向下滚动以查看更多内容。这样更节省时间。


当前回答

把WebDriverWait放在While循环中并捕获异常如何?

from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import TimeoutException

browser = webdriver.Firefox()
browser.get("url")
delay = 3 # seconds
while True:
    try:
        WebDriverWait(browser, delay).until(EC.presence_of_element_located(browser.find_element_by_id('IdOfMyElement')))
        print "Page is ready!"
        break # it will break from the loop once the specific element will be present. 
    except TimeoutException:
        print "Loading took too much time!-Try again"

其他回答

这里我使用了一个相当简单的形式:

from selenium import webdriver
browser = webdriver.Firefox()
browser.get("url")
searchTxt=''
while not searchTxt:
    try:    
      searchTxt=browser.find_element_by_name('NAME OF ELEMENT')
      searchTxt.send_keys("USERNAME")
    except:continue

正如David Cullen的回答中提到的,我总是看到这样的建议:

element_present = EC.presence_of_element_located((By.ID, 'element_id'))
WebDriverWait(driver, timeout).until(element_present)

对于我来说,很难找到所有可以与By一起使用的定位器,所以我认为在这里提供列表会很有用。 根据Ryan Mitchell的Web Scraping with Python:

ID Used in the example; finds elements by their HTML id attribute CLASS_NAME Used to find elements by their HTML class attribute. Why is this function CLASS_NAME not simply CLASS? Using the form object.CLASS would create problems for Selenium's Java library, where .class is a reserved method. In order to keep the Selenium syntax consistent between different languages, CLASS_NAME was used instead. CSS_SELECTOR Finds elements by their class, id, or tag name, using the #idName, .className, tagName convention. LINK_TEXT Finds HTML tags by the text they contain. For example, a link that says "Next" can be selected using (By.LINK_TEXT, "Next"). PARTIAL_LINK_TEXT Similar to LINK_TEXT, but matches on a partial string. NAME Finds HTML tags by their name attribute. This is handy for HTML forms. TAG_NAME Finds HTML tags by their tag name. XPATH Uses an XPath expression ... to select matching elements.

如果您试图滚动并找到页面上的所有项目。您可以考虑使用以下方法。这是其他人在这里提到的一些方法的组合。它帮我完成了任务:

while True:
    try:
        driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
        driver.implicitly_wait(30)
        time.sleep(4)
        elem1 = WebDriverWait(driver, 30).until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, "element-name")))
        len_elem_1 = len(elem1)
        print(f"A list Length {len_elem_1}")
        driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
        driver.implicitly_wait(30)
        time.sleep(4)
        elem2 = WebDriverWait(driver, 30).until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, "element-name")))
        len_elem_2 = len(elem2)
        print(f"B list Length {len_elem_2}")
        if len_elem_1 == len_elem_2:
            print(f"final length = {len_elem_1}")
            break
    except TimeoutException:
            print("Loading took too much time!")

你可以通过这个函数简单地做到这一点:

def page_is_loading(driver):
    while True:
        x = driver.execute_script("return document.readyState")
        if x == "complete":
            return True
        else:
            yield False

当你想在页面加载完成后做一些事情时,你可以使用:

Driver = webdriver.Firefox(options=Options, executable_path='geckodriver.exe')
Driver.get("https://www.google.com/")

while not page_is_loading(Driver):
    continue

Driver.execute_script("alert('page is loaded')")

webdriver将在默认情况下通过.get()方法等待页面加载。

正如你可能正在寻找一些特定的元素@user227215所说的,你应该使用WebDriverWait来等待位于你页面中的元素:

from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.common.exceptions import TimeoutException

browser = webdriver.Firefox()
browser.get("url")
delay = 3 # seconds
try:
    myElem = WebDriverWait(browser, delay).until(EC.presence_of_element_located((By.ID, 'IdOfMyElement')))
    print "Page is ready!"
except TimeoutException:
    print "Loading took too much time!"

我用它来检查提醒。您可以使用任何其他类型方法来查找定位器。

编辑1:

I should mention that the webdriver will wait for a page to load by default. It does not wait for loading inside frames or for ajax requests. It means when you use .get('url'), your browser will wait until the page is completely loaded and then go to the next command in the code. But when you are posting an ajax request, webdriver does not wait and it's your responsibility to wait an appropriate amount of time for the page or a part of page to load; so there is a module named expected_conditions.