我正在收集网站列表上的统计数据,为了简单起见,我正在使用请求。这是我的代码:

data=[]
websites=['http://google.com', 'http://bbc.co.uk']
for w in websites:
    r= requests.get(w, verify=False)
    data.append( (r.url, len(r.content), r.elapsed.total_seconds(), str([(l.status_code, l.url) for l in r.history]), str(r.headers.items()), str(r.cookies.items())) )
 

现在,我想要请求。10秒后进入超时,这样循环就不会卡住。

这个问题以前也很有趣,但没有一个答案是干净的。

我听说可能不使用请求是一个好主意,但我应该如何得到请求提供的好东西(元组中的那些)。


当前回答

如果你使用选项stream=True,你可以这样做:

r = requests.get(
    'http://url_to_large_file',
    timeout=1,  # relevant only for underlying socket
    stream=True)

with open('/tmp/out_file.txt'), 'wb') as f:
    start_time = time.time()
    for chunk in r.iter_content(chunk_size=1024):
        if chunk:  # filter out keep-alive new chunks
            f.write(chunk)
        if time.time() - start_time > 8:
            raise Exception('Request took longer than 8s')

该解决方案不需要信号或多处理。

其他回答

其他答案大多不正确

尽管有这么多的答案,我相信这个帖子仍然缺乏一个合适的解决方案,而且没有现有的答案可以提供一个合理的方法来做一些简单而明显的事情。

我们首先要说的是,截至2022年,仅凭请求仍然绝对无法正确地做到这一点。这是库开发人员有意识的设计决定。

利用超时参数的解决方案根本不能完成它们想要做的事情。事实上,乍一看,它“似乎”起作用纯粹是偶然的:

timeout参数与请求的总执行时间完全没有关系。它只是控制底层套接字接收任何数据之前可以通过的最大时间量。以5秒的超时为例,服务器也可以每4秒发送1字节的数据,这完全没问题,但对您的帮助不大。

带有stream和iter_content的答案稍好一些,但它们仍然不能覆盖请求中的所有内容。在发送响应头之前,您实际上不会从iter_content中接收到任何内容,这也属于相同的问题——即使您使用1字节作为iter_content的块大小,读取完整的响应头可能需要完全任意的时间,并且您永远无法实际到达从iter_content中读取任何响应体的位置。

下面是一些完全打破超时和基于流的方法的示例。都试试。不管你使用哪种方法,它们都是无限期地挂着的。

server.py

import socket
import time

server = socket.socket()

server.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, True)
server.bind(('127.0.0.1', 8080))

server.listen()

while True:
    try:
        sock, addr = server.accept()
        print('Connection from', addr)
        sock.send(b'HTTP/1.1 200 OK\r\n')

        # Send some garbage headers very slowly but steadily.
        # Never actually complete the response.

        while True:
            sock.send(b'a')
            time.sleep(1)
    except:
        pass

demo1.py

import requests

requests.get('http://localhost:8080')

demo2.py

import requests

requests.get('http://localhost:8080', timeout=5)

demo3.py

import requests

requests.get('http://localhost:8080', timeout=(5, 5))

demo4.py

import requests

with requests.get('http://localhost:8080', timeout=(5, 5), stream=True) as res:
    for chunk in res.iter_content(1):
        break

正确的解决方法

My approach utilizes Python's sys.settrace function. It is dead simple. You do not need to use any external libraries or turn your code upside down. Unlike most other answers, this actually guarantees that the code executes in specified time. Be aware that you still need to specify the timeout parameter, as settrace only concerns Python code. Actual socket reads are external syscalls which are not covered by settrace, but are covered by the timeout parameter. Due to this fact, the exact time limit is not TOTAL_TIMEOUT, but a value which is explained in comments below.

import requests
import sys
import time

# This function serves as a "hook" that executes for each Python statement
# down the road. There may be some performance penalty, but as downloading
# a webpage is mostly I/O bound, it's not going to be significant.

def trace_function(frame, event, arg):
    if time.time() - start > TOTAL_TIMEOUT:
        raise Exception('Timed out!') # Use whatever exception you consider appropriate.

    return trace_function

# The following code will terminate at most after TOTAL_TIMEOUT + the highest
# value specified in `timeout` parameter of `requests.get`.
# In this case 10 + 6 = 16 seconds.
# For most cases though, it's gonna terminate no later than TOTAL_TIMEOUT.

TOTAL_TIMEOUT = 10

start = time.time()

sys.settrace(trace_function)

try:
    res = requests.get('http://localhost:8080', timeout=(3, 6)) # Use whatever timeout values you consider appropriate.
except:
    raise
finally:
    sys.settrace(None) # Remove the time constraint and continue normally.

# Do something with the response

浓缩

import requests, sys, time

TOTAL_TIMEOUT = 10

def trace_function(frame, event, arg):
    if time.time() - start > TOTAL_TIMEOUT:
        raise Exception('Timed out!')

    return trace_function

start = time.time()
sys.settrace(trace_function)

try:
    res = requests.get('http://localhost:8080', timeout=(3, 6))
except:
    raise
finally:
    sys.settrace(None)

就是这样!

尽管问题是关于请求的,但我发现使用pycurl CURLOPT_TIMEOUT或CURLOPT_TIMEOUT_MS很容易做到这一点。

不需要线程或信号:

import pycurl
import StringIO

url = 'http://www.example.com/example.zip'
timeout_ms = 1000
raw = StringIO.StringIO()
c = pycurl.Curl()
c.setopt(pycurl.TIMEOUT_MS, timeout_ms)  # total timeout in milliseconds
c.setopt(pycurl.WRITEFUNCTION, raw.write)
c.setopt(pycurl.NOSIGNAL, 1)
c.setopt(pycurl.URL, url)
c.setopt(pycurl.HTTPGET, 1)
try:
    c.perform()
except pycurl.error:
    traceback.print_exc() # error generated on timeout
    pass # or just pass if you don't want to print the error

此代码工作socketError 11004和10060......

# -*- encoding:UTF-8 -*-
__author__ = 'ACE'
import requests
from PyQt4.QtCore import *
from PyQt4.QtGui import *


class TimeOutModel(QThread):
    Existed = pyqtSignal(bool)
    TimeOut = pyqtSignal()

    def __init__(self, fun, timeout=500, parent=None):
        """
        @param fun: function or lambda
        @param timeout: ms
        """
        super(TimeOutModel, self).__init__(parent)
        self.fun = fun

        self.timeer = QTimer(self)
        self.timeer.setInterval(timeout)
        self.timeer.timeout.connect(self.time_timeout)
        self.Existed.connect(self.timeer.stop)
        self.timeer.start()

        self.setTerminationEnabled(True)

    def time_timeout(self):
        self.timeer.stop()
        self.TimeOut.emit()
        self.quit()
        self.terminate()

    def run(self):
        self.fun()


bb = lambda: requests.get("http://ipv4.download.thinkbroadband.com/1GB.zip")

a = QApplication([])

z = TimeOutModel(bb, 500)
print 'timeout'

a.exec_()

如果遇到这种情况,创建一个看门狗线程,在10秒后搞乱请求的内部状态,例如:

关闭底层套接字,理想情况下 如果请求重试该操作,则触发异常

请注意,根据系统库的不同,您可能无法设置DNS解析的截止日期。

设置超时参数:

r = requests.get(w, verify=False, timeout=10) # 10 seconds

2.25.1版的更改

如果读取之间的连接或延迟超过10秒,上面的代码将导致对requests.get()的调用超时。见:https://requests.readthedocs.io/en/stable/user/advanced/超时