我正在收集网站列表上的统计数据,为了简单起见,我正在使用请求。这是我的代码:

data=[]
websites=['http://google.com', 'http://bbc.co.uk']
for w in websites:
    r= requests.get(w, verify=False)
    data.append( (r.url, len(r.content), r.elapsed.total_seconds(), str([(l.status_code, l.url) for l in r.history]), str(r.headers.items()), str(r.cookies.items())) )
 

现在,我想要请求。10秒后进入超时,这样循环就不会卡住。

这个问题以前也很有趣,但没有一个答案是干净的。

我听说可能不使用请求是一个好主意,但我应该如何得到请求提供的好东西(元组中的那些)。


当前回答

其他答案大多不正确

尽管有这么多的答案,我相信这个帖子仍然缺乏一个合适的解决方案,而且没有现有的答案可以提供一个合理的方法来做一些简单而明显的事情。

我们首先要说的是,截至2022年,仅凭请求仍然绝对无法正确地做到这一点。这是库开发人员有意识的设计决定。

利用超时参数的解决方案根本不能完成它们想要做的事情。事实上,乍一看,它“似乎”起作用纯粹是偶然的:

timeout参数与请求的总执行时间完全没有关系。它只是控制底层套接字接收任何数据之前可以通过的最大时间量。以5秒的超时为例,服务器也可以每4秒发送1字节的数据,这完全没问题,但对您的帮助不大。

带有stream和iter_content的答案稍好一些,但它们仍然不能覆盖请求中的所有内容。在发送响应头之前,您实际上不会从iter_content中接收到任何内容,这也属于相同的问题——即使您使用1字节作为iter_content的块大小,读取完整的响应头可能需要完全任意的时间,并且您永远无法实际到达从iter_content中读取任何响应体的位置。

下面是一些完全打破超时和基于流的方法的示例。都试试。不管你使用哪种方法,它们都是无限期地挂着的。

server.py

import socket
import time

server = socket.socket()

server.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, True)
server.bind(('127.0.0.1', 8080))

server.listen()

while True:
    try:
        sock, addr = server.accept()
        print('Connection from', addr)
        sock.send(b'HTTP/1.1 200 OK\r\n')

        # Send some garbage headers very slowly but steadily.
        # Never actually complete the response.

        while True:
            sock.send(b'a')
            time.sleep(1)
    except:
        pass

demo1.py

import requests

requests.get('http://localhost:8080')

demo2.py

import requests

requests.get('http://localhost:8080', timeout=5)

demo3.py

import requests

requests.get('http://localhost:8080', timeout=(5, 5))

demo4.py

import requests

with requests.get('http://localhost:8080', timeout=(5, 5), stream=True) as res:
    for chunk in res.iter_content(1):
        break

正确的解决方法

My approach utilizes Python's sys.settrace function. It is dead simple. You do not need to use any external libraries or turn your code upside down. Unlike most other answers, this actually guarantees that the code executes in specified time. Be aware that you still need to specify the timeout parameter, as settrace only concerns Python code. Actual socket reads are external syscalls which are not covered by settrace, but are covered by the timeout parameter. Due to this fact, the exact time limit is not TOTAL_TIMEOUT, but a value which is explained in comments below.

import requests
import sys
import time

# This function serves as a "hook" that executes for each Python statement
# down the road. There may be some performance penalty, but as downloading
# a webpage is mostly I/O bound, it's not going to be significant.

def trace_function(frame, event, arg):
    if time.time() - start > TOTAL_TIMEOUT:
        raise Exception('Timed out!') # Use whatever exception you consider appropriate.

    return trace_function

# The following code will terminate at most after TOTAL_TIMEOUT + the highest
# value specified in `timeout` parameter of `requests.get`.
# In this case 10 + 6 = 16 seconds.
# For most cases though, it's gonna terminate no later than TOTAL_TIMEOUT.

TOTAL_TIMEOUT = 10

start = time.time()

sys.settrace(trace_function)

try:
    res = requests.get('http://localhost:8080', timeout=(3, 6)) # Use whatever timeout values you consider appropriate.
except:
    raise
finally:
    sys.settrace(None) # Remove the time constraint and continue normally.

# Do something with the response

浓缩

import requests, sys, time

TOTAL_TIMEOUT = 10

def trace_function(frame, event, arg):
    if time.time() - start > TOTAL_TIMEOUT:
        raise Exception('Timed out!')

    return trace_function

start = time.time()
sys.settrace(trace_function)

try:
    res = requests.get('http://localhost:8080', timeout=(3, 6))
except:
    raise
finally:
    sys.settrace(None)

就是这样!

其他回答

我想到了一个更直接的解决方案,虽然很难看,但能解决真正的问题。它是这样的:

resp = requests.get(some_url, stream=True)
resp.raw._fp.fp._sock.settimeout(read_timeout)
# This will load the entire response even though stream is set
content = resp.content

你可以在这里阅读完整的解释

其他答案大多不正确

尽管有这么多的答案,我相信这个帖子仍然缺乏一个合适的解决方案,而且没有现有的答案可以提供一个合理的方法来做一些简单而明显的事情。

我们首先要说的是,截至2022年,仅凭请求仍然绝对无法正确地做到这一点。这是库开发人员有意识的设计决定。

利用超时参数的解决方案根本不能完成它们想要做的事情。事实上,乍一看,它“似乎”起作用纯粹是偶然的:

timeout参数与请求的总执行时间完全没有关系。它只是控制底层套接字接收任何数据之前可以通过的最大时间量。以5秒的超时为例,服务器也可以每4秒发送1字节的数据,这完全没问题,但对您的帮助不大。

带有stream和iter_content的答案稍好一些,但它们仍然不能覆盖请求中的所有内容。在发送响应头之前,您实际上不会从iter_content中接收到任何内容,这也属于相同的问题——即使您使用1字节作为iter_content的块大小,读取完整的响应头可能需要完全任意的时间,并且您永远无法实际到达从iter_content中读取任何响应体的位置。

下面是一些完全打破超时和基于流的方法的示例。都试试。不管你使用哪种方法,它们都是无限期地挂着的。

server.py

import socket
import time

server = socket.socket()

server.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, True)
server.bind(('127.0.0.1', 8080))

server.listen()

while True:
    try:
        sock, addr = server.accept()
        print('Connection from', addr)
        sock.send(b'HTTP/1.1 200 OK\r\n')

        # Send some garbage headers very slowly but steadily.
        # Never actually complete the response.

        while True:
            sock.send(b'a')
            time.sleep(1)
    except:
        pass

demo1.py

import requests

requests.get('http://localhost:8080')

demo2.py

import requests

requests.get('http://localhost:8080', timeout=5)

demo3.py

import requests

requests.get('http://localhost:8080', timeout=(5, 5))

demo4.py

import requests

with requests.get('http://localhost:8080', timeout=(5, 5), stream=True) as res:
    for chunk in res.iter_content(1):
        break

正确的解决方法

My approach utilizes Python's sys.settrace function. It is dead simple. You do not need to use any external libraries or turn your code upside down. Unlike most other answers, this actually guarantees that the code executes in specified time. Be aware that you still need to specify the timeout parameter, as settrace only concerns Python code. Actual socket reads are external syscalls which are not covered by settrace, but are covered by the timeout parameter. Due to this fact, the exact time limit is not TOTAL_TIMEOUT, but a value which is explained in comments below.

import requests
import sys
import time

# This function serves as a "hook" that executes for each Python statement
# down the road. There may be some performance penalty, but as downloading
# a webpage is mostly I/O bound, it's not going to be significant.

def trace_function(frame, event, arg):
    if time.time() - start > TOTAL_TIMEOUT:
        raise Exception('Timed out!') # Use whatever exception you consider appropriate.

    return trace_function

# The following code will terminate at most after TOTAL_TIMEOUT + the highest
# value specified in `timeout` parameter of `requests.get`.
# In this case 10 + 6 = 16 seconds.
# For most cases though, it's gonna terminate no later than TOTAL_TIMEOUT.

TOTAL_TIMEOUT = 10

start = time.time()

sys.settrace(trace_function)

try:
    res = requests.get('http://localhost:8080', timeout=(3, 6)) # Use whatever timeout values you consider appropriate.
except:
    raise
finally:
    sys.settrace(None) # Remove the time constraint and continue normally.

# Do something with the response

浓缩

import requests, sys, time

TOTAL_TIMEOUT = 10

def trace_function(frame, event, arg):
    if time.time() - start > TOTAL_TIMEOUT:
        raise Exception('Timed out!')

    return trace_function

start = time.time()
sys.settrace(trace_function)

try:
    res = requests.get('http://localhost:8080', timeout=(3, 6))
except:
    raise
finally:
    sys.settrace(None)

就是这样!

最大的问题是,如果无法建立连接,请求包会等待太长时间,并阻塞程序的其余部分。

有几种方法来解决这个问题,但当我寻找类似请求的联机程序时,我找不到任何东西。这就是为什么我为请求构建了一个名为reqto(“请求超时”)的包装器,它支持来自请求的所有标准方法的适当超时。

pip install reqto

语法与请求相同

import reqto

response = reqto.get(f'https://pypi.org/pypi/reqto/json',timeout=1)
# Will raise an exception on Timeout
print(response)

此外,还可以设置自定义超时函数

def custom_function(parameter):
    print(parameter)


response = reqto.get(f'https://pypi.org/pypi/reqto/json',timeout=5,timeout_function=custom_function,timeout_args="Timeout custom function called")
#Will call timeout_function instead of raising an exception on Timeout
print(response)

重要的注意事项是导入行

import reqto

由于monkey_patch在后台运行,需要比所有其他导入更早地导入请求,线程等。

设置stream=True并使用r.iter_content(1024)。是的,eventlet。我就是不喜欢超时。

try:
    start = time()
    timeout = 5
    with get(config['source']['online'], stream=True, timeout=timeout) as r:
        r.raise_for_status()
        content = bytes()
        content_gen = r.iter_content(1024)
        while True:
            if time()-start > timeout:
                raise TimeoutError('Time out! ({} seconds)'.format(timeout))
            try:
                content += next(content_gen)
            except StopIteration:
                break
        data = content.decode().split('\n')
        if len(data) in [0, 1]:
            raise ValueError('Bad requests data')
except (exceptions.RequestException, ValueError, IndexError, KeyboardInterrupt,
        TimeoutError) as e:
    print(e)
    with open(config['source']['local']) as f:
        data = [line.strip() for line in f.readlines()]

讨论在这里https://redd.it/80kp1h

我相信你可以使用多处理,而不依赖于第三方包:

import multiprocessing
import requests

def call_with_timeout(func, args, kwargs, timeout):
    manager = multiprocessing.Manager()
    return_dict = manager.dict()

    # define a wrapper of `return_dict` to store the result.
    def function(return_dict):
        return_dict['value'] = func(*args, **kwargs)

    p = multiprocessing.Process(target=function, args=(return_dict,))
    p.start()

    # Force a max. `timeout` or wait for the process to finish
    p.join(timeout)

    # If thread is still active, it didn't finish: raise TimeoutError
    if p.is_alive():
        p.terminate()
        p.join()
        raise TimeoutError
    else:
        return return_dict['value']

call_with_timeout(requests.get, args=(url,), kwargs={'timeout': 10}, timeout=60)

传递给kwargs的超时是从服务器获取任何响应的超时,参数timeout是获取完整响应的超时。