我正在收集网站列表上的统计数据,为了简单起见,我正在使用请求。这是我的代码:

data=[]
websites=['http://google.com', 'http://bbc.co.uk']
for w in websites:
    r= requests.get(w, verify=False)
    data.append( (r.url, len(r.content), r.elapsed.total_seconds(), str([(l.status_code, l.url) for l in r.history]), str(r.headers.items()), str(r.cookies.items())) )
 

现在,我想要请求。10秒后进入超时,这样循环就不会卡住。

这个问题以前也很有趣,但没有一个答案是干净的。

我听说可能不使用请求是一个好主意,但我应该如何得到请求提供的好东西(元组中的那些)。


当前回答

更新:https://requests.readthedocs.io/en/master/user/advanced/超时

在新版本的请求:

如果你为超时指定一个单独的值,像这样:

r = requests.get('https://github.com', timeout=5)

超时值将应用于连接超时和读取超时。如果你想分别设置值,请指定一个元组:

r = requests.get('https://github.com', timeout=(3.05, 27))

如果远程服务器非常慢,您可以告诉Requests永远等待响应,方法是将None作为超时值,然后检索一杯咖啡。

r = requests.get('https://github.com', timeout=None)

我以前的答案(可能已经过时了)(很久以前贴出来的):

还有其他方法可以克服这个问题:

1. 使用TimeoutSauce内部类

来自:https://github.com/kennethreitz/requests/issues/1928 # issuecomment - 35811896

import requests from requests.adapters import TimeoutSauce class MyTimeout(TimeoutSauce): def __init__(self, *args, **kwargs): connect = kwargs.get('connect', 5) read = kwargs.get('read', connect) super(MyTimeout, self).__init__(connect=connect, read=read) requests.adapters.TimeoutSauce = MyTimeout This code should cause us to set the read timeout as equal to the connect timeout, which is the timeout value you pass on your Session.get() call. (Note that I haven't actually tested this code, so it may need some quick debugging, I just wrote it straight into the GitHub window.)

2. 使用kevinburke请求的分支:https://github.com/kevinburke/requests/tree/connect-timeout

从它的文档:https://github.com/kevinburke/requests/blob/connect-timeout/docs/user/advanced.rst

如果你为超时指定一个单独的值,像这样: R = requests.get('https://github.com', timeout=5) 超时值将应用于连接和读取 超时。如果要设置值,请指定一个元组 另外: R = requests.get('https://github.com', timeout=(3.05, 27))

Kevinburke已请求将其合并到主要请求项目中,但尚未被接受。

其他回答

我想到了一个更直接的解决方案,虽然很难看,但能解决真正的问题。它是这样的:

resp = requests.get(some_url, stream=True)
resp.raw._fp.fp._sock.settimeout(read_timeout)
# This will load the entire response even though stream is set
content = resp.content

你可以在这里阅读完整的解释

如果你使用选项stream=True,你可以这样做:

r = requests.get(
    'http://url_to_large_file',
    timeout=1,  # relevant only for underlying socket
    stream=True)

with open('/tmp/out_file.txt'), 'wb') as f:
    start_time = time.time()
    for chunk in r.iter_content(chunk_size=1024):
        if chunk:  # filter out keep-alive new chunks
            f.write(chunk)
        if time.time() - start_time > 8:
            raise Exception('Request took longer than 8s')

该解决方案不需要信号或多处理。

其他答案大多不正确

尽管有这么多的答案,我相信这个帖子仍然缺乏一个合适的解决方案,而且没有现有的答案可以提供一个合理的方法来做一些简单而明显的事情。

我们首先要说的是,截至2022年,仅凭请求仍然绝对无法正确地做到这一点。这是库开发人员有意识的设计决定。

利用超时参数的解决方案根本不能完成它们想要做的事情。事实上,乍一看,它“似乎”起作用纯粹是偶然的:

timeout参数与请求的总执行时间完全没有关系。它只是控制底层套接字接收任何数据之前可以通过的最大时间量。以5秒的超时为例,服务器也可以每4秒发送1字节的数据,这完全没问题,但对您的帮助不大。

带有stream和iter_content的答案稍好一些,但它们仍然不能覆盖请求中的所有内容。在发送响应头之前,您实际上不会从iter_content中接收到任何内容,这也属于相同的问题——即使您使用1字节作为iter_content的块大小,读取完整的响应头可能需要完全任意的时间,并且您永远无法实际到达从iter_content中读取任何响应体的位置。

下面是一些完全打破超时和基于流的方法的示例。都试试。不管你使用哪种方法,它们都是无限期地挂着的。

server.py

import socket
import time

server = socket.socket()

server.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, True)
server.bind(('127.0.0.1', 8080))

server.listen()

while True:
    try:
        sock, addr = server.accept()
        print('Connection from', addr)
        sock.send(b'HTTP/1.1 200 OK\r\n')

        # Send some garbage headers very slowly but steadily.
        # Never actually complete the response.

        while True:
            sock.send(b'a')
            time.sleep(1)
    except:
        pass

demo1.py

import requests

requests.get('http://localhost:8080')

demo2.py

import requests

requests.get('http://localhost:8080', timeout=5)

demo3.py

import requests

requests.get('http://localhost:8080', timeout=(5, 5))

demo4.py

import requests

with requests.get('http://localhost:8080', timeout=(5, 5), stream=True) as res:
    for chunk in res.iter_content(1):
        break

正确的解决方法

My approach utilizes Python's sys.settrace function. It is dead simple. You do not need to use any external libraries or turn your code upside down. Unlike most other answers, this actually guarantees that the code executes in specified time. Be aware that you still need to specify the timeout parameter, as settrace only concerns Python code. Actual socket reads are external syscalls which are not covered by settrace, but are covered by the timeout parameter. Due to this fact, the exact time limit is not TOTAL_TIMEOUT, but a value which is explained in comments below.

import requests
import sys
import time

# This function serves as a "hook" that executes for each Python statement
# down the road. There may be some performance penalty, but as downloading
# a webpage is mostly I/O bound, it's not going to be significant.

def trace_function(frame, event, arg):
    if time.time() - start > TOTAL_TIMEOUT:
        raise Exception('Timed out!') # Use whatever exception you consider appropriate.

    return trace_function

# The following code will terminate at most after TOTAL_TIMEOUT + the highest
# value specified in `timeout` parameter of `requests.get`.
# In this case 10 + 6 = 16 seconds.
# For most cases though, it's gonna terminate no later than TOTAL_TIMEOUT.

TOTAL_TIMEOUT = 10

start = time.time()

sys.settrace(trace_function)

try:
    res = requests.get('http://localhost:8080', timeout=(3, 6)) # Use whatever timeout values you consider appropriate.
except:
    raise
finally:
    sys.settrace(None) # Remove the time constraint and continue normally.

# Do something with the response

浓缩

import requests, sys, time

TOTAL_TIMEOUT = 10

def trace_function(frame, event, arg):
    if time.time() - start > TOTAL_TIMEOUT:
        raise Exception('Timed out!')

    return trace_function

start = time.time()
sys.settrace(trace_function)

try:
    res = requests.get('http://localhost:8080', timeout=(3, 6))
except:
    raise
finally:
    sys.settrace(None)

就是这样!

设置stream=True并使用r.iter_content(1024)。是的,eventlet。我就是不喜欢超时。

try:
    start = time()
    timeout = 5
    with get(config['source']['online'], stream=True, timeout=timeout) as r:
        r.raise_for_status()
        content = bytes()
        content_gen = r.iter_content(1024)
        while True:
            if time()-start > timeout:
                raise TimeoutError('Time out! ({} seconds)'.format(timeout))
            try:
                content += next(content_gen)
            except StopIteration:
                break
        data = content.decode().split('\n')
        if len(data) in [0, 1]:
            raise ValueError('Bad requests data')
except (exceptions.RequestException, ValueError, IndexError, KeyboardInterrupt,
        TimeoutError) as e:
    print(e)
    with open(config['source']['local']) as f:
        data = [line.strip() for line in f.readlines()]

讨论在这里https://redd.it/80kp1h

嗯,我尝试了这个页面上的许多解决方案,仍然面临不稳定,随机挂起,连接性能差。

我现在正在使用Curl,我对它的“max time”功能和全局性能非常满意,即使实现如此糟糕:

content=commands.getoutput('curl -m6 -Ss "http://mywebsite.xyz"')

这里,我定义了一个最大6秒的时间参数,包括连接时间和传输时间。

我相信Curl有一个很好的python绑定,如果你更喜欢坚持python语法:)