用Python发送100,000个HTTP请求的最快方法是什么?

I am opening a file which has 100,000 URL's. I need to send an HTTP request to each URL and print the status code. I am using Python 2.6, and so far looked at the many confusing ways Python implements threading/concurrency. I have even looked at the python concurrence library, but cannot figure out how to write this program correctly. Has anyone come across a similar problem? I guess generally I need to know how to perform thousands of tasks in Python as fast as possible - I suppose that means 'concurrently'.

当前回答

使用grequests，它是requests + Gevent模块的组合。

GRequests允许您使用带有Gevent的Requests来轻松地生成异步HTTP请求。

用法很简单:

import grequests

urls = [
   'http://www.heroku.com',
   'http://tablib.org',
   'http://httpbin.org',
   'http://python-requests.org',
   'http://kennethreitz.com'
]

创建一组未发送的请求:

>>> rs = (grequests.get(u) for u in urls)

同时发送:

>>> grequests.map(rs)
[<Response [200]>, <Response [200]>, <Response [200]>, <Response [200]>, <Response [200]>]

2014-07-14 07:41:25

其他回答

考虑使用风车，虽然风车可能不能做那么多线程。

您可以在5台机器上使用手卷Python脚本，每台机器使用端口40000-60000连接出站，打开100,000个端口连接。

另外，使用一个线程良好的QA应用程序(如OpenSTA)做一个示例测试可能会有所帮助，以了解每个服务器可以处理多少。

另外，试着在LWP::ConnCache类中使用简单的Perl。这样您可能会获得更好的性能(更多的连接)。

2010-04-13 20:20:02

解决这个问题的一个好方法是首先编写获得一个结果所需的代码，然后合并线程代码来并行化应用程序。

In a perfect world this would simply mean simultaneously starting 100,000 threads which output their results into a dictionary or list for later processing, but in practice you are limited in how many parallel HTTP requests you can issue in this fashion. Locally, you have limits in how many sockets you can open concurrently, how many threads of execution your Python interpreter will allow. Remotely, you may be limited in the number of simultaneous connections if all the requests are against one server, or many. These limitations will probably necessitate that you write the script in such a way as to only poll a small fraction of the URLs at any one time (100, as another poster mentioned, is probably a decent thread pool size, although you may find that you can successfully deploy many more).

您可以遵循以下设计模式来解决上述问题:

Start a thread which launches new request threads until the number of currently running threads (you can track them via threading.active_count() or by pushing the thread objects into a data structure) is >= your maximum number of simultaneous requests (say 100), then sleeps for a short timeout. This thread should terminate when there is are no more URLs to process. Thus, the thread will keep waking up, launching new threads, and sleeping until your are finished. Have the request threads store their results in some data structure for later retrieval and output. If the structure you are storing the results in is a list or dict in CPython, you can safely append or insert unique items from your threads without locks, but if you write to a file or require in more complex cross-thread data interaction you should use a mutual exclusion lock to protect this state from corruption.

我建议您使用threading模块。您可以使用它来启动和跟踪正在运行的线程。Python的线程支持是完全的，但是对问题的描述表明它完全满足了您的需求。

最后，如果您希望看到用Python编写的并行网络应用程序的相当简单的应用程序，请查看ssh.py。它是一个小型库，使用Python线程并行处理许多SSH连接。该设计非常接近您的需求，您可能会发现它是一个很好的资源。

2010-04-13 20:09:49

使用grequests，它是requests + Gevent模块的组合。

GRequests允许您使用带有Gevent的Requests来轻松地生成异步HTTP请求。

用法很简单:

import grequests

urls = [
   'http://www.heroku.com',
   'http://tablib.org',
   'http://httpbin.org',
   'http://python-requests.org',
   'http://kennethreitz.com'
]

创建一组未发送的请求:

>>> rs = (grequests.get(u) for u in urls)

同时发送:

>>> grequests.map(rs)
[<Response [200]>, <Response [200]>, <Response [200]>, <Response [200]>, <Response [200]>]

2014-07-14 07:41:25

最简单的方法是使用Python的内置线程库。它们不是“真正的”/内核线程。它们有问题(比如序列化)，但足够好了。你需要一个队列和线程池。这里有一个选项，但是编写自己的选项很简单。您无法并行处理所有100,000个调用，但可以同时发出100个(或左右)调用。

2010-04-13 19:30:25

使用线程池是一个很好的选择，这将使这相当容易。不幸的是，python并没有一个标准库来简化线程池。但这里有一个不错的图书馆，你应该开始: http://www.chrisarndt.de/projects/threadpool/

来自他们网站的代码示例:

pool = ThreadPool(poolsize)
requests = makeRequests(some_callable, list_of_args, callback)
[pool.putRequest(req) for req in requests]
pool.wait()

希望这能有所帮助。

2010-04-13 19:42:41

用Python发送100,000个HTTP请求的最快方法是什么?

推荐文章

最新文章

标签