set()是如何实现的?

我见过有人说python中的set对象有O(1)成员检查。它们在内部是如何实现的呢?它使用什么样的数据结构?这个实现还有什么其他含义?

这里的每个答案都很有启发性，但我只能接受一个，所以我将采用最接近我最初问题的答案。谢谢你提供的信息!

当前回答

为了进一步强调set's和dict's之间的区别，下面是seobject .c注释部分的摘录，它阐明了set's和dict的主要区别。

集合的用例与字典有很大的不同钥匙更有可能出现。相反，集合主要是中未知元素存在时的成员关系测试的进步。因此，set实现需要对两者进行优化找到和没有找到的情况。

来源github

2017-11-15 08:58:10

其他回答

根据这个线程:

事实上，CPython的集合被实现为类似字典的东西使用虚拟值(键是集合的成员)，使用一些利用这种价值缺失的优化

所以基本上一个集合使用哈希表作为它的底层数据结构。这解释了O(1)成员检查，因为平均而言，在哈希表中查找项是O(1)操作。

如果你有这样的倾向，你甚至可以浏览CPython源代码的set，根据Achim Domma，最初主要是从dict实现的剪切和粘贴。

注意:现在，set和dict的实现已经有了很大的分歧，所以在不同的用例中，精确的行为(例如任意顺序和插入顺序)和性能是不同的;它们仍然是根据哈希表实现的，所以平均大小写查找和插入仍然是O(1)，但set不再只是“dict，而是带有哑键/省略键”。

2010-10-16 14:47:43

我认为这是一个常见的错误，集查找(或哈希表)不是O(1)。来自维基百科

In the simplest model, the hash function is completely unspecified and the table does not resize. For the best possible choice of hash function, a table of size n with open addressing has no collisions and holds up to n elements, with a single comparison for successful lookup, and a table of size n with chaining and k keys has the minimum max(0, k-n) collisions and O(1 + k/n) comparisons for lookup. For the worst choice of hash function, every insertion causes a collision, and hash tables degenerate to linear search, with Ω(k) amortized comparisons per insertion and up to k comparisons for a successful lookup.

相关:Java hashmap真的是O(1)吗?

2010-10-16 14:57:26

我们都可以很容易地访问源代码，其中set_lookkey()之前的注释说:

/* set object implementation
 Written and maintained by Raymond D. Hettinger <python@rcn.com>
 Derived from Lib/sets.py and Objects/dictobject.c.
 The basic lookup function used by all operations.
 This is based on Algorithm D from Knuth Vol. 3, Sec. 6.4.
 The initial probe index is computed as hash mod the table size.
 Subsequent probe indices are computed as explained in Objects/dictobject.c.
 To improve cache locality, each probe inspects a series of consecutive
 nearby entries before moving on to probes elsewhere in memory.  This leaves
 us with a hybrid of linear probing and open addressing.  The linear probing
 reduces the cost of hash collisions because consecutive memory accesses
 tend to be much cheaper than scattered probes.  After LINEAR_PROBES steps,
 we then use open addressing with the upper bits from the hash value.  This
 helps break-up long chains of collisions.
 All arithmetic on hash should ignore overflow.
 Unlike the dictionary implementation, the lookkey function can return
 NULL if the rich comparison returns an error.
*/


...
#ifndef LINEAR_PROBES
#define LINEAR_PROBES 9
#endif

/* This must be >= 1 */
#define PERTURB_SHIFT 5

static setentry *
set_lookkey(PySetObject *so, PyObject *key, Py_hash_t hash)  
{
...

2010-10-16 14:59:38

当人们说集合有O(1)个成员检验时，他们说的是平均情况。在最坏的情况下(当所有散列值碰撞时)，成员检查是O(n)。请参阅Python wiki中的时间复杂度。

维基百科文章说，不调整大小的哈希表的最佳情况下时间复杂度是O(1 + k/n)。这个结果并不直接应用于Python集，因为Python集使用一个可以调整大小的哈希表。

维基百科的文章进一步说，对于平均情况，并假设一个简单的统一哈希函数，时间复杂度为O(1/(1-k/n))，其中k/n可以被常数c<1所限制。

大o仅指n→∞时的渐近行为。由于k/n可以以常数c<1为界，与n无关，

O(1/(1-k/n))不大于O(1/(1-c))，等于O(常数)= O(1)。

所以假设统一的简单哈希，平均来说，Python集合的成员检查是O(1)。

2010-10-16 16:47:12

为了进一步强调set's和dict's之间的区别，下面是seobject .c注释部分的摘录，它阐明了set's和dict的主要区别。

来源github

2017-11-15 08:58:10

set()是如何实现的?

推荐文章

最新文章

标签