我见过有人说python中的set对象有O(1)成员检查。它们在内部是如何实现的呢?它使用什么样的数据结构?这个实现还有什么其他含义?

这里的每个答案都很有启发性,但我只能接受一个,所以我将采用最接近我最初问题的答案。谢谢你提供的信息!


根据这个线程:

事实上,CPython的集合被实现为类似字典的东西 使用虚拟值(键是集合的成员),使用一些 利用这种价值缺失的优化

所以基本上一个集合使用哈希表作为它的底层数据结构。这解释了O(1)成员检查,因为平均而言,在哈希表中查找项是O(1)操作。

如果你有这样的倾向,你甚至可以浏览CPython源代码的set,根据Achim Domma,最初主要是从dict实现的剪切和粘贴。

注意:现在,set和dict的实现已经有了很大的分歧,所以在不同的用例中,精确的行为(例如任意顺序和插入顺序)和性能是不同的;它们仍然是根据哈希表实现的,所以平均大小写查找和插入仍然是O(1),但set不再只是“dict,而是带有哑键/省略键”。


我认为这是一个常见的错误,集查找(或哈希表)不是O(1)。 来自维基百科

In the simplest model, the hash function is completely unspecified and the table does not resize. For the best possible choice of hash function, a table of size n with open addressing has no collisions and holds up to n elements, with a single comparison for successful lookup, and a table of size n with chaining and k keys has the minimum max(0, k-n) collisions and O(1 + k/n) comparisons for lookup. For the worst choice of hash function, every insertion causes a collision, and hash tables degenerate to linear search, with Ω(k) amortized comparisons per insertion and up to k comparisons for a successful lookup.

相关:Java hashmap真的是O(1)吗?


我们都可以很容易地访问源代码,其中set_lookkey()之前的注释说:

/* set object implementation
 Written and maintained by Raymond D. Hettinger <python@rcn.com>
 Derived from Lib/sets.py and Objects/dictobject.c.
 The basic lookup function used by all operations.
 This is based on Algorithm D from Knuth Vol. 3, Sec. 6.4.
 The initial probe index is computed as hash mod the table size.
 Subsequent probe indices are computed as explained in Objects/dictobject.c.
 To improve cache locality, each probe inspects a series of consecutive
 nearby entries before moving on to probes elsewhere in memory.  This leaves
 us with a hybrid of linear probing and open addressing.  The linear probing
 reduces the cost of hash collisions because consecutive memory accesses
 tend to be much cheaper than scattered probes.  After LINEAR_PROBES steps,
 we then use open addressing with the upper bits from the hash value.  This
 helps break-up long chains of collisions.
 All arithmetic on hash should ignore overflow.
 Unlike the dictionary implementation, the lookkey function can return
 NULL if the rich comparison returns an error.
*/


...
#ifndef LINEAR_PROBES
#define LINEAR_PROBES 9
#endif

/* This must be >= 1 */
#define PERTURB_SHIFT 5

static setentry *
set_lookkey(PySetObject *so, PyObject *key, Py_hash_t hash)  
{
...

当人们说集合有O(1)个成员检验时,他们说的是平均情况。在最坏的情况下(当所有散列值碰撞时),成员检查是O(n)。请参阅Python wiki中的时间复杂度。

维基百科文章说,不调整大小的哈希表的最佳情况下时间复杂度是O(1 + k/n)。这个结果并不直接应用于Python集,因为Python集使用一个可以调整大小的哈希表。

维基百科的文章进一步说,对于平均情况,并假设一个简单的统一哈希函数,时间复杂度为O(1/(1-k/n)),其中k/n可以被常数c<1所限制。

大o仅指n→∞时的渐近行为。 由于k/n可以以常数c<1为界,与n无关,

O(1/(1-k/n))不大于O(1/(1-c)),等于O(常数)= O(1)。

所以假设统一的简单哈希,平均来说,Python集合的成员检查是O(1)。


为了进一步强调set's和dict's之间的区别,下面是seobject .c注释部分的摘录,它阐明了set's和dict的主要区别。

集合的用例与字典有很大的不同 钥匙更有可能出现。相反,集合主要是 中未知元素存在时的成员关系测试 的进步。因此,set实现需要对两者进行优化 找到和没有找到的情况。

来源github


Sets in python employ hash table internally. Let us first talk about hash table. Let there be some elements that you want to store in a hash table and you have 31 places in the hash table where you can do so. Let the elements be: 2.83, 8.23, 9.38, 10.23, 25.58, 0.42, 5.37, 28.10, 32.14, 7.31. When you want to use a hash table, you first determine the indices in the hash table where these elements would be stored. Modulus function is a popular way of determining these indices, so let us say we take one element at a time, multiply it by 100 and apply modulo by 31. It is important that each such operation on an element results in a unique number as an entry in a hash table can store only one element unless chaining is allowed. In this way, each element would be stored at a location governed by the indices obtained through modulo operation. Now if you want to search for an element in a set which essentially stores elements using this hash table, you would obtain the element in O(1) time as the index of the element is computed using the modulo operation in a constant time. To expound on the modulo operation, let me also write some code:

piles = [2.83, 8.23, 9.38, 10.23, 25.58, 0.42, 5.37, 28.10, 32.14, 7.31]

def hash_function(x):
    return int(x*100 % 31)

[hash_function(pile) for pile in piles]

输出:[4,17,8,0,16,11,10,20,21,18]