我见过有人说python中的set对象有O(1)成员检查。它们在内部是如何实现的呢?它使用什么样的数据结构?这个实现还有什么其他含义?

这里的每个答案都很有启发性,但我只能接受一个,所以我将采用最接近我最初问题的答案。谢谢你提供的信息!


当前回答

Sets in python employ hash table internally. Let us first talk about hash table. Let there be some elements that you want to store in a hash table and you have 31 places in the hash table where you can do so. Let the elements be: 2.83, 8.23, 9.38, 10.23, 25.58, 0.42, 5.37, 28.10, 32.14, 7.31. When you want to use a hash table, you first determine the indices in the hash table where these elements would be stored. Modulus function is a popular way of determining these indices, so let us say we take one element at a time, multiply it by 100 and apply modulo by 31. It is important that each such operation on an element results in a unique number as an entry in a hash table can store only one element unless chaining is allowed. In this way, each element would be stored at a location governed by the indices obtained through modulo operation. Now if you want to search for an element in a set which essentially stores elements using this hash table, you would obtain the element in O(1) time as the index of the element is computed using the modulo operation in a constant time. To expound on the modulo operation, let me also write some code:

piles = [2.83, 8.23, 9.38, 10.23, 25.58, 0.42, 5.37, 28.10, 32.14, 7.31]

def hash_function(x):
    return int(x*100 % 31)

[hash_function(pile) for pile in piles]

输出:[4,17,8,0,16,11,10,20,21,18]

其他回答

根据这个线程:

事实上,CPython的集合被实现为类似字典的东西 使用虚拟值(键是集合的成员),使用一些 利用这种价值缺失的优化

所以基本上一个集合使用哈希表作为它的底层数据结构。这解释了O(1)成员检查,因为平均而言,在哈希表中查找项是O(1)操作。

如果你有这样的倾向,你甚至可以浏览CPython源代码的set,根据Achim Domma,最初主要是从dict实现的剪切和粘贴。

注意:现在,set和dict的实现已经有了很大的分歧,所以在不同的用例中,精确的行为(例如任意顺序和插入顺序)和性能是不同的;它们仍然是根据哈希表实现的,所以平均大小写查找和插入仍然是O(1),但set不再只是“dict,而是带有哑键/省略键”。

我认为这是一个常见的错误,集查找(或哈希表)不是O(1)。 来自维基百科

In the simplest model, the hash function is completely unspecified and the table does not resize. For the best possible choice of hash function, a table of size n with open addressing has no collisions and holds up to n elements, with a single comparison for successful lookup, and a table of size n with chaining and k keys has the minimum max(0, k-n) collisions and O(1 + k/n) comparisons for lookup. For the worst choice of hash function, every insertion causes a collision, and hash tables degenerate to linear search, with Ω(k) amortized comparisons per insertion and up to k comparisons for a successful lookup.

相关:Java hashmap真的是O(1)吗?

我们都可以很容易地访问源代码,其中set_lookkey()之前的注释说:

/* set object implementation
 Written and maintained by Raymond D. Hettinger <python@rcn.com>
 Derived from Lib/sets.py and Objects/dictobject.c.
 The basic lookup function used by all operations.
 This is based on Algorithm D from Knuth Vol. 3, Sec. 6.4.
 The initial probe index is computed as hash mod the table size.
 Subsequent probe indices are computed as explained in Objects/dictobject.c.
 To improve cache locality, each probe inspects a series of consecutive
 nearby entries before moving on to probes elsewhere in memory.  This leaves
 us with a hybrid of linear probing and open addressing.  The linear probing
 reduces the cost of hash collisions because consecutive memory accesses
 tend to be much cheaper than scattered probes.  After LINEAR_PROBES steps,
 we then use open addressing with the upper bits from the hash value.  This
 helps break-up long chains of collisions.
 All arithmetic on hash should ignore overflow.
 Unlike the dictionary implementation, the lookkey function can return
 NULL if the rich comparison returns an error.
*/


...
#ifndef LINEAR_PROBES
#define LINEAR_PROBES 9
#endif

/* This must be >= 1 */
#define PERTURB_SHIFT 5

static setentry *
set_lookkey(PySetObject *so, PyObject *key, Py_hash_t hash)  
{
...

Sets in python employ hash table internally. Let us first talk about hash table. Let there be some elements that you want to store in a hash table and you have 31 places in the hash table where you can do so. Let the elements be: 2.83, 8.23, 9.38, 10.23, 25.58, 0.42, 5.37, 28.10, 32.14, 7.31. When you want to use a hash table, you first determine the indices in the hash table where these elements would be stored. Modulus function is a popular way of determining these indices, so let us say we take one element at a time, multiply it by 100 and apply modulo by 31. It is important that each such operation on an element results in a unique number as an entry in a hash table can store only one element unless chaining is allowed. In this way, each element would be stored at a location governed by the indices obtained through modulo operation. Now if you want to search for an element in a set which essentially stores elements using this hash table, you would obtain the element in O(1) time as the index of the element is computed using the modulo operation in a constant time. To expound on the modulo operation, let me also write some code:

piles = [2.83, 8.23, 9.38, 10.23, 25.58, 0.42, 5.37, 28.10, 32.14, 7.31]

def hash_function(x):
    return int(x*100 % 31)

[hash_function(pile) for pile in piles]

输出:[4,17,8,0,16,11,10,20,21,18]

为了进一步强调set's和dict's之间的区别,下面是seobject .c注释部分的摘录,它阐明了set's和dict的主要区别。

集合的用例与字典有很大的不同 钥匙更有可能出现。相反,集合主要是 中未知元素存在时的成员关系测试 的进步。因此,set实现需要对两者进行优化 找到和没有找到的情况。

来源github