set()是如何实现的?

我见过有人说python中的set对象有O(1)成员检查。它们在内部是如何实现的呢?它使用什么样的数据结构?这个实现还有什么其他含义?

这里的每个答案都很有启发性，但我只能接受一个，所以我将采用最接近我最初问题的答案。谢谢你提供的信息!

根据这个线程:

事实上，CPython的集合被实现为类似字典的东西使用虚拟值(键是集合的成员)，使用一些利用这种价值缺失的优化

所以基本上一个集合使用哈希表作为它的底层数据结构。这解释了O(1)成员检查，因为平均而言，在哈希表中查找项是O(1)操作。

如果你有这样的倾向，你甚至可以浏览CPython源代码的set，根据Achim Domma，最初主要是从dict实现的剪切和粘贴。

注意:现在，set和dict的实现已经有了很大的分歧，所以在不同的用例中，精确的行为(例如任意顺序和插入顺序)和性能是不同的;它们仍然是根据哈希表实现的，所以平均大小写查找和插入仍然是O(1)，但set不再只是“dict，而是带有哑键/省略键”。

2010-10-16 14:47:43

我认为这是一个常见的错误，集查找(或哈希表)不是O(1)。来自维基百科

In the simplest model, the hash function is completely unspecified and the table does not resize. For the best possible choice of hash function, a table of size n with open addressing has no collisions and holds up to n elements, with a single comparison for successful lookup, and a table of size n with chaining and k keys has the minimum max(0, k-n) collisions and O(1 + k/n) comparisons for lookup. For the worst choice of hash function, every insertion causes a collision, and hash tables degenerate to linear search, with Ω(k) amortized comparisons per insertion and up to k comparisons for a successful lookup.

相关:Java hashmap真的是O(1)吗?

2010-10-16 14:57:26

我们都可以很容易地访问源代码，其中set_lookkey()之前的注释说:

/* set object implementation
 Written and maintained by Raymond D. Hettinger <python@rcn.com>
 Derived from Lib/sets.py and Objects/dictobject.c.
 The basic lookup function used by all operations.
 This is based on Algorithm D from Knuth Vol. 3, Sec. 6.4.
 The initial probe index is computed as hash mod the table size.
 Subsequent probe indices are computed as explained in Objects/dictobject.c.
 To improve cache locality, each probe inspects a series of consecutive
 nearby entries before moving on to probes elsewhere in memory.  This leaves
 us with a hybrid of linear probing and open addressing.  The linear probing
 reduces the cost of hash collisions because consecutive memory accesses
 tend to be much cheaper than scattered probes.  After LINEAR_PROBES steps,
 we then use open addressing with the upper bits from the hash value.  This
 helps break-up long chains of collisions.
 All arithmetic on hash should ignore overflow.
 Unlike the dictionary implementation, the lookkey function can return
 NULL if the rich comparison returns an error.
*/


...
#ifndef LINEAR_PROBES
#define LINEAR_PROBES 9
#endif

/* This must be >= 1 */
#define PERTURB_SHIFT 5

static setentry *
set_lookkey(PySetObject *so, PyObject *key, Py_hash_t hash)  
{
...

2010-10-16 14:59:38

当人们说集合有O(1)个成员检验时，他们说的是平均情况。在最坏的情况下(当所有散列值碰撞时)，成员检查是O(n)。请参阅Python wiki中的时间复杂度。

维基百科文章说，不调整大小的哈希表的最佳情况下时间复杂度是O(1 + k/n)。这个结果并不直接应用于Python集，因为Python集使用一个可以调整大小的哈希表。

维基百科的文章进一步说，对于平均情况，并假设一个简单的统一哈希函数，时间复杂度为O(1/(1-k/n))，其中k/n可以被常数c<1所限制。

大o仅指n→∞时的渐近行为。由于k/n可以以常数c<1为界，与n无关，

O(1/(1-k/n))不大于O(1/(1-c))，等于O(常数)= O(1)。

所以假设统一的简单哈希，平均来说，Python集合的成员检查是O(1)。

2010-10-16 16:47:12

为了进一步强调set's和dict's之间的区别，下面是seobject .c注释部分的摘录，它阐明了set's和dict的主要区别。

集合的用例与字典有很大的不同钥匙更有可能出现。相反，集合主要是中未知元素存在时的成员关系测试的进步。因此，set实现需要对两者进行优化找到和没有找到的情况。

来源github

2017-11-15 08:58:10

Sets in python employ hash table internally. Let us first talk about hash table. Let there be some elements that you want to store in a hash table and you have 31 places in the hash table where you can do so. Let the elements be: 2.83, 8.23, 9.38, 10.23, 25.58, 0.42, 5.37, 28.10, 32.14, 7.31. When you want to use a hash table, you first determine the indices in the hash table where these elements would be stored. Modulus function is a popular way of determining these indices, so let us say we take one element at a time, multiply it by 100 and apply modulo by 31. It is important that each such operation on an element results in a unique number as an entry in a hash table can store only one element unless chaining is allowed. In this way, each element would be stored at a location governed by the indices obtained through modulo operation. Now if you want to search for an element in a set which essentially stores elements using this hash table, you would obtain the element in O(1) time as the index of the element is computed using the modulo operation in a constant time. To expound on the modulo operation, let me also write some code:

piles = [2.83, 8.23, 9.38, 10.23, 25.58, 0.42, 5.37, 28.10, 32.14, 7.31]

def hash_function(x):
    return int(x*100 % 31)

[hash_function(pile) for pile in piles]

输出:[4,17,8,0,16,11,10,20,21,18]

2020-10-23 07:26:00

set()是如何实现的?

推荐文章

最新文章

标签