从整数流中找到运行中位数

可能的重复: 滚动中值算法

假设整数是从数据流中读取的。以有效的方式查找到目前为止读取的元素的中位数。

我读过的解决方案:我们可以在左边使用max堆来表示小于有效中位数的元素，在右边使用min堆来表示大于有效中位数的元素。

在处理一个传入元素后，堆中的元素数量最多相差1个元素。当两个堆包含相同数量的元素时，我们发现堆根数据的平均值为有效中位数。当堆不平衡时，我们从包含更多元素的堆根中选择有效中值。

但是我们如何构造最大堆和最小堆也就是说，我们如何知道这里的有效中值?我认为我们应该在max-heap中插入1个元素然后在min-heap中插入下一个元素，如此类推。如果我说错了请指正。

当前回答

如果您不能一次将所有项保存在内存中，这个问题就会变得更加困难。堆解决方案要求您一次将所有元素保存在内存中。这在这个问题的大多数实际应用中是不可能的。

相反，当您看到数字时，请记录您看到每个整数的次数。假设4个字节整数，即2^32个桶，或最多2^33个整数(每个int的key和count)，即2^35字节或32GB。它可能会比这个小得多，因为您不需要存储键或为那些为0的条目计数(例如。就像python中的defaultdict)。插入每个新整数需要常数时间。

然后在任意点，要找到中位数，只需使用计数来确定哪个整数是中间元素。这需要常数时间(虽然是一个很大的常数，但仍然是常数)。

2012-05-21 21:19:09

其他回答

高效这个词取决于上下文。这个问题的解决方案取决于执行的查询量与插入量的关系。假设你插入N个数字K次直到最后你对中位数感兴趣。基于堆的算法的复杂度是O(N log N + K)。

考虑下面的替代方案。将数字放入一个数组中，对于每个查询，运行线性选择算法(比如使用快速排序枢轴)。现在你有了一个运行时间为O(K N)的算法。

现在如果K足够小(不频繁查询)，后一种算法实际上更有效，反之亦然。

2012-05-21 20:50:04

从流数据中找到运行中值有许多不同的解决方案，我将在答案的最后简要地讨论它们。

这个问题是关于特定解决方案(最大堆/最小堆解决方案)的细节，以及基于堆的解决方案如何工作的解释如下:

对于前两个元素，将较小的元素添加到左边的maxHeap中，将较大的元素添加到右边的minHeap中。然后逐个处理流数据，

Step 1: Add next item to one of the heaps

   if next item is smaller than maxHeap root add it to maxHeap,
   else add it to minHeap

Step 2: Balance the heaps (after this step heaps will be either balanced or
   one of them will contain 1 more item)

   if number of elements in one of the heaps is greater than the other by
   more than 1, remove the root element from the one containing more elements and
   add to the other one

然后在任何给定的时间，你都可以像这样计算中值:

   If the heaps contain equal amount of elements;
     median = (root of maxHeap + root of minHeap)/2
   Else
     median = root of the heap with more elements

Now I will talk about the problem in general as promised in the beginning of the answer. Finding running median from a stream of data is a tough problem, and finding an exact solution with memory constraints efficiently is probably impossible for the general case. On the other hand, if the data has some characteristics we can exploit, we can develop efficient specialized solutions. For example, if we know that the data is an integral type, then we can use counting sort, which can give you a constant memory constant time algorithm. Heap based solution is a more general solution because it can be used for other data types (doubles) as well. And finally, if the exact median is not required and an approximation is enough, you can just try to estimate a probability density function for the data and estimate median using that.

2012-05-18 18:15:42

我发现的最有效的计算流百分位数的方法是P²算法:Raj Jain, Imrich Chlamtac:不存储观测数据的动态计算分位数和直方图的P²算法。Commun。Acm 28(10): 1076-1085 (1985)

该算法易于实现，工作效果非常好。然而，这只是一个估计，所以要记住这一点。来自摘要:

A heuristic algorithm is proposed for dynamic calculation qf the median and other quantiles. The estimates are produced dynamically as the observations are generated. The observations are not stored; therefore, the algorithm has a very small and fixed storage requirement regardless of the number of observations. This makes it ideal for implementing in a quantile chip that can be used in industrial controllers and recorders. The algorithm is further extended to histogram plotting. The accuracy of the algorithm is analyzed.

2012-05-21 23:14:09

然后在任意点，要找到中位数，只需使用计数来确定哪个整数是中间元素。这需要常数时间(虽然是一个很大的常数，但仍然是常数)。

2012-05-21 21:19:09

你不能只用一个堆来做这个吗?更新:没有。请看评论。

不变性:在读取2*n个输入后，最小堆保存其中最大的n个。

循环:读取2个输入。将它们都添加到堆中，并删除堆的最小值。这将重新建立不变量。

所以当读取了2n个输入时，堆的最小值是第n大的。在中间位置附近取两个元素的平均值，以及在奇数个输入之后处理查询，需要稍微复杂一点。

2012-05-21 21:12:22

从整数流中找到运行中位数

推荐文章

最新文章

标签