可能的重复:
滚动中值算法
假设整数是从数据流中读取的。以有效的方式查找到目前为止读取的元素的中位数。
我读过的解决方案:我们可以在左边使用max堆来表示小于有效中位数的元素,在右边使用min堆来表示大于有效中位数的元素。
在处理一个传入元素后,堆中的元素数量最多相差1个元素。当两个堆包含相同数量的元素时,我们发现堆根数据的平均值为有效中位数。当堆不平衡时,我们从包含更多元素的堆根中选择有效中值。
但是我们如何构造最大堆和最小堆也就是说,我们如何知道这里的有效中值?我认为我们应该在max-heap中插入1个元素然后在min-heap中插入下一个元素,如此类推。如果我说错了请指正。
如果我们想要找到n个最近出现的元素的中值,这个问题有一个精确的解决方案,只需要将n个最近出现的元素保存在内存中。它速度快,规模大。
可索引skiplist支持O(ln n)插入、删除和任意元素的索引搜索,同时保持排序顺序。当再加上一个FIFO队列来跟踪第n个最古老的条目时,解决方案很简单:
class RunningMedian:
'Fast running median with O(lg n) updates where n is the window size'
def __init__(self, n, iterable):
self.it = iter(iterable)
self.queue = deque(islice(self.it, n))
self.skiplist = IndexableSkiplist(n)
for elem in self.queue:
self.skiplist.insert(elem)
def __iter__(self):
queue = self.queue
skiplist = self.skiplist
midpoint = len(queue) // 2
yield skiplist[midpoint]
for newelem in self.it:
oldelem = queue.popleft()
skiplist.remove(oldelem)
queue.append(newelem)
skiplist.insert(newelem)
yield skiplist[midpoint]
以下是完整工作代码的链接(一个易于理解的类版本和一个内联可索引的skiplist代码的优化生成器版本):
http://code.activestate.com/recipes/576930-efficient-running-median-using-an-indexable-skipli/
http://code.activestate.com/recipes/577073。
我可以确认@schmil-the-cat的答案是正确的。
下面是一个JS的实现。我不是算法专家,但我认为它可能对其他人有用。
class Heap {
constructor(isMin) {
this.heap = [];
this.isMin = isMin;
}
heapify() {
if (this.heap.length === 1) {
return;
}
let currentIndex = this.heap.length - 1;
while (true) {
if (currentIndex === 0) {
break;
}
const parentIndex = Math.floor((currentIndex - 1) / 2);
const parentValue = this.heap[parentIndex];
const currentValue = this.heap[currentIndex];
if (
(this.isMin && parentValue < currentValue) ||
(!this.isMin && parentValue > currentValue)
) {
break;
}
this.heap[parentIndex] = currentValue;
this.heap[currentIndex] = parentValue;
currentIndex = parentIndex;
}
}
insert(val) {
this.heap.push(val);
this.heapify();
}
pop() {
const val = this.heap.shift();
this.heapify();
return val;
}
top() {
return this.heap[0];
}
length() {
return this.heap.length;
}
}
function findMedian(arr) {
const topHeap = new Heap(true);
const bottomHeap = new Heap(false);
const output = [];
if (arr.length === 1) {
return arr[0];
}
topHeap.insert(Math.max(arr[0], arr[1]));
bottomHeap.insert(Math.min(arr[0], arr[1]));
for (let i = 0; i < arr.length; i++) {
const currentVal = arr[i];
if (i === 0) {
output.push(currentVal);
continue;
}
if (i > 1) {
if (currentVal < bottomHeap.top()) {
bottomHeap.insert(currentVal);
} else {
topHeap.insert(currentVal);
}
}
if (bottomHeap.length() - topHeap.length() > 1) {
const bottomVal = bottomHeap.pop();
topHeap.insert(bottomVal);
}
if (topHeap.length() - bottomHeap.length() > 1) {
const topVal = topHeap.pop();
bottomHeap.insert(topVal);
}
if (bottomHeap.length() === topHeap.length()) {
output.push(Math.floor((bottomHeap.top() + topHeap.top()) / 2));
continue;
}
if (bottomHeap.length() > topHeap.length()) {
output.push(bottomHeap.top());
} else {
output.push(topHeap.top());
}
}
return output;
}
如果我们想要找到n个最近出现的元素的中值,这个问题有一个精确的解决方案,只需要将n个最近出现的元素保存在内存中。它速度快,规模大。
可索引skiplist支持O(ln n)插入、删除和任意元素的索引搜索,同时保持排序顺序。当再加上一个FIFO队列来跟踪第n个最古老的条目时,解决方案很简单:
class RunningMedian:
'Fast running median with O(lg n) updates where n is the window size'
def __init__(self, n, iterable):
self.it = iter(iterable)
self.queue = deque(islice(self.it, n))
self.skiplist = IndexableSkiplist(n)
for elem in self.queue:
self.skiplist.insert(elem)
def __iter__(self):
queue = self.queue
skiplist = self.skiplist
midpoint = len(queue) // 2
yield skiplist[midpoint]
for newelem in self.it:
oldelem = queue.popleft()
skiplist.remove(oldelem)
queue.append(newelem)
skiplist.insert(newelem)
yield skiplist[midpoint]
以下是完整工作代码的链接(一个易于理解的类版本和一个内联可索引的skiplist代码的优化生成器版本):
http://code.activestate.com/recipes/576930-efficient-running-median-using-an-indexable-skipli/
http://code.activestate.com/recipes/577073。
我发现的最有效的计算流百分位数的方法是P²算法:Raj Jain, Imrich Chlamtac:不存储观测数据的动态计算分位数和直方图的P²算法。Commun。Acm 28(10): 1076-1085 (1985)
该算法易于实现,工作效果非常好。然而,这只是一个估计,所以要记住这一点。来自摘要:
A heuristic algorithm is proposed for dynamic calculation qf the median and other quantiles. The estimates are produced dynamically as the observations are generated. The observations are not stored; therefore, the algorithm has a very small and fixed storage requirement regardless of the number of observations. This makes it ideal for implementing in a quantile chip that can be used in industrial controllers and recorders. The algorithm is further extended to histogram plotting. The accuracy of the algorithm is analyzed.