可能的重复:
滚动中值算法
假设整数是从数据流中读取的。以有效的方式查找到目前为止读取的元素的中位数。
我读过的解决方案:我们可以在左边使用max堆来表示小于有效中位数的元素,在右边使用min堆来表示大于有效中位数的元素。
在处理一个传入元素后,堆中的元素数量最多相差1个元素。当两个堆包含相同数量的元素时,我们发现堆根数据的平均值为有效中位数。当堆不平衡时,我们从包含更多元素的堆根中选择有效中值。
但是我们如何构造最大堆和最小堆也就是说,我们如何知道这里的有效中值?我认为我们应该在max-heap中插入1个元素然后在min-heap中插入下一个元素,如此类推。如果我说错了请指正。
如果输入的方差是统计分布的(如正态分布、对数正态分布等),那么从任意长的数据流中估计百分位数/中位数是一种合理的方法。
int n = 0; // Running count of elements observed so far
#define SIZE 10000
int reservoir[SIZE];
while(streamHasData())
{
int x = readNumberFromStream();
if (n < SIZE)
{
reservoir[n++] = x;
}
else
{
int p = random(++n); // Choose a random number 0 >= p < n
if (p < SIZE)
{
reservoir[p] = x;
}
}
}
“水库”则是一个运行的,均匀的(公平的),所有输入的样本-无论大小。然后找到中位数(或任何百分位数)是一个简单的问题,即对存储库进行排序并轮询感兴趣的点。
由于存储库的大小是固定的,因此排序可以被认为是有效的O(1) -并且该方法运行的时间和内存消耗都是常数。
从流数据中找到运行中值有许多不同的解决方案,我将在答案的最后简要地讨论它们。
这个问题是关于特定解决方案(最大堆/最小堆解决方案)的细节,以及基于堆的解决方案如何工作的解释如下:
对于前两个元素,将较小的元素添加到左边的maxHeap中,将较大的元素添加到右边的minHeap中。然后逐个处理流数据,
Step 1: Add next item to one of the heaps
if next item is smaller than maxHeap root add it to maxHeap,
else add it to minHeap
Step 2: Balance the heaps (after this step heaps will be either balanced or
one of them will contain 1 more item)
if number of elements in one of the heaps is greater than the other by
more than 1, remove the root element from the one containing more elements and
add to the other one
然后在任何给定的时间,你都可以像这样计算中值:
If the heaps contain equal amount of elements;
median = (root of maxHeap + root of minHeap)/2
Else
median = root of the heap with more elements
Now I will talk about the problem in general as promised in the beginning of the answer. Finding running median from a stream of data is a tough problem, and finding an exact solution with memory constraints efficiently is probably impossible for the general case. On the other hand, if the data has some characteristics we can exploit, we can develop efficient specialized solutions. For example, if we know that the data is an integral type, then we can use counting sort, which can give you a constant memory constant time algorithm. Heap based solution is a more general solution because it can be used for other data types (doubles) as well. And finally, if the exact median is not required and an approximation is enough, you can just try to estimate a probability density function for the data and estimate median using that.
下面是我简单但有效的算法(c++),用于从整数流中计算运行中值:
#include<algorithm>
#include<fstream>
#include<vector>
#include<list>
using namespace std;
void runningMedian(std::ifstream& ifs, std::ofstream& ofs, const unsigned bufSize) {
if (bufSize < 1)
throw exception("Wrong buffer size.");
bool evenSize = bufSize % 2 == 0 ? true : false;
list<int> q;
vector<int> nums;
int n;
unsigned count = 0;
while (ifs.good()) {
ifs >> n;
q.push_back(n);
auto ub = std::upper_bound(nums.begin(), nums.end(), n);
nums.insert(ub, n);
count++;
if (nums.size() >= bufSize) {
auto it = std::find(nums.begin(), nums.end(), q.front());
nums.erase(it);
q.pop_front();
if (evenSize)
ofs << count << ": " << (static_cast<double>(nums[nums.size() / 2 - 1] +
static_cast<double>(nums[nums.size() / 2]))) / 2.0 << '\n';
else
ofs << count << ": " << static_cast<double>(nums[nums.size() / 2]);
}
}
}
The bufferSize specifies the size of the numbers sequence, on which the running median must be calculated. When reading numbers from the input stream ifs the vector of the size bufferSize is maintained in sorted order. The median is calculated by taking the middle of the sorted vector, if bufferSize is odd, or the sum of the two middle elements divided by 2, when bufferSize is even. Additinally, I maintain a list of last bufferSize elements read from input. When a new element is added, I put it in the right place in sorted vector and remove from the vector the element added bufferSize steps before (the value of the element retained in the front of the list). In the same time I remove the old element from the list: every new element is placed on the back of the list, every old element is removed from the front. After reaching the bufferSize, both the list and the vector stop to grow, and every insertion of a new element is compensated be deletion of an old element, placed in the list bufferSize steps before. Note, I do not care, whether I remove from the vector exactly the element, placed bufferSize steps before, or just an element that has the same value. For the value of median it does not matter.
All calculated median values are output in the output stream.
如果我们想要找到n个最近出现的元素的中值,这个问题有一个精确的解决方案,只需要将n个最近出现的元素保存在内存中。它速度快,规模大。
可索引skiplist支持O(ln n)插入、删除和任意元素的索引搜索,同时保持排序顺序。当再加上一个FIFO队列来跟踪第n个最古老的条目时,解决方案很简单:
class RunningMedian:
'Fast running median with O(lg n) updates where n is the window size'
def __init__(self, n, iterable):
self.it = iter(iterable)
self.queue = deque(islice(self.it, n))
self.skiplist = IndexableSkiplist(n)
for elem in self.queue:
self.skiplist.insert(elem)
def __iter__(self):
queue = self.queue
skiplist = self.skiplist
midpoint = len(queue) // 2
yield skiplist[midpoint]
for newelem in self.it:
oldelem = queue.popleft()
skiplist.remove(oldelem)
queue.append(newelem)
skiplist.insert(newelem)
yield skiplist[midpoint]
以下是完整工作代码的链接(一个易于理解的类版本和一个内联可索引的skiplist代码的优化生成器版本):
http://code.activestate.com/recipes/576930-efficient-running-median-using-an-indexable-skipli/
http://code.activestate.com/recipes/577073。