可能的重复: 滚动中值算法








using namespace std;

void runningMedian(std::ifstream& ifs, std::ofstream& ofs, const unsigned bufSize) {
    if (bufSize < 1)
        throw exception("Wrong buffer size.");
    bool evenSize = bufSize % 2 == 0 ? true : false;
    list<int> q;
    vector<int> nums;
    int n;
    unsigned count = 0;
    while (ifs.good()) {
        ifs >> n;
        auto ub = std::upper_bound(nums.begin(), nums.end(), n);
        nums.insert(ub, n);
        if (nums.size() >= bufSize) {
            auto it = std::find(nums.begin(), nums.end(), q.front());
            if (evenSize)
                ofs << count << ": " << (static_cast<double>(nums[nums.size() / 2 - 1] +
                static_cast<double>(nums[nums.size() / 2]))) / 2.0 << '\n';
                ofs << count << ": " << static_cast<double>(nums[nums.size() / 2]);

The bufferSize specifies the size of the numbers sequence, on which the running median must be calculated. When reading numbers from the input stream ifs the vector of the size bufferSize is maintained in sorted order. The median is calculated by taking the middle of the sorted vector, if bufferSize is odd, or the sum of the two middle elements divided by 2, when bufferSize is even. Additinally, I maintain a list of last bufferSize elements read from input. When a new element is added, I put it in the right place in sorted vector and remove from the vector the element added bufferSize steps before (the value of the element retained in the front of the list). In the same time I remove the old element from the list: every new element is placed on the back of the list, every old element is removed from the front. After reaching the bufferSize, both the list and the vector stop to grow, and every insertion of a new element is compensated be deletion of an old element, placed in the list bufferSize steps before. Note, I do not care, whether I remove from the vector exactly the element, placed bufferSize steps before, or just an element that has the same value. For the value of median it does not matter. All calculated median values are output in the output stream.





Step 1: Add next item to one of the heaps

   if next item is smaller than maxHeap root add it to maxHeap,
   else add it to minHeap

Step 2: Balance the heaps (after this step heaps will be either balanced or
   one of them will contain 1 more item)

   if number of elements in one of the heaps is greater than the other by
   more than 1, remove the root element from the one containing more elements and
   add to the other one


   If the heaps contain equal amount of elements;
     median = (root of maxHeap + root of minHeap)/2
     median = root of the heap with more elements

Now I will talk about the problem in general as promised in the beginning of the answer. Finding running median from a stream of data is a tough problem, and finding an exact solution with memory constraints efficiently is probably impossible for the general case. On the other hand, if the data has some characteristics we can exploit, we can develop efficient specialized solutions. For example, if we know that the data is an integral type, then we can use counting sort, which can give you a constant memory constant time algorithm. Heap based solution is a more general solution because it can be used for other data types (doubles) as well. And finally, if the exact median is not required and an approximation is enough, you can just try to estimate a probability density function for the data and estimate median using that.

我发现的最有效的计算流百分位数的方法是P²算法:Raj Jain, Imrich Chlamtac:不存储观测数据的动态计算分位数和直方图的P²算法。Commun。Acm 28(10): 1076-1085 (1985)


A heuristic algorithm is proposed for dynamic calculation qf the median and other quantiles. The estimates are produced dynamically as the observations are generated. The observations are not stored; therefore, the algorithm has a very small and fixed storage requirement regardless of the number of observations. This makes it ideal for implementing in a quantile chip that can be used in industrial controllers and recorders. The algorithm is further extended to histogram plotting. The accuracy of the algorithm is analyzed.



class Heap {
  constructor(isMin) {
    this.heap = [];
    this.isMin = isMin;

  heapify() {
    if (this.heap.length === 1) {

    let currentIndex = this.heap.length - 1; 

    while (true) {
      if (currentIndex === 0) {

      const parentIndex = Math.floor((currentIndex - 1) / 2);
      const parentValue = this.heap[parentIndex];
      const currentValue = this.heap[currentIndex];

      if (
        (this.isMin && parentValue < currentValue) ||
        (!this.isMin && parentValue > currentValue)
      ) {

      this.heap[parentIndex] = currentValue;
      this.heap[currentIndex] = parentValue;

      currentIndex = parentIndex;

  insert(val) {


  pop() {
    const val = this.heap.shift();
    return val;

  top() {
    return this.heap[0];

  length() {
    return this.heap.length;

function findMedian(arr) {
  const topHeap = new Heap(true);
  const bottomHeap = new Heap(false);

  const output = [];

  if (arr.length === 1) {
    return arr[0];

  topHeap.insert(Math.max(arr[0], arr[1]));
  bottomHeap.insert(Math.min(arr[0], arr[1]));

  for (let i = 0; i < arr.length; i++) {
    const currentVal = arr[i];

    if (i === 0) {

    if (i > 1) {
      if (currentVal < bottomHeap.top()) {
      } else {

    if (bottomHeap.length() - topHeap.length() > 1) {
      const bottomVal = bottomHeap.pop();

    if (topHeap.length() - bottomHeap.length() > 1) {
      const topVal = topHeap.pop();

    if (bottomHeap.length() === topHeap.length()) {
      output.push(Math.floor((bottomHeap.top() + topHeap.top()) / 2));

    if (bottomHeap.length() > topHeap.length()) {
    } else {

  return output;

高效这个词取决于上下文。这个问题的解决方案取决于执行的查询量与插入量的关系。假设你插入N个数字K次直到最后你对中位数感兴趣。基于堆的算法的复杂度是O(N log N + K)。

考虑下面的替代方案。将数字放入一个数组中,对于每个查询,运行线性选择算法(比如使用快速排序枢轴)。现在你有了一个运行时间为O(K N)的算法。



int n = 0;  // Running count of elements observed so far  
#define SIZE 10000
int reservoir[SIZE];  

  int x = readNumberFromStream();

  if (n < SIZE)
       reservoir[n++] = x;
      int p = random(++n); // Choose a random number 0 >= p < n
      if (p < SIZE)
           reservoir[p] = x;


由于存储库的大小是固定的,因此排序可以被认为是有效的O(1) -并且该方法运行的时间和内存消耗都是常数。