用MySQL计算中位数最简单(希望不会太慢)的方法是什么?我已经使用AVG(x)来寻找平均值,但我很难找到一个简单的方法来计算中位数。现在,我将所有的行返回到PHP,进行排序,然后选择中间的行,但是肯定有一些简单的方法可以在一个MySQL查询中完成它。

示例数据:

id | val
--------
 1    4
 2    7
 3    2
 4    2
 5    9
 6    8
 7    3

对val排序得到2 2 3 4 7 8 9,因此中位数应该是4,而SELECT AVG(val) == 5。


当前回答

上面的大多数解决方案只适用于表中的一个字段,您可能需要获得查询中多个字段的中位数(第50百分位数)。

我用这个:

SELECT CAST(SUBSTRING_INDEX(SUBSTRING_INDEX(
 GROUP_CONCAT(field_name ORDER BY field_name SEPARATOR ','),
  ',', 50/100 * COUNT(*) + 1), ',', -1) AS DECIMAL) AS `Median`
FROM table_name;

你可以将上面例子中的“50”替换为任何百分位数,这是非常有效的。

只要确保你有足够的内存给GROUP_CONCAT,你可以改变它:

SET group_concat_max_len = 10485760; #10MB max length

详情:http://web.performancerasta.com/metrics-tips-calculating-95th-99th-or-any-percentile-with-single-mysql-query/

其他回答

我有一个包含大约10亿行的数据库,我们需要它来确定集合中的年龄中位数。对十亿行进行排序是困难的,但如果你将可以找到的不同值(年龄范围从0到100)聚合在一起,你可以对这个列表进行排序,并使用一些算术魔术来找到你想要的任何百分位数,如下所示:

with rawData(count_value) as
(
    select p.YEAR_OF_BIRTH
        from dbo.PERSON p
),
overallStats (avg_value, stdev_value, min_value, max_value, total) as
(
  select avg(1.0 * count_value) as avg_value,
    stdev(count_value) as stdev_value,
    min(count_value) as min_value,
    max(count_value) as max_value,
    count(*) as total
  from rawData
),
aggData (count_value, total, accumulated) as
(
  select count_value, 
    count(*) as total, 
        SUM(count(*)) OVER (ORDER BY count_value ROWS UNBOUNDED PRECEDING) as accumulated
  FROM rawData
  group by count_value
)
select o.total as count_value,
  o.min_value,
    o.max_value,
    o.avg_value,
    o.stdev_value,
    MIN(case when d.accumulated >= .50 * o.total then count_value else o.max_value end) as median_value,
    MIN(case when d.accumulated >= .10 * o.total then count_value else o.max_value end) as p10_value,
    MIN(case when d.accumulated >= .25 * o.total then count_value else o.max_value end) as p25_value,
    MIN(case when d.accumulated >= .75 * o.total then count_value else o.max_value end) as p75_value,
    MIN(case when d.accumulated >= .90 * o.total then count_value else o.max_value end) as p90_value
from aggData d
cross apply overallStats o
GROUP BY o.total, o.min_value, o.max_value, o.avg_value, o.stdev_value
;

这个查询取决于你的db支持窗口函数(包括ROWS UNBOUNDED precede),但如果你没有,这是一个简单的事情,将aggData CTE与自身连接,并将所有先前的总数聚合到' cumulative '列,用于确定哪个值包含指定的预分词。上面的示例计算p10、p25、p50(中位数)、p75和p90。

屁股的

我使用了两个查询方法:

第一个得到count, min, Max和avg 第二个语句(预处理语句)使用“LIMIT @count/ 2,1”和“ORDER BY ..”子句来获得中值

它们被包装在函数defn中,因此可以从一次调用中返回所有值。

如果您的范围是静态的,并且数据不经常更改,那么预先计算/存储这些值并使用存储的值,而不是每次都从头查询,可能会更有效。

MariaDB / MySQL:

SELECT AVG(dd.val) as median_val
FROM (
SELECT d.val, @rownum:=@rownum+1 as `row_number`, @total_rows:=@rownum
  FROM data d, (SELECT @rownum:=0) r
  WHERE d.val is NOT NULL
  -- put some where clause here
  ORDER BY d.val
) as dd
WHERE dd.row_number IN ( FLOOR((@total_rows+1)/2), FLOOR((@total_rows+2)/2) );

Steve Cohen指出,在第一次传递之后,@rownum将包含总行数。这可用于确定中值,因此不需要第二次传递或连接。

此外,AVG(dd.val)和dd.row_number IN(…)用于在有偶数条记录时正确地产生中位数。推理:

SELECT FLOOR((3+1)/2),FLOOR((3+2)/2); -- when total_rows is 3, avg rows 2 and 2
SELECT FLOOR((4+1)/2),FLOOR((4+2)/2); -- when total_rows is 4, avg rows 2 and 3

最后,MariaDB 10.3.3+包含一个MEDIAN函数

我建议一个更快的方法。

获取行数:

SELECT CEIL(COUNT(*)/2);

然后取排序子查询的中间值:

SELECT max(val) FROM (SELECT val FROM data ORDER BY val limit @middlevalue);

我用5x10e6的随机数数据集进行了测试,它将在10秒内找到中位数。

MySQL文档中这一页的注释有以下建议:

-- (mostly) High Performance scaling MEDIAN function per group
-- Median defined in http://en.wikipedia.org/wiki/Median
--
-- by Peter Hlavac
-- 06.11.2008
--
-- Example Table:

DROP table if exists table_median;
CREATE TABLE table_median (id INTEGER(11),val INTEGER(11));
COMMIT;


INSERT INTO table_median (id, val) VALUES
(1, 7), (1, 4), (1, 5), (1, 1), (1, 8), (1, 3), (1, 6),
(2, 4),
(3, 5), (3, 2),
(4, 5), (4, 12), (4, 1), (4, 7);



-- Calculating the MEDIAN
SELECT @a := 0;
SELECT
id,
AVG(val) AS MEDIAN
FROM (
SELECT
id,
val
FROM (
SELECT
-- Create an index n for every id
@a := (@a + 1) mod o.c AS shifted_n,
IF(@a mod o.c=0, o.c, @a) AS n,
o.id,
o.val,
-- the number of elements for every id
o.c
FROM (
SELECT
t_o.id,
val,
c
FROM
table_median t_o INNER JOIN
(SELECT
id,
COUNT(1) AS c
FROM
table_median
GROUP BY
id
) t2
ON (t2.id = t_o.id)
ORDER BY
t_o.id,val
) o
) a
WHERE
IF(
-- if there is an even number of elements
-- take the lower and the upper median
-- and use AVG(lower,upper)
c MOD 2 = 0,
n = c DIV 2 OR n = (c DIV 2)+1,

-- if its an odd number of elements
-- take the first if its only one element
-- or take the one in the middle
IF(
c = 1,
n = 1,
n = c DIV 2 + 1
)
)
) a
GROUP BY
id;

-- Explanation:
-- The Statement creates a helper table like
--
-- n id val count
-- ----------------
-- 1, 1, 1, 7
-- 2, 1, 3, 7
-- 3, 1, 4, 7
-- 4, 1, 5, 7
-- 5, 1, 6, 7
-- 6, 1, 7, 7
-- 7, 1, 8, 7
--
-- 1, 2, 4, 1

-- 1, 3, 2, 2
-- 2, 3, 5, 2
--
-- 1, 4, 1, 4
-- 2, 4, 5, 4
-- 3, 4, 7, 4
-- 4, 4, 12, 4


-- from there we can select the n-th element on the position: count div 2 + 1