根据MSDN, Median在Transact-SQL中不能作为聚合函数使用。但是,我想知道是否可以创建此功能(使用create Aggregate函数、用户定义函数或其他方法)。

最好的方法(如果可能的话)是什么——允许在聚合查询中计算中值(假设是数值数据类型)?


当前回答

以下是我的解决方案:

with tempa as

 (

    select value,row_number() over (order by value) as Rn,/* Assigning a 
                                                           row_number */
           count(value) over () as Cnt /*Taking total count of the values */
    from numbers
    where value is not null /* Excluding the null values */
 ),

tempb as

  (

    /* Since we don't know whether the number of rows is odd or even, we shall 
     consider both the scenarios */

    select round(cnt/2) as Ref from tempa where mod(cnt,2)=1
    union all
    select round(cnt/2) a Ref from tempa where mod(cnt,2)=0
     union all
    select round(cnt/2) + 1 as Ref from tempa where mod(cnt,2)=0
   )
  select avg(value) as Median_Value

  from tempa where rn in

    ( select Ref from tempb);

其他回答

虽然Justin grant的解决方案看起来很可靠,但我发现当您在给定的分区键中有许多重复值时,ASC重复值的行号最终会不按顺序排列,因此它们不能正确对齐。

以下是我的研究结果的一个片段:

KEY VALUE ROWA ROWD  

13  2     22   182
13  1     6    183
13  1     7    184
13  1     8    185
13  1     9    186
13  1     10   187
13  1     11   188
13  1     12   189
13  0     1    190
13  0     2    191
13  0     3    192
13  0     4    193
13  0     5    194

我使用Justin的代码作为这个解决方案的基础。尽管考虑到使用多个派生表效率不高,但它确实解决了我遇到的行排序问题。任何改进都会受到欢迎,因为我在T-SQL方面不是那么有经验。

SELECT PKEY, cast(AVG(VALUE)as decimal(5,2)) as MEDIANVALUE
FROM
(
  SELECT PKEY,VALUE,ROWA,ROWD,
  'FLAG' = (CASE WHEN ROWA IN (ROWD,ROWD-1,ROWD+1) THEN 1 ELSE 0 END)
  FROM
  (
    SELECT
    PKEY,
    cast(VALUE as decimal(5,2)) as VALUE,
    ROWA,
    ROW_NUMBER() OVER (PARTITION BY PKEY ORDER BY ROWA DESC) as ROWD 

    FROM
    (
      SELECT
      PKEY, 
      VALUE,
      ROW_NUMBER() OVER (PARTITION BY PKEY ORDER BY VALUE ASC,PKEY ASC ) as ROWA 
      FROM [MTEST]
    )T1
  )T2
)T3
WHERE FLAG = '1'
GROUP BY PKEY
ORDER BY PKEY

我只是在寻找一个基于集的中位数的解决方案时偶然发现了这一页。在研究了一些解决方案之后,我想到了以下几点。希望是有用的。

DECLARE @test TABLE(
    i int identity(1,1),
    id int,
    score float
)

INSERT INTO @test (id,score) VALUES (1,10)
INSERT INTO @test (id,score) VALUES (1,11)
INSERT INTO @test (id,score) VALUES (1,15)
INSERT INTO @test (id,score) VALUES (1,19)
INSERT INTO @test (id,score) VALUES (1,20)

INSERT INTO @test (id,score) VALUES (2,20)
INSERT INTO @test (id,score) VALUES (2,21)
INSERT INTO @test (id,score) VALUES (2,25)
INSERT INTO @test (id,score) VALUES (2,29)
INSERT INTO @test (id,score) VALUES (2,30)

INSERT INTO @test (id,score) VALUES (3,20)
INSERT INTO @test (id,score) VALUES (3,21)
INSERT INTO @test (id,score) VALUES (3,25)
INSERT INTO @test (id,score) VALUES (3,29)

DECLARE @counts TABLE(
    id int,
    cnt int
)

INSERT INTO @counts (
    id,
    cnt
)
SELECT
    id,
    COUNT(*)
FROM
    @test
GROUP BY
    id

SELECT
    drv.id,
    drv.start,
    AVG(t.score)
FROM
    (
        SELECT
            MIN(t.i)-1 AS start,
            t.id
        FROM
            @test t
        GROUP BY
            t.id
    ) drv
    INNER JOIN @test t ON drv.id = t.id
    INNER JOIN @counts c ON t.id = c.id
WHERE
    t.i = ((c.cnt+1)/2)+drv.start
    OR (
        t.i = (((c.cnt+1)%2) * ((c.cnt+2)/2))+drv.start
        AND ((c.cnt+1)%2) * ((c.cnt+2)/2) <> 0
    )
GROUP BY
    drv.id,
    drv.start

MS SQL Server 2012(及以后版本)有PERCENTILE_DISC函数,计算排序值的特定百分比。PERCENTILE_DISC(0.5)将计算中位数- https://msdn.microsoft.com/en-us/library/hh231327.aspx

使用COUNT聚合, 首先可以计算有多少行,并存储在一个名为@cnt的变量中。然后 你可以计算OFFSET-FETCH过滤器的参数来指定,基于数量排序, 要跳过多少行(偏移值)和筛选多少行(获取值)。

行数 跳过是(@cnt - 1) / 2。很明显,对于奇数,这个计算是正确的,因为 首先对单个中间值减去1,然后再除以2。

这也适用于偶数计数,因为表达式中使用的除法是 整数除法;所以,当一个偶数减去1时,你得到的是一个奇数。

When dividing that odd value by 2, the fraction part of the result (.5) is truncated. The number of rows to fetch is 2 - (@cnt % 2). The idea is that when the count is odd the result of the modulo operation is 1, and you need to fetch 1 row. When the count is even the result of the modulo operation is 0, and you need to fetch 2 rows. By subtracting the 1 or 0 result of the modulo operation from 2, you get the desired 1 or 2, respectively. Finally, to compute the median quantity, take the one or two result quantities, and apply an average after converting the input integer value to a numeric one as follows:

DECLARE @cnt AS INT = (SELECT COUNT(*) FROM [Sales].[production].[stocks]);
SELECT AVG(1.0 * quantity) AS median
FROM ( SELECT quantity
FROM [Sales].[production].[stocks]
ORDER BY quantity
OFFSET (@cnt - 1) / 2 ROWS FETCH NEXT 2 - @cnt % 2 ROWS ONLY ) AS D;

2019 UPDATE: In the 10 years since I wrote this answer, more solutions have been uncovered that may yield better results. Also, SQL Server releases since then (especially SQL 2012) have introduced new T-SQL features that can be used to calculate medians. SQL Server releases have also improved its query optimizer which may affect perf of various median solutions. Net-net, my original 2009 post is still OK but there may be better solutions on for modern SQL Server apps. Take a look at this article from 2012 which is a great resource: https://sqlperformance.com/2012/08/t-sql-queries/median

本文发现,以下模式比所有其他选择都要快得多,至少在他们测试的简单模式上是这样。该解决方案比测试的最慢解决方案(PERCENTILE_CONT)快373x (!!)注意,这个技巧需要两个独立的查询,这可能不是在所有情况下都可行。它还需要SQL 2012或更高版本。

DECLARE @c BIGINT = (SELECT COUNT(*) FROM dbo.EvenRows);

SELECT AVG(1.0 * val)
FROM (
    SELECT val FROM dbo.EvenRows
     ORDER BY val
     OFFSET (@c - 1) / 2 ROWS
     FETCH NEXT 1 + (1 - @c % 2) ROWS ONLY
) AS x;

当然,仅仅因为2012年对一个模式的一次测试产生了很好的结果,您的实际情况可能会有所不同,特别是如果您使用的是SQL Server 2014或更高版本。如果性能对中值计算很重要,我强烈建议尝试并测试那篇文章中推荐的几个选项,以确保您找到了最适合您的模式的选项。

我还会特别小心地使用(SQL Server 2012新增的)函数PERCENTILE_CONT,这是这个问题的其他答案之一中推荐的,因为上面链接的文章发现这个内置函数比最快的解决方案慢373x。在过去的7年里,这种差异可能已经得到了改善,但就我个人而言,在验证它与其他解决方案的性能之前,我不会在大型表上使用这个函数。

2009年的原始帖子如下:

有很多方法可以做到这一点,它们的性能差别很大。下面是一个优化得特别好的解决方案,包括median、ROW_NUMBERs和性能。当涉及到执行期间生成的实际I/ o时,这是一个特别优的解决方案——它看起来比其他解决方案成本更高,但实际上要快得多。

该页还包含对其他解决方案和性能测试细节的讨论。请注意,如果有多行具有相同的中位数列值,则使用唯一列作为消歧器。

就像所有的数据库性能场景一样,总是尝试在真实的硬件上用真实的数据测试解决方案——你永远不知道什么时候对SQL Server优化器的更改或环境中的某个特性会使正常快速的解决方案变慢。

SELECT
   CustomerId,
   AVG(TotalDue)
FROM
(
   SELECT
      CustomerId,
      TotalDue,
      -- SalesOrderId in the ORDER BY is a disambiguator to break ties
      ROW_NUMBER() OVER (
         PARTITION BY CustomerId
         ORDER BY TotalDue ASC, SalesOrderId ASC) AS RowAsc,
      ROW_NUMBER() OVER (
         PARTITION BY CustomerId
         ORDER BY TotalDue DESC, SalesOrderId DESC) AS RowDesc
   FROM Sales.SalesOrderHeader SOH
) x
WHERE
   RowAsc IN (RowDesc, RowDesc - 1, RowDesc + 1)
GROUP BY CustomerId
ORDER BY CustomerId;