根据MSDN, Median在Transact-SQL中不能作为聚合函数使用。但是,我想知道是否可以创建此功能(使用create Aggregate函数、用户定义函数或其他方法)。

最好的方法(如果可能的话)是什么——允许在聚合查询中计算中值(假设是数值数据类型)?


当前回答

在UDF中,写:

 Select Top 1 medianSortColumn from Table T
  Where (Select Count(*) from Table
         Where MedianSortColumn <
           (Select Count(*) From Table) / 2)
  Order By medianSortColumn

其他回答

使用COUNT聚合, 首先可以计算有多少行,并存储在一个名为@cnt的变量中。然后 你可以计算OFFSET-FETCH过滤器的参数来指定,基于数量排序, 要跳过多少行(偏移值)和筛选多少行(获取值)。

行数 跳过是(@cnt - 1) / 2。很明显,对于奇数,这个计算是正确的,因为 首先对单个中间值减去1,然后再除以2。

这也适用于偶数计数,因为表达式中使用的除法是 整数除法;所以,当一个偶数减去1时,你得到的是一个奇数。

When dividing that odd value by 2, the fraction part of the result (.5) is truncated. The number of rows to fetch is 2 - (@cnt % 2). The idea is that when the count is odd the result of the modulo operation is 1, and you need to fetch 1 row. When the count is even the result of the modulo operation is 0, and you need to fetch 2 rows. By subtracting the 1 or 0 result of the modulo operation from 2, you get the desired 1 or 2, respectively. Finally, to compute the median quantity, take the one or two result quantities, and apply an average after converting the input integer value to a numeric one as follows:

DECLARE @cnt AS INT = (SELECT COUNT(*) FROM [Sales].[production].[stocks]);
SELECT AVG(1.0 * quantity) AS median
FROM ( SELECT quantity
FROM [Sales].[production].[stocks]
ORDER BY quantity
OFFSET (@cnt - 1) / 2 ROWS FETCH NEXT 2 - @cnt % 2 ROWS ONLY ) AS D;

查看SQL中位数计算的其他解决方案: “用MySQL计算中位数的简单方法”(解决方案大多与供应商无关)。

这是我能想到的求中位数的最优解。示例中的名称基于Justin示例。确保表有索引 销售。SalesOrderHeader以索引列CustomerId和TotalDue的顺序存在。

SELECT
 sohCount.CustomerId,
 AVG(sohMid.TotalDue) as TotalDueMedian
FROM 
(SELECT 
  soh.CustomerId,
  COUNT(*) as NumberOfRows
FROM 
  Sales.SalesOrderHeader soh 
GROUP BY soh.CustomerId) As sohCount
CROSS APPLY 
    (Select 
       soh.TotalDue
    FROM 
    Sales.SalesOrderHeader soh 
    WHERE soh.CustomerId = sohCount.CustomerId 
    ORDER BY soh.TotalDue
    OFFSET sohCount.NumberOfRows / 2 - ((sohCount.NumberOfRows + 1) % 2) ROWS 
    FETCH NEXT 1 + ((sohCount.NumberOfRows + 1) % 2) ROWS ONLY
    ) As sohMid
GROUP BY sohCount.CustomerId

更新

我有点不确定哪种方法性能最好,所以我比较了我的方法Justin Grants和Jeff Atwoods,在一个批量中运行基于这三种方法的查询,每个查询的批量成本为:

没有指数:

我的30% Justin Grants 13% Jeff Atwoods 58%

还有index

我的3%。 Justin Grants 10% Jeff Atwoods 87%

I tried to see how well the queries scale if you have index by creating more data from around 14 000 rows by a factor of 2 up to 512 which means in the end around 7,2 millions rows. Note I made sure CustomeId field where unique for each time I did a single copy, so the proportion of rows compared to unique instance of CustomerId was kept constant. While I was doing this I ran executions where I rebuilt index afterwards, and I noticed the results stabilized at around a factor of 128 with the data I had to these values:

我的3%。 贾斯汀·格兰特5% Jeff Atwoods 92%

我想知道,在保持惟一CustomerId不变的情况下,扩展行数会如何影响性能,因此我设置了一个新的测试,在其中执行了上述操作。现在,批成本比率并没有稳定下来,而是不断分化,每个CustomerId平均大约有20行,最后每个这样唯一的Id大约有10000行。数字如下:

我的4% 贾斯汀60% 杰夫斯35%

通过比较结果,我确保我正确地实现了每个方法。 我的结论是,只要索引存在,我使用的方法通常更快。还要注意,本文针对这个特定问题推荐使用这种方法https://www.microsoftpressstore.com/articles/article.aspx?p=2314819&seqNum=5

进一步提高对该查询的后续调用的性能的一种方法是在辅助表中持久化计数信息。您甚至可以通过一个触发器来维护它,该触发器更新并保存有关依赖于CustomerId的SalesOrderHeader行计数的信息,当然您也可以简单地存储中值。

虽然Justin grant的解决方案看起来很可靠,但我发现当您在给定的分区键中有许多重复值时,ASC重复值的行号最终会不按顺序排列,因此它们不能正确对齐。

以下是我的研究结果的一个片段:

KEY VALUE ROWA ROWD  

13  2     22   182
13  1     6    183
13  1     7    184
13  1     8    185
13  1     9    186
13  1     10   187
13  1     11   188
13  1     12   189
13  0     1    190
13  0     2    191
13  0     3    192
13  0     4    193
13  0     5    194

我使用Justin的代码作为这个解决方案的基础。尽管考虑到使用多个派生表效率不高,但它确实解决了我遇到的行排序问题。任何改进都会受到欢迎,因为我在T-SQL方面不是那么有经验。

SELECT PKEY, cast(AVG(VALUE)as decimal(5,2)) as MEDIANVALUE
FROM
(
  SELECT PKEY,VALUE,ROWA,ROWD,
  'FLAG' = (CASE WHEN ROWA IN (ROWD,ROWD-1,ROWD+1) THEN 1 ELSE 0 END)
  FROM
  (
    SELECT
    PKEY,
    cast(VALUE as decimal(5,2)) as VALUE,
    ROWA,
    ROW_NUMBER() OVER (PARTITION BY PKEY ORDER BY ROWA DESC) as ROWD 

    FROM
    (
      SELECT
      PKEY, 
      VALUE,
      ROW_NUMBER() OVER (PARTITION BY PKEY ORDER BY VALUE ASC,PKEY ASC ) as ROWA 
      FROM [MTEST]
    )T1
  )T2
)T3
WHERE FLAG = '1'
GROUP BY PKEY
ORDER BY PKEY

如果你使用的是SQL 2005或更好的版本,这是一个很好的,简单的中位数计算表中的单列:

SELECT
(
 (SELECT MAX(Score) FROM
   (SELECT TOP 50 PERCENT Score FROM Posts ORDER BY Score) AS BottomHalf)
 +
 (SELECT MIN(Score) FROM
   (SELECT TOP 50 PERCENT Score FROM Posts ORDER BY Score DESC) AS TopHalf)
) / 2 AS Median