根据MSDN, Median在Transact-SQL中不能作为聚合函数使用。但是,我想知道是否可以创建此功能(使用create Aggregate函数、用户定义函数或其他方法)。

最好的方法(如果可能的话)是什么——允许在聚合查询中计算中值(假设是数值数据类型)?


当前回答

在Jeff Atwood的答案的基础上,它是用GROUP BY和一个相关的子查询来获得每个组的中位数。

SELECT TestID, 
(
 (SELECT MAX(Score) FROM
   (SELECT TOP 50 PERCENT Score FROM Posts WHERE TestID = Posts_parent.TestID ORDER BY Score) AS BottomHalf)
 +
 (SELECT MIN(Score) FROM
   (SELECT TOP 50 PERCENT Score FROM Posts WHERE TestID = Posts_parent.TestID ORDER BY Score DESC) AS TopHalf)
) / 2 AS MedianScore,
AVG(Score) AS AvgScore, MIN(Score) AS MinScore, MAX(Score) AS MaxScore
FROM Posts_parent
GROUP BY Posts_parent.TestID

其他回答

试试下面的逻辑来找出中位数:

考虑一个包含以下数字的表格: 1、1、2、3、4、5所示

中位数是2.5

with tempa as 
(
    select num,count(num) over() as Cnt,
        row_number() over (order by num) as Rnum
    from temp),
tempb as
    (
        select round(cnt/2) as ref_value
        from tempa where mod(cnt,2)<>0
        union all
        select round(cnt/2) from tempa where mod(cnt,2)=0
        union all
        select round(cnt/2+1)
        from tempa where mod(cnt,2)=0
    )
select avg(num) from tempa
where rnum in (select * from tempb);
    

这是我能想到的求中位数的最优解。示例中的名称基于Justin示例。确保表有索引 销售。SalesOrderHeader以索引列CustomerId和TotalDue的顺序存在。

SELECT
 sohCount.CustomerId,
 AVG(sohMid.TotalDue) as TotalDueMedian
FROM 
(SELECT 
  soh.CustomerId,
  COUNT(*) as NumberOfRows
FROM 
  Sales.SalesOrderHeader soh 
GROUP BY soh.CustomerId) As sohCount
CROSS APPLY 
    (Select 
       soh.TotalDue
    FROM 
    Sales.SalesOrderHeader soh 
    WHERE soh.CustomerId = sohCount.CustomerId 
    ORDER BY soh.TotalDue
    OFFSET sohCount.NumberOfRows / 2 - ((sohCount.NumberOfRows + 1) % 2) ROWS 
    FETCH NEXT 1 + ((sohCount.NumberOfRows + 1) % 2) ROWS ONLY
    ) As sohMid
GROUP BY sohCount.CustomerId

更新

我有点不确定哪种方法性能最好,所以我比较了我的方法Justin Grants和Jeff Atwoods,在一个批量中运行基于这三种方法的查询,每个查询的批量成本为:

没有指数:

我的30% Justin Grants 13% Jeff Atwoods 58%

还有index

我的3%。 Justin Grants 10% Jeff Atwoods 87%

I tried to see how well the queries scale if you have index by creating more data from around 14 000 rows by a factor of 2 up to 512 which means in the end around 7,2 millions rows. Note I made sure CustomeId field where unique for each time I did a single copy, so the proportion of rows compared to unique instance of CustomerId was kept constant. While I was doing this I ran executions where I rebuilt index afterwards, and I noticed the results stabilized at around a factor of 128 with the data I had to these values:

我的3%。 贾斯汀·格兰特5% Jeff Atwoods 92%

我想知道,在保持惟一CustomerId不变的情况下,扩展行数会如何影响性能,因此我设置了一个新的测试,在其中执行了上述操作。现在,批成本比率并没有稳定下来,而是不断分化,每个CustomerId平均大约有20行,最后每个这样唯一的Id大约有10000行。数字如下:

我的4% 贾斯汀60% 杰夫斯35%

通过比较结果,我确保我正确地实现了每个方法。 我的结论是,只要索引存在,我使用的方法通常更快。还要注意,本文针对这个特定问题推荐使用这种方法https://www.microsoftpressstore.com/articles/article.aspx?p=2314819&seqNum=5

进一步提高对该查询的后续调用的性能的一种方法是在辅助表中持久化计数信息。您甚至可以通过一个触发器来维护它,该触发器更新并保存有关依赖于CustomerId的SalesOrderHeader行计数的信息,当然您也可以简单地存储中值。

如果你使用的是SQL 2005或更好的版本,这是一个很好的,简单的中位数计算表中的单列:

SELECT
(
 (SELECT MAX(Score) FROM
   (SELECT TOP 50 PERCENT Score FROM Posts ORDER BY Score) AS BottomHalf)
 +
 (SELECT MIN(Score) FROM
   (SELECT TOP 50 PERCENT Score FROM Posts ORDER BY Score DESC) AS TopHalf)
) / 2 AS Median

MS SQL Server 2012(及以后版本)有PERCENTILE_DISC函数,计算排序值的特定百分比。PERCENTILE_DISC(0.5)将计算中位数- https://msdn.microsoft.com/en-us/library/hh231327.aspx

DECLARE @Obs int
DECLARE @RowAsc table
(
ID      INT IDENTITY,
Observation  FLOAT
)
INSERT INTO @RowAsc
SELECT Observations FROM MyTable
ORDER BY 1 
SELECT @Obs=COUNT(*)/2 FROM @RowAsc
SELECT Observation AS Median FROM @RowAsc WHERE ID=@Obs