根据MSDN, Median在Transact-SQL中不能作为聚合函数使用。但是,我想知道是否可以创建此功能(使用create Aggregate函数、用户定义函数或其他方法)。

最好的方法(如果可能的话)是什么——允许在聚合查询中计算中值(假设是数值数据类型)?


当前回答

使用一条语句——一种方法是使用ROW_NUMBER(), COUNT()窗口函数并过滤子查询。下面是薪资中位数:

 SELECT AVG(e_salary) 
 FROM                                                             
    (SELECT 
      ROW_NUMBER() OVER(ORDER BY e_salary) as row_no, 
      e_salary,
      (COUNT(*) OVER()+1)*0.5 AS row_half
     FROM Employee) t
 WHERE row_no IN (FLOOR(row_half),CEILING(row_half))

我在网上看到过类似的解决方案,使用地板和天花板,但尝试使用单一的语句。(编辑)

其他回答

对于像我这样正在学习基础知识的新手来说,我个人觉得这个例子更容易理解,因为它更容易理解到底发生了什么以及中值来自哪里……

select
 ( max(a.[Value1]) + min(a.[Value1]) ) / 2 as [Median Value1]
,( max(a.[Value2]) + min(a.[Value2]) ) / 2 as [Median Value2]

from (select
    datediff(dd,startdate,enddate) as [Value1]
    ,xxxxxxxxxxxxxx as [Value2]
     from dbo.table1
     )a

不过,对上面的一些代码绝对敬畏!!

如果你想在SQL Server中使用创建聚合函数,这是如何做到的。这样做的好处是能够编写干净的查询。注意,这个过程可以很容易地计算一个百分位值。

创建一个新的Visual Studio项目,并将目标框架设置为。net 3.5(这是针对SQL 2008的,在SQL 2012中可能有所不同)。然后创建一个类文件,并放入以下代码或c#等效代码:

Imports Microsoft.SqlServer.Server
Imports System.Data.SqlTypes
Imports System.IO

<Serializable>
<SqlUserDefinedAggregate(Format.UserDefined, IsInvariantToNulls:=True, IsInvariantToDuplicates:=False, _
  IsInvariantToOrder:=True, MaxByteSize:=-1, IsNullIfEmpty:=True)>
Public Class Median
  Implements IBinarySerialize
  Private _items As List(Of Decimal)

  Public Sub Init()
    _items = New List(Of Decimal)()
  End Sub

  Public Sub Accumulate(value As SqlDecimal)
    If Not value.IsNull Then
      _items.Add(value.Value)
    End If
  End Sub

  Public Sub Merge(other As Median)
    If other._items IsNot Nothing Then
      _items.AddRange(other._items)
    End If
  End Sub

  Public Function Terminate() As SqlDecimal
    If _items.Count <> 0 Then
      Dim result As Decimal
      _items = _items.OrderBy(Function(i) i).ToList()
      If _items.Count Mod 2 = 0 Then
        result = ((_items((_items.Count / 2) - 1)) + (_items(_items.Count / 2))) / 2@
      Else
        result = _items((_items.Count - 1) / 2)
      End If

      Return New SqlDecimal(result)
    Else
      Return New SqlDecimal()
    End If
  End Function

  Public Sub Read(r As BinaryReader) Implements IBinarySerialize.Read
    'deserialize it from a string
    Dim list = r.ReadString()
    _items = New List(Of Decimal)

    For Each value In list.Split(","c)
      Dim number As Decimal
      If Decimal.TryParse(value, number) Then
        _items.Add(number)
      End If
    Next

  End Sub

  Public Sub Write(w As BinaryWriter) Implements IBinarySerialize.Write
    'serialize the list to a string
    Dim list = ""

    For Each item In _items
      If list <> "" Then
        list += ","
      End If      
      list += item.ToString()
    Next
    w.Write(list)
  End Sub
End Class

然后编译它,并将DLL和PDB文件复制到您的SQL Server机器,在SQL Server中运行以下命令:

CREATE ASSEMBLY CustomAggregate FROM '{path to your DLL}'
WITH PERMISSION_SET=SAFE;
GO

CREATE AGGREGATE Median(@value decimal(9, 3))
RETURNS decimal(9, 3) 
EXTERNAL NAME [CustomAggregate].[{namespace of your DLL}.Median];
GO

然后你可以写一个查询来计算中位数,就像这样: SELECT dbo.Median(Field) FROM Table

我想自己想出一个解决办法,但我的大脑绊倒了。我觉得很管用,但别让我早上解释。: P

DECLARE @table AS TABLE
(
    Number int not null
);

insert into @table select 2;
insert into @table select 4;
insert into @table select 9;
insert into @table select 15;
insert into @table select 22;
insert into @table select 26;
insert into @table select 37;
insert into @table select 49;

DECLARE @Count AS INT
SELECT @Count = COUNT(*) FROM @table;

WITH MyResults(RowNo, Number) AS
(
    SELECT RowNo, Number FROM
        (SELECT ROW_NUMBER() OVER (ORDER BY Number) AS RowNo, Number FROM @table) AS Foo
)
SELECT AVG(Number) FROM MyResults WHERE RowNo = (@Count+1)/2 OR RowNo = ((@Count+1)%2) * ((@Count+2)/2)

这是我能想到的求中位数的最优解。示例中的名称基于Justin示例。确保表有索引 销售。SalesOrderHeader以索引列CustomerId和TotalDue的顺序存在。

SELECT
 sohCount.CustomerId,
 AVG(sohMid.TotalDue) as TotalDueMedian
FROM 
(SELECT 
  soh.CustomerId,
  COUNT(*) as NumberOfRows
FROM 
  Sales.SalesOrderHeader soh 
GROUP BY soh.CustomerId) As sohCount
CROSS APPLY 
    (Select 
       soh.TotalDue
    FROM 
    Sales.SalesOrderHeader soh 
    WHERE soh.CustomerId = sohCount.CustomerId 
    ORDER BY soh.TotalDue
    OFFSET sohCount.NumberOfRows / 2 - ((sohCount.NumberOfRows + 1) % 2) ROWS 
    FETCH NEXT 1 + ((sohCount.NumberOfRows + 1) % 2) ROWS ONLY
    ) As sohMid
GROUP BY sohCount.CustomerId

更新

我有点不确定哪种方法性能最好,所以我比较了我的方法Justin Grants和Jeff Atwoods,在一个批量中运行基于这三种方法的查询,每个查询的批量成本为:

没有指数:

我的30% Justin Grants 13% Jeff Atwoods 58%

还有index

我的3%。 Justin Grants 10% Jeff Atwoods 87%

I tried to see how well the queries scale if you have index by creating more data from around 14 000 rows by a factor of 2 up to 512 which means in the end around 7,2 millions rows. Note I made sure CustomeId field where unique for each time I did a single copy, so the proportion of rows compared to unique instance of CustomerId was kept constant. While I was doing this I ran executions where I rebuilt index afterwards, and I noticed the results stabilized at around a factor of 128 with the data I had to these values:

我的3%。 贾斯汀·格兰特5% Jeff Atwoods 92%

我想知道,在保持惟一CustomerId不变的情况下,扩展行数会如何影响性能,因此我设置了一个新的测试,在其中执行了上述操作。现在,批成本比率并没有稳定下来,而是不断分化,每个CustomerId平均大约有20行,最后每个这样唯一的Id大约有10000行。数字如下:

我的4% 贾斯汀60% 杰夫斯35%

通过比较结果,我确保我正确地实现了每个方法。 我的结论是,只要索引存在,我使用的方法通常更快。还要注意,本文针对这个特定问题推荐使用这种方法https://www.microsoftpressstore.com/articles/article.aspx?p=2314819&seqNum=5

进一步提高对该查询的后续调用的性能的一种方法是在辅助表中持久化计数信息。您甚至可以通过一个触发器来维护它,该触发器更新并保存有关依赖于CustomerId的SalesOrderHeader行计数的信息,当然您也可以简单地存储中值。

中找到

这是查找属性中值的最简单方法。

Select round(S.salary,4) median from employee S 
where (select count(salary) from station 
where salary < S.salary ) = (select count(salary) from station
where salary > S.salary)