根据MSDN, Median在Transact-SQL中不能作为聚合函数使用。但是,我想知道是否可以创建此功能(使用create Aggregate函数、用户定义函数或其他方法)。
最好的方法(如果可能的话)是什么——允许在聚合查询中计算中值(假设是数值数据类型)?
根据MSDN, Median在Transact-SQL中不能作为聚合函数使用。但是,我想知道是否可以创建此功能(使用create Aggregate函数、用户定义函数或其他方法)。
最好的方法(如果可能的话)是什么——允许在聚合查询中计算中值(假设是数值数据类型)?
当前回答
这是我能想到的求中位数的最优解。示例中的名称基于Justin示例。确保表有索引 销售。SalesOrderHeader以索引列CustomerId和TotalDue的顺序存在。
SELECT
sohCount.CustomerId,
AVG(sohMid.TotalDue) as TotalDueMedian
FROM
(SELECT
soh.CustomerId,
COUNT(*) as NumberOfRows
FROM
Sales.SalesOrderHeader soh
GROUP BY soh.CustomerId) As sohCount
CROSS APPLY
(Select
soh.TotalDue
FROM
Sales.SalesOrderHeader soh
WHERE soh.CustomerId = sohCount.CustomerId
ORDER BY soh.TotalDue
OFFSET sohCount.NumberOfRows / 2 - ((sohCount.NumberOfRows + 1) % 2) ROWS
FETCH NEXT 1 + ((sohCount.NumberOfRows + 1) % 2) ROWS ONLY
) As sohMid
GROUP BY sohCount.CustomerId
更新
我有点不确定哪种方法性能最好,所以我比较了我的方法Justin Grants和Jeff Atwoods,在一个批量中运行基于这三种方法的查询,每个查询的批量成本为:
没有指数:
我的30% Justin Grants 13% Jeff Atwoods 58%
还有index
我的3%。 Justin Grants 10% Jeff Atwoods 87%
I tried to see how well the queries scale if you have index by creating more data from around 14 000 rows by a factor of 2 up to 512 which means in the end around 7,2 millions rows. Note I made sure CustomeId field where unique for each time I did a single copy, so the proportion of rows compared to unique instance of CustomerId was kept constant. While I was doing this I ran executions where I rebuilt index afterwards, and I noticed the results stabilized at around a factor of 128 with the data I had to these values:
我的3%。 贾斯汀·格兰特5% Jeff Atwoods 92%
我想知道,在保持惟一CustomerId不变的情况下,扩展行数会如何影响性能,因此我设置了一个新的测试,在其中执行了上述操作。现在,批成本比率并没有稳定下来,而是不断分化,每个CustomerId平均大约有20行,最后每个这样唯一的Id大约有10000行。数字如下:
我的4% 贾斯汀60% 杰夫斯35%
通过比较结果,我确保我正确地实现了每个方法。 我的结论是,只要索引存在,我使用的方法通常更快。还要注意,本文针对这个特定问题推荐使用这种方法https://www.microsoftpressstore.com/articles/article.aspx?p=2314819&seqNum=5
进一步提高对该查询的后续调用的性能的一种方法是在辅助表中持久化计数信息。您甚至可以通过一个触发器来维护它,该触发器更新并保存有关依赖于CustomerId的SalesOrderHeader行计数的信息,当然您也可以简单地存储中值。
其他回答
虽然Justin grant的解决方案看起来很可靠,但我发现当您在给定的分区键中有许多重复值时,ASC重复值的行号最终会不按顺序排列,因此它们不能正确对齐。
以下是我的研究结果的一个片段:
KEY VALUE ROWA ROWD
13 2 22 182
13 1 6 183
13 1 7 184
13 1 8 185
13 1 9 186
13 1 10 187
13 1 11 188
13 1 12 189
13 0 1 190
13 0 2 191
13 0 3 192
13 0 4 193
13 0 5 194
我使用Justin的代码作为这个解决方案的基础。尽管考虑到使用多个派生表效率不高,但它确实解决了我遇到的行排序问题。任何改进都会受到欢迎,因为我在T-SQL方面不是那么有经验。
SELECT PKEY, cast(AVG(VALUE)as decimal(5,2)) as MEDIANVALUE
FROM
(
SELECT PKEY,VALUE,ROWA,ROWD,
'FLAG' = (CASE WHEN ROWA IN (ROWD,ROWD-1,ROWD+1) THEN 1 ELSE 0 END)
FROM
(
SELECT
PKEY,
cast(VALUE as decimal(5,2)) as VALUE,
ROWA,
ROW_NUMBER() OVER (PARTITION BY PKEY ORDER BY ROWA DESC) as ROWD
FROM
(
SELECT
PKEY,
VALUE,
ROW_NUMBER() OVER (PARTITION BY PKEY ORDER BY VALUE ASC,PKEY ASC ) as ROWA
FROM [MTEST]
)T1
)T2
)T3
WHERE FLAG = '1'
GROUP BY PKEY
ORDER BY PKEY
如果你想在SQL Server中使用创建聚合函数,这是如何做到的。这样做的好处是能够编写干净的查询。注意,这个过程可以很容易地计算一个百分位值。
创建一个新的Visual Studio项目,并将目标框架设置为。net 3.5(这是针对SQL 2008的,在SQL 2012中可能有所不同)。然后创建一个类文件,并放入以下代码或c#等效代码:
Imports Microsoft.SqlServer.Server
Imports System.Data.SqlTypes
Imports System.IO
<Serializable>
<SqlUserDefinedAggregate(Format.UserDefined, IsInvariantToNulls:=True, IsInvariantToDuplicates:=False, _
IsInvariantToOrder:=True, MaxByteSize:=-1, IsNullIfEmpty:=True)>
Public Class Median
Implements IBinarySerialize
Private _items As List(Of Decimal)
Public Sub Init()
_items = New List(Of Decimal)()
End Sub
Public Sub Accumulate(value As SqlDecimal)
If Not value.IsNull Then
_items.Add(value.Value)
End If
End Sub
Public Sub Merge(other As Median)
If other._items IsNot Nothing Then
_items.AddRange(other._items)
End If
End Sub
Public Function Terminate() As SqlDecimal
If _items.Count <> 0 Then
Dim result As Decimal
_items = _items.OrderBy(Function(i) i).ToList()
If _items.Count Mod 2 = 0 Then
result = ((_items((_items.Count / 2) - 1)) + (_items(_items.Count / 2))) / 2@
Else
result = _items((_items.Count - 1) / 2)
End If
Return New SqlDecimal(result)
Else
Return New SqlDecimal()
End If
End Function
Public Sub Read(r As BinaryReader) Implements IBinarySerialize.Read
'deserialize it from a string
Dim list = r.ReadString()
_items = New List(Of Decimal)
For Each value In list.Split(","c)
Dim number As Decimal
If Decimal.TryParse(value, number) Then
_items.Add(number)
End If
Next
End Sub
Public Sub Write(w As BinaryWriter) Implements IBinarySerialize.Write
'serialize the list to a string
Dim list = ""
For Each item In _items
If list <> "" Then
list += ","
End If
list += item.ToString()
Next
w.Write(list)
End Sub
End Class
然后编译它,并将DLL和PDB文件复制到您的SQL Server机器,在SQL Server中运行以下命令:
CREATE ASSEMBLY CustomAggregate FROM '{path to your DLL}'
WITH PERMISSION_SET=SAFE;
GO
CREATE AGGREGATE Median(@value decimal(9, 3))
RETURNS decimal(9, 3)
EXTERNAL NAME [CustomAggregate].[{namespace of your DLL}.Median];
GO
然后你可以写一个查询来计算中位数,就像这样: SELECT dbo.Median(Field) FROM Table
中找到
这是查找属性中值的最简单方法。
Select round(S.salary,4) median from employee S
where (select count(salary) from station
where salary < S.salary ) = (select count(salary) from station
where salary > S.salary)
这段代码有点长,但很容易理解
medii是有列val的表,它有数据集, Smedi是一个cte,它将列idx作为行号,val作为medi表中的'val',该表是升序排序的。 这是基本的数学,如果行号是奇数,那么它的中值来自smedi。 当它是偶数时,它是中间两个值的平均值。
with smedi(idx,vals) as(
select ROW_NUMBER() over(order by val),val from medi
)
select (case
when (select count(*) from medi)%2!=0 then (select vals from smedi where (((select count(*) from medi)/2))=idx)
else (select avg(vals) from smedi where idx in ((select count(*)/2 from medi),(select (count(*)/2)+1 from medi)))
end)
with t1 as (select *, row_number() over(order by ordqty) as rn,
count(*) over() as rc from ord_line)
select rn,* from t1 where rn in((rc+1)/2, (rc+2)/2);
它将计算偶数和奇数的中位数。
Ord_line是一个表 Ordqty是一个列