我有一个表,我想获得每组的最新条目。下面是表格:

DocumentStatusLogs表

|ID| DocumentID | Status | DateCreated |
| 2| 1          | S1     | 7/29/2011   |
| 3| 1          | S2     | 7/30/2011   |
| 6| 1          | S1     | 8/02/2011   |
| 1| 2          | S1     | 7/28/2011   |
| 4| 2          | S2     | 7/30/2011   |
| 5| 2          | S3     | 8/01/2011   |
| 6| 3          | S1     | 8/02/2011   |

该表将按documententid分组,并按DateCreated降序排序。对于每个documententid,我希望获得最新的状态。

我的首选输出:

| DocumentID | Status | DateCreated |
| 1          | S1     | 8/02/2011   |
| 2          | S3     | 8/01/2011   |
| 3          | S1     | 8/02/2011   |

Is there any aggregate function to get only the top from each group? See pseudo-code GetOnlyTheTop below: SELECT DocumentID, GetOnlyTheTop(Status), GetOnlyTheTop(DateCreated) FROM DocumentStatusLogs GROUP BY DocumentID ORDER BY DateCreated DESC If such function doesn't exist, is there any way I can achieve the output I want? Or at the first place, could this be caused by unnormalized database? I'm thinking, since what I'm looking for is just one row, should that status also be located in the parent table?

更多信息请参见父表:

当前文档表

| DocumentID | Title  | Content  | DateCreated |
| 1          | TitleA | ...      | ...         |
| 2          | TitleB | ...      | ...         |
| 3          | TitleC | ...      | ...         |

父表应该是这样的,以便我可以轻松地访问它的状态吗?

| DocumentID | Title  | Content  | DateCreated | CurrentStatus |
| 1          | TitleA | ...      | ...         | s1            |
| 2          | TitleB | ...      | ...         | s3            |
| 3          | TitleC | ...      | ...         | s1            |

更新 我刚刚学会了如何使用“apply”,它可以更容易地解决这类问题。


当前回答

从上面验证克林特的正确答案:

下面两个查询之间的性能非常有趣。52%是最高的。48%是第二个。使用DISTINCT而不是ORDER BY提高了4%的性能。但是ORDER BY具有按多列排序的优势。

IF (OBJECT_ID('tempdb..#DocumentStatusLogs') IS NOT NULL) BEGIN DROP TABLE #DocumentStatusLogs END

CREATE TABLE #DocumentStatusLogs (
    [ID] int NOT NULL,
    [DocumentID] int NOT NULL,
    [Status] varchar(20),
    [DateCreated] datetime
)

INSERT INTO #DocumentStatusLogs([ID], [DocumentID], [Status], [DateCreated]) VALUES (2, 1, 'S1', '7/29/2011 1:00:00')
INSERT INTO #DocumentStatusLogs([ID], [DocumentID], [Status], [DateCreated]) VALUES (3, 1, 'S2', '7/30/2011 2:00:00')
INSERT INTO #DocumentStatusLogs([ID], [DocumentID], [Status], [DateCreated]) VALUES (6, 1, 'S1', '8/02/2011 3:00:00')
INSERT INTO #DocumentStatusLogs([ID], [DocumentID], [Status], [DateCreated]) VALUES (1, 2, 'S1', '7/28/2011 4:00:00')
INSERT INTO #DocumentStatusLogs([ID], [DocumentID], [Status], [DateCreated]) VALUES (4, 2, 'S2', '7/30/2011 5:00:00')
INSERT INTO #DocumentStatusLogs([ID], [DocumentID], [Status], [DateCreated]) VALUES (5, 2, 'S3', '8/01/2011 6:00:00')
INSERT INTO #DocumentStatusLogs([ID], [DocumentID], [Status], [DateCreated]) VALUES (6, 3, 'S1', '8/02/2011 7:00:00')

选项1:

    SELECT
    [Extent1].[ID], 
    [Extent1].[DocumentID],
    [Extent1].[Status], 
    [Extent1].[DateCreated]
FROM #DocumentStatusLogs AS [Extent1]
    OUTER APPLY (
        SELECT TOP 1
            [Extent2].[ID], 
            [Extent2].[DocumentID],
            [Extent2].[Status], 
            [Extent2].[DateCreated]
        FROM #DocumentStatusLogs AS [Extent2]
        WHERE [Extent1].[DocumentID] = [Extent2].[DocumentID]
        ORDER BY [Extent2].[DateCreated] DESC, [Extent2].[ID] DESC
    ) AS [Project2]
WHERE ([Project2].[ID] IS NULL OR [Project2].[ID] = [Extent1].[ID])

选项2:

SELECT 
    [Limit1].[DocumentID] AS [ID], 
    [Limit1].[DocumentID] AS [DocumentID], 
    [Limit1].[Status] AS [Status], 
    [Limit1].[DateCreated] AS [DateCreated]
FROM (
    SELECT DISTINCT [Extent1].[DocumentID] AS [DocumentID] FROM #DocumentStatusLogs AS [Extent1]
) AS [Distinct1]
    OUTER APPLY  (
        SELECT TOP (1) [Project2].[ID] AS [ID], [Project2].[DocumentID] AS [DocumentID], [Project2].[Status] AS [Status], [Project2].[DateCreated] AS [DateCreated]
        FROM (
            SELECT 
                [Extent2].[ID] AS [ID], 
                [Extent2].[DocumentID] AS [DocumentID], 
                [Extent2].[Status] AS [Status], 
                [Extent2].[DateCreated] AS [DateCreated]
            FROM #DocumentStatusLogs AS [Extent2]
            WHERE [Distinct1].[DocumentID] = [Extent2].[DocumentID]
        )  AS [Project2]
        ORDER BY [Project2].[ID] DESC
    ) AS [Limit1]

在Microsoft SQL Server Management Studio中:在突出显示并运行第一个块后,突出显示选项1和选项2,右键单击->[显示估计的执行计划]。然后运行整个程序以查看结果。

选项1结果:

ID  DocumentID  Status  DateCreated
6   1   S1  8/2/11 3:00
5   2   S3  8/1/11 6:00
6   3   S1  8/2/11 7:00

选项2结果:

ID  DocumentID  Status  DateCreated
6   1   S1  8/2/11 3:00
5   2   S3  8/1/11 6:00
6   3   S1  8/2/11 7:00

注意:

当我想要一个连接为1到-(多个连接中的1个)时,我倾向于使用APPLY。 如果我想要连接为1对多或多对多,则使用JOIN。 我使用ROW_NUMBER()避免CTE,除非我需要做一些高级的事情,并且可以接受窗口性能损失。

我还避免在WHERE或ON子句中使用EXISTS / IN子查询,因为我曾经经历过这样会导致一些糟糕的执行计划。但具体里程各不相同。在需要的时候随时随地检查执行计划和概要文件的性能!

其他回答

;WITH cte AS
(
   SELECT *,
         ROW_NUMBER() OVER (PARTITION BY DocumentID ORDER BY DateCreated DESC) AS rn
   FROM DocumentStatusLogs
)
SELECT *
FROM cte
WHERE rn = 1

如果您希望每天有2个条目,那么这将任意选择一个。要获得一天的两个条目,请使用DENSE_RANK代替

至于是否正常化,这取决于你是否想:

在两个地方保持状态 保存状态历史 ...

目前,您保留了状态历史。如果你也想在父表中保持最新状态(这是非规范化的),你需要一个触发器来维持父表中的“状态”。或者删除这个状态历史表。

此解决方案可用于获取每个分区的TOP N最近行(在本例中,WHERE语句中的N为1,partition为doc_id):

SELECT T.doc_id, T.status, T.date_created FROM 
(
    SELECT a.*, ROW_NUMBER() OVER (PARTITION BY doc_id ORDER BY date_created DESC) AS rnk FROM doc a
) T
WHERE T.rnk = 1;
SELECT * FROM
DocumentStatusLogs JOIN (
  SELECT DocumentID, MAX(DateCreated) DateCreated
  FROM DocumentStatusLogs
  GROUP BY DocumentID
  ) max_date USING (DocumentID, DateCreated)

什么数据库服务器?这段代码并不是对所有的都有效。

关于你问题的后半部分,我认为应该在一栏中写上“地位”。您可以将DocumentStatusLogs保留为日志,但仍然将最新信息存储在主表中。

顺便说一句,如果你已经在文档表中有DateCreated列,你可以使用它来加入DocumentStatusLogs(只要DateCreated在DocumentStatusLogs中是唯一的)。

编辑:MsSQL不支持USING,因此将其更改为:

ON DocumentStatusLogs.DocumentID = max_date.DocumentID AND DocumentStatusLogs.DateCreated = max_date.DateCreated

我知道这是一个旧的线程,但TOP 1与TIES解决方案是相当不错的,可能有助于阅读一些解决方案。

select top 1 with ties
   DocumentID
  ,Status
  ,DateCreated
from DocumentStatusLogs
order by row_number() over (partition by DocumentID order by DateCreated desc)

select top 1 with ties子句告诉SQL Server要返回每个组的第一行。但是SQL Server如何知道如何对数据进行分组呢?这就是按row_number()的顺序除以(按documententid分区的顺序除以DateCreated desc的顺序。分区后的列定义了SQL Server如何对数据进行分组。在每个组中,行将根据列的顺序进行排序。排序之后,查询中将返回每个组中的第一行。

关于TOP子句的更多信息可以在这里找到。

从上面验证克林特的正确答案:

下面两个查询之间的性能非常有趣。52%是最高的。48%是第二个。使用DISTINCT而不是ORDER BY提高了4%的性能。但是ORDER BY具有按多列排序的优势。

IF (OBJECT_ID('tempdb..#DocumentStatusLogs') IS NOT NULL) BEGIN DROP TABLE #DocumentStatusLogs END

CREATE TABLE #DocumentStatusLogs (
    [ID] int NOT NULL,
    [DocumentID] int NOT NULL,
    [Status] varchar(20),
    [DateCreated] datetime
)

INSERT INTO #DocumentStatusLogs([ID], [DocumentID], [Status], [DateCreated]) VALUES (2, 1, 'S1', '7/29/2011 1:00:00')
INSERT INTO #DocumentStatusLogs([ID], [DocumentID], [Status], [DateCreated]) VALUES (3, 1, 'S2', '7/30/2011 2:00:00')
INSERT INTO #DocumentStatusLogs([ID], [DocumentID], [Status], [DateCreated]) VALUES (6, 1, 'S1', '8/02/2011 3:00:00')
INSERT INTO #DocumentStatusLogs([ID], [DocumentID], [Status], [DateCreated]) VALUES (1, 2, 'S1', '7/28/2011 4:00:00')
INSERT INTO #DocumentStatusLogs([ID], [DocumentID], [Status], [DateCreated]) VALUES (4, 2, 'S2', '7/30/2011 5:00:00')
INSERT INTO #DocumentStatusLogs([ID], [DocumentID], [Status], [DateCreated]) VALUES (5, 2, 'S3', '8/01/2011 6:00:00')
INSERT INTO #DocumentStatusLogs([ID], [DocumentID], [Status], [DateCreated]) VALUES (6, 3, 'S1', '8/02/2011 7:00:00')

选项1:

    SELECT
    [Extent1].[ID], 
    [Extent1].[DocumentID],
    [Extent1].[Status], 
    [Extent1].[DateCreated]
FROM #DocumentStatusLogs AS [Extent1]
    OUTER APPLY (
        SELECT TOP 1
            [Extent2].[ID], 
            [Extent2].[DocumentID],
            [Extent2].[Status], 
            [Extent2].[DateCreated]
        FROM #DocumentStatusLogs AS [Extent2]
        WHERE [Extent1].[DocumentID] = [Extent2].[DocumentID]
        ORDER BY [Extent2].[DateCreated] DESC, [Extent2].[ID] DESC
    ) AS [Project2]
WHERE ([Project2].[ID] IS NULL OR [Project2].[ID] = [Extent1].[ID])

选项2:

SELECT 
    [Limit1].[DocumentID] AS [ID], 
    [Limit1].[DocumentID] AS [DocumentID], 
    [Limit1].[Status] AS [Status], 
    [Limit1].[DateCreated] AS [DateCreated]
FROM (
    SELECT DISTINCT [Extent1].[DocumentID] AS [DocumentID] FROM #DocumentStatusLogs AS [Extent1]
) AS [Distinct1]
    OUTER APPLY  (
        SELECT TOP (1) [Project2].[ID] AS [ID], [Project2].[DocumentID] AS [DocumentID], [Project2].[Status] AS [Status], [Project2].[DateCreated] AS [DateCreated]
        FROM (
            SELECT 
                [Extent2].[ID] AS [ID], 
                [Extent2].[DocumentID] AS [DocumentID], 
                [Extent2].[Status] AS [Status], 
                [Extent2].[DateCreated] AS [DateCreated]
            FROM #DocumentStatusLogs AS [Extent2]
            WHERE [Distinct1].[DocumentID] = [Extent2].[DocumentID]
        )  AS [Project2]
        ORDER BY [Project2].[ID] DESC
    ) AS [Limit1]

在Microsoft SQL Server Management Studio中:在突出显示并运行第一个块后,突出显示选项1和选项2,右键单击->[显示估计的执行计划]。然后运行整个程序以查看结果。

选项1结果:

ID  DocumentID  Status  DateCreated
6   1   S1  8/2/11 3:00
5   2   S3  8/1/11 6:00
6   3   S1  8/2/11 7:00

选项2结果:

ID  DocumentID  Status  DateCreated
6   1   S1  8/2/11 3:00
5   2   S3  8/1/11 6:00
6   3   S1  8/2/11 7:00

注意:

当我想要一个连接为1到-(多个连接中的1个)时,我倾向于使用APPLY。 如果我想要连接为1对多或多对多,则使用JOIN。 我使用ROW_NUMBER()避免CTE,除非我需要做一些高级的事情,并且可以接受窗口性能损失。

我还避免在WHERE或ON子句中使用EXISTS / IN子查询,因为我曾经经历过这样会导致一些糟糕的执行计划。但具体里程各不相同。在需要的时候随时随地检查执行计划和概要文件的性能!