我有一个表,我想获得每组的最新条目。下面是表格:

DocumentStatusLogs表

|ID| DocumentID | Status | DateCreated |
| 2| 1          | S1     | 7/29/2011   |
| 3| 1          | S2     | 7/30/2011   |
| 6| 1          | S1     | 8/02/2011   |
| 1| 2          | S1     | 7/28/2011   |
| 4| 2          | S2     | 7/30/2011   |
| 5| 2          | S3     | 8/01/2011   |
| 6| 3          | S1     | 8/02/2011   |

该表将按documententid分组,并按DateCreated降序排序。对于每个documententid,我希望获得最新的状态。

我的首选输出:

| DocumentID | Status | DateCreated |
| 1          | S1     | 8/02/2011   |
| 2          | S3     | 8/01/2011   |
| 3          | S1     | 8/02/2011   |

Is there any aggregate function to get only the top from each group? See pseudo-code GetOnlyTheTop below: SELECT DocumentID, GetOnlyTheTop(Status), GetOnlyTheTop(DateCreated) FROM DocumentStatusLogs GROUP BY DocumentID ORDER BY DateCreated DESC If such function doesn't exist, is there any way I can achieve the output I want? Or at the first place, could this be caused by unnormalized database? I'm thinking, since what I'm looking for is just one row, should that status also be located in the parent table?

更多信息请参见父表:

当前文档表

| DocumentID | Title  | Content  | DateCreated |
| 1          | TitleA | ...      | ...         |
| 2          | TitleB | ...      | ...         |
| 3          | TitleC | ...      | ...         |

父表应该是这样的,以便我可以轻松地访问它的状态吗?

| DocumentID | Title  | Content  | DateCreated | CurrentStatus |
| 1          | TitleA | ...      | ...         | s1            |
| 2          | TitleB | ...      | ...         | s3            |
| 3          | TitleC | ...      | ...         | s1            |

更新 我刚刚学会了如何使用“apply”,它可以更容易地解决这类问题。


当前回答

这里有3种不同的方法来解决这个问题,以及为每个查询建立索引的最佳选择(请自己尝试索引,并查看逻辑读取、消耗时间和执行计划。我根据自己的经验提供了关于此类查询的建议,但没有针对这个特定问题执行)。

方法1:使用ROW_NUMBER()。如果rowstore索引不能提高性能,对于具有聚合和分组的查询以及始终按不同列排序的表,可以尝试使用非聚集/聚集的columnstore索引,columnstore索引通常是最佳选择。

;WITH CTE AS
    (
       SELECT   *,
                RN = ROW_NUMBER() OVER (PARTITION BY DocumentID ORDER BY DateCreated DESC)
       FROM     DocumentStatusLogs
    )
    SELECT  ID      
        ,DocumentID 
        ,Status     
        ,DateCreated
    FROM    CTE
    WHERE   RN = 1;

方法2:使用FIRST_VALUE。如果rowstore索引不能提高性能,对于具有聚合和分组的查询以及始终按不同列排序的表,可以尝试使用非聚集/聚集的columnstore索引,columnstore索引通常是最佳选择。

SELECT  DISTINCT
    ID      = FIRST_VALUE(ID) OVER (PARTITION BY DocumentID ORDER BY DateCreated DESC)
    ,DocumentID
    ,Status     = FIRST_VALUE(Status) OVER (PARTITION BY DocumentID ORDER BY DateCreated DESC)
    ,DateCreated    = FIRST_VALUE(DateCreated) OVER (PARTITION BY DocumentID ORDER BY DateCreated DESC)
FROM    DocumentStatusLogs;

方法3:使用CROSS APPLY。在DocumentStatusLogs表上创建涵盖查询中使用的列的行存储索引应该足以覆盖查询,而不需要columnstore索引。

SELECT  DISTINCT
    ID      = CA.ID
    ,DocumentID = D.DocumentID
    ,Status     = CA.Status 
    ,DateCreated    = CA.DateCreated
FROM    DocumentStatusLogs D
    CROSS APPLY (
            SELECT  TOP 1 I.*
            FROM    DocumentStatusLogs I
            WHERE   I.DocumentID = D.DocumentID
            ORDER   BY I.DateCreated DESC
            ) CA;

其他回答

如果你担心性能问题,你也可以用MAX()这样做:

SELECT *
FROM DocumentStatusLogs D
WHERE DateCreated = (SELECT MAX(DateCreated) FROM DocumentStatusLogs WHERE ID = D.ID)

ROW_NUMBER()要求对SELECT语句中的所有行进行排序,而MAX则不需要。应该会大大加快你的查询速度。

这里有3种不同的方法来解决这个问题,以及为每个查询建立索引的最佳选择(请自己尝试索引,并查看逻辑读取、消耗时间和执行计划。我根据自己的经验提供了关于此类查询的建议,但没有针对这个特定问题执行)。

方法1:使用ROW_NUMBER()。如果rowstore索引不能提高性能,对于具有聚合和分组的查询以及始终按不同列排序的表,可以尝试使用非聚集/聚集的columnstore索引,columnstore索引通常是最佳选择。

;WITH CTE AS
    (
       SELECT   *,
                RN = ROW_NUMBER() OVER (PARTITION BY DocumentID ORDER BY DateCreated DESC)
       FROM     DocumentStatusLogs
    )
    SELECT  ID      
        ,DocumentID 
        ,Status     
        ,DateCreated
    FROM    CTE
    WHERE   RN = 1;

方法2:使用FIRST_VALUE。如果rowstore索引不能提高性能,对于具有聚合和分组的查询以及始终按不同列排序的表,可以尝试使用非聚集/聚集的columnstore索引,columnstore索引通常是最佳选择。

SELECT  DISTINCT
    ID      = FIRST_VALUE(ID) OVER (PARTITION BY DocumentID ORDER BY DateCreated DESC)
    ,DocumentID
    ,Status     = FIRST_VALUE(Status) OVER (PARTITION BY DocumentID ORDER BY DateCreated DESC)
    ,DateCreated    = FIRST_VALUE(DateCreated) OVER (PARTITION BY DocumentID ORDER BY DateCreated DESC)
FROM    DocumentStatusLogs;

方法3:使用CROSS APPLY。在DocumentStatusLogs表上创建涵盖查询中使用的列的行存储索引应该足以覆盖查询,而不需要columnstore索引。

SELECT  DISTINCT
    ID      = CA.ID
    ,DocumentID = D.DocumentID
    ,Status     = CA.Status 
    ,DateCreated    = CA.DateCreated
FROM    DocumentStatusLogs D
    CROSS APPLY (
            SELECT  TOP 1 I.*
            FROM    DocumentStatusLogs I
            WHERE   I.DocumentID = D.DocumentID
            ORDER   BY I.DateCreated DESC
            ) CA;

一些数据库引擎*开始支持允许过滤窗口函数结果的qualifier子句(接受的答案使用该子句)。

所以公认的答案可以变成

SELECT *, ROW_NUMBER() OVER (PARTITION BY DocumentID ORDER BY DateCreated DESC) AS rn
FROM DocumentStatusLogs
QUALIFY rn = 1

查看这篇文章以获得更深入的解释:https://jrandrews.net/the-joy-of-qualify

您可以使用此工具查看哪个数据库支持此子句:https://www.jooq.org/translate/ 当目标方言不支持qualifier子句时,可以选择转换它。

*Teradata, BigQuery, H2, Snowflake…

我相信这是可以做到的。这可能需要一些调整,但你可以从组中选择最大值。

这些答案太夸张了。

SELECT
  d.DocumentID,
  MAX(d.Status),
  MAX(d1.DateCreated)
FROM DocumentStatusLogs d, DocumentStatusLogs d1
USING DocumentID
GROUP BY 1
ORDER BY 3 DESC

从上面验证克林特的正确答案:

下面两个查询之间的性能非常有趣。52%是最高的。48%是第二个。使用DISTINCT而不是ORDER BY提高了4%的性能。但是ORDER BY具有按多列排序的优势。

IF (OBJECT_ID('tempdb..#DocumentStatusLogs') IS NOT NULL) BEGIN DROP TABLE #DocumentStatusLogs END

CREATE TABLE #DocumentStatusLogs (
    [ID] int NOT NULL,
    [DocumentID] int NOT NULL,
    [Status] varchar(20),
    [DateCreated] datetime
)

INSERT INTO #DocumentStatusLogs([ID], [DocumentID], [Status], [DateCreated]) VALUES (2, 1, 'S1', '7/29/2011 1:00:00')
INSERT INTO #DocumentStatusLogs([ID], [DocumentID], [Status], [DateCreated]) VALUES (3, 1, 'S2', '7/30/2011 2:00:00')
INSERT INTO #DocumentStatusLogs([ID], [DocumentID], [Status], [DateCreated]) VALUES (6, 1, 'S1', '8/02/2011 3:00:00')
INSERT INTO #DocumentStatusLogs([ID], [DocumentID], [Status], [DateCreated]) VALUES (1, 2, 'S1', '7/28/2011 4:00:00')
INSERT INTO #DocumentStatusLogs([ID], [DocumentID], [Status], [DateCreated]) VALUES (4, 2, 'S2', '7/30/2011 5:00:00')
INSERT INTO #DocumentStatusLogs([ID], [DocumentID], [Status], [DateCreated]) VALUES (5, 2, 'S3', '8/01/2011 6:00:00')
INSERT INTO #DocumentStatusLogs([ID], [DocumentID], [Status], [DateCreated]) VALUES (6, 3, 'S1', '8/02/2011 7:00:00')

选项1:

    SELECT
    [Extent1].[ID], 
    [Extent1].[DocumentID],
    [Extent1].[Status], 
    [Extent1].[DateCreated]
FROM #DocumentStatusLogs AS [Extent1]
    OUTER APPLY (
        SELECT TOP 1
            [Extent2].[ID], 
            [Extent2].[DocumentID],
            [Extent2].[Status], 
            [Extent2].[DateCreated]
        FROM #DocumentStatusLogs AS [Extent2]
        WHERE [Extent1].[DocumentID] = [Extent2].[DocumentID]
        ORDER BY [Extent2].[DateCreated] DESC, [Extent2].[ID] DESC
    ) AS [Project2]
WHERE ([Project2].[ID] IS NULL OR [Project2].[ID] = [Extent1].[ID])

选项2:

SELECT 
    [Limit1].[DocumentID] AS [ID], 
    [Limit1].[DocumentID] AS [DocumentID], 
    [Limit1].[Status] AS [Status], 
    [Limit1].[DateCreated] AS [DateCreated]
FROM (
    SELECT DISTINCT [Extent1].[DocumentID] AS [DocumentID] FROM #DocumentStatusLogs AS [Extent1]
) AS [Distinct1]
    OUTER APPLY  (
        SELECT TOP (1) [Project2].[ID] AS [ID], [Project2].[DocumentID] AS [DocumentID], [Project2].[Status] AS [Status], [Project2].[DateCreated] AS [DateCreated]
        FROM (
            SELECT 
                [Extent2].[ID] AS [ID], 
                [Extent2].[DocumentID] AS [DocumentID], 
                [Extent2].[Status] AS [Status], 
                [Extent2].[DateCreated] AS [DateCreated]
            FROM #DocumentStatusLogs AS [Extent2]
            WHERE [Distinct1].[DocumentID] = [Extent2].[DocumentID]
        )  AS [Project2]
        ORDER BY [Project2].[ID] DESC
    ) AS [Limit1]

在Microsoft SQL Server Management Studio中:在突出显示并运行第一个块后,突出显示选项1和选项2,右键单击->[显示估计的执行计划]。然后运行整个程序以查看结果。

选项1结果:

ID  DocumentID  Status  DateCreated
6   1   S1  8/2/11 3:00
5   2   S3  8/1/11 6:00
6   3   S1  8/2/11 7:00

选项2结果:

ID  DocumentID  Status  DateCreated
6   1   S1  8/2/11 3:00
5   2   S3  8/1/11 6:00
6   3   S1  8/2/11 7:00

注意:

当我想要一个连接为1到-(多个连接中的1个)时,我倾向于使用APPLY。 如果我想要连接为1对多或多对多,则使用JOIN。 我使用ROW_NUMBER()避免CTE,除非我需要做一些高级的事情,并且可以接受窗口性能损失。

我还避免在WHERE或ON子句中使用EXISTS / IN子查询,因为我曾经经历过这样会导致一些糟糕的执行计划。但具体里程各不相同。在需要的时候随时随地检查执行计划和概要文件的性能!