我听说,应该在索引声明的开头放置最具有选择性的列。例子:

CREATE NONCLUSTERED INDEX MyINDX on Table1
(
   MostSelective,
   SecondMost,
   Least
)

首先,我说的对吗?如果是这样,我是否可能通过重新排列索引中的列的顺序来看到性能上的巨大差异,或者它更像是一种“不错的做法”?

我这样问的原因是,在通过DTA进行查询后,它建议我创建一个索引,其中几乎所有列都与现有索引相同,只是顺序不同。我正在考虑把缺失的列添加到现有的索引中,并称之为好。想法吗?


您应该将最具选择性的列放在索引声明的开头。

正确的。 索引可以是复合索引(由多个列组成),由于最左边原则,索引的顺序很重要。原因是,数据库从左到右检查列表,并且必须找到与定义顺序匹配的相应列引用。例如,在一个列地址表上有一个索引:

地址 城市 状态

任何使用地址列的查询都可以使用索引,但如果查询只有城市和/或州引用,则不能使用索引。这是因为最左边的列没有被引用。查询性能应该告诉您哪个是最优的——单个索引,还是具有不同顺序的多个组合。《引爆点》(The Tipping Point),金伯利·特里普著


看看这样的索引:

Cols
  1   2   3
-------------
|   | 1 |   |
| A |---|   |
|   | 2 |   |
|---|---|   |
|   |   |   |
|   | 1 | 9 |
| B |   |   |
|   |---|   |
|   | 2 |   |
|   |---|   |
|   | 3 |   |
|---|---|   |

看看首先限制A,因为你的第一列比限制第二列减少了更多的结果?如果您想象一下索引必须如何遍历,列1,然后列2,等等,这就更容易了……您可以看到,在第一次传递中删除大部分结果会使第二步更快。

另一种情况是,如果在列3上查询,优化器甚至不会使用索引,因为它对缩小结果集没有任何帮助。在进行查询时,在下一步之前缩小要处理的结果数量意味着更好的性能。

由于索引也是以这种方式存储的,因此在查询索引时,不需要回溯索引来查找第一列。

简而言之:不,这不是为了炫耀,它有真正的性能好处。


The order of columns is critical. Now which order is correct it depends on how you are going to query it. An index can be used to do an exact seek or an range scan. An exact seek is when values for all columns in the index are specified and the query lands exactly on the row is interested in. For seeks the order of columns is irrelevant. A range scan is when only some columns are specified, and in this case when the order becomes important. SQL Server can use an index for a range scan only if the leftmost column is specified, and then only if the next leftmost column is specified, and so on. If you have an index on (A,B,C) it can be used to range scan for A=@a, for A=@a AND B=@b but not for B=@b, for C=@c norB=@b AND C=@c. The case A=@a AND C=@c is mixed one, as in the A=@a portion will use the index, but the C=@c not (the query will scan all B values for A=@a, will not 'skip' to C=@c). Other database systems have the so called 'skip scan' operator that can take some advantage of inner columns in an index when the outer columns are not specified.

With that knowledge in hand you can look at the index definitions again. An index on (MostSelective, SecondMost, Least) will be effective only when MostSelective column is specified. But that being the most selective, the relevance of the inner columns will quickly degrade. Very often you'll find that a better index is on (MostSelective) include (SecondMost, Least) or on (MostSelective, SecondMost) include (Least). Because the inner columns are less relevant, placing low selectivity columns in such right positions in the index makes them nothing but noise for a seek, so it makes sense to move them out of the intermediate pages and keep them only on the leaf pages, for query coverability purposes. In other words, move them to INCLUDE. This becomes more important as the size of Least column increases. The idea is that this index can only benefit queries that specify MostSelective either as an exact value or a range, and that column being the most selective it already restricts the candidate rows to great extent.

On the other hand an index on (Least, SecondMost, MostSelective) may seem a mistake, but it actually quite a powerful index. Because it has the Least column as its outermost query, it can be used for queries that have to aggregate results on low selectivity columns. Such queries are prevalent in OLAP and analysis data warehouses, and this is exactly where such indexes have a very good case going for them. Such indexes actually make excellent clustered indexes, exactly because they organize the physical layout on large chunks of related rows (same Least value, which usually indicate some sort of category or type) and they facilitate analysis queries.

所以,不幸的是,没有“正确”的顺序。您不应该遵循任何千篇一律的配方,而是要分析您将要对这些表使用的查询模式,并决定哪个索引列顺序是正确的。


正如Remus所说,这取决于你的工作量。

不过,我想指出公认答案的误导方面。

对于在索引中的所有列上执行相等搜索的查询,没有显著差异。

下面创建两个表并用相同的数据填充它们。唯一的区别是,一个键的顺序从最多到最少,另一个则相反。

CREATE TABLE Table1(MostSelective char(800), SecondMost TINYINT, Least  CHAR(1), Filler CHAR(4000) null);
CREATE TABLE Table2(MostSelective char(800), SecondMost TINYINT, Least  CHAR(1), Filler CHAR(4000) null);

CREATE NONCLUSTERED INDEX MyINDX on Table1(MostSelective,SecondMost,Least);
CREATE NONCLUSTERED INDEX MyINDX2 on Table2(Least,SecondMost,MostSelective);

INSERT INTO Table1 (MostSelective, SecondMost, Least)
output inserted.* into Table2
SELECT TOP 26 REPLICATE(CHAR(number + 65),800), number/5, '~'
FROM master..spt_values
WHERE type = 'P' AND number >= 0
ORDER BY number;

现在对这两个表进行查询……

SELECT *
FROM   Table1
WHERE  MostSelective = REPLICATE('P', 800)
       AND SecondMost = 3
       AND Least = '~';

SELECT *
FROM   Table2
WHERE  MostSelective = REPLICATE('P', 800)
       AND SecondMost = 3
       AND Least = '~'; 

... 两者都使用指数罚款,两者的成本完全相同。

公认答案中的ASCII艺术实际上并不是索引的结构。表1的索引页如下所示(单击图像以全尺寸打开)。

索引页包含包含整个键的行(在这种情况下,实际上有一个额外的键列附加为行标识符,因为索引没有声明为唯一的,但可以忽略关于此的进一步信息,可以在这里找到)。

对于上面的查询,SQL Server不关心列的选择性。它对根页面进行二进制搜索,并发现Key (PPP…,3,~)是>=(JJJ…,1,~)和< (SSS…,3,~)所以应该读第1页118。然后,它对该页上的关键项进行二进制搜索,并定位要向下移动到的叶页。

按照选择性的顺序更改索引既不会影响二元搜索中键比较的预期数量,也不会影响进行索引查找所需导航的页面数量。最好的情况下,它可能会略微加快键比较本身。

不过,有时优先排序最具选择性的索引对于工作负载中的其他查询是有意义的。

例如,如果工作负载包含以下两种表单的查询。

SELECT * ... WHERE  MostSelective = 'P'

SELECT * ...WHERE Least = '~'

上面的指数并没有涵盖这两种情况。MostSelective具有足够的选择性,使得包含查找和查找的计划有价值,但针对Least的查询则没有。

然而,这种情况(对复合索引的主列子集进行非覆盖索引查找)只是索引可以帮助的一种可能的查询类别。如果你从来没有通过MostSelective或MostSelective和SecondMost的组合来搜索,而总是通过这三列的组合来搜索,那么这个理论上的优势对你来说是没有用的。

相反,查询如下

SELECT MostSelective,
       SecondMost,
       Least
FROM   Table2
WHERE  Least = '~'
ORDER  BY SecondMost,
          MostSelective 

如果使用与通常指定的顺序相反的顺序,则会有所帮助-因为它涵盖了查询,可以支持seek并以所需的顺序返回行以引导。

因此,这是一个经常被重复的建议,但最多只是关于其他查询的潜在好处的启发式——它不能替代实际查看您的工作量。


选择性是一个非常次要的因素;“最左边”很关键

在选择顺序时,复合索引中各个列的选择性无关紧要。

下面是简单的思考过程:实际上,索引是相关列的连接。

基于这种原理,唯一的区别是比较字符串中较早和较晚不同的两个“字符串”。这只是总成本的一小部分。没有“第一关/第二关”,就像在一个答案中提到的那样。

那么,应该用什么顺序呢?

从用=测试的列开始,以任意顺序。 然后附加一个范围列。

例如,非常低的选择性列必须排在前面:

WHERE deleted = 0  AND  the_datetime > NOW() - INTERVAL 7 DAY
INDEX(deleted, the_datetime)

交换索引中的顺序将使其完全忽略删除。

(对于列的排序还有很多规则。)