Sunday, August 23, 2009

Index Organization In SQL Server 2005



Index Organization
Think of the indexes you might see in your everyday lifethose in books and other documents. In this era of massive amounts of online information and extremely fast online search engines, it's probably rare that you actually open a book to look for information, so you might have to stretch your imagination a bit to follow my analogy here. But assume that all you have is printed reference materials. Suppose you want to know the definition of a word and you have two options. First one is your text book and the second option is the dictionary. You can quickly find the information either from the book or from the dictionary even though the two are organized in completely different way.
In the dictionary all the words are organized in alphabetical order. Letters are shown on top of the each page to tell you the words that are on the page. So you can quickly flip through just a few pages and end up at the page that has your word.
In your text book there is no helpful key words on top of each page except the page number but you have the index at back of the book and all the index entries are organized alphabetically. So again you can quickly find your key word. Unlike the dictionary, the book index only gives you a page number where you can find the definition of your key word.
These searches through the two books are analogous to using clustered and non-clustered indexes. In a clustered index, the data is actually stored in order, just like the dictionary has all the words in order. Once you find the data you’re looking for you are done with the search. In a nonclustered index, the index is a completely separate structure from the data itself. Once you find what you're looking for in the index, you have to follow pointers to the actual data. A nonclustered index in SQL Server is very much like the index in the back of a book. If only one page reference is listed in the index, you can quickly find the information. If a dozen pages are listed, it takes quite a bit longer to find what you need. If hundreds of pages are listed, you might think there's no point in using the index at all.
Indexes in SQL Server store their information using standard B-trees. A B-tree provides fast access to data by searching on a key value of the index. B-trees cluster records with similar keys. The B stands for balanced, and balancing the tree is a core feature of a B-tree's usefulness. The trees are managed, and branches are grafted as necessary, so navigating down the tree to find a value and locate a specific record takes only a few page accesses. Because the trees are balanced, finding any record requires about the same amount of resources, and retrieval speed is consistent because the index has the same depth throughout.



Clustered Indexes
The leaf level of a clustered index contains the data pages i.e. all the columns of every row are in the leaf level. Another way to say this is that the data itself is part of the clustered index. A clustered index keeps the data in a table ordered around the key. The data pages in the table are kept in a doubly linked list called a page chain. (Note that pages in a heap are not linked together.) The order of pages in the page chain, and the order of rows on the data pages, is the order of the index key or keys. Deciding which key to cluster on is an important performance consideration. When the index is traversed to the leaf level, the data itself has been retrieved, not simply pointed to.
Because the actual page chain for the data pages can be ordered in only one way, a table can have only one clustered index. The query optimizer strongly favors a clustered index in many cases because such an index allows the data to be found directly at the leaf level. Because it defines the logical order of the data, a clustered index allows especially fast access for queries that look for a range of values. The query optimizer detects that only a certain range of data pages must be scanned.
Most tables should have a clustered index. If your table will have only one index, it generally should be clustered. Many documents describing SQL Server indexes will tell you that the clustered index physically stores the data in sorted order. This can be misleading if you think of physical storage as the disk itself. If a clustered index had to keep the data on the actual disk in a particular order, it could be prohibitively expensive to make changes. If a page got too full and had to be split in two, all the data on all the succeeding pages would have to be moved down. Sorted order in a clustered index simply means that the data page chain is logically in order. If SQL Server follows the page chain, it can access each row in clustered index key order, but new pages can be added simply by adjusting the links in the page chain.
In SQL Server 2005, all clustered indexes are unique. If you build a clustered index without specifying the UNIQUE keyword, SQL Server guarantees uniqueness internally by adding a uniqueifier to the rows when necessary. This uniqueifier is a 4-byte value added as an additional clustered index key column. It is added only to the rows that have duplicates of their declared index key column(s).



Nonclustered Indexes
In a nonclustered index, the leaf level does not contain all the data. In addition to the key values, each index row in the leaf level (the lowest level of the tree) contains a bookmark that tells SQL Server where to find the data row corresponding to the key in the index. A bookmark can take one of two forms. If the table has a clustered index, the bookmark is the clustered index key for the corresponding data row. If the table is a heap (in other words, it has no clustered index), the bookmark is a row identifier (RID), which is an actual row locator in the form File#:Page#:Slot#.
The presence or absence of a nonclustered index doesn't affect how the data pages are organized, so you're not restricted to having only one nonclustered index per table, as is the case with clustered indexes. Each table can include as many as 249 nonclustered indexes, but you'll usually want to have far fewer than that.
When you search for data using a nonclustered index, the index is traversed and then SQL Server retrieves the record or records pointed to by the leaf-level indexes. For example, if you're looking for a data page using a seek operation on an index with a depth of three a root page, one intermediate page, and the leaf page all three index pages must be traversed. If the leaf level contains a clustered index key, all the levels of the clustered index must then be traversed to locate the specific row. The clustered index will probably also have three levels, but in this case remember that the leaf level is the data itself. There are two additional index levels separate from the data, typically one less than the number of levels needed for a nonclustered index. The data page still must be retrieved, but because it has been exactly identified, there's no need to scan the entire table. Still, it takes six logical I/O operations to get one data page. You can see that a nonclustered index is a win only if it's highly selective.