Sunday, December 29, 2013

Clustered Index in SQL Server

Indices are of many types in SQL Server. But mainly we hear the words Clustered Index and Non-Clustered Index. Both have different functionality and architecture. In this article we will see about Clustered Index, its functionality and architecture.

Index acts like a pointer. Index is mainly used to sort the data rows. Clustered Index sorts the rows in the physical order they are present on the disk. It may be Ascending or Descending order, based on user's choice. Clustered Index has fields like Key Value and it stores the data row. When a data row is queried, Optimizer goes to the row through Index and gets the row through its Key Value.

Clustered Index can be only one per table because sorting the rows in a physical order they are stored on disk can be done only once. It can be of two types namely Unique and Non-Unique. When a column is defined as Primary Key, automatically a unique Clustered Index is created. It can also be defined on a table manually, either Unique or Non-Unique. Non-Unique Clustered Index allows duplicate values where as Unique doesn't allow.

Architecture


The aim of Index is to divide the table into pages, allocating some number of rows per page. The size of a page is 8 KB. This 8 KB is allocated based on type of columns you use in table. If you use varchar columns of larger size, which is equal to 8 KB then it may allocate only one row per page.


Levels of a clustered index

This is the architecture specified by Microsoft in its official documentation on Clustered Index. It is represented as a B- Tree. It is a doubly linked list. The leaf nodes contain a Key Value and Data Pages which in turn contain Data Rows. The intermediate level nodes point to the leaf nodes. These nodes contain the Key Values of the leaf nodes. The root node points to the intermediate level leaf nodes. 

The above architecture is for one partition. Partitioning means dividing a table among different file groups. Each partition contains some number of rows. But this feature is available in Enterprise Edition of SQL Server, not in Express Editions. In Express Editions, a complete table is considered as a Partition. For a partition, the id of Clustered Index is always 1. You can check by running the following query:

SELECT * FROM sys.partitions

See the Result Pane now:




You can see in the Result Pane, wherever the column index_id has value 1 that represents the respective table has a clustered index on it. You can also observe that partition_number has value 1. It is because no partitions are made on any table. The value 0 for index_id for a table means, a No Index exists on that table. 

To know about indexes on tables, you can use the following query:

SELECT * FROM sys.indexes

See the Result Pane:




It can be seen what type of indexes are present on which table. If the index_id column has 1 then it's a clustered index because only one clustered index can exist per table. The type of clustered index is 1. If that clustered index is created automatically on creation of Primary Key then the column is_primary_key column has value 1.

Usage of Clustered Index


Indexing do has advantages and Usage. When you query a record from a table, Optimizer doesn't scan the whole table to fetch that row. Instead it goes by index. It searches key values of index, it gets the memory address of the data row and then fetches it to user. You can notice that in Execution Plan. Let's have a small practical on this.

Let's create a Test table and insert some records into it as follows:

CREATE TABLE Test (ID INT,Name VARCHAR(100))

INSERT Test VALUES (1,'A')
INSERT Test VALUES (2,'B')
INSERT Test VALUES (3,'B')

Turn ON 'Include Actual Execution Plan' and query data from above table.

SELECT * FROM Test

SELECT * FROM Test WHERE ID=2

Run the above two queries, you will find the same execution plan.




Here Query Optimizer is scanning the whole table if you query whole table or a single row with a query predicate. Right now there are only three records in the table. If there are millions of records then it takes a long time for Optimizer to search the whole table for a single record. In such cases, Index comes into use. Let's create a clustered index on ID column.

CREATE CLUSTERED INDEX CI_ID ON Test (ID DESC)

I specified an order DESC (descending). It can be specified if we need such order, otherwise the default ordering is ASC (ascending order). Now let's run above queries again.

SELECT * FROM Test

SELECT * FROM Test WHERE ID=2

See the Execution Plan now:




Here optimizer performed Clustered Index Scan operation when whole table is queried and Clustered Index Seek operation when a single record is queried. Scan means it scanned the whole index whereas Seek means it just went to a particular Key Value and an Address in the index. Now go to the Results tab to see the order we have given while creating Index (i.e descending order).



The values of ID column are sorted in descending order hence the values of Name column are also sorted in the same. Now let's test one more functionality of our index. We didn't specify UNIQUE while creating index which means it is Non-Unique and it should allow duplicate values in the column ID. So let's add a duplicate row.

INSERT Test VALUES (1,'D')

Now it can be seen that this row is inserted into the table when we query data.




Now let's drop the existing index and create Unique index on the same column.

DROP INDEX CI_ID ON Test
GO
CREATE UNIQUE CLUSTERED INDEX CI_ID ON Test (ID DESC)

After running the above code we get an error like:

The CREATE UNIQUE INDEX statement terminated because a duplicate key was found for the object name 'dbo.Test' and the index name 'CI_ID'. The duplicate key value is (1).
The statement has been terminated.

This means there are duplicate values in the table, so Unique Clustered Index cannot be created. It even specifies the duplicate value present as 1. This also proves another feature. When a Primary Key is created on a column, it creates a Unique Clustered Index which allows neither duplicate values nor NULL values.

Disadvantages of Clustered Index


Every feature has its own advantages and disadvantages. So does Clustered Index.

Disk Space


Having index occupies more disk space with increasing number of records in the table. Sometimes a situation occurs that size occupied by index is greater than size occupied by records in the table. The space occupation can be acquired by running the following system stored procedure.

EXEC sp_spaceused Test

See the Result Pane:




We can see that size occupied by data and index are same in Test table i.e. 8 KB. In such cases, creating index is an overhead for Server.

Fragmentation


Generally in Windows, if we do more file deletions and creations in a disk then there occurs a fragmentation of disk space. The same is the case with index. If more number of INSERT, UPDATE and DELETE operations are performed on the table with index then it results in Fragmentation of index. This fragmentation causes slow execution of query. For 45000 rows, it takes 5 seconds to fetch. In such cases, Re-Organize or Re-Build of Index should be done. Then the fragmented space is claimed and index is de-fragmented.

Sorting on Modification


Every time a row is deleted from a table with index, optimizer needs to sort the remaining rows in the order specified while creating the index. This increases burden on Optimizer and also consumes more I/O resources. So it is recommended that index should not be created on tables which are modified frequently.

1 comment:

  1. SQL is actually amongst the many database operations and queries that help people to break down beneath the database errors and solutions.

    SQL Server Load Soap API

    ReplyDelete