Sunday, April 10, 2016

SQL Server - Indexes with Included Columns

SQL Server 2005 came up with a facility to extend the Non Clustered Index to cover another column and such column is called "Included Column". The column on which the index is created is called "Key Column" and including a non key column to it adds it to the leaf level of non clustered index. There is no restriction that key column and non key column of the index should be of same data type. It has also an advantage that including non key column does not increase the size occupied by index.

An included index increases performance with less I/O reads when all the columns included in the index is used in the SELECT list. If so, the optimizer never goes to clustered index thus decreasing I/O reads. Let's see a scenario of included columns.

Create a table and populate some records in it,

CREATE TABLE INCLUDE_COLUMN (ID INT,NAME VARCHAR(100),CONSTRAINT PK_INCLUDE_COLUMN PRIMARY KEY (ID))
GO
DECLARE @ID INT = 1
WHILE (@ID <= 1000)
BEGIN
INSERT INTO INCLUDE_COLUMN SELECT @ID,'A_'+CAST(@ID AS varchar)
SET @ID = @ID + 1
END
GO

Turn on "Actual Execution Plan" and run the below query,

SELECT * FROM INCLUDE_COLUMN

As known, optimizer uses clustered index created along with primary key. Now, add a column to the table and populate some values in it,

ALTER TABLE INCLUDE_COLUMN ADD NAME1 VARCHAR(100)
GO
UPDATE INCLUDE_COLUMN SET NAME1 = CAST(ID AS varchar)+NAME

Now create a non clustered index on the column "Name" and include the newly added "Name1" column in that index as follows,

CREATE NONCLUSTERED INDEX IX_INCLUDE_COLUMN_NAME ON INCLUDE_COLUMN (NAME) INCLUDE (NAME1)

If we query the entire table now, optimizer may use clustered index or non clustered index depending on its feasibility, as stated in our previous article. But Books Online states that optimize does not go for clustered index only if all the included columns of non clustered index are used in SELECT list. Right now, we have all the columns covered in some indexes. So, to test the performance of included columns, let's add another column to our table.

ALTER TABLE INCLUDE_COLUMN ADD SURNAME VARCHAR(100)
GO
UPDATE INCLUDE_COLUMN SET SURNAME = NAME + NAME1
GO

Now if the whole table is queried then clustered index is used for sure. Let's see in what scenario, clustered index is avoided.

SELECT ID,NAME,NAME1 FROM INCLUDE_COLUMN

This query goes for non clustered index as we have included all the indexed columns in our SELECT list. As it claims to have lesser I/O reads, let's force the optimizer to use our clustered index and compare the costs for both the queries.

SELECT ID,NAME,NAME1 FROM INCLUDE_COLUMN
GO
SELECT ID,NAME,NAME1 FROM INCLUDE_COLUMN WITH (INDEX(PK_INCLUDE_COLUMN))

Compare the I/O cost, CPU Cost, Operator Cost and CPU Cost and you can notice that using Non Clustered Index incurs lesser cost than that of clustered index. To use clustered index, optimizer needs to access disk which requires more amount of CPU resources. So, include columns are introduced. This gives more performance when you have a huge amount of data in your table.

Remember one thing,

SQL Server Query Optimizer knows the best way of retrieve the data that gives good performance.

SQL Server - Usage of Indexes in Execution Plan

Indexes are used to have a better performance while retrieving data. I'm not going to refer basics in this article. Rather, I'd like to share my experiences with the well known indexes i.e., Clustered and Non Clustered Indexes and how they are used by SQL Server Query Optimizer in its execution plan.

As the performance of index can be observed when there is a huge amount of data, let's have that data using the following code,

CREATE DATABASE YASHWANTH
GO
USE YASHWANTH
GO
CREATE TABLE TEST (ID INT,NAME VARCHAR(100),CONSTRAINT PK_TEST_ID PRIMARY KEY (ID))
GO
DECLARE @ID INT = 1
WHILE (@ID <= 2000000)
BEGIN
INSERT INTO TEST SELECT @ID,'A_'+CAST(@ID AS varchar)
SET @ID = @ID + 1
END
GO

Now that, we have 20 lakhs of unique records in our table. As we created a table with a primary key column, a clustered index is created automatically. Let's do some operations on this table.

Turn on "Actual Execution Plan" and run the following query,

SELECT * FROM TEST

After the entire table result is retrieved, check the execution plan where we can find this,


As per our knowledge, query optimizer uses the primary key's clustered index and scans it using the operator "Clustered Index Scan". If a "where" clause is used then it uses the same index using "Clustered Index Seek" operator.

We know that among all indexes, Clustered is a high prioritized index and this is why it gets created automatically when a primary key is created on the table. Is it true in all the cases?

Let's create a Non Clustered Index on the column "Name" of our table. You can create a unique or non-unique or ascending or descending. Here, I'm creating a normal non-unique ascending non clustered index,

CREATE NONCLUSTERED INDEX IX_TEST_NAME ON TEST (NAME)

Now run the below query and observe the execution plan,

SELECT * FROM TEST


Though there is a high prioritized clustered index, optimizer used newly created Non Clustered index. We know that clustered index orders table data on the disk according to it. But why does optimizer go for non clustered index instead of clustered index.

Now, let's do some more operations to have our doubts cleared. Clear the procedure cache.

CHECKPOINT
DBCC FREEPROCCACHE
DBCC DROPCLEANBUFFERS

We can force the optimizer to use our favourite clustered index instead of non clustered index. Let's try that now. Run the following query,

SELECT * FROM TEST WITH (INDEX(PK_TEST_ID))

My system is of I5 processor with 8GB of physical memory. It used clustered index and retrieved 20 lakh records within 11 seconds. Now run the below query,

SELECT * FROM TEST

It used its favourite non clustered index and retrieved the same result of 20 lakh records within 10 seconds in my system.

Why did this happen?

Because SQL Server is built with query optimization capability without the manual intervention. The optimizer knows it can get result faster when it uses non clustered index and hence it used it. Earlier, it had only clustered index to use and now it has two and an option to decide which can give better performance.

Still confused?

Take the following query,

SELECT ID FROM TEST

Now, it should definitely use the clustered index as we used only the column "ID" in our SELECT list. But execute it and see the result and execution plan,


In my system, I got the result in 9 seconds this time, performing "Non Clustered Index Scan".

Let's see the CPU usage when we run our previous queries individually,

SELECT * FROM TEST
GO
SELECT * FROM TEST WITH (INDEX(PK_TEST_ID))

Observe the Subtree Cost, I/O Cost and Operator Cost for both the queries where we can find that using Non Clustered Index incurred lesser costs. Now, run the both queries at once and see the execution plan to get the percentage wise performance stats,


Using clustered index used 52% of the total 100% execution plan for both the queries. Since SQL Server 2005, query optimizer knows how to use resources to have the better performance.

When used a "where" clause, optimizer uses "Clustered Index Seek" operator for the below first two queries and "Non Clustered Index Seek" operator for the third query,

SELECT * FROM TEST WHERE ID = 1 AND NAME = 'A_1'
GO
SELECT * FROM TEST WHERE NAME = 'A_1' AND ID = 1
GO
SELECT * FROM TEST WHERE NAME = 'A_1'

Now let's do another observation on usage of indexes in execution plan by adding a new column to the table,

ALTER TABLE TEST ADD SURNAME VARCHAR(100)

Now, we have no index on the column "SURNAME". Let's run the below query,

SELECT * FROM TEST


Clustered Index Scan is performed because we are querying all the columns out of which "SURNAME" has no index and optimizer has to rely on clustered index to retrieve its data. Here are some more observations,

SELECT ID,NAME FROM TEST -- PERFORMS NON CLUSTERED INDEX SCAN
GO
SELECT ID,SURNAME FROM TEST -- PERFORMS CLUSTERED INDEX SCAN
GO
SELECT NAME,SURNAME FROM TEST -- PERFORMS CLUSTERED INDEX SCAN
GO
SELECT * FROM TEST WHERE NAME = 'A_1' -- PERFORMS NON CLUSTERED INDEX SEEK AND KEY LOOKUP IN CLUSTERED INDEX FOR "SURNAME" COLUMN
GO
SELECT NAME,SURNAME FROM TEST WHERE NAME = 'A_1' -- PERFORMS NON CLUSTERED INDEX SEEK AND KEY LOOKUP IN CLUSTERED INDEX FOR "SURNAME" COLUMN
GO
SELECT SURNAME FROM TEST -- PERFORMS CLUSTERED INDEX SCAN
GO
SELECT ID FROM TEST -- PERFORMS NON CLUSTERED INDEX SCAN
GO
SELECT NAME FROM TEST WHERE ID = 1 -- PERFORMS CLUSTERED INDEX SEEK
GO
SELECT SURNAME FROM TEST WHERE NAME = 'A_1' -- PERFORMS NON CLUSTERED INDEX SEEK AND KEY LOOKUP IN CLUSTERED INDEX FOR "SURNAME" COLUMN

Try any other combinations I missed here and let me know through your comments.

Now, we observed when the optimizer is going for Clustered and when for Non Clustered. There is one more index called "Include Index" in used terminology which is called "Include Column" technically. It doesn't let optimizer go for Clustered Index hence reducing disk I/O and increasing query performance. I will write about it in my next article.

Now it's time for clean up,

DROP TABLE TEST
GO
USE master
GO
DROP DATABASE YASHWANTH