Sunday, December 29, 2013

Clustered Index in SQL Server

Indices are of many types in SQL Server. But mainly we hear the words Clustered Index and Non-Clustered Index. Both have different functionality and architecture. In this article we will see about Clustered Index, its functionality and architecture.

Index acts like a pointer. Index is mainly used to sort the data rows. Clustered Index sorts the rows in the physical order they are present on the disk. It may be Ascending or Descending order, based on user's choice. Clustered Index has fields like Key Value and it stores the data row. When a data row is queried, Optimizer goes to the row through Index and gets the row through its Key Value.

Clustered Index can be only one per table because sorting the rows in a physical order they are stored on disk can be done only once. It can be of two types namely Unique and Non-Unique. When a column is defined as Primary Key, automatically a unique Clustered Index is created. It can also be defined on a table manually, either Unique or Non-Unique. Non-Unique Clustered Index allows duplicate values where as Unique doesn't allow.

Architecture


The aim of Index is to divide the table into pages, allocating some number of rows per page. The size of a page is 8 KB. This 8 KB is allocated based on type of columns you use in table. If you use varchar columns of larger size, which is equal to 8 KB then it may allocate only one row per page.


Levels of a clustered index

This is the architecture specified by Microsoft in its official documentation on Clustered Index. It is represented as a B- Tree. It is a doubly linked list. The leaf nodes contain a Key Value and Data Pages which in turn contain Data Rows. The intermediate level nodes point to the leaf nodes. These nodes contain the Key Values of the leaf nodes. The root node points to the intermediate level leaf nodes. 

The above architecture is for one partition. Partitioning means dividing a table among different file groups. Each partition contains some number of rows. But this feature is available in Enterprise Edition of SQL Server, not in Express Editions. In Express Editions, a complete table is considered as a Partition. For a partition, the id of Clustered Index is always 1. You can check by running the following query:

SELECT * FROM sys.partitions

See the Result Pane now:




You can see in the Result Pane, wherever the column index_id has value 1 that represents the respective table has a clustered index on it. You can also observe that partition_number has value 1. It is because no partitions are made on any table. The value 0 for index_id for a table means, a No Index exists on that table. 

To know about indexes on tables, you can use the following query:

SELECT * FROM sys.indexes

See the Result Pane:




It can be seen what type of indexes are present on which table. If the index_id column has 1 then it's a clustered index because only one clustered index can exist per table. The type of clustered index is 1. If that clustered index is created automatically on creation of Primary Key then the column is_primary_key column has value 1.

Usage of Clustered Index


Indexing do has advantages and Usage. When you query a record from a table, Optimizer doesn't scan the whole table to fetch that row. Instead it goes by index. It searches key values of index, it gets the memory address of the data row and then fetches it to user. You can notice that in Execution Plan. Let's have a small practical on this.

Let's create a Test table and insert some records into it as follows:

CREATE TABLE Test (ID INT,Name VARCHAR(100))

INSERT Test VALUES (1,'A')
INSERT Test VALUES (2,'B')
INSERT Test VALUES (3,'B')

Turn ON 'Include Actual Execution Plan' and query data from above table.

SELECT * FROM Test

SELECT * FROM Test WHERE ID=2

Run the above two queries, you will find the same execution plan.




Here Query Optimizer is scanning the whole table if you query whole table or a single row with a query predicate. Right now there are only three records in the table. If there are millions of records then it takes a long time for Optimizer to search the whole table for a single record. In such cases, Index comes into use. Let's create a clustered index on ID column.

CREATE CLUSTERED INDEX CI_ID ON Test (ID DESC)

I specified an order DESC (descending). It can be specified if we need such order, otherwise the default ordering is ASC (ascending order). Now let's run above queries again.

SELECT * FROM Test

SELECT * FROM Test WHERE ID=2

See the Execution Plan now:




Here optimizer performed Clustered Index Scan operation when whole table is queried and Clustered Index Seek operation when a single record is queried. Scan means it scanned the whole index whereas Seek means it just went to a particular Key Value and an Address in the index. Now go to the Results tab to see the order we have given while creating Index (i.e descending order).



The values of ID column are sorted in descending order hence the values of Name column are also sorted in the same. Now let's test one more functionality of our index. We didn't specify UNIQUE while creating index which means it is Non-Unique and it should allow duplicate values in the column ID. So let's add a duplicate row.

INSERT Test VALUES (1,'D')

Now it can be seen that this row is inserted into the table when we query data.




Now let's drop the existing index and create Unique index on the same column.

DROP INDEX CI_ID ON Test
GO
CREATE UNIQUE CLUSTERED INDEX CI_ID ON Test (ID DESC)

After running the above code we get an error like:

The CREATE UNIQUE INDEX statement terminated because a duplicate key was found for the object name 'dbo.Test' and the index name 'CI_ID'. The duplicate key value is (1).
The statement has been terminated.

This means there are duplicate values in the table, so Unique Clustered Index cannot be created. It even specifies the duplicate value present as 1. This also proves another feature. When a Primary Key is created on a column, it creates a Unique Clustered Index which allows neither duplicate values nor NULL values.

Disadvantages of Clustered Index


Every feature has its own advantages and disadvantages. So does Clustered Index.

Disk Space


Having index occupies more disk space with increasing number of records in the table. Sometimes a situation occurs that size occupied by index is greater than size occupied by records in the table. The space occupation can be acquired by running the following system stored procedure.

EXEC sp_spaceused Test

See the Result Pane:




We can see that size occupied by data and index are same in Test table i.e. 8 KB. In such cases, creating index is an overhead for Server.

Fragmentation


Generally in Windows, if we do more file deletions and creations in a disk then there occurs a fragmentation of disk space. The same is the case with index. If more number of INSERT, UPDATE and DELETE operations are performed on the table with index then it results in Fragmentation of index. This fragmentation causes slow execution of query. For 45000 rows, it takes 5 seconds to fetch. In such cases, Re-Organize or Re-Build of Index should be done. Then the fragmented space is claimed and index is de-fragmented.

Sorting on Modification


Every time a row is deleted from a table with index, optimizer needs to sort the remaining rows in the order specified while creating the index. This increases burden on Optimizer and also consumes more I/O resources. So it is recommended that index should not be created on tables which are modified frequently.

Wednesday, December 25, 2013

Joins in SQL Server

Joins are very useful in situations where you need to get data from different tables. Joins are written based on columns from both the tables. A column from each table is taken for writing the join.

The following are the kinds of joins used in SQL Server

                                   1. Inner Join
                                   2. Outer Join
                                   3. Cross Join
                                   4. Self Join
                                   5. Comma Join

Let's go inside each join with a practical explanation.

Inner Join


Inner Join is a join which returns the matched rows from both the tables. The matching is done based on a column in each table. It ignores NULL values. Though there are NULL values in both the tables, it doesn't match them.

Let's see this in practical:

To do research on different features, Microsoft provides a sample database with tables and data inserted into it. namely AdventureWorks. Let's do our Join adventures on that database.

USE AdventureWorks2012
GO
SELECT * FROM Person.Person
GO
SELECT * FROM Person.PersonPhone

There are two tables like I specified above. If I run the above code, I get results as:


You can see that there is a common column among both the tables by name BusinessEntityID.Let's join both the tables based on this column.

SELECT p.BusinessEntityID,p.PersonType,p.FirstName,p1.PhoneNumber,p1.ModifiedDate
FROM Person.Person p
INNER JOIN Person.PersonPhone p1
ON p.BusinessEntityID = p1.BusinessEntityID 
WHERE p.BusinessEntityID < 10

I need BusinessEntityID, PersonType, FirstName columns from table Person.Person and PhoneNumber, ModifiedDate from table Person.PersonPhone. I've chosen to join those table based on BusinessEntityID column to get my desired result. Note that in the key word 'INNER JOIN', the word INNER is optional. If you don't specify it, Query Optimizer automatically performs INNER JOIN. But as a standard and to make your fellow users understand whay kind of join you performed, specify INNER keyword.

Now see the Result Pane after executing the above query:



You can see in the result pane that the result is acquired by joining the specified columns from both the tables like a single table. 

As I said above, Inner Join ignores NULL values though they exist in both the tables. Let's see a small case of this:

I updated a column value to NULL in both the tables.

UPDATE Person.Person SET BusinessEntityID=NULL WHERE BusinessEntityID=9

UPDATE Person.PersonPhone SET BusinessEntityID=NULL WHERE BusinessEntityID=9

Run the following query:

SELECT p.BusinessEntityID,p.PersonType,p.FirstName,p1.PhoneNumber,p1.ModifiedDate
FROM Person.Person p
INNER JOIN Person.PersonPhone p1
ON p.BusinessEntityID = p1.BusinessEntityID 
WHERE p.BusinessEntityID < 10


See the Result Pane, though the values of column BusinessEntityID in both the tables have same column value NULL, Inner Join didn't display that record. This proves that Inner Join ignores even the matching NULL values.

Outer Join

Outer Joins are of three kinds,

                               1. Left Outer Join.
                               2. Right Outer Join.
                               3. Full Outer Join

Left Outer Join


It joins two tables based on a column but it doesn't work on matching principle like Inner Join. It returns all the rows present in Left Table. If any matching is found between Left and Right Tables it returns those rows. If no matching is found, it returns NULL values.

Let's have a practical:

Run the following query:

USE AdventureWorks2012
GO
SELECT P.Name,P.ProductNumber,pr.ReviewerName,pr.ReviewDate FROM Production.Product p
LEFT OUTER JOIN Production.ProductReview pr
ON p.ProductID = pr.ProductReviewID

Here the Left Table is Production.Product.

Now see the Result Pane


We tried to match based on ProductID column in Production.Product table and ProductReviewID in Production.ProductReview table. There are only four matching records in both the tables. So those four matching records are returned. You can also see that after those four records, all the NULL values are present. This means there are no matching records based on the matched columns. All the records of left table (Product.Product) are displayed.

Right Outer Join


Right Outer Join also follows the same principle as Left Outer Join. The only difference is. if you use Left Outer Join, all the records of Left Table are displayed whereas if you use Right Outer Join, all the records of Right Table are used.

If you want to perform Right Outer Join on the above query, the query is as follows:

USE AdventureWorks2012
GO
SELECT P.Name,P.ProductNumber,pr.ReviewerName,pr.ReviewDate FROM Production.Product p
RIGHT OUTER JOIN Production.ProductReview pr
ON p.ProductID = pr.ProductReviewID

Full Outer Join


Unlike Left Outer and Right Outer Joins, Full Outer Join returns all the rows from all the tables that are joined.

Suppose there are 500 rows in Table1 and 400 rows in Table2. If a Full Outer Join is performed among these tables then you will get 500 rows from table1 and 400 rows from table2. In the Result Pane, you can see that the last 100 rows in table2 are NULL values. In total you get 500 rows in Result.

Use the above query:

USE AdventureWorks2012
GO
SELECT P.Name,P.ProductNumber,pr.ReviewerName,pr.ReviewDate FROM Production.Product p
RIGHT OUTER JOIN Production.ProductReview pr
ON p.ProductID = pr.ProductReviewID

Now see the Result Pane:



Here after 4th row all are NULL values. This is because in the table Production.ProductReview, we have only 4 rows. So we got only 4 rows and remaining all NULLs.

Let's check whether I said right or wrong. Run the following queries as a single batch:

SELECT COUNT(*) AS PRODUCT FROM Production.Product

SELECT COUNT(*) AS PRODUCTREVIEW FROM Production.ProductReview


So, we got 504 records by running the JOIN query.

Cross Join


Cross Join is like a Cartesian Product. It can also be considered as One-To-Many Relationship. Let's have a practical to understand this.

In AdventureWorks2012, we have two tables by names Sales.SalesPerson and Sales.SalesTerritory.

Let's see how many records exist in both the tables by running the following queries:

SELECT COUNT(*) AS SALESPERSON FROM Sales.SalesPerson

SELECT COUNT(*) AS SALESTERRITORY FROM Sales.SalesTerritory


There are 17 and 10 records respectively in both the tables. Now let's perform a cross join among these tables.

SELECT p.BusinessEntityID, t.Name AS Territory
FROM Sales.SalesPerson p
CROSS JOIN Sales.SalesTerritory t

See the Result Pane


Here, you can see that the column BusinessEntityID column values are 274,275....290 are with Territory column value 'Australia'. Later they are with Territory column value 'Canada'. It means a kind of one-to-many relationship is there between Australia and 274,275...290. If you check the number of records obtained in this join, it will be 170 which is the product of number f records from Sales.SalesPerson and Sales.SalesTerritory tables (i.e. 17x10 = 170).

In the above Cross Join query, we didn't use a WHERE clause. If we use it, Cross Join behaves as Inner Join.

Run the following query:

SELECT p.BusinessEntityID, t.Name AS Territory

FROM Sales.SalesPerson p
CROSS JOIN Sales.SalesTerritory t
WHERE p.TerritoryID = t.TerritoryID

See the Result Pane now:


Now let's perform a Inner Join just by replacing the keyword CROSS with INNER.

SELECT p.BusinessEntityID, t.Name AS Territory
FROM Sales.SalesPerson p
INNER JOIN Sales.SalesTerritory t
ON p.TerritoryID = t.TerritoryID

See the Result Pane

                                    

You can see that, Cross Join and Inner Join produced the same result if we use a WHERE clause in Cross Join.

Self Join


If a table is joined to itself then it is called Self Join. This is used if some records of a table should be joined with some other records in the same table. Here we need to use two different aliases for same table in the same query to help Query Processor know which column values should be displayed from which part of the table.

Let's see a practical on this:

SELECT st.BusinessEntityID,sp.SalesQuota, sp.SalesYTD
FROM Sales.SalesPerson AS sp
JOIN Sales.SalesPerson AS st
ON sp.TerritoryID = st.TerritoryID

Here we need to get records matching on the column TerritoryID within the table. You can also consider Self Join as Inner Join within the table. It also ignores NULL values. That's why I said in Inner Join part to specify INNER keyword.

See the Result Pane now:


Comma Join


One more join in SQL Server is Comma Join. It doesn't follow the principles of above joins. It just needs tables to join and get the result. It doesn't even require columns for matching. Just a Comma (,) is needed. Here's an example of Comma Join:

SELECT * FROM table1,table2

Now see the Result Pane:


The columns firstname and lastname belong to table1 while the remaining belong to table2. Result is acquired just by joining those tables but not matching any column. That's about Comma Join.


I just tried to explain my best about joins with practical examples. Please follow the official documentation of Microsoft for the fundamentals of Joins.