Saturday, November 30, 2013

Benefits and Limitations of Common Table Expression (CTE)

In my previous post, I've written about the basics of CTE. But every feature has its own benefits and limitations. I want to write about them now.

Benefits of CTE


  • CTE can be termed as 'Temporary View' used as a good alternative for a View in some cases.
    • The main advantage over a view is usage of memory. As CTE's scope is limited only to its batch, the memory allocated for it is flushed as soon as its batch is crossed. But once a view is created, it is stored until user drops it. If the view is not used after creation then it's a mere waste of memory.
    • CPU cost for CTE execution is lesser when compared to that of View.
    • Like View, CTE doesn't store any metadata of its definition and provides better readability.
  • A CTE can be referred for multiple times in a query.
  • As the scope is limited to the batch, multiple CTEs can have the same name which a view cannot have.
  • It can be made recursive.

Limitations of CTE

Though using CTE is advantageous, it does have some limitations to be kept in mind,
  • We knew that it is a substitute for a view but a CTE cannot be nested while Views can be nested.
  • View once declared can be used for any number of times but CTE cannot be used. It should be declared every time you want to use it. For this scenario, CTE is not recommended to use as it is a tiring job for user to declare the batches again and again.
  • Between the anchor members there should be operators like UNION, UNION ALL or EXCEPT etc.
  • In Recursive CTEs, you can define many Anchor Members and Recursive Members but all the Anchor Members must be defined before the first Recursive Member. You cannot define an Anchor Member between two Recursive Member.
  • The number of columns, the data types used in Anchor and Recursive Members should be same.
  • In Recursive Member, aggregate functions like TOP, operator like DISTINCT, clause like HAVING and GROUP BY, Sub-queries, joins like Left Outer or Right Outer or Full Outer are not allowed. Regarding Joins, only Inner Join is allowed in Recursive Member.
  • Recursion Limit is 32767, crossing which results in the crash of server due to infinite loop.

Regarding this topic, Microsoft provided a very good documentation. 

Common Table Expression (CTE) in SQL Server

According to Microsoft's Official Document, Common Table Expression is  a temporary result set that is defined within the execution scope of a single SELECT, INSERT, UPDATE, DELETE, or CREATE VIEW statement.

But as far as I observed, it acts like a Temporary View and can be used as a substitute for view in many cases. The scope of CTE is present until the execution of its same batch. You cannot access the CTE outside its batch.

Let me be more clear with an example.

;WITH CTE
AS
(
     SELECT ID1,ID2 FROM myTable
)
SELECT * FROM CTE

For understanding purpose, I've taken a small table containing only two columns. The above CTE should be executed as a single batch. If you execute leaving the SQL statement below or if you want to try to access it like a base table by using SELECT statement, SQL Server raises an exception. 

A CTE should always be started with a WITH keyword followed by a custom name, followed by AS keyword inside which you can write your own query. Please do keep in mind that a CTE should always be followed by a SQL statement like SELECT, INSERT, UPDATE or DELETE.
The semicolon (;) before WITH keyword is a delimiter. It is used to separate the CTE from other queries in the batch. If you want to execute only the CTE as a batch then no need of keeping it. If you want to execute the CTE along with other SQL statements then it should be specified, otherwise SQL Server raises an exception.

The main advantage of using CTE is, it reduces the overhead of using sub queries and joins if you want to use them for many times in a Query Window. You can write multiple queries inside CTE but between two queries operator like UNION,INTERSECT,EXCEPT or UNION ALL must be specified based upon the requirement.

One more advantage is, CTE can be made recursive. Let's have an example of recursive CTE


WITH CTE
AS
(
      SELECT X = CONVERT(VARCHAR(1000),'HELLO')
      UNION ALL
      SELECT X = CONVERT(VARCHAR(1000),'GOODBYE')
      UNION ALL
      SELECT CONVERT(VARCHAR(1000),X + 'A') FROM CTE
      WHERE LEN(X) < 10
)
SELECT X FROM CTE
ORDER BY LEN(X), X

I got the above example from a blog but I wanna have a detailed explanation of the recursion which I came to know by implementing. In the above CTE, there is a recursion applied in the last statement i.e., the function CONVERT(VARCHAR(1000),X + 'A') will be implemented by taking the values obtained from its above two lines. This will be implemented until the length of X is less than 10. It means the CTE is recursively called for 10 times.

SQL Server provides a default recursion limit of 100 times. If you want to have your own recursion limit, you need to specify it manually by using an option MAXRECURSION. Let's test it. Suppose I change the above code as follows. Please observe carefully:

WITH CTE
AS
(
      SELECT X = CONVERT(VARCHAR(1000),'HELLO')
      UNION ALL
      SELECT X = CONVERT(VARCHAR(1000),'GOODBYE')
      UNION ALL
      SELECT CONVERT(VARCHAR(1000),X + 'A') FROM CTE
      WHERE LEN(X) < 200
)
SELECT X FROM CTE
ORDER BY LEN(X), X

If I execute the above code, SQL Server throws the following error:
The statement terminated. The maximum recursion 100 has been exhausted before statement completion.

It means I crossed the limit of recursion by specifying LEN(X) < 200. By the error message, we can know that the automatic recursion limit provided by SQL Server is 100. That's fine. Now I want to execute the above CTE with LEN(X) < 200. Then I will go by specifying the recursion limit manually by adding a line to the above CTE code. It goes as follows:

WITH CTE
AS
(
      SELECT X = CONVERT(VARCHAR(1000),'HELLO') --> ANCHOR MEMBER
      UNION ALL
      SELECT X = CONVERT(VARCHAR(1000),'GOODBYE') –> ANCHOR MEMBER
      UNION ALL
      SELECT CONVERT(VARCHAR(1000),X + 'A') FROM CTE
      WHERE LEN(X) < 200 --> RECURSIVE MEMBER
)
SELECT X FROM CTE
ORDER BY LEN(X), X
OPTION(MAXRECURSION 200)

I won't get any error now as I specified the limit. But note that you cannot specify this limit as you wish. The default limit provided is 100. The manual limit is restricted to 32767. You can specify the limit ranging from 0 to 32767. There is also one more option here. Like in other programming languages, we have an infinite looping even here. If you specify MAXRECURSION as "NOLIMIT" then it results in an infinite loop. Unlike other languages, SQL Server's infinite looping is so dangerous. If it occurs, it results in the crash of your server. So be careful about this point.

There is also a terminology used in the CTE which I specified in the above example. The members of CTE are called ANCHOR MEMBERS. If you use a recursion there then that CTE member is called RECURSIVE MEMBER.

That's all about the basics of Common Table Expression (CTE). I'll write more about it in my coming posts because I don't want to increase the length of this post so that you may feel tired in scrolling as I do.

For the basic concept of CTE, read the official documentation of Microsoft

Monday, November 25, 2013

Codd's Rules

Edgar Frank "Ted" Codd is considered a pioneer of relational model for databases. He proposed 13 rules to say why a DataBase Management System (DBMS) is called a Relational and what it has to give in order to be called as Relational.

These are the fundamental rules to be followed by every database. However some databases do not follow some of the rules Codd specified.

The following are the rules:

Rule 0 – Foundation Rule

It states that the system must qualify as a RELATIONAL as a database as well as MANAGEMENT SYSTEM.

RELATIONAL specifies that data is represented in the form of tuples (records) and grouped into relations.
A database is said to be RELATIONAL if the data is stored in the form of tables and there can exist a relationship among the tables.

MANAGEMENT SYSTEM specifies that the database is maintained well through some procedures and processes so that the functionality of the database is good.

Rule 1 – The Information Rule

The information in the RDBMS is always presented in the form of rows and columns.

The data present in those rows and columns is called VALUE.

Rule 2 – The Guaranteed Access Rule

All the data should be accessible.

There should be a PRIMARY KEY on one of the columns of the table.

Data can be accessed from the database by specifying the name of the table. When a particular row is to be accessed then the name of the table and the name of the column on which the PRIMARY KEY is specified should be mentioned.

Rule 3 – Systematic Treatment of NULL Values

DBMS should allow the fields of the table remain empty or NULL. The moto behind this is there will be an indication that some data is missing.

These NULLs are different from the regular values and are independent of data types. NULL cannot be considered as ZERO (0) even.

NULL refers neither positive nor negative. It just denotes an empty space.

Rule 4 – Active Online Catalog based on the Relational Model

Here the Catalog means Dictionary. There should be a catalog which contains the definitions of database objects like Tables, Views, Indexes, Statistics etc.

"Active Online" specifies that there should be access to the users to know the catalog information.

"Based on the Relational Model" User should be able to know this information through the query language which he/she uses to get the information from the regular database objects.

There is a uniform means specified by SQL standards to know this metadata called INFORMATION_SCHEMA.

Rule 5 – The Comprehensive Data Sub-Language Rule

There should be a language communicating through which the data can be accessed by the Users.

Such language contains some syntax and semantics defined. This resulted in the development of Structured Query Language (SQL).

Every database user should know this language to access the data. Otherwise if the data is accessible if user uses any kind of language then there won't be a proper organization in the database.

This SQL is used in the database interface as well as within the application programs.

This language supports Data Definition, Data Manipulation, Constraints, Indexes etc.

Rule 6 – The View Updating Rule

All the Views that are theoretically updatable must be updated by the system.

This means the content of the view should be modified if the content of the table with which the view is referencing, is modified.

Some databases do not follow this rule. In Microsoft SQL Server, you can see that this rule is followed.

Rule 7 – High-Level Insert, Update and Delete

All the operations like INSERT, UPDATE and DELETE should be supported.

Even the SET operations of SET THEORY must be supported.

This means that data can be retrieved from a relational database in sets constructed of data from multiple rows and/or multiple tables.

A single statement is enough to update multiple number of rows and values so that there is less work for user.

Rule 8 – Physical Data Independence

Any changes in the physical storage of the data should not effect the application performance.

If some supporting tables are modified or moved from one disk to another then a situation shouldn't arise where application should be changed.

Rule 9 – Logical Data Independence

Any changes to the Logical storage like tables should not affect the application structure.

If a table structure is changed, like it is split into two tables then the user shouldn't have a situation that he/she has to get the data from two tables. Though there are two tables, user should get the result in the form of single table obtained by joining the two tables.

Logical Data Independence is tougher to achieve than Physical Data Independence.

Rule 10 – Integrity Independence

Integrity Constraints are used to enforce accuracy and consistency in relational databases.

They must be specified separately from the application program and are stored in the catalog.

If any change has to be made to these constraints then that change should not effect the application.

The database should maintain its own integrity rather than depending on other programs. This makes RDBMS independent of the Front-End.

Rule 11 – Distribution Independence

If a database is distributed across different networks then there shouldn't be any difference in the performance of it.

Even the users shouldn't know that the database is distributed across the network and can be able to work as usual.

Rule 12 – The Non-Subversion Rule

Suppose a user is provided a low level access to the data then it should be ensured that he/she is unable to cross the integrity rule and change the data.

Such security is achieved by using some locking and encryption mechanisms.