As part of our continuing series about window functions
on SQL Server, we’ll review, in this article, which improvements there
are in the next version of SQL Server 2012, and explore how a running
aggregation can perform much better by combining the ORDER BY clause
with the OVER clause.
Let’s start with a diagram that shows which of the functions of the
SQL ANSI/ISO Standard have been introduced with each of the SQL Server
versions:Even though we have a very good support for these window functions, I would like to see full support. For instance, as we can see in the pictures, the clause OVER has been supported since SQL Server 2005 but with some limitations; and it was improved on SQL Server 2012, but it still doesn’t yet have full support as you can see in the list of limitations below.
I‘ve used the words “getting close” in the sequence above because we still have some unsupported standard functionality. Here is a list of aspects of the standard that are still missing in SQL Server 2012:
- Function FIRST, returns the first value of an ordered group
- MAX(City) KEEP (DENSE_RANK FIRST ORDER BY SUM(Value))*
- Function LAST, returns the last value of an ordered group MIN(City) KEEP (DENSE_RANK LAST ORDER BY SUM(Value))*
- Clause NULLs FIRST and NULLs LAST
- OVER(ORDER BY Coluna1 NULLs FIRST)
- Support on interval frame (Year, Month, Day, Hour, Minute, Second)
- Window Clause
- SELECT LAG(Col1) OVER MyAliasWin AS Col1 FROM Tab1
- WINDOW MyAliasWin AS (ORDER BY Col1 ROWS 2 PRECEDING)
* Not standard (ISO/ANSI) clauses
New functions on SQL Server 2012
I’ll now briefly introduce some important window functions supported on SQL Server 2012. After that I’ll explain how the window frame works so that you can understand better why and when to use it in the clause OVER.I’ll not cover all the possible syntax options of these functions because these are well covered on Books Online here and here.
If you want to test these functions, you will need the copy of SQL Server 2012 RC0, which you can download here.
Here you can see the script to create a table to test the functions:
USE TempDB
GOIF OBJECT_ID('Tab1') IS NOT NULL DROP TABLE Tab1
GOCREATE TABLE Tab1 (Col1 INT) GOINSERT INTO Tab1 VALUES(5), (5), (3) , (1) GO
GOIF OBJECT_ID('Tab1') IS NOT NULL DROP TABLE Tab1
GOCREATE TABLE Tab1 (Col1 INT) GOINSERT INTO Tab1 VALUES(5), (5), (3) , (1) GO
LEAD()
The LEAD function is used to read a value from the next row, or the row below the actual row. If the next row doesn’t exist, then NULL is returned.
-- LEADSELECT Col1,
LEAD(Col1) OVER(ORDER BY Col1) AS "LEAD()" FROM Tab1
LEAD(Col1) OVER(ORDER BY Col1) AS "LEAD()" FROM Tab1
By default, the next row is returned; but you can change this behavior by specifying a parameter to read N following rows, for instance:
-- LEADSELECT Col1,
LEAD(Col1, 2) OVER(ORDER BY Col1) AS "LEAD()" FROM Tab1
LEAD(Col1, 2) OVER(ORDER BY Col1) AS "LEAD()" FROM Tab1
LAG()
The LAG() function is similar to the LEAD() function, but it returns the row before the actual row rather than return the next row. For instance:
-- LEADSELECT Col1,
LEAD(Col1, 2) OVER(ORDER BY Col1) AS "LEAD()" FROM Tab1
LEAD(Col1, 2) OVER(ORDER BY Col1) AS "LEAD()" FROM Tab1
You may be wondering whether I could do the same thing by using the function LEAD() with a negative parameter (OffSet): In other words, instead of read 1 following value I could read -1 following value.
Let’s see a sample:
SELECT Col1,
LEAD(Col1, -1) OVER(ORDER BY Col1) AS "LEAD() as LAG()" FROM Tab1
LEAD(Col1, -1) OVER(ORDER BY Col1) AS "LEAD() as LAG()" FROM Tab1
Msg 8730, Level 16, State 1, Line 1
Offset parameter for Lag and Lead functions cannot be a negative value.
As we can see, we cannot use negative values in this parameter to the function.Offset parameter for Lag and Lead functions cannot be a negative value.
FIRST_VALUE()
As the name says, FIRST_VALUE() returns the first value in a partition window. For instance:
-- FIRST_VALUESELECT Col1,
FIRST_VALUE(Col1) OVER(ORDER BY Col1) AS "FIRST_VALUE()" FROM Tab1
FIRST_VALUE(Col1) OVER(ORDER BY Col1) AS "FIRST_VALUE()" FROM Tab1
LAST_VALUE()
Also, as the name suggests, it returns the last value in a partition window: For instance:
-- LAST_VALUESELECT Col1,
LAST_VALUE(Col1) OVER(ORDER BY Col1 ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS "LAST_VALUE()" FROM Tab1
LAST_VALUE(Col1) OVER(ORDER BY Col1 ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS "LAST_VALUE()" FROM Tab1
PERCENT_RANK()
In my first article about window functions, I’ve demonstrated the behavior of the RANK() function. The PERCENT_RANK() function is very similar to the RANK() function, but the values that are returned range between 0 and 1.
SELECT Col1,
PERCENT_RANK() OVER(ORDER BY Col1) AS "PERCENT_RANK()"
RANK() OVER(ORDER BY Col1) AS "RANK()"
(SELECT COUNT(*) FROM Tab1) "COUNT" FROM Tab1
PERCENT_RANK() OVER(ORDER BY Col1) AS "PERCENT_RANK()"
RANK() OVER(ORDER BY Col1) AS "RANK()"
(SELECT COUNT(*) FROM Tab1) "COUNT" FROM Tab1
- (RANK() – 1) / (NumberOfRows - 1)
CUME_DIST()
The function CUME_DIST() is also used to calculate a rank from 0 to 1 based on the position of the row in the rank, for instance:
-- CUME_DIST()SELECT Col1,
CUME_DIST() OVER(ORDER BY Col1) AS "CUME_DIST()" FROM Tab1
CUME_DIST() OVER(ORDER BY Col1) AS "CUME_DIST()" FROM Tab1
- COUNT(*) OVER (ORDER BY Col1) / COUNT(*) OVER ()
Window Frame
To start investigating, and to understand the concept of a window frame, let’s examine the syntax of a window function and the window frame:
OVER ( [ <PARTITION BY clause> ] [ <ORDER BY clause> ] [ <ROW or RANGE clause> ] ) |
In the window frame, you can specify the subset of rows in which the windowing function will work. You can specify the top and bottom boundary condition of the sliding window using the window specification clause. The syntax for the window specification clause is:
[ROWS | RANGE] BETWEEN <Start expr> AND <End expr>
Where:
<Start expr> is one of:
- UNBOUNDED PRECEDING: The window starts in the first row of the partition
- CURRENT ROW: The window starts in the current row
- <unsigned integer literal> PRECEDING or FOLLOWING
- UNBOUNDED FOLLOWING: The window ends in the last row of the partition
- CURRENT ROW: The window ends in the current row
- <unsigned integer literal> PRECEDING or FOLLOWING
To see all this theory in practice, let’s consider the following query on the table Orders from the Northwind database:
USE NorthWind
GO
SELECT OrderID, CustomerIDFROM OrdersWHERE CustomerID IN (1,2)
GO
SELECT OrderID, CustomerIDFROM OrdersWHERE CustomerID IN (1,2)
Even then, this is not a 100% correct representation of a window, but it’s easier to understand it when we look at the window as it is in the picture; where the windows correspond to each distinct CustomerID.
A more correct picture of a window might be the following:
I can understand if this is not quite so straightforward to understand, so let’s, instead, see some examples: Let’s suppose that I want to return the first OrderID of each window above. I could write something like the following query:
SELECT OrderID, CustomerID,
FIRST_VALUE(OrderID) OVER(PARTITION BY CustomerID ORDER BY OrderID) AS FirstOrderIDFROM OrdersWHERE CustomerID IN (1,2)
FIRST_VALUE(OrderID) OVER(PARTITION BY CustomerID ORDER BY OrderID) AS FirstOrderIDFROM OrdersWHERE CustomerID IN (1,2)
SELECT OrderID, CustomerID,
FIRST_VALUE(OrderID) OVER(PARTITION BY CustomerID
ORDER BY OrderID
ROWS BETWEEN UNBOUNDED PRECEDING
AND CURRENT ROW) AS FirstOrderIDFROM OrdersWHERE CustomerID IN (1,2)
To try to make things clearer, here’s an illustration of the way that
the top and bottom boundary works in the query above (using the
function FIRST_VALUE). This would work something like this:FIRST_VALUE(OrderID) OVER(PARTITION BY CustomerID
ORDER BY OrderID
ROWS BETWEEN UNBOUNDED PRECEDING
AND CURRENT ROW) AS FirstOrderIDFROM OrdersWHERE CustomerID IN (1,2)
A good example of how the window frame works is the function LAST_VALUE, because we need to change the default frame in order to really return the last value of a partition.
This function can be a little confusing at first, but as soon we understood the window frame we can see that the action that the function performs by default is correct. It’s very common to test the last_value function and think that this is not working properly, and some people even demand a “fix” for the function because they reckon that it is not working correctly.
Let’s see what the function returns for the same sample in the Orders table:
SELECT OrderID, CustomerID,
LAST_VALUE(OrderID) OVER(PARTITION BY CustomerID
ORDER BY OrderID) AS FirstOrderIDFROM OrdersWHERE CustomerID IN (1,2)
LAST_VALUE(OrderID) OVER(PARTITION BY CustomerID
ORDER BY OrderID) AS FirstOrderIDFROM OrdersWHERE CustomerID IN (1,2)
Let’s see the same illustration we saw earlier with the first_value function but now using the LAST_VALUE concept:
Now let’s see the difference when if I specify the frame as unbounded, in the following code:
SELECT OrderID, CustomerID,
LAST_VALUE(OrderID) OVER(PARTITION BY CustomerID
ORDER BY OrderID
ROWS BETWEEN UNBOUNDED PRECEDING
AND UNBOUNDED FOLLOWING) AS FirstOrderIDFROM OrdersWHERE CustomerID IN (1,2)
LAST_VALUE(OrderID) OVER(PARTITION BY CustomerID
ORDER BY OrderID
ROWS BETWEEN UNBOUNDED PRECEDING
AND UNBOUNDED FOLLOWING) AS FirstOrderIDFROM OrdersWHERE CustomerID IN (1,2)
RANGE versus ROW
Another confusing thing about the window frame is the RANGE versus ROW. The question is, what is the difference between them?The ROWS clause limits the rows within a partition by specifying a fixed number of rows preceding or following the current row. Alternatively, the RANGE clause logically limits the rows within a partition by specifying a range of values with respect to the value in the current row. Preceding and following rows are defined based on the ordering specified by the ORDER BY clause. The window frame “RANGE … CURRENT ROW …” includes all rows that have the same values in the ORDER BY expression as the current row.
Again, let’s see this in practice, so as to try to understand it better.
Let’s suppose we have the following query:
USE tempdb
GOIF OBJECT_ID('tempdb.dbo.#TMP') IS NOT NULL
DROP TABLE #TMPGOCREATE TABLE #TMP (ID INT, Col1 CHAR(1), Col2 INT)GO
INSERT INTO #TMP VALUES(1,'A', 5), (2, 'A', 5), (3, 'B', 5), (4, 'C', 5), (5, 'D', 5)GO--SELECT * FROM #TMP
SELECT *,
SUM(Col2) OVER(ORDER BY Col1 RANGE UNBOUNDED PRECEDING) "Range"
SUM(Col2) OVER(ORDER BY Col1 ROWS UNBOUNDED PRECEDING) "Rows" FROM #TMP
GOIF OBJECT_ID('tempdb.dbo.#TMP') IS NOT NULL
DROP TABLE #TMPGOCREATE TABLE #TMP (ID INT, Col1 CHAR(1), Col2 INT)GO
INSERT INTO #TMP VALUES(1,'A', 5), (2, 'A', 5), (3, 'B', 5), (4, 'C', 5), (5, 'D', 5)GO--SELECT * FROM #TMP
SELECT *,
SUM(Col2) OVER(ORDER BY Col1 RANGE UNBOUNDED PRECEDING) "Range"
SUM(Col2) OVER(ORDER BY Col1 ROWS UNBOUNDED PRECEDING) "Rows" FROM #TMP
There is another very important difference between the ROW and RANGE clause: the FRAME clause always uses an on-disk window to process the window spool operator. But this is a subject for the next and final article.
Conclusion
Window functions are very powerful, and can help us to think in set based terms, and get more elegant solutions for many problems we have on building efficient T-SQL code for which we’ve formerly required cursors or non-standard SQL.In the next and final article about window functions I’ll answer the question you may are doing to yourself. What about performance and scalability?
I’ll be describing the window_spool_ondisk_warning eXtended Event, to tell the difference between a on disk window spool versus a on memory window spool. Also we’ll see some performance tests between the actual solutions versus windowing functions. The question is this: Is it really worth changing all my queries to start using windowing functions?
No comments:
Post a Comment