Hi Folks, I’m going to start a new series of three articles about
Window functions, starting off by explaining what they are, and how they
are used to provide details of an aggregation.
Window functions
Window functions belong to a type of function known as a ‘set function’, which means a function that applies to a set of rows. The word ‘window’ is used to refer to the set of rows that the function works on.Windowing functions were added to the standard SQL:2003 that is managed by the ISO and it was specified in more detail in SQL:2008 For some time, other DBMSs such as Oracle, Sybase and DB2 have had support for window functions. Even the open source RDBMS PostgreSQL has a full implementation. SQL Server has had only a partial implementation up to now, but it is coming in SQL 2012.
One of the most important benefits of window functions is that we can access the detail of the rows from an aggregation. To see an example, let’s first suppose we have this table and data:
USE tempdb
GOIF OBJECT_ID('TestAggregation') IS NOT NULL
DROP TABLE TestAggregation
GOCREATE TABLE TestAggregation (ID INT, Value Numeric(18,2))GO
INSERT INTO TestAggregation (ID, Value) VALUES(1, 50.3), (1, 123.3), (1, 132.9),
(2, 50.3), (2, 123.3), (2, 132.9), (2, 88.9),
(3, 50.3), (3, 123.3);GO
This is what the data looks like:GOIF OBJECT_ID('TestAggregation') IS NOT NULL
DROP TABLE TestAggregation
GOCREATE TABLE TestAggregation (ID INT, Value Numeric(18,2))GO
INSERT INTO TestAggregation (ID, Value) VALUES(1, 50.3), (1, 123.3), (1, 132.9),
(2, 50.3), (2, 123.3), (2, 132.9), (2, 88.9),
(3, 50.3), (3, 123.3);GO
SELECT * FROM TestAggregation
SELECT ID, SUM(Value)
FROM TestAggregation
GROUP BY ID;
FROM TestAggregation
GROUP BY ID;
Let’s suppose I need to write a query to return the total value of sales, the average value of sales and the quantity of sales for each ID, and still return the actual values of the rows, then we might think that we could use something like this to return this data:
SELECT ID,
Value,
SUM(Value) AS "Sum"
AVG(Value) AS "Avg"
COUNT(Value) AS "Quantity" FROM TestAggregation
GROUP BY ID;
Msg 8120, Level 16, State 1, Line 2Value,
SUM(Value) AS "Sum"
AVG(Value) AS "Avg"
COUNT(Value) AS "Quantity" FROM TestAggregation
GROUP BY ID;
Column 'TestAggregation.Value' is invalid in the select list because it is not contained in either an aggregate function or the GROUP BY clause.
Unfortunately it is against the way that aggregations work. If you group by something, then you lose access to the details.
A very commonly used alternative is to write every aggregation into a subquery, and then correlate with the main query using a join, something like:
SELECT TestAggregation.ID,
TestAggregation.Value,
TabSum."Sum"
TabAvg."Avg"
TabCount."Quantity"
FROM TestAggregation
INNER JOIN (SELECT ID, SUM(Value) AS "Sum"
FROM TestAggregation
GROUP BY ID) AS TabSum
ON TabSum.ID = TestAggregation.ID
INNER JOIN (SELECT ID, AVG(Value) AS "Avg"
FROM TestAggregation
GROUP BY ID) AS TabAvg
ON TabAvg.ID = TestAggregation.ID
INNER JOIN (SELECT ID, COUNT(Value) AS "Quantity"
FROM TestAggregation
GROUP BY ID) AS TabCount
ON TabCount.ID = TestAggregation.ID
TestAggregation.Value,
TabSum."Sum"
TabAvg."Avg"
TabCount."Quantity"
FROM TestAggregation
INNER JOIN (SELECT ID, SUM(Value) AS "Sum"
FROM TestAggregation
GROUP BY ID) AS TabSum
ON TabSum.ID = TestAggregation.ID
INNER JOIN (SELECT ID, AVG(Value) AS "Avg"
FROM TestAggregation
GROUP BY ID) AS TabAvg
ON TabAvg.ID = TestAggregation.ID
INNER JOIN (SELECT ID, COUNT(Value) AS "Quantity"
FROM TestAggregation
GROUP BY ID) AS TabCount
ON TabCount.ID = TestAggregation.ID
SELECT
TestAggregation.ID,
TestAggregation.Value,
AggregatedValues.[Sum],
AggregatedValues.[Avg],
AggregatedValues.QuantityFROM TestAggregationINNER JOIN
(
SELECT ID,
SUM(Value) AS "Sum"
AVG(Value) AS "Avg"
COUNT(Value) AS "Quantity" FROM TestAggregation
GROUP BY ID) AggregatedValuesON AggregatedValues.ID=TestAggregation.ID
A very elegant alternative is to use the clause OVER() implemented in
the aggregation functions, it allow-us to access the details of the rows
that have been aggregated.TestAggregation.ID,
TestAggregation.Value,
AggregatedValues.[Sum],
AggregatedValues.[Avg],
AggregatedValues.QuantityFROM TestAggregationINNER JOIN
(
SELECT ID,
SUM(Value) AS "Sum"
AVG(Value) AS "Avg"
COUNT(Value) AS "Quantity" FROM TestAggregation
GROUP BY ID) AggregatedValuesON AggregatedValues.ID=TestAggregation.ID
For instance we could try to get the result we want with this …
SELECT ID,
Value,
SUM(Value) OVER() AS "Sum"
AVG(Value) OVER() AS "Avg"
COUNT(Value) OVER() AS "Quantity" FROM TestAggregation
Value,
SUM(Value) OVER() AS "Sum"
AVG(Value) OVER() AS "Avg"
COUNT(Value) OVER() AS "Quantity" FROM TestAggregation
SELECT ID,
Value,
SUM(Value) OVER(PARTITION BY ID) AS "Sum"
AVG(Value) OVER(PARTITION BY ID) AS "Avg"
COUNT(Value) OVER(PARTITION BY ID) AS "Quantity"
FROM TestAggregation
Value,
SUM(Value) OVER(PARTITION BY ID) AS "Sum"
AVG(Value) OVER(PARTITION BY ID) AS "Avg"
COUNT(Value) OVER(PARTITION BY ID) AS "Quantity"
FROM TestAggregation
In the following picture we can see that even the results are grouped by ID we can see access the details of the aggregated partitions through the clause OVER and PARTITION BY.
Set based thinking
It can get tedious to hear about this set base thing, but in fact it’s not so easy to think set based. Sometimes it’s very hard to find a set based solution to a query we have. Window Functions give us more set-based solutions to awkward problems.The main point of windowing functions it’s that they were created to work with a set. SQL Server never was good on processing queries row by row, that’s why you always hearing that ‘cursors are evil’, and are ‘not good for performance’, ‘you should avoid them’ and so on. SQL Server was not built to work row by row.
Let’s illustrate this. SQL Server takes an average of 50 seconds to run a loop of 1 hundred million times against 100 milliseconds of a Win32 app.
DECLARE @i INT = 0, @Time Time(3) = GETDATE()WHILE @i < 100000000BEGIN
SET @i += 1;END
SELECT CONVERT(Time(3), GETDATE() - @Time) AS "Total Time to run the Loop"
SET @i += 1;END
SELECT CONVERT(Time(3), GETDATE() - @Time) AS "Total Time to run the Loop"
PROCEDURE TForm1.Button1Click(Sender: TObject);Var
i : Integer;
Tempo : TDateTime;BEGIN
i := 0;
Tempo := Now();
WHILE i < 100000000 do
BEGIN
inc(i);
END;
i : Integer;
Tempo : TDateTime;BEGIN
i := 0;
Tempo := Now();
WHILE i < 100000000 do
BEGIN
inc(i);
END;
I once was in London doing training with my colleague Itzik Ben-Gan when I remember he said: “There is no row-by-row code that you cannot run an equivalent set-based version, the problem is that you didn’t figured out how to do it”. Yes it’s a heavy phrase but, who could tell Itzik that this is not true? Not I!
Window functions on SQL Server 2005 and 2008
Since SQL Server 2005 we have had support for some window functions, they are: ROW_NUMBER, RANK, DENSE_RANK and NTILE.In this first article we’ll review how these functions works, and how they can help-us to write better and efficient set-based codes.
Test database
To test the functions we’ll use a table called Tab1. The code to create the table is the following:
USE TempDB
GOIF OBJECT_ID('Tab1') IS NOT NULL
DROP TABLE Tab1
GOCREATE TABLE Tab1 (Col1 INT)GO
INSERT INTO Tab1 VALUES(5), (5), (3) , (1)GO
GOIF OBJECT_ID('Tab1') IS NOT NULL
DROP TABLE Tab1
GOCREATE TABLE Tab1 (Col1 INT)GO
INSERT INTO Tab1 VALUES(5), (5), (3) , (1)GO
Row_Number()
The ROW_NUMBER function is used to generate a sequence of numbers based in a set in a specific order, in easy words, it returns the sequence number of each row inside a set in the order that you specify.For instance:
-- RowNumberSELECT Col1,
ROW_NUMBER() OVER(ORDER BY Col1 DESC) AS "ROW_NUMBER()"
FROM Tab1
ROW_NUMBER() OVER(ORDER BY Col1 DESC) AS "ROW_NUMBER()"
FROM Tab1
Rank() & Dense_Rank()
Return the position in a ranking for each row inside a partition. The ranking is calculated by 1 plus the number of previews rows.It’s important to mention that the function RANK returns the result with a GAP after a tie, whereas the function DENSE_RANK doesn’t. To understand this better, let’s see some samples.
-- RankSELECT Col1,
RANK() OVER(ORDER BY Col1 DESC) AS "RANK()" FROM Tab1
GO
-- Dense_RankSELECT Col1,
DENSE_RANK() OVER(ORDER BY Col1 DESC) AS "DENSE_RANK" FROM Tab1
RANK() OVER(ORDER BY Col1 DESC) AS "RANK()" FROM Tab1
GO
-- Dense_RankSELECT Col1,
DENSE_RANK() OVER(ORDER BY Col1 DESC) AS "DENSE_RANK" FROM Tab1
NTILE()
The NTILE function is used for calculating summary statistics. It distributes the rows within an ordered partition into a specified number of “buckets” or groups. The groups are numbered, starting at one. For each row, NTILE returns the number of the group to which the row belongs. It makes it easy to calculate n-tile distributions such as percentiles.Let’s see a sample:
-- NTILESELECT Col1,
NTILE(3) OVER(ORDER BY Col1 DESC) AS "NTILE(3)"
FROM Tab1
NTILE(3) OVER(ORDER BY Col1 DESC) AS "NTILE(3)"
FROM Tab1
-- NTILESELECT Col1,
NTILE(2) OVER(ORDER BY Col1 DESC) AS "NTILE(2)"
FROM Tab1
NTILE(2) OVER(ORDER BY Col1 DESC) AS "NTILE(2)"
FROM Tab1
The Power of Window functions
Now let’s see some examples where window functions could be used to return some complex queries.Example 1
Let’s suppose I need to write a query to return those employees that receive more than his department’s average.Let’s start by creating two sample tables called Departamentos (departments in Portuguese) and Funcionarios (employees in Portuguese):
IF OBJECT_ID('Departamentos') IS NOT NULL
DROP TABLE Departamentos
GOCREATE TABLE Departamentos (ID INT IDENTITY(1,1) PRIMARY KEY,
Nome_Dep VARCHAR(200))GO
INSERT INTO Departamentos(Nome_Dep) VALUES('Vendas'), ('TI'), ('Recursos Humanos')GO
IF OBJECT_ID('Funcionarios') IS NOT NULL
DROP TABLE Funcionarios
GOCREATE TABLE Funcionarios (ID INT IDENTITY(1,1) PRIMARY KEY,
ID_Dep INT,
Nome VARCHAR(200),
Salario Numeric(18,2))GO
INSERT INTO Funcionarios (ID_Dep, Nome, Salario) VALUES(1, 'Fabiano', 2000), (1, 'Amorim', 2500), (1, 'Diego', 9000),
(2, 'Felipe', 2000), (2, 'Ferreira', 2500), (2, 'Nogare', 11999),
(3, 'Laerte', 5000), (3, 'Luciano', 23500), (3, 'Zavaschi', 13999)GO
This is what the data looks like:DROP TABLE Departamentos
GOCREATE TABLE Departamentos (ID INT IDENTITY(1,1) PRIMARY KEY,
Nome_Dep VARCHAR(200))GO
INSERT INTO Departamentos(Nome_Dep) VALUES('Vendas'), ('TI'), ('Recursos Humanos')GO
IF OBJECT_ID('Funcionarios') IS NOT NULL
DROP TABLE Funcionarios
GOCREATE TABLE Funcionarios (ID INT IDENTITY(1,1) PRIMARY KEY,
ID_Dep INT,
Nome VARCHAR(200),
Salario Numeric(18,2))GO
INSERT INTO Funcionarios (ID_Dep, Nome, Salario) VALUES(1, 'Fabiano', 2000), (1, 'Amorim', 2500), (1, 'Diego', 9000),
(2, 'Felipe', 2000), (2, 'Ferreira', 2500), (2, 'Nogare', 11999),
(3, 'Laerte', 5000), (3, 'Luciano', 23500), (3, 'Zavaschi', 13999)GO
SELECT Departamentos.Nome_Dep,
Funcionarios.Nome AS Funcionario,
Funcionarios.Salario,
AVG(Funcionarios.Salario) OVER(PARTITION BY Departamentos.Nome_Dep) "Média por Departamento"
Salario - AVG(Funcionarios.Salario) OVER(PARTITION BY Departamentos.Nome_Dep) "Diferença de Salário" FROM Funcionarios
INNER JOIN Departamentos
ON Funcionarios.ID_Dep = Departamentos.ID
ORDER BY 5 DESC
Funcionarios.Nome AS Funcionario,
Funcionarios.Salario,
AVG(Funcionarios.Salario) OVER(PARTITION BY Departamentos.Nome_Dep) "Média por Departamento"
Salario - AVG(Funcionarios.Salario) OVER(PARTITION BY Departamentos.Nome_Dep) "Diferença de Salário" FROM Funcionarios
INNER JOIN Departamentos
ON Funcionarios.ID_Dep = Departamentos.ID
ORDER BY 5 DESC
Using the OVER clause, I was able to partition the data per each department, and then I could access the average with the details of the salary.
You can avoid using window functions by using a query like this.
SELECT Departamentos.Nome_Dep,
Funcionarios.Nome AS Funcionario,
Funcionarios.Salario,
[Média por Departamento],
Salario - [Média por Departamento] AS [Diferença de Salário]
FROM Funcionarios
INNER JOIN Departamentos
ON Funcionarios.ID_Dep = Departamentos.ID
INNER JOIN (SELECT ID_Dep, AVG(Funcionarios.Salario) AS [Média por Departamento] FROM FuncionariosGROUP BY ID_Dep)[Média]ON [Média].ID_Dep=Funcionarios.ID_DepORDER BY [Diferença de Salário] DESC
Funcionarios.Nome AS Funcionario,
Funcionarios.Salario,
[Média por Departamento],
Salario - [Média por Departamento] AS [Diferença de Salário]
FROM Funcionarios
INNER JOIN Departamentos
ON Funcionarios.ID_Dep = Departamentos.ID
INNER JOIN (SELECT ID_Dep, AVG(Funcionarios.Salario) AS [Média por Departamento] FROM FuncionariosGROUP BY ID_Dep)[Média]ON [Média].ID_Dep=Funcionarios.ID_DepORDER BY [Diferença de Salário] DESC
Example 2
Another very common problem is the “running totals”. To show this sample I’ll use a very simple data. Let’s started by creating a table called tblLancamentos where I’ve a column with a date and a column with a numeric value.
IF OBJECT_ID('tblLancamentos') IS NOT NULL
DROP TABLE tblLancamentos
GO
-- Tabela de Lançamentos para exemplificar o SubtotalCREATE TABLE tblLancamentos (DataLancamento Date,
ValorLancamento FLOAT)GO
-- Insere os registrosINSERT INTO tblLancamentos VALUES ('20080623',100)INSERT INTO tblLancamentos VALUES ('20080624',-250)INSERT INTO tblLancamentos VALUES ('20080625',380)INSERT INTO tblLancamentos VALUES ('20080626',200)INSERT INTO tblLancamentos VALUES ('20080627',-300)GO
One of the alternatives to return the running total of the column
ValorLancamento ordered by DataLancamento is to write a query like the
following:DROP TABLE tblLancamentos
GO
-- Tabela de Lançamentos para exemplificar o SubtotalCREATE TABLE tblLancamentos (DataLancamento Date,
ValorLancamento FLOAT)GO
-- Insere os registrosINSERT INTO tblLancamentos VALUES ('20080623',100)INSERT INTO tblLancamentos VALUES ('20080624',-250)INSERT INTO tblLancamentos VALUES ('20080625',380)INSERT INTO tblLancamentos VALUES ('20080626',200)INSERT INTO tblLancamentos VALUES ('20080627',-300)GO
SELECT DataLancamento,
ValorLancamento,
(SELECT SUM(ValorLancamento)
FROM tblLancamentos
WHERE DataLancamento <= QE.DataLancamento) AS SaldoFROM tblLancamentos AS QE
ValorLancamento,
(SELECT SUM(ValorLancamento)
FROM tblLancamentos
WHERE DataLancamento <= QE.DataLancamento) AS SaldoFROM tblLancamentos AS QE
SELECT DataLancamento,
ValorLancamento,
SUM(ValorLancamento) OVER(ORDER BY DataLancamento) AS Saldo
FROM tblLancamentos AS QE
ValorLancamento,
SUM(ValorLancamento) OVER(ORDER BY DataLancamento) AS Saldo
FROM tblLancamentos AS QE
Conclusion
In the next article I’ll show in more details the limitations of the window functions on SQL Server 2008 R2, and compare the performance of a running aggregation on SQL Server 2012 (Denali).In the final and third article I’ll show in details how the window frame works. Also I’ll show how the windowing functions LEAD, LAG, FIRST_VALUE, LAST_VALUE, PERCENT_RANK and CUME_DIST works in SQL Server.
That’s all folks, see you soon with the second part of this article.
No comments:
Post a Comment