3/01/2012

The Road to Professional Database Development: Set-Based Thinking

Under the pseudonym of 'SwePeso', Peter Larsson is famous on SQL forums for the amazing performance he can get from SQL. How does he do it? In the first of a series of articles, Peter explains his secrets.
The single most common barrier to writing efficient database code, for a developer versed in a typical client language (such as C#), is an inability to leave behind the habit of working on one row at a time, in favor of working with a set (a collection of rows).
What is the difference? Let's examine a very simple example of summing up 5 rows in a table. The following code creates a table variable (@Sample) and populates it with the rows of data.
DECLARE @Sample TABLE ( Data INT )
INSERT @Sample( Data )VALUES ( 1 ),
(
2 ),
( -
4 ),
(
5 ),
(
8 )
SELECT DataFROM @Sample
For a .NET developer this pseudo-code would be the way to perform this simple sum:
Dim intS AS Integer = 0, 'Initialise the sum to zero
objR AS Row

For Each objR In coll.Rows
intS += objR.Data
Next

Debug.Print intS
This is the standard approach for a procedural language; we not only tell the compiler what we want (the sum of all objects) but also how we want to approach the objects when calculating the sum. In this example, the coll variable holds a collection with objects of type Row. The For-Next loop iterates through all the objects and, for each object, fetches the current value from the Data property and adds the value to the intS variable. In other words the code operates on one row at a time.
T-SQL is a declarative language which means that the programmer is supposed to tell the database server what is required but let SQL Server decide the best way to finalize the task. However, many programmers force the procedural approach on SQL Server, with code such as the following:
DECLARE curSum CURSOR FOR
SELECT
DataFROM @Sample
DECLARE @s INT ,@Total INT = 0
OPEN curSum
FETCH NEXTFROM curSumINTO @s
WHILE @@FETCH_STATUS = 0 BEGIN
SET
@Total + = @sFETCH NEXTFROM curSumINTO @sEND

CLOSE
curSumDEALLOCATE curSum
SELECT @Total
Even though this code is written in T-SQL, it still operates on one object (row) at any given time. This approach is commonly referred to as Row-by-Agonizing Row (RBAR). If we increase the number of rows tenfold, the time to complete the task would increase by a factor of ten; for a very large table (100 million rows or more) this sum could easily take hours to complete.
So how do you do this in a set-based way? Luckily for us, T-SQL has an aggregate function available to do the job for us.
SELECT SUM ( Data )FROM @Sample
The following table compares the performance of the cursor method, versus the set-based method, for an increasing numbers of rows.
Rows CURSOR SET BASED Performance Factor

DURATION (mS) READS DURATION READS DURATION READS
1k 47 4009 0 7 953
10k 492 40004 3 42 164 953
100k 5189 400389 28 388 179 1039
1000k 51358 4003848 286 3847 180 1041
As you can see, both solutions scales linearly i.e. when we increase the number of rows by a factor of ten, we increase the query duration by roughly the same factor. However, there is one big difference, and that is that the set-based solution is about 180 times faster, and puts only about 1/1000th of the read pressure on the server!
So why is the performance difference so vast? We'll discuss this in full detail in a later article in this series, but in essence, the set-based approach allows SQL Server to return the all the required rows very efficiently, fully exploiting the underlying data storage mechanisms. The internal structure where the data is stored in SQL Server is called a Page and it defines the minimum amount of data read or written to the database, regardless of whether you want all rows in that Page or only one row. So, with a set-based approach, we only need to fetch a page once to get all x rows stored within, whereas with the RBAR approach, we must re-fetch the page x times. We'll also see in a later article how, with appropriate indexes in place, we can speed up this process still further.
This paradigm shift is all there is to it! SQL Server has some built-in aggregate functions to accomplish the task of summing a number of rows, and many others. The key difference is that you, in set-based thinking, instead of working with individual rows, we work with sets which can contain any number of rows.
To become a professional database developer you have to abandon the old habit of thinking what you want to do with each row, and start thinking instead about what you want to do with the set, by manipulating the columns. Think vertical instead of horizontal!
If we take this example a step further, considering the task of multiplying together all the rows, we will encounter the second biggest barrier to becoming a professional database developer, and that is math skills.
For a .NET developer, this pseudo code would be the way to go:
Dim intP AS Integer = 1, ' Initialise the product to 1
objR AS Row

For Each objR In coll.Rows
intP *= objR.Data
Next

Debug.Print intP
However, we just established that we no longer want to work with single rows, right? Unfortunately, however, while SQL Server provides a handy SUM aggregate function, it does not provide a MULTIPLY aggregate function. So, how do we perform this task in a set-based fashion? This is where your math skills become handy.
A * B * C equals e^(ln(A)+ln(B)+ln(C))
Having transformed the multiplication into a sum, it's easier to see that the final query should look something like this:
SELECT EXP ( SUM ( LOG ( Data )))FROM @Sample
But we still have an issue here; logarithmic operations are only allowed on positive values, and not including zero (logarithmic operations on zero and negative values are undefined in natural numbers and will give an Invalid Floating Point operation in T-SQL), and multiplication with zero, under any circumstances, is equal to zero. Also, remember that a multiplication with an odd number of negative numbers results in a negative product and multiplication with an even number of negative number results in a positive product. Here is one example of how to solve this problem:
IF EXISTS( SELECT * FROM @Sample WHERE Data = 0 )SELECT 0.0E -- If any value is zero, the total product is zero.ELSE
SELECT
CASE IsNegativeProductWHEN 1 THEN - EXP ( theSum )ELSE EXP ( theSum )END
FROM
(SELECT SUM ( LOG ( ABS ( Data ))), -- Sum all exponents
-- Keep track if product is negative or positive
SUM ( CASE WHEN Data < 0 THEN 1 ELSE 0 END ) % 2FROM @Sample
) AS d ( theSum , IsNegativeProduct )
The main reason for the checks in the above code is the limitations of the LOG function. So, can we complete the task in another fashion and still produce the correct result? Yes, we can:
DECLARE @p INT = 1
UPDATE @SampleSET @p * = Data
SELECT @p
This code, which looks very much like the C# pseudo code, produces the right answer, a value of -320, but how did it work? The difference is that we tell T-SQL to multiply the values but we do not force T-SQL to retrieve the rows in any order. Another reason we can use the code above is due to the nature of a product. If we want the product of A and B (a * b), the reverse is also valid (b * a); it doesn't matter which order we take the values in the multiplication, as long they are of same data type.
This leads us to the third 'light bulb' moment on the road to becoming a professional database developer: understanding that rows in a database have no order. This is very different to the way .NET developers think. A collection of objects is normally a double-linked list, where you can traverse forward and backward. Rows in a table can be logically ordered by assigning a sequence number or a sequence denoted by date and time, but you can never rely on the physical order of rows in a set.

Summary

  • Think in sets and not in single rows
  • Brush up your math skills as much as you possibly can
  • Rows have no determined physical order
Coming Next….
Efficient storage and retrieval: Normalization rules.

No comments:

Post a Comment