Under the pseudonym of 'SwePeso', Peter Larsson is
famous on SQL forums for the amazing performance he can get from SQL.
How does he do it? In the first of a series of articles, Peter explains
his secrets.
The single most common barrier to writing efficient
database code, for a developer versed in a typical client language (such
as C#), is an inability to leave behind the habit of working on one row
at a time, in favor of working with a set (a collection of rows).
What is the difference? Let's examine a very simple example of
summing up 5 rows in a table. The following code creates a table
variable (@Sample) and populates it with the rows of data.
DECLARE @Sample TABLE ( Data INT )
INSERT @Sample( Data )VALUES ( 1 ),
( 2 ),
( - 4 ),
( 5 ),
( 8 )
SELECT DataFROM @Sample
For a .NET developer this pseudo-code would be the way to perform this simple sum: INSERT @Sample( Data )VALUES ( 1 ),
( 2 ),
( - 4 ),
( 5 ),
( 8 )
SELECT DataFROM @Sample
Dim intS AS Integer = 0, 'Initialise the sum to zero
objR AS Row
For Each objR In coll.Rows
intS += objR.Data
Next
Debug.Print intS
This is the standard approach for a procedural language; we not only
tell the compiler what we want (the sum of all objects) but also how we
want to approach the objects when calculating the sum. In this example,
the coll variable holds a collection with objects of type Row. The For-Next loop iterates through all the objects and, for each object, fetches the current value from the Data property and adds the value to the intS variable. In other words the code operates on one row at a time. objR AS Row
For Each objR In coll.Rows
intS += objR.Data
Next
Debug.Print intS
T-SQL is a declarative language which means that the programmer is supposed to tell the database server what is required but let SQL Server decide the best way to finalize the task. However, many programmers force the procedural approach on SQL Server, with code such as the following:
DECLARE curSum CURSOR FOR
SELECT DataFROM @Sample
DECLARE @s INT ,@Total INT = 0
OPEN curSum
FETCH NEXTFROM curSumINTO @s
WHILE @@FETCH_STATUS = 0 BEGIN
SET @Total + = @sFETCH NEXTFROM curSumINTO @sEND
CLOSE curSumDEALLOCATE curSum
SELECT @Total
Even though this code is written in T-SQL, it still operates on one
object (row) at any given time. This approach is commonly referred to as
Row-by-Agonizing Row (RBAR). If we increase the number of rows tenfold,
the time to complete the task would increase by a factor of ten; for a
very large table (100 million rows or more) this sum could easily take
hours to complete. SELECT DataFROM @Sample
DECLARE @s INT ,@Total INT = 0
OPEN curSum
FETCH NEXTFROM curSumINTO @s
WHILE @@FETCH_STATUS = 0 BEGIN
SET @Total + = @sFETCH NEXTFROM curSumINTO @sEND
CLOSE curSumDEALLOCATE curSum
SELECT @Total
So how do you do this in a set-based way? Luckily for us, T-SQL has an aggregate function available to do the job for us.
SELECT SUM ( Data )FROM @Sample
The following table compares the performance of the cursor method,
versus the set-based method, for an increasing numbers of rows. Rows | CURSOR | SET BASED | Performance Factor | |||
DURATION (mS) | READS | DURATION | READS | DURATION | READS | |
1k | 47 | 4009 | 0 | 7 | – | 953 |
10k | 492 | 40004 | 3 | 42 | 164 | 953 |
100k | 5189 | 400389 | 28 | 388 | 179 | 1039 |
1000k | 51358 | 4003848 | 286 | 3847 | 180 | 1041 |
So why is the performance difference so vast? We'll discuss this in full detail in a later article in this series, but in essence, the set-based approach allows SQL Server to return the all the required rows very efficiently, fully exploiting the underlying data storage mechanisms. The internal structure where the data is stored in SQL Server is called a Page and it defines the minimum amount of data read or written to the database, regardless of whether you want all rows in that Page or only one row. So, with a set-based approach, we only need to fetch a page once to get all x rows stored within, whereas with the RBAR approach, we must re-fetch the page x times. We'll also see in a later article how, with appropriate indexes in place, we can speed up this process still further.
This paradigm shift is all there is to it! SQL Server has some built-in aggregate functions to accomplish the task of summing a number of rows, and many others. The key difference is that you, in set-based thinking, instead of working with individual rows, we work with sets which can contain any number of rows.
To become a professional database developer you have to abandon the old habit of thinking what you want to do with each row, and start thinking instead about what you want to do with the set, by manipulating the columns. Think vertical instead of horizontal!
If we take this example a step further, considering the task of multiplying together all the rows, we will encounter the second biggest barrier to becoming a professional database developer, and that is math skills.
For a .NET developer, this pseudo code would be the way to go:
Dim intP AS Integer = 1, ' Initialise the product to 1
objR AS Row
For Each objR In coll.Rows
intP *= objR.Data
Next
Debug.Print intP
However, we just established that we no longer want to work with
single rows, right? Unfortunately, however, while SQL Server provides a
handy SUM aggregate function, it does not provide a MULTIPLY aggregate function. So, how do we perform this task in a set-based fashion? This is where your math skills become handy. objR AS Row
For Each objR In coll.Rows
intP *= objR.Data
Next
Debug.Print intP
A * B * C equals e^(ln(A)+ln(B)+ln(C))
Having transformed the multiplication into a sum, it's easier to see that the final query should look something like this:
SELECT EXP ( SUM ( LOG ( Data )))FROM @Sample
But we still have an issue here; logarithmic operations are only
allowed on positive values, and not including zero (logarithmic
operations on zero and negative values are undefined in natural numbers
and will give an Invalid Floating Point operation in T-SQL), and
multiplication with zero, under any circumstances, is equal to zero.
Also, remember that a multiplication with an odd number of negative
numbers results in a negative product and multiplication with an even
number of negative number results in a positive product. Here is one
example of how to solve this problem:
IF EXISTS( SELECT * FROM @Sample WHERE Data = 0 )SELECT 0.0E -- If any value is zero, the total product is zero.ELSE
SELECT CASE IsNegativeProductWHEN 1 THEN - EXP ( theSum )ELSE EXP ( theSum )END
FROM (SELECT SUM ( LOG ( ABS ( Data ))), -- Sum all exponents
-- Keep track if product is negative or positiveSUM ( CASE WHEN Data < 0 THEN 1 ELSE 0 END ) % 2FROM @Sample
) AS d ( theSum , IsNegativeProduct )
The main reason for the checks in the above code is the limitations of the LOG function. So, can we complete the task in another fashion and still produce the correct result? Yes, we can: SELECT CASE IsNegativeProductWHEN 1 THEN - EXP ( theSum )ELSE EXP ( theSum )END
FROM (SELECT SUM ( LOG ( ABS ( Data ))), -- Sum all exponents
-- Keep track if product is negative or positiveSUM ( CASE WHEN Data < 0 THEN 1 ELSE 0 END ) % 2FROM @Sample
) AS d ( theSum , IsNegativeProduct )
DECLARE @p INT = 1
UPDATE @SampleSET @p * = Data
SELECT @p
This code, which looks very much like the C# pseudo code, produces
the right answer, a value of -320, but how did it work? The difference
is that we tell T-SQL to multiply the values but we do not force T-SQL
to retrieve the rows in any order. Another reason we can use the code
above is due to the nature of a product. If we want the product of A and
B (a * b), the reverse is also valid (b * a); it doesn't matter which
order we take the values in the multiplication, as long they are of same
data type. UPDATE @SampleSET @p * = Data
SELECT @p
This leads us to the third 'light bulb' moment on the road to becoming a professional database developer: understanding that rows in a database have no order. This is very different to the way .NET developers think. A collection of objects is normally a double-linked list, where you can traverse forward and backward. Rows in a table can be logically ordered by assigning a sequence number or a sequence denoted by date and time, but you can never rely on the physical order of rows in a set.
Summary
- Think in sets and not in single rows
- Brush up your math skills as much as you possibly can
- Rows have no determined physical order
Efficient storage and retrieval: Normalization rules.
No comments:
Post a Comment