9/04/2012

Converting String Data to XML and XML to String Data


01 February 2012
We all appreciate that, in general, XML documents or fragments are held in strings as text markup. In SQL Server, XML variables and columns are instead tokenised to allow rapid access to the data within. This is fine, but can cause some odd problems, auch as 'entitization'. What, also, do you do if you need to preserve the formatting? As usual Rob Sheldon comes to our aid.
When you’re working with XML data, you might find that you want to convert a value from the XML data type to another type, or from another type to the XML type. SQL Server lets you do both, to a limited degree. You can convert XML data to any of the string or binary types, and you can convert any of the string or binary types to XML. The process you follow if fairly straightforward in either case. Mostly, you need to know when SQL Server will do implicit conversions or when you must explicitly cast data into another type. And you should be aware of a couple other subtleties when converting to or from XML data.
NOTE: This article is the fourth in a series about working with XML data in SQL Server. The first three articles cover the XML data type (“Working with the XML Data Type in SQL Server”), its methods (“The XML Methods in SQL Server”), and incorporating XML into database objects (“Incorporating XML into Your Database Objects”).

Converting String Data to XML

In SQL Server, you can convert data configured with any of the character or binary data types—such as CHAR, VARCHAR, and VARBINARY—to the XML data type. You can use the CAST() or CONVERT() function to explicitly cast the data to a different type, or you can let SQL Server implicitly convert the data. For instance, SQL Server will automatically convert a string value stored with the NVARCHAR data type into an XML value, as shown in the following example:
DECLARE @string NVARCHAR(MAX);DECLARE @xml XML;SET @string =
'<bookstore><book>Candide</book></bookstore>';SET @xml = @string;SELECT @xml;
In this set of statements, I first declare the @string variable with the NVARCHAR(MAX) data type and then the @xml variable with the XML data type. I assign a string value—an XML fragment—to the @string data type, then set the value of @xml to equal @string. Because SQL Server can implicitly convert an NVARCHAR value to an XML value, the assignment is very straightforward, which I confirm by querying the @xml variable. As expected, the SELECT statement returns the following results:
<bookstore><book>Candide</book></bookstore>
The result set contains the XML fragment as it was originally assigned to the @string variable. The same holds true for other character data types, in terms of performing implicit conversions. For instance, in the following example, I replace the NVARCHAR(MAX) data type with VARCHAR(100):
DECLARE @string VARCHAR(100);DECLARE @xml XML;SET @string =
'<bookstore><book>Candide</book></bookstore>';SET @xml = @string;SELECT @xml;
This example is identical to the preceding one, except for switching the character data types. As a result, the SELECT statement returns the same XML fragment as before:
<bookstore><book>Candide</book></bookstore>
Again, SQL Server has implicitly converted the string data to XML. However, you can explicitly convert that data if you want. In the following example, I use the CAST() function to convert the @string value:
DECLARE @string VARCHAR(100);DECLARE @xml XML;SET @string =
'<bookstore><book>Candide</book></bookstore>';SET @xml = CAST(@string AS XML);SELECT @xml;
When you specify the CAST() function, you must pass in an expression that references the data to be converted, in this case, the @string variable, followed by the AS keyword and the name of the target data type, XML. All this is enclosed in parentheses, as I’ve done here. I then assign the value returned by the CAST() function to the @xml variable. As to be expected, the SELECT statement returns the same results as the previous examples:
<bookstore><book>Candide</book></bookstore>
You can just as easily use the CONVERT() function to achieve the same results. The main difference is how you structure the arguments that you pass into the function. For CONVERT(), you must first specify the target data type (XML), followed by the expression that references the source data (@string), as shown in the following example:
DECLARE @string VARCHAR(100);DECLARE @xml XML;SET @string =
'<bookstore><book>Candide</book></bookstore>';SET @xml = CONVERT(XML, @string);SELECT @xml;
Notice that I follow the XML data type argument with a comma, then I specify the @string variable. Once again, the SELECT statement returns the XML fragment as it is assigned to the @string variable:
<bookstore><book>Candide</book></bookstore>
You might be wondering why you would explicitly cast string or binary data to the XML type if SQL Server handles such conversions implicitly. Let’s look at the CAST() function first. One reason you might want to use CAST()is if you plan to run your SQL script against a database management system in addition to SQL Server and that system doesn’t support implicit conversions. Because the CAST() function conforms to ANSI specifications, you can use it with any database systems that conform to those standards.
That’s not the case with the CONVERT() function, which is specific to Transact-SQL in SQL Server. If you want to run your script against another database system, you have to modify your code to conform to that system’s specifications. However, if you’re running your script only against SQL Server, you can take advantage of the CONVERT() function’s support for additional options, which let you better refine the conversion process.
For instance, in the following example I use tabs between the parent and child elements in the XML fragment:
DECLARE @string VARCHAR(100);DECLARE @xml XML;SET @string =
'<bookstore> <book>Candide</book> </bookstore>';SET @xml = CONVERT(XML, @string);SELECT @xml;
The tabs serve only to add white space between the XML elements and do not affect the elements themselves. Then, as I do in the previous example, I use the CONVERT() function to cast the string as XML data. However, when I run the SELECT statement, the data no longer includes the white space, as shown in the following results:
<bookstore><book>Candide</book></bookstore>
But I can preserve the white space in the original string by adding a third argument to the CONVERT() function, as shown in the following example:
DECLARE @string VARCHAR(100);DECLARE @xml XML;SET @string =
'<bookstore> <book>Candide</book> </bookstore>';SET @xml = CONVERT(XML, @string, 1);SELECT @xml;
Notice that the function’s third argument is 1, which tells SQL Server to preserve the white space in the source data during the conversion process. Now the SELECT statement returns the following results:
<bookstore><book>Candide</book></bookstore>
This time the tabs are intact. I could have used spaces instead and received similar results. Now let’s look at what happens when we add linefeeds to the XML fragment. In the following example, I modify the string to include an additional book along with linefeeds and tabs:
DECLARE @string VARCHAR(100);DECLARE @xml XML;SET @string = '<bookstore>
<book>Candide</book>
<book>Pride and Prejudice</book>
</bookstore>'
;SET @xml = CONVERT(XML, @string, 1);SELECT @xml;
Because I pass the third argument (1) into the CONVERT() function, as I did in the previous example, the conversion process preserves the white space and linefeeds, as shown in the results returned by the SELECT statement:
<bookstore>
<book>Candide</book>
<book>Pride and Prejudice</book>
</bookstore>
Another thing worth noting about converting data to the XML data type is that once you’ve converted the data, you can use the data type’s methods to retrieve data. For example, I modified the SELECT statement in the previous example to include the query() method:
DECLARE @string VARCHAR(100);DECLARE @xml XML;SET @string = '<bookstore>
<book>Candide</book>
<book>Pride and Prejudice</book>
</bookstore>'
;SET @xml = CONVERT(XML, @string, 1);SELECT @xml.query('/bookstore/book[2]');
As you can see, I pass in the second instance of the child element (/bookstore/book[2]) as an argument to the query() method. The SELECT statement now returns the following results.
<book>Pride and Prejudice</book>
You can, of course, use the query() method to perform more complex queries, and you can use other methods. The key is to save the string data to the XML type and then call the method.

Converting XML to String Data

Although SQL Server implicitly converts string and binary data to the XML data type, it doesn’t work the other way. All conversions from the XML type to string and binary types must be done explicitly using the CAST() or CONVERT() function. If you try to perform an implicit conversion, your statement will fail. For instance, the following example tries to convert the XML fragment to VARCHAR(100):
DECLARE @xml XML;DECLARE @string VARCHAR(100);SET @xml =
'<bookstore><book>Candide</book></bookstore>';SET @string = @xml;SELECT @string;
After I declare the @xml and @string variables, I assign the XML fragment to the @xml variable. I then try to set the @string variable to equal the @xml variable, but when I try to run these statements, SQL Server returns the following error:
Implicit conversion from data type xml to varchar is not allowed. Use the CONVERT function to run this query.
The problem is easy enough to fix. In the following example, I include the CAST() function to explicitly convert the XML value to VARCHAR(100):
DECLARE @xml XML;DECLARE @string VARCHAR(100);SET @xml =
'<bookstore><book>Candide</book></bookstore>';SET @string = CAST(@xml AS VARCHAR(100));SELECT @string;
Now the XML value is converted into a string with no problem. When I retrieve the value of @string, the SELECT statement returns the XML fragment, as shown in the following results:
<bookstore><book>Candide</book></bookstore>
I could have just as easily used the CONVERT() function to cast the XML value to VARCHAR:
DECLARE @xml XML;DECLARE @string VARCHAR(100);SET @xml =
'<bookstore><book>Candide</book></bookstore>';SET @string = CONVERT(VARCHAR(100), @xml);SELECT @string;
As to be expected, the conversion occurs without a hitch and the XML fragment is assigned to the @string variable, which the SELECT statement confirms. Now suppose you insert linefeeds and tabs into your XML fragment, as I do in the following example:
DECLARE @xml XML;DECLARE @string VARCHAR(100);SET @xml ='<bookstore>
<book>Candide</book>
<book>Pride and Prejudice</book>
</bookstore>'
;SET @string = CONVERT(VARCHAR(100), @xml, 1);SELECT @string;
Because the XML fragment contains these new elements, you might expect that you can simply add the third argument to the CONVERT() function to preserve the tabs and linefeeds. However, although I take this approach, the value saved to the @string variable does not preserve these elements, as shown in the results returned by the SELECT statement:
<bookstore><book>Candide</book><book>Pride and Prejudice</book></bookstore>
The problem is not with how I convert the @xml value, but with the way I assign the XML fragment to that variable. Notice that I assign the value simply by setting @xml to equal the XML fragment, enclosed in single quotes. What is essentially happening here is that SQL Server is implicitly converting a string value to the XML type, which means that the tabs and linefeeds are not being preserved during that assignment. The way to get around this is to use the CONVERT() function to explicitly cast the string to the XML type, as I do in the following example:
DECLARE @xml XML;DECLARE @string VARCHAR(100);SET @xml = CONVERT(XML,'<bookstore>
<book>Candide</book>
<book>Pride and Prejudice</book>
</bookstore>'
, 1);SET @string = CONVERT(VARCHAR(100), @xml, 1);SELECT @string;
As you can see, I include a third argument in the CONVERT() function that specifies these elements be preserved. The SELECT statement returns the following results:
<bookstore>
<book>Candide</book>
<book>Pride and Prejudice</book>
</bookstore>
Now the string includes the tabs and linefeeds as they exist in the original XML fragment. Let’s take a closer look at this process so you can better understand what’s happening behind the scenes when you convert XML data.

XML Entitization

As the last example demonstrates, you can use the CONVERT() function to preserve characters that the XML parser normally ignores, such as spaces and tabs between elements. What this points to is that SQL Server handles certain components of an XML fragment—whether converting to or from the XML type—differently from the basic elements, attributes, and their values.
To better understand what’s happening, take a look at another example. The following set of Transact-SQL statements are similar to the last example, except that I’ve included a second SELECT statement to retrieve the @xml value and I’ve modified the second instance of the CONVERT() function so it doesn’t include the third argument (1):
DECLARE @xml XML;DECLARE @string VARCHAR(100);SET @xml = CONVERT(XML,'<bookstore>
<book>Candide</book>
<book>Pride and Prejudice</book>
</bookstore>'
, 1);SELECT @xml;SET @string = CONVERT(VARCHAR(100), @xml);SELECT @string;
As you might expect, the new SELECT statement returns the XML fragment with the linefeeds and tabs in place:
<bookstore>
<book>Candide</book>
<book>Pride and Prejudice</book>
</bookstore>
However, the second SELECT statement returns several unexpected values:
<bookstore>
&#x09;<book>Candide</book>
&#x09;<book>Pride and Prejudice</book>&#x0A;</bookstore>
NOTE: If I were to try to run this example without converting the original XML fragment, or converting the fragment without specifying the third argument in the CONVERT() function, the tabs and linefeeds would not be preserved, and the two SELECT statements would return the string without those tabs and linefeeds.
When the XML parser stores characters such as tabs and linefeeds within an XML document or fragment, it must do so in a way that preserves them throughout the parser’s normalization process, allowing these elements to be maintained whether being stored or retrieved or whether being converted to or from XML. Tabs and linefeeds are just two of the special characters that need to be preserved in this way, as the above example demonstrates, with tabs being saved as &#x09; and linefeeds saved as &#x0A;. When you retrieve the XML value directly, these special characters are automatically displayed in a readable format.
The process of preserving certain characters with these special symbols is known as entitization. Unfortunately, the XML parser is not always consistent in entitizing characters, as demonstrated by the fact that only one linefeed is stored as &#x0A;. Another issue is the way the parser handles special characters that appear in an element or attribute value. In such cases, SQL Server simply returns an error, rather than trying to entitize them. For instance, in the following example I replace the word “and” in the second book title with an ampersand (&):
DECLARE @xml XML;DECLARE @string VARCHAR(100);SET @xml = CONVERT(XML,'<bookstore>
<book>Candide</book>
<book>Pride & Prejudice</book>
</bookstore>'
, 1);SELECT @xmlSET @string = CONVERT(VARCHAR(100), @xml);SELECT @string;
When I run this statement, I receive the following error:
XML parsing: line 3, character 15, illegal name character
The XML parser doesn’t like characters such as ampersands, left brackets (<), and right brackets (>) within element and attribute values. To store XML that contains these characters, you must manually entitize them by replacing the character with their symbols. In this case, I replace the ampersand with &amp;:
DECLARE @xml XML;DECLARE @string VARCHAR(100);SET @xml = CONVERT(XML,'<bookstore>
<book>Candide</book>
<book>Pride &amp; Prejudice</book>
</bookstore>'
, 1);SELECT @xml;SET @string = CONVERT(VARCHAR(100), @xml);SELECT @string;
The parser can now parse the XML fragment. In doing so, it preserves the entitized ampersand, as shown in the results returned by the first SELECT statement:
<bookstore>
<book>Candide</book>
<book>Pride &amp; Prejudice</book>
</bookstore>
As you can see, the second book title contains the entitized ampersand. Now SQL Server has no problem handling the element’s value when that value is stored, retrieved, or converted. The entitized ampersand is also preserved when you convert the XML value to a string, which the second SELECT statement confirms:
<bookstore>
&#x09;<book>Candide</book>
&#x09;<book>Pride &amp; Prejudice</book>&#x0A;</bookstore>
Of course, chances are pretty good you won’t want your string value to show the entitized characters. As before, you can try to get rid of them by adding the third argument to the second CONVERT() function, as shown in the following Transact-SQL:
DECLARE @xml XML;DECLARE @string VARCHAR(100);SET @xml = CONVERT(XML,'<bookstore>
<book>Candide</book>
<book>Pride &amp; Prejudice</book>
</bookstore>'
, 1);SELECT @xml;SET @string = CONVERT(VARCHAR(100), @xml, 1);SELECT @string;
The second SELECT statement now returns the following results:
<bookstore>
<book>Candide</book>
<book>Pride &amp; Prejudice</book>
</bookstore>
Almost there. The entitized tab and linefeed characters are returned as regular characters, but the ampersand within the element value is still entitized. To address this issue, you have to specifically remove it from your string. For example, in the following Transact-SQL I use the REPLACE() function to replace the &amp; value with an ampersand:
DECLARE @xml XML;DECLARE @string VARCHAR(100);SET @xml = CONVERT(XML,'<bookstore>
<book>Candide</book>
<book>Pride &amp; Prejudice</book>
</bookstore>'
, 1);SELECT @xml;SET @string =
REPLACE(CONVERT(VARCHAR(100), @xml, 1), '&amp;', '&');SELECT @string;
For the first argument of the REPLACE() function, I pass in the converted @xml value, then specify the value to be replaced (&amp;) and the new value (&). Now the second SELECT statement returns the value we want:
<bookstore>
<book>Candide</book>
<book>Pride & Prejudice</book>
</bookstore>
As the results show, the string value no longer contains any entitized characters. As you work with XML data, you’ll get a better feel for when to manually entitize and de-entitize data. For more information about the entitization process in SQL Server, see the topic “Serialization of XML Data” in SQL Server Books Online.

XML Data Conversion

For the most part, converting data from the XML data type to another type or converting it from another type to the XML type is a fairly painless process. Not surprisingly, there are a few gotchyas along the way, entitization being one of them. For more details about converting XML data, see the topics “Generating XML Instances” and “CAST and CONVERT (Transact-SQL)” in SQL Server Books Online. Also worth checking out is how to incorporate the FOR XML clause in a SELECT statement to return results as XML. You can find details about that in my Simple-Talk article “Using the FOR XML Clause to Return Query Results as XML.” As you can see, SQL Server provides a number of ways to work with XML data. This article, along with the first three in the series, should have provided you with a good overview of the various considerations to take into account when working with XML. 

Recommendations - NVARCHAR


Recommendations

Don’t Use VARCHAR(MAX) Everywhere

One design-simplification technique that has been suggested is to use VARCHAR(MAX) for every string column. Though the tests here show that using VARCHAR(MAX) (in-row) instead of VARCHAR(n) is not a performance problem for strings under 8,000 characters, there are a number of reasons to limit the length of strings in your database.

UI Issues

Anyone with experience creating an application UI knows that handling long strings is difficult. When laying out the UI for an application, it is important to know the expected and maximum length of strings that will be displayed. Leaving this open to whatever someone writes to the database makes designing, building, and testing an application very difficult.

Performance – Writing or Updating Long Strings

From the performance results, you can see that writing long strings definitely affects write delays. While writing a record of any size (even zero characters) takes time, writing longer strings takes more time. As an example, using VARCHAR(n), writing 1,000 characters takes an average of 217 microseconds while writing 8,000 characters takes an average of 448 microseconds.

Compression

What I believe is the most compelling reason to avoid over-8,000-character strings is compression. LOB data never gets compressed.
“When considering data compression one very key thing to remember is that out of row data (LOB data) isn’t compressed. If the LOB data is stored in row, then it will be compressed, but only when you compress the table with PAGE level compression. If however you use ROW level compression then the LOB data will never be compressed, no matter if it is stored in row or if it is stored out of row.” (ref)

Incorporating XML into your Database Objects


04 January 2012
XML data can become a full participant in a SQL Server Database, and can be used in views, functions, check constraints, computed columns and defaults. Views and table-valued functions can be used to provide a tabular view of XML data that can be used in SQL Expressions. Robert Sheldon explains how.
In my last two articles, “Working with the XML Data Type in SQL Server” and “The XML Methods in SQL Server,” I discussed how to use the XML data type and its methods to store and access XML data. This article takes these topics a step further and explains ways in which you can implement XML within various database objects, including views, functions, computed columns, check constraints, and defaults. If you’re not familiar with how the XML data type and its methods are implemented in SQL Server, you should review the first two articles before starting in on this one. Once you have the basic information, you’ll find that incorporating XML within these other database objects is a relatively straightforward process.

Creating Views

When you create a view, you can include an XML column just like you would a column configured with another data type. For example, the Sales.Store table in the AdventureWorks2008R2 sample database includes the Demographics column, which is configured with the XML type. For each row in the table, the column stores an XML document that contains details about the store described in that row of data.
In the following example, I create a view that retrieves data from the Sales.Store table, including the Demographics column:
USE AdventureWorks2008R2;
IF OBJECT_ID('StoreSurvey') IS NOT NULLDROP VIEW StoreSurvey;GO

CREATE VIEW StoreSurvey
AS
SELECT
BusinessEntityID AS StoreID,
Demographics AS Survey
FROM
Sales.Store;GO

SELECT Survey
FROM StoreSurvey
WHERE StoreID = 292;
There should be no surprises here. I simply specified the name of the XML column as I would other columns and assigned the alias Survey to the column name. The SELECT statement I tagged onto the view definition returns the following results:
<StoreSurvey xmlns="http://schemas.microsoft.com/sqlserver/2004/07/adventure-works/StoreSurvey"
<AnnualSales>800000</AnnualSales>
<
AnnualRevenue>80000</AnnualRevenue>
<
BankName>United Security</BankName>
<
BusinessType>BM</BusinessType>
<
YearOpened>1996</YearOpened>
<
Specialty>Mountain</Specialty>
<
SquareFeet>21000</SquareFeet>
<
Brands>2</Brands>
<
Internet>ISDN</Internet>
<
NumberEmployees>13</NumberEmployees>
</
StoreSurvey>
The SELECT statement returns the full XML document, as it’s stored in the Demographics column (for the row with a StoreID value of 292). Although this all fine enough, it’s not particularly noteworthy. Where things get interesting is when you use an XML method within a view to retrieve only part of the XML data. For example, in the following view definition, I use the value() method to retrieve a single value from the XML document in the Demographics column:
USE AdventureWorks2008R2;
IF OBJECT_ID('StoreSales') IS NOT NULLDROP VIEW StoreSales;GO

CREATE VIEW StoreSales
AS
SELECT
BusinessEntityID AS StoreID,
Demographics.value('declare namespace ns=
"http://schemas.microsoft.com/sqlserver/2004/07/adventure-works/StoreSurvey";
(/ns:StoreSurvey/ns:AnnualSales)[1]'
, 'int') AS AnnualSales
FROM
Sales.Store;GO

SELECT AnnualSales
FROM StoreSales
WHERE StoreID = 292;
The value() method retrieves the value stored in the <AnnualSales> child element of the <StoreSurvey> element. Because the Demographics column is a typed XML column, the value() method’s first argument includes the namespace reference, along with a reference to the actual element. And the method’s second argument specifies that the value be returned as type int. The SELECT statement I’ve tagged onto the view definition now returns only a single value, 800000.
Because XQuery expressions can get somewhat complex, particularly when they include the namespace reference, views provide a handy way of storing those expressions in your database. For instance, if you want to retrieve annual sales information regularly, a view such as the one above can make the process quite easy.

Creating Functions

As is the case with views, you can also work with XML data within functions. For example, in the following function definition, I return the contents of the Demographics column based on an inputted store ID value (which is equivalent to the table’s BusinessEntityID value):
USE AdventureWorks2008R2;
IF OBJECT_ID('Survey') IS NOT NULLDROP FUNCTION Survey;GO

CREATE FUNCTION Survey(@StoreID INT)RETURNS XML
AS BEGIN
RETURN
(
SELECT Demographics
FROM Sales.Store
WHERE BusinessEntityID = @StoreID)END;GO

SELECT dbo.Survey(292);
Once again, no surprises here. When I call the function and store ID (292) in a SELECT statement, the function returns the full XML document from the Demographics column for that store. However, notice that I specified XML as the return type. Because XML is being returned, you can even use the XML methods when you call the function. For instance, in the following example, I again call the Survey function, passing in the store ID (292), but this time I also include the value() method:
SELECT dbo.Survey(292).value('declare namespace ns=
"http://schemas.microsoft.com/sqlserver/2004/07/adventure-works/StoreSurvey";
(/ns:StoreSurvey/ns:AnnualSales)[1]'
, 'int');
Notice that I simply add the period and method name when I call the function, followed by the necessary arguments. As you would expect, the function now returns only a single value, in this case, 800000.
I could have just as easily specified the query() method, as in the following example:
SELECT dbo.Survey(292).query('declare namespace ns=
"http://schemas.microsoft.com/sqlserver/2004/07/adventure-works/StoreSurvey";
/ns:StoreSurvey/ns:AnnualSales'
);
Now the SELECT statement returns the entire <AnnualSales> element, along with the namespace information, as shown in the following results:
<ns:AnnualSales xmlns:ns="http://schemas.microsoft.com/sqlserver/2004/07/adventure-works/StoreSurvey">800000</ns:AnnualSales>
So far, working with XML data within a function has been pretty straightforward, but suppose you don’t want to retrieve the entire XML document. You can instead use XML methods within the function to return specific information from the document. In the following example, I create a function that returns only the value from the <AnnualSales> child element, just like I did earlier in my view definition:
USE AdventureWorks2008R2;
IF OBJECT_ID('AnnualSales') IS NOT NULLDROP FUNCTION AnnualSales;GO

CREATE FUNCTION AnnualSales(@survey XML)RETURNS INT
AS BEGIN
RETURN
@survey.value('declare namespace ns=
"http://schemas.microsoft.com/sqlserver/2004/07/adventure-works/StoreSurvey";
(/ns:StoreSurvey/ns:AnnualSales)[1]'
, 'int')END;GO

SELECT dbo.AnnualSales(Demographics) AS AnnualSales
FROM Sales.Store
WHERE BusinessEntityID = 292;
There are a couple things worth noting about this function definition. First, unlike the preceding example, I specified INT as the return type (rather than XML) because INT is the value type returned by the value() method. In addition, I specified XML as the type for the input parameter (@survey). I then used that parameter to call the value() method. That way, I can pass in the column name as the argument to the function, and the value() method will return the specific value from the XML document in the column.
To verify this, I created a SELECT statement that calls the function and passes in the Demographics column as its input argument. Now the function returns a value of 800000, without having to specify an XML method when calling the function itself, as in the preceding example.
Being able to use XQuery expressions within functions provides the same advantages as being able to use them in views. You can save complex expressions to the database that you can call as often as necessary, without having to reconstruct those expressions each time you need them.

Creating Computed Columns

There might be times when you want to use XML data to construct a computed column. One simple way to do that is to convert an entire XML value to a string, as I do in the following example:
USE AdventureWorks2008R2;
IF OBJECT_ID('Stores') IS NOT NULLDROP TABLE Stores;GO

CREATE TABLE Stores
(
StoreID INT PRIMARY KEY,
SurveyXml XML,
SurveyChar AS CAST(SurveyXml AS NVARCHAR(MAX))
);
GO

INSERT INTO Stores(StoreID, SurveyXml)SELECT BusinessEntityID, Demographics
FROM Sales.Store;
SELECT * FROM Stores
WHERE StoreID = 292;
First, I create the Stores table, which includes one XML column and one calculated column. The calculated column is simply a recasting of the XML column to an NVARCHAR(MAX) column. Notice in the INSERT statement, I retrieve data from the Sales.Store column, including the Demographics column. The Demographics data is converted to NVARCHAR(MAX) when inserted into the SURVEYCHAR column. As you would expect, the SELECT statement returns the full XML document for the SurveyXml column and returns the same document as character data for the SurveyChar column.
This is all well and good if all you want to do is play around with the entire XML document. But suppose instead you want to create a calculated column based on only a portion of that document. That’s where things get a bit sticky.
It turns out that you can’t use XML methods to define a calculated column. However, there is a work-around. You can define a function that retrieves the data you need and then use that function within the column definition, as I do in the following example:
USE AdventureWorks2008R2;
IF OBJECT_ID('Stores') IS NOT NULLDROP TABLE Stores;GO

IF OBJECT_ID('AnnualSales') IS NOT NULLDROP FUNCTION AnnualSales;GO

CREATE FUNCTION AnnualSales(@survey XML)RETURNS INT
AS BEGIN
RETURN
@survey.value('declare namespace ns=
"http://schemas.microsoft.com/sqlserver/2004/07/adventure-works/StoreSurvey";
(/ns:StoreSurvey/ns:AnnualSales)[1]'
, 'int')END;GO

CREATE TABLE Stores
(
StoreID INT PRIMARY KEY,
Survey XML,
SalesAmount AS dbo.AnnualSales(Survey)
);
GO

INSERT INTO Stores(StoreID, Survey)SELECT BusinessEntityID, Demographics
FROM Sales.Store;
SELECT * FROM Stores
WHERE StoreID = 292;
First, I create a function (AnnualSales) that uses the value() method to retrieve the <AnnualSales> value from the inputted XML document, as you saw in a previous example. I then create a table that includes an XML column (Survey) and a calculated column (SalesAmount). Within the SalesAmount column definition, I call the AnnualSales function and pass in the Survey column as its argument. Next, I populate the table with data from the Sales.Store table and then use a SELECT statement to retrieve a row from that table. As to be expected, the Survey column returns the full XML document, and the SalesAmount column returns a value of 800000.
This is, of course, a roundabout way to use XML methods to create a calculated column, but it’s an effective approach nonetheless. The key is making sure that the function targets the XML data that will provide the source for the calculated column. That means any schema and element references must be consistent with the source data.

Creating Check Constraints

Another way you can work with XML in your database is to create check constraints on an XML column. However, chances are, if you decide to do that, you’ll want to use one of the XML methods to create the constraint. The problem with this is that, like calculated columns, you can’t use these methods within the constraint definition. Once again, you must first create a function, as I’ve done in the following example:
USE AdventureWorks2008R2;
IF OBJECT_ID('Stores') IS NOT NULLDROP TABLE Stores;GO

IF OBJECT_ID('SpecialtyExists') IS NOT NULLDROP FUNCTION SpecialtyExists;GO

CREATE FUNCTION SpecialtyExists(@survey XML)RETURNS BIT
AS BEGIN
RETURN
@survey.exist('declare namespace ns=
"http://schemas.microsoft.com/sqlserver/2004/07/adventure-works/StoreSurvey";
/ns:StoreSurvey/ns:Specialty'
)END;GO

CREATE TABLE Stores
(
StoreID INT PRIMARY KEY,
Survey XML CHECK(dbo.SpecialtyExists(Survey) = 1)
);
GO

INSERT INTO Stores(StoreID, Survey)SELECT BusinessEntityID, Demographics
FROM Sales.Store;
SELECT * FROM Stores
WHERE StoreID = 292;
In this case, the function (SpecialtyExists) uses the exist() method to determine whether the inputted XML document contains the <Specialty> child element within the <StoreSurvey> parent element. The EXIST() method will return 1 if the child element exists within the parent element and will return a 0 if the two elements don’t exist as specified. Consequently, the return type used for the function is BIT. As for the EXIST() method itself, it takes only one argument, the namespace reference and the parent and child elements.
After I defined the function, I created a table (Stores), which includes an XML column (Survey). I’ve also defined a check constraint on the column. The constraint uses the SpecialtyExists function to determine if the data to be inserted into the Survey column contains the specified child and parent elements. If the data contains the elements, the rows are inserted into the table; otherwise the insertion fails.
For instance, when I inserted the data from the Sales.Store table, the Demographics data was added to the store data because that data includes the required child and parent elements. However, if the function had specified a different child element, such as <MainType>, no rows would have been inserted because the check constraint would have evaluated to False in each case.
As you saw with calculated columns, functions provide a workaround for using XML methods to create check constraints on XML columns. In the example above, I used the EXIST() method to verify the existence of an element. However, I could have also used that method to check the existence of an attribute, or I could have used another XML method, although the EXIST() method will probably prove to be the handiest one in this case.

Creating Defaults

If you create a table that includes an XML column, you can create a DEFAULT constraint on that column, as you would other column types. The main difference with the XML column, however, is that if that column is typed XML, your constraint can’t violate the schema associated with that column.
You can create a DEFAULT constraint on an XML column in one of two ways: by implicitly converting the string value to XML or explicitly converting the value. In the following example, I use both approaches:
USE AdventureWorks2008R2;
IF OBJECT_ID('Stores') IS NOT NULLDROP TABLE Stores;GO

CREATE TABLE Stores
(
StoreID INT PRIMARY KEY,
Survey_implicit1 XML
DEFAULT
'<Survey>Survey to follow</Survey>',
Survey_implicit2 XML
DEFAULT
'<Survey>Survey to follow</Survey>',
Survey_explicit1 XML
DEFAULT
CAST('<Survey>Survey to follow</Survey>' AS XML),
Survey_explicit2 XML
DEFAULT
CAST('<Survey>Survey to follow</Survey>' AS XML)
);
GO

INSERT INTO Stores
(StoreID, Survey_implicit1, Survey_explicit1)SELECT BusinessEntityID, Demographics, Demographics
FROM Sales.Store;
SELECT * FROM Stores
WHERE StoreID = 292;
Notice that I created four XML columns in the table. I did this to verify how the defaults work when I don’t add data to the columns. The first two XML columns (Survey_implicit1 and Survey_implicit2) each include default definitions that specify an XML value as a string without trying to convert that value to the XML type. For the third and fourth columns (Survey_explicit1 and Survey_explicit2), I specifically cast the columns to the XML type. Because SQL Server automatically converts the string values, you can take either approach.
After I defined the table, I inserted data into the Survey_implicit1 and Survey_explicit1 columns. I then retrieved a row from the table. As to be expected, the Survey_implicit1 and Survey_explicit1 columns returned the entire XML documents and the Survey_implicit2 and Survey_explicit2 columns each returned the value defined in the DEFAULT constraint.

Working with XML

As you saw with the DEFAULT constraints shown in the example above, you can convert non-XML data into XML. You can also convert XML data into other types. In my next article, I’ll provide more specifics on how those conversions work. In the meantime, this article should have provided you with the details you need to work with XML data when creating views, functions, computed columns, check constraints, and defaults. Of course, the XML methods, and the XQuery expressions you can create by using those methods, often play an important role in how you incorporate XML in your database objects, so knowing those methods and the XQuery language can be pivotal in effectively using that XML.

8/25/2012

SSIS Basics: Adding Data Flow to Your Package


12 June 2012
Annette continues her popular series for SSIS beginners by showing how a data flow task can be used in a package to move data from a SQL Server database to an Excel file, and insert an additional column into the Excel file that’s based on derived data.
In my previous article, "SSIS Basics: Setting Up Your Initial Package", I showed you how to create an SSIS package and configure connection managers, data sources, and data source views. In this article, I will show you how to use some of those data connections to retrieve data from a SQL Server database and load the data into an Excel file. I will also show you how to add a computed column based on data derived from the data flow. In addition, I will demonstrate how to run the package.
SSIS supports many control flow items that manage a package’s workflow, but the one I think to be the most important and most often used is the Data Flow task. For this reason, I focus on that task in this article. In future articles, I’ll cover other control flow items.
Note:
If you want to try out the examples in this article, you’ll need to create an OLE DB connection manager that points to the AdventureWorks database and a Flat File connection manager that points to an Excel file. You can find details about how to set up these connection managers in my previous article, which is referenced above. I created the examples shown in this article and the last one on a local instance of SQL Server 2008 R2.

Adding a Data Flow Task

Our goal in creating this package is to move data from a SQL Server database to an Excel file. As part of that goal, we also want to insert an additional column into the Excel file that’s based on derived data.
To carry out our goal, we must add a Data Flow task to our control flow. The task lets us retrieve data from our data source, transform that data, and insert it into our destination, the Excel file. The Data Flow task is one of the most important and powerful components in SSIS and as such has it’s own workspace, which is represented by the Data Flow tab in SSIS Designer, as shown in Figure 1.
The Data Flow tab in SSIS Designer
Figure 1: The Data Flow tab in SSIS Designer
Before we can do anything on the Data Flow tab, we must first add a Data Flow task to our control flow. To add the task, drag it from the Control Flow Items window to the Control Flow tab of the SSIS Designer screen, as illustrated in Figure 2.
Adding a Data Flow task to the control flow
Figure 2: Adding a Data Flow task to the control flow
To configure the data flow, double-click the Data Flow task in the control flow. This will move you to the Data Flow tab, shown in Figure 3.
The Data Flow tab in SSIS Designer
Figure 3: The Data Flow tab in SSIS Designer

Configuring the Data Flow

You configure a Data Flow task by adding components to the Data Flow tab. SSIS supports three types of data flow components:
  • Sources: Where the data comes from
  • Transformations: How you can modify the data
  • Destinations: Where you want to put the data
A Data Flow task will always start with a source and will usually end with a destination, but not always. You can also add as many transformations as necessary to prepare the data for the destination. For example, you can use the Derived Column transformation to add a computed column to the data flow, or you can use a Conditional Split transformation to split data into different destinations based on specified criteria. This and other components will be explained in future articles.
To add components to the Data Flow task, you need to open the Toolbox if it’s not already open. To do this, point to the View menu and then click ToolBox, as shown in Figure 4.
Opening the Toolbox to view the data flow components
Figure 4: Opening the Toolbox to view the data flow components
At the left side of the Data Flow tab, you should now find the Toolbox window, which lists the various components you can add to your data flow. The Toolbox organizes the components according to their function, as shown in Figure 5.
The component categories as they appear in the Toolbox
Figure 5: The component categories as they appear in the Toolbox
To view the actual components, you must expand the categories. For example, to view the source components, you must expand the Data Flow Sources category, as shown in Figure 6
Viewing the data flow source components
Figure 6: Viewing the data flow source components

Adding an OLE DB Source

The first component we’re going to add to the data flow is a source. Because we’re going to be retrieving data from a SQL Server database, we’ll use an OLE DB source. To add the component, expand the Data Flow Sources category in the Toolbox. Then drag an OLE DB source from to the Data Flow window. Your data flow should now look similar to Figure 7.
Adding an OLE DB source to your data flow
Figure 7: Adding an OLE DB source to your data flow
You will see that we have a new item named OLE DB Source. You can rename the component by right-clicking it and selecting rename. For this example, I renamed it Employees.
There are several other features about the OLE DB source noting:
  • A database icon is associated with that source type. Other source types will show different icons.
  • A reversed red X appears to the right of the name. This indicates that the component has not yet been properly configured.
  • Two arrows extend below the component. These are called data paths. In this case, there is one green and one red. The green data path marks the flow of data that has no errors. The red data path redirects rows whose values are truncated or that generate an error. Together these data paths enable the developer to specifically control the flow of data, even if errors are present.
To configure the OLE DB source, right-click the component and then click Edit. The OLE DB Source Editor appears, as shown in Figure 8.
Configuring the OLEDB source
Figure 8: Configuring the OLEDB source
From the OLE DB connection manager drop-down list, select the OLE DB connection manager we set up in the last article, the one that connects to the AdventureWorks database.
Next, you must select one of the following four options from the Data access mode drop-down list:
  • Table or view
  • Table name or view name variable
  • SQL command
  • SQL command from variable
For this example, we’ll select the Table or View option because we’ll be retrieving our data through the uvw_GetEmployeePayRate view, which returns the latest employee pay raise and the amount of that raise. Listing 1 shows the Transact-SQL used to create the view in the AdventureWorks database.
CREATE VIEW uvw_GetEmployeePayRate
AS
    SELECT  H.EmployeeID ,
            RateChangeDate ,
            Rate
    FROM    HumanResources.EmployeePayHistory H
            JOIN ( SELECT   EmployeeID ,
                            MAX(RateChangeDate) AS [MaxDate]
                   FROM     HumanResources.EmployeePayHistory
                   GROUP BY EmployeeID
                 ) xx ON H.EmployeeID = xx.EmployeeID
                         AND H.RateChangeDate = xx.MaxDate
GO
Listing 1: The uvw_GetEmployeePayRate view definition
After you ensure that Table or view is selected in the Data access mode drop-down list, select the uvw_GetEmployeePayRate view from the Name of the table or the view drop-down list. Now go to the Columns page to select the columns that will be returned from the data source. By default, all columns are selected. Figure 9 shows the columns (EmployeeID, RateChangeDate, and Rate) that will be added to the data flow for our package, as they appear on the Columns page.
The Columns page of the OLE DB Source Editor
Figure 9: The Columns page of the OLE DB Source Editor
If there are columns you don’t wish to use, you can simply uncheck them in the Available External Columns box.
Now click on the Error Output page (shown in Figure 10) to view the actions that the SSIS package will take if it encounters errors.
The Error Output page of the OLE DB Source Editor
Figure 10: The Error Output page of the OLE DB Source Editor
By default, if there is an error or truncation, the component will fail. You can override the default behavior, but explaining how to do that is beyond the scope of this article. You’ll learn about error handling in future articles.
Now return to the Connection Manager page and click the Preview button to view a sample dataset in the Preview Query Results window, shown in Figure 11. Previewing the data ensures that what is being returned is what you are expecting.
Previewing a sample dataset
Figure 11: Previewing a sample dataset
After you’ve configured the OLE DB Source component, click OK.

Adding a Derived Column Transformation

The next step in configuring our data flow is to add a transformation component. In this case, we’ll add the Derived Column transformation to create a column that calculates the annual pay increase for each employee record we retrieve through the OLE DB source.
To add the component, expand the Data Flow Transformations category in the Toolbox window, and drag the Derived Column transformation (shown in Figure 12) to the Data Flow tab design surface.
The Derived Column transformation as its listed in the Toolbox
Figure 12: The Derived Column transformation as its listed in the Toolbox
Drag the green data path from the OLE DB source to the Derived Column transformation to associate the two components, as shown in Figure 13. (If you don’t connect the two components, they won’t be linked and, as a result, you won’t be able to edit the transformation.)
Using the data path to connect the two components
Figure 13: Using the data path to connect the two components
The next step is to configure the Derived Column component. Double-click the component to open the Derived Column Transformation Editor, as shown in Figure 14.
Configuring the Derived Column transformation
Figure 14: Configuring the Derived Column transformation
This editor is made up of three regions, which I’ve labeled 1, 2 and 3:
  1. Objects you can use as a starting point. For example you can either select columns from your data flow or select a variable. (We will be working with variables in a future article.)
  2. Functions and operators you can use in your derived column expression. For example, you can use a mathematical function to calculate data returned from a column or use a date/time function to extract the year from a selected date.
  3. Workspace where you build one or more derived columns. Each row in the grid contains the details necessary to define a derived column.
For this exercise, we’ll be creating a derived column that calculates a pay raise for employees. The first step is to select the existing column that will be the basis for our new column.
To select the column, expand the Columns node, and drag the Rate column to the Expression column of the first row in the derived columns grid, as shown in Figure 15.
Adding a column to the Expression column of the derived column grid
Figure 15: Adding a column to the Expression column of the derived column grid
When you add your column to the Expression column, SSIS prepopulates the other columns in that row of the grid, as shown in Figure 16.
Prepopulated values in derived column grid
Figure 16: Prepopulated values in derived column grid
As you can see, SSIS has assigned our derived column the name Derived Column 1 and set the Derived Column value to <add as new column>. In addition, our [Rate] field now appears in the Expression column, and the currency[DT_CY] value has been assigned to the Data Type column.
You can change the Derived Column Name value by simply typing a new name in the box. For this example, I’ve renamed the column NewPayRate.
For the Derived Column value, you can choose to add a new column to your data flow (which is the default value, <add as new column>) or to replace one of the existing columns in your data flow. In this instance, we’ll add a new column, but there may be times when overwriting a column is required.
The data type is automatically created by the system and can’t be changed at this stage.
Our next step is to refine our expression. Currently, because only the Rate column is included in the expression, the derived column will return the existing values in that column. However, we want to calculate a new pay rate. The first step, then, is to add an operator. To view the list of available operators, expand the list and scroll through them. Some of the operators are for string functions and some for math functions.
To increase the employee’s pay rate by 5%, we’ll use the following calculation:
[Rate] * 1.05
To do this in the Expression box, either type the multiplication operator (*), or drag it from the list of operators to our expression (just after the column name), and then type 1.05, as shown in Figure 17.
Defining an expression for our derived column
Figure 17: Defining an expression for our derived column
You will see that the Data Type has now changed to numeric [DT_NUMERIC].
Once you are happy with the expression, click on OK to complete the process. You will be returned to the Data Flow tab. From here, you can rename the Derived Column transformation to clearly show what it does. Again, there are two data paths to use to link to further transformations or to connect to destinations.

Adding an Excel Destination

Now we need to add a destination to our data flow to enable us to export our results into an Excel spreadsheet.
To add the destination, expand the Data Flow Destinations category in the Toolbox, and drag the Excel destination to the SSIS Designer workspace, as shown in Figure 18.
Adding an Excel destination to your data flow
Figure 18: Adding an Excel destination to your data flow
Now connect the green data path from the Derived Column transformation to the Excel destination to associate the two components, as shown in Figure 19.
Connecting the data path from the transformation to the destination
Figure 19: Connecting the data path from the transformation to the destination
As you can see, even though we have connected the PayRate transformation to the Excel destination, we still have the reversed red X showing us that there is a connection issue. This is because we have not yet selected the connection manager or linked the data flow columns to those in the Excel destination.
Next, right-click the Excel destination, and click Edit. This launches the Excel Destination Editor dialog box, shown in Figure 20. On the Connection Manager page, under OLE DB connection manager, click on the New button then under Excel File Path click on the Browse button and select the file you created in the previous article and click on OK, then under Name of the Excel Sheet select the appropriate sheet from the file.
Configuring the Excel destination component
Figure 20: Configuring the Excel destination component
At the bottom of the Connection Manager page, you’ll notice a message that indicates we haven’t mapped the source columns with the destination columns. To do this, go to the Mappings page (shown in Figure 21) and ensure that the columns in the data flow (the input columns) map correctly to the columns in the destination Excel file. The package will make a best guess based on field names; however, for this example, I have purposefully named my columns in the excel spreadsheet differently from those in the source database so they wouldn’t be matched automatically.
The Mappings page of the Excel Destination Editor
Figure 21: The Mappings page of the Excel Destination Editor
To match the remaining columns, click the column name in the Input Column grid at the bottom of the page, and select the correct column. As you select the column, the list will be reduced so that only those columns not linked are available. At the same time, the source and destination columns in the top diagram will be connected by arrows, as shown in Figure 22.
Mapping the columns between the data flow and the destination
Figure 22: Mapping the columns between the data flow and the destination
Once you’ve properly mapped the columns, click OK. The Data Flow tab should now look similar to the screenshot in Figure 23.
The configured data flow in your SSIS package
Figure 23: The configured data flow in your SSIS package

Running an SSIS Package in BIDS

Now all we need to do is execute the package and see if it works. To do this, click the Execute button. It’s the green arrow on the toolbar, as shown in Figure 24.
Clicking the Execute button to run your SSIS package
Figure 24: Clicking the Execute button to run your SSIS package
As the package progresses through the data flow components, each one will change color. The component will turn yellow while it is running, then turn green or red on completion. If it turns green, it has run successfully, and if it turns red, it has failed. Note, however, that if a component runs too quickly, you won’t see it turn yellow. Instead, it will go straight from white to green or red.
The Data Flow tab also shows the number of rows that are processed along each step of the way. That number is displayed next to the data path. For our example package, 290 rows were processed between the Employees source and the PayRate transformation, and 290 rows were processed between the transformation and the Excel destination. Figure 25 shows the data flow after the three components ran successfully. Note that the number of processed rows are also displayed.
The data flow after if has completed running
Figure 25: The data flow after if has completed running
You can also find details about the package’s execution on the Progress tab (shown in Figure 26). The tab displays each step of the execution process. If there is an error, a red exclamation mark is displayed next to the step’s description. If there is a warning, a yellow exclamation mark is displayed. We will go into resolving errors and how to find them in a future article.
The Progress tab in SSIS Designer
Figure 26: The Progress tab in SSIS Designer
Now all that’s needed is to check the Excel file to ensure that the data was properly added. You should expect to see results similar to those in Figure 27.
Reviewing the Excel file after package execution
Figure 27: Reviewing the Excel file after package execution

Summary

In this article of the "SSIS Basics" series, I’ve shown you how to add the data flow to your SSIS package in order to retrieve data from a SQL Server database and load it into an Excel file. I’ve also shown you how to add a derived column that calculates the data to be inserted into the file. In addition, I’ve demonstrated how to run the package.
In future articles, I plan to show you how to deploy the package so you can run it as part of a scheduled job or call in other ways. In addition, I’ll explain how to use variables in your package and pass them between tasks. I also aim to cover more control flow tasks and data flow components, including those that address conditional flow logic and for-each looping logic. There is much much more that can be done using SSIS, and I hope over the course of this series to cover as much information as possible