Sometimes, XML seems a bewildering convention that offers solutions to problems that the average database user doesn't have. To make things worse, XML experts generally lack the wherewithal to provide simple answers to even the simplest questions. Rob Sheldon, in contrast, can answer even those questions we somehow feel silly asking in public, and think twice about doing so.
The Questions
- Why store XML data in a SQL Server database? Why not just store the XML as documents in a file system?
- What makes an XML document different from a regular text file?.
- What is an XML instance in SQL Server, as opposed to an XML document?.
- When storing XML documents in a SQL Server database, should you always use the xml data type?
- What does it mean for an xml column to be typed?
- How do I convert an XML value to a string value in SQL Server?
- Is it worth indexing an xml column in a SQL Server database?
- What’s this I hear about selective XML indexes in SQL Server?
- How do I retrieve an individual value from a specific element within an XML instance?
- Why can't I just explore XML data using SELECT statements?
- How do I update an individual value for a specific element in an XML instance?
- Is it possible to retrieve relational data in a SQL Server database as XML?
Why store XML data in a SQL Server database? Why not just store the XML as documents in a file system?
In some cases, the file system might be fine, if you don’t plan to associate the files with any data in your SQL Server database and you don’t want to take advantage of any of the XML-specific features that SQL Server provides, such as being able to associate relational data with data in an XML instance. For example, suppose you have a database that stores product information. For each product, you maintain an XML document that provides details about that product. By storing the document within the database, you can create queries that retrieve both relational data and specific information within the XML instance related to that data.
In SQL Server, you usually store XML data in a column configured with the xml data type. The data type supports several methods that let you query and modify individual elements, attributes, and their values directly within the XML instance, rather than having to work with that instance as a whole. In addition, the xml data type ensures that each XML instance is, at the very least, well formed according to ISO standards. And if you associate a schema collection with your xmlcolumn, you can further validate the structure and its data.
SQL Server also provides several options for indexing your xml column, with expanded features starting in SQL Server 2012. In addition, storing your XML in a SQL Server database lets you take advantage of SQL Server’s administrative capabilities and data access mechanisms, such as OLE DB and ADO.NET. Even if you don’t store data in an xmlcolumn, you can still leverage the other SQL Server storage and access capabilities. That said, if you don’t need any of these features, you might not need SQL Server. Take a look at how you plan to access and modify your XML files, and then decide.
What makes an XML document different from a regular text file?
An XML document is one that conforms to the ISO formatting rules governing the eXtensible Markup Language (XML). As the name suggests, XML is a text-based language similar to HTML. It uses tags to describe the nature of the data and how the data should be formatted. Unlike HTML, however, XML lets you define your own tags, thus its extensible nature, which is why XML is often referred to as a self-describing language. Because it has been so universally adopted, you can use it to share a wide range of data across heterogeneous systems, as long as the XML conforms to ISO standards.
An XML document includes two basic types of information: the data itself and the tags that organize and describe the data. The following XML demonstrates the basic components of an XML document:
<Cars>
<Car id="1234">
<Make>Volkswagen</Make>
<Model>Eurovan</Model>
<Year>2003</Year>
<Color>White</Color>
</Car>
<Car id="5678">
<Make>Honda</Make>
<Model>CRV</Model>
<Year>2009</Year>
<Color>Black</Color>
<Mileage>35,600</Mileage>
</Car>
</Cars>
Elements represent the building blocks of in an XML instance. They include the tags enclosed in angle brackets and the data between those tags. All elements contain an opening tag and an end tag, such as <Cars> and </Cars>, respectively. Between the opening and end tags, you’ll find child elements, data, or both. For example, the <Car>elements are embedded in the <Cars> element, with child elements such as <Make> and <Model> containing the actual data.
In addition, an element’s opening tag can include attributes, such as id="1234", which is a combination of an attribute name (id) and it’s value (1234). You’ll also notice that the XML begins with a special type of tag called a declaration. The declaration specifies which XML version is being used and can optionally include encoding information. In this case, the declaration indicates that the data is stored as a sequence of 8-bit Unicode characters.
This basic example also points out the hierarchical nature of XML and its self-describing nature. We could have just as easily created an XML structure that describes people, places, or other things. Although XML is nothing but simple text, its self-describing format lets it pass data to a variety of systems. As long as applications such as SQL Server have the capacity to parse the XML and that XML conforms to ISO standards, there’s no limit to the type of data you can pass between heterogeneous systems. Its universal acceptance and standardized methodology make it a simple yet effective way to move data between applications that serve different functions and are based on different technologies.
What is an XML instance in SQL Server, as opposed to an XML document?
You’ll often find that SQL Server documentation refers to the XML stored in an xml object (column, variable, or parameter) as an XML instance. This is because SQL Server supports XML documents and XML fragments, both of which are considered XML instances. Although you’ll often see the terms XML instance and XML document used interchangeable, strictly speaking, they are different, with a document being a type of instance, as is a fragment.
You can store either an XML document or fragment in an xml object. However, if you associate that object with a schema collection, you can specify that only documents be permitted. To be considered a document, an XML instance must have only one root element, with no text specified at the top level. Consequently, the following XML instance is considered a document:
<Cars>
<Car id="1234">
<Make>Volkswagen</Make>
<Model>Eurovan</Model>
<Year>2003</Year>
<Color>White</Color>
</Car>
<Car id="5678">
<Make>Honda</Make>
<Model>CRV</Model>
<Year>2009</Year>
<Color>Black</Color>
<Mileage>35,600</Mileage>
</Car>
</Cars>
However, if we were to strip out the <Cars> element, we would have only two <Car> elements with no root, which would leave us with an XML fragment, as shown in the following example:
<Car id="1234">
<Make>Volkswagen</Make>
<Model>Eurovan</Model>
<Year>2003</Year>
<Color>White</Color>
</Car>
<Car id="5678">
<Make>Honda</Make>
<Model>CRV</Model>
<Year>2009</Year>
<Color>Black</Color>
<Mileage>35,600</Mileage>
</Car>
Both examples are considered XML instances, but only the first is a document.
When storing XML documents in a SQL Server database, should you always use the xml data type?
No, you should not. In some cases, you should use large object storage such as varchar(max) or nvarchar(max). It all depends on how you’ll be accessing and updating the XML instances and whether you want to validate the XML structure.
The xml data type requires that all XML be well formed according to ISO standards, which specify how the XML elements must be formatted. You can even take this a step further by associating a schema collection with your xml columns. A schema collection contains one or more XML schemas that define the types of elements and attributes an XML instance can contain. Schemas also more tightly define the type of values that can be associated with an element or attribute.
In addition, the xml data type supports a set of methods that let you query specific elements and attributes within an XML instance. For example, suppose your xml column stores all the contact information associated with each customer. The data might include email addresses, phone numbers, social networking handles, and physical addresses. From this column, you can retrieve an email address or add a cell phone number or change a home address. Plus, you can define an XML index on an xml column if your query workload commonly includes that column.
In some cases, however, you don’t need any of the features that the xml data type provides. For example, you might be storing product manuals that are accessed only in their entirety. If the manual is modified, the xml value itself is replaced with the new document. You never need to query or modify individual components, and you don’t care about the XML being validated. In a situation such as this, you should use a large object data type to avoid the extra validating processes that the database engine performs for an xml column. In general, whenever your application is simply storing and retrieving the entire XML instance or you want to preserve an XML instance in its original form, as might be the case with legal documents, you should stick with large object storage.
What does it mean for an xml column to be typed?
A typed xml column (or variable or parameter) refers to one that is bound to an XML schema collection within SQL Server. A schema collection is a set of one or more XML schemas that determine the permitted structure of the XML instances stored within that column.
The schema specifies the types of elements and attributes permitted in the XML and how those components should be ordered and hierarchically preserved. A schema also determines the type of values permitted in an element or attribute (such as string or integer) and the number to times an element can be included. For example, the schema might specify that there must be exactly one <FirstName> child element within each <Customer> element and the <FirstName>value must be a string.
An xml column not associated with a schema collection is considered untyped. However, the XML instance must still be well formed accorded to ISO standards. For example, each element must be defined by opening and end tags, as in the following example:
<FirstName>Elizabeth</FirstName>
Notice that the two tags are identical, except that the end tag also includes a slash to distinguish it from the opening tag. In addition, the opening and end tag names must have identical capitalization.
When you insert an XML instance into an untyped xml column, the database engine validates the XML to ensure that it is well formed. When you insert an XML instance into a typed column, the engine ensures that the XML is well formed andthat it adheres to the more rigid structure defined in the associated schema.
You should use typed xml columns whenever you have the necessary schema collections and you want SQL Server to validate the XML against that schema. If you don’t have a schema collection or don’t want to associate an xml column with the collection you do have, then use an untyped column.
How do I convert an XML value to a string value in SQL Server?
Unlike many types of conversions in SQL Server, you cannot implicitly convert an xml value to a string value. Although SQL Server supports many types of implicit data type conversions, including string data to XML, it does not support implicit XML-to-string conversions. For that, you must use either the CAST or CONVERT function. (If you do try to do an implicit conversion, you’ll receive an error.)
Converting XML to a string is a fairly straightforward process. The following example shows you how to use the CASTfunction to convert an xml variable varchar(max):
DECLARE @xmlCar xml;
DECLARE @strCar varchar(max);
SET @xmlCar =
'<Car id="1234">
<Make>Volkswagen</Make>
<Model>Eurovan</Model>
<Year>2003</Year>
<Color>White</Color>
</Car>'
SET @strCar = CAST(@xmlCar AS varchar(max));
SELECT @strCar;
In the example, I’ve simply assigned the XML to the @xmlCar variable and used the CAST function to convert it to the string data type. If you were to run the SELECT statement, you would receive a single line of text that contains your XML instance:
<Car id="1234"><Make>Volkswagen</Make><Model>Eurovan</Model><Year>2003</Year><Color>White</Color></Car>
If you want to instead use the CONVERT function, the final SET statement would look as follows:
SET @strCar = CONVERT(varchar(max), @xmlCar);
The function would again convert the XML instance to the varchar(max) data type, and the SELECT statement would return the same single line of text.
When using the CONVERT function, be aware that it’s specific to SQL Server and does not port to other systems. However, unlike CAST, the CONVERT function supports additional options for converting data. For example, if your XML instance has been saved with the white space and line breaks preserved, you can add a third argument to the function to preserve them when converting the instance to a string. Even so, if there’s any possibility that you’ll be porting your database to another system, you should probably stick with CAST
Is it worth indexing an xml column in a SQL Server database?
In some cases, yes. That’s why XML indexes are available. Keep in mind that SQL Server stores XML data as large binary objects (BLOBS). When there is no index, the database engine shreds the XML at runtime when evaluating the query. In some cases, this can represent a fairly time-consuming process. If you plan to query the xml column routinely, then you should consider an XML index.
Also, you should consider an XML index if your XML instances are relatively large compared to the parts you’re retrieving. With an index, the engine doesn’t have to parse the entire instance at runtime, and queries can benefit from how the tags, values, and paths have been indexed. But be sure to take into account your data manipulation operations. XML indexes receiving heavy data modifications can severely affect query performance.
SQL Server supports two types of XML indexes: primary and secondary. The first index on an XML column must be a primary index, which stores a shredded representation of the XML instance. The index includes all tags, paths, and values, as well as other details about the instance, in order to support queries that target specific components. Queries that retrieve the entire instance do not use the index.
Secondary indexes are specific to the paths, values, or properties within the XML instance and support queries that target those specific elements. You can define any combination of one or more secondary index on you xml column, as long as you first define your primary index. For example, you might define a secondary index on the XML paths and another on the XML values because your queries frequently target both types of components within your XML instances.
What’s this I hear about selective XML indexes in SQL Server?
The selective index is new to SQL Server 2012 SP1, although it’s received relatively little press. The new feature addresses one of the main limitations of the original XML index, that is, it includes the entire XML instance, which can lead to unnecessary hits on performance and index maintenance.
The selective index feature lets you create primary and secondary XML indexes that contain only specific paths within your XML instances. In other words, you can create XML indexes based on a subset of nodes, ideally those you most often query. Selective indexes can improve query performance as well as reduce storage and maintenance costs. (A secondary selective index is based on a single path that is already included in an existing primary selective XML index.)
When you create a selective XML index, you specify the paths that should be included. However, you cannot index nodes with complex or binary XS types. In addition, you can index only child, attribute, or descendent axes. In general, you should avoid using selective XML indexes if you must index a large number of nodes.
In order to create a selective XML index, you must first enable the feature on your database. To do so, use thesp_db_selective_xml_index system stored procedure.
How do I retrieve an individual value from a specific element within an XML instance?
The simplest way to retrieve an individual value from an XML instance is to use the value() method, one of the five methods available to the xml data type. When using the value() method, you call it by first specifying the name of thexml object that contains the target data and then providing two arguments, an XQuery expression and the data type of the returned value, as shown in the following syntax:
xml_object.value('xquery_exp', 'data_type')
XQuery is a scripting language used to access XML data. SQL Server supports a subset of the language that you can use in the expression you pass into the value() method and other xml methods. Although a full discussion of XQuery, as it is implemented in SQL Server, is way beyond what we can cover here, you can find more information in the MSDN XQuery Language Reference.
As you can see in the syntax, you must include the XQuery expression as the first argument to the value() method. In most cases, that expression will be a fairly straightforward path that points to the XML value you want to retrieve. If you're querying a typed column, that expression will also include a reference to the schema.
Let’s look at an example to get a sense of how the value() method works against an untyped variable. The following T-SQL declares the @cars variable with the xml data type and assigns an XML document to the variable:
DECLARE @cars xml;
SET @cars =
'<?xml version="1.0" encoding="UTF-8"?>
<Cars>
<Car id="1234">
<Make>Volkswagen</Make>
<Model>Eurovan</Model>
<Year>2003</Year>
<Color>White</Color>
</Car>
<Car id="5678">
<Make>Honda</Make>
<Model>CRV</Model>
<Year>2009</Year>
<Color>Black</Color>
<Mileage>35,600</Mileage>
</Car>
</Cars>';
SELECT
@cars.value('(/Cars/Car[@id=5678]/Make)[1]', 'varchar(15)')
AS Make,
@cars.value('(/Cars/Car[@id=5678]/Model)[1]', 'varchar(15)')
AS Model;
After we declare and set the @cars variable, we can use it to retrieve the specific value we want. (It works the same way for an xml column.) When we call the variable, we add the value() method and pass in the two arguments. The first argument is the XQuery expression that points to the path where our data is located. The second argument specifies that the value be returned as a varchar(15) string.
In this case, we’re specifying that we retrieve the <Make> and <Model> values for the <Car> element whose idattribute is 5678. Notice that, although we’re referencing a specific <Car> element, we must still include the [1] after the XQuery expression. The value() method requires a scalar value so you must specify the [1] to ensure that a single value is being returned, whether or not it’s needed.
As to be expected, the SELECT statement returns the value Honda from the <Make> element and returns CRV from the<Model> element. However, by modifying the XQuery, we can retrieve whatever individual values we need.
Why can't I just explore XML data using SELECT statements?
You can. There is an excellent function published by Jacob Sebastian called XMLTable to get you started. If you don't feel like writing your own queries to explore XML documents, you can use this to see what data is in a document, and the path to specific values within the document. If, on the other hand, you wish to know more, it is well worth investigating the source to see how it is done.
If, for example, you want to investigate the XPaths and values in an XML fragment, you can do this
DECLARE @cars xml;
SET @cars =
'<?xml version="1.0" encoding="UTF-8"?>
<Cars>
<Car id="1234">
<Make>Volkswagen</Make>
<Model>Eurovan</Model>
<Year>2003</Year>
<Color>White</Color>
</Car>
<Car id="5678">
<Make>Honda</Make>
<Model>CRV</Model>
<Year>2009</Year>
<Color>Black</Color>
<Mileage>35,600</Mileage>
</Car>
</Cars>';
SELECT xpath, value FROM dbo.xmlTable(@cars)
WHERE value is not null ORDER BY xpath
This would give the result of …
xpath Value
-------------------------------------
Cars[1]/Car[1]/@id 1234
Cars[1]/Car[1]/Color[1] White
Cars[1]/Car[1]/Make[1] Volkswagen
Cars[1]/Car[1]/Model[1] Eurovan
Cars[1]/Car[1]/Year[1] 2003
Cars[1]/Car[2]/@id 5678
Cars[1]/Car[2]/Color[1] Black
Cars[1]/Car[2]/Make[1] Honda
Cars[1]/Car[2]/Mileage[1] 35,600
Cars[1]/Car[2]/Model[1] CRV
Cars[1]/Car[2]/Year[1] 2009
You could then use the XPaths supplied in order to select individual values from the XML document. For example:
SELECT @cars.value('Cars[1]/Car[2]/Make[1]','VARCHAR(20)')+' '+@cars.value('Cars[1]/Car[2]/Model[1]','VARCHAR(20)') AS CarType
... giving ...
CarType
-------------------
Honda CRV
Or then, to select all the cars in the XML file you can use the XPath in the .nodes method.
SELECT t.value('Make[1]','NVARCHAR(MAX)')+' '+ t.value('Model[1]','NVARCHAR(MAX)') AS CarType
from @cars.nodes('/Cars/Car') x(t)
giving …
CarType
------------------
Volkswagen Eurovan
Honda CRV
How do I update an individual value for a specific element in an XML instance?
Another method that the xml data type supports is modify(), which uses the XML Data Modification Language (XML DML) to update, insert, and delete individual components within an XML instance. The XML DML language is a SQL Server extension to the XQuery language. Like XQuery, XML DML supports a number of different elements. For example, the second SET statement in the following T-SQL uses XML DML to update the Eurovan’s <Year> value to 2001:
DECLARE @cars xml;
SET @cars =
'<?xml version="1.0" encoding="UTF-8"?>
<Cars>
<Car id="1234">
<Make>Volkswagen</Make>
<Model>Eurovan</Model>
<Year>2003</Year>
<Color>White</Color>
</Car>
<Car id="5678">
<Make>Honda</Make>
<Model>CRV</Model>
<Year>2009</Year>
<Color>Black</Color>
<Mileage>35,600</Mileage>
</Car>
</Cars>';
SET @cars.modify(
'replace value of
(/Cars/Car[@id=1234]/Year/text())[1]
with "2001" ');
SELECT @cars;
As in the previous example, we declare the @cars variable and assign an XML value. Then we issue a second SETstatement that again specifies the variable, but this time calls the modify() method. In this case, the method takes only one argument enclosed in singe quotes, but that argument includes several components.
The argument starts with the replace value of keywords, which indicates that we’re updating data. This is followed by an XQuery path expression, in parentheses and followed by [1], which specifies what value we want to update. In this case, the path points to the <Year> node for car 1234. Notice, however, that the XQuery path expression also includes the text() function, which specifically targets the path’s value. We then follow with the with keyword and then the new value, 2001, enclosed in double quotes. When we run the SELECT statement, we’ll find that the <Year> value has been updated, but nothing else has been touched.
This, of course, is only one simple example of the modify() method. It can do a lot more than just update a single value. Be sure to refer to SQL Server Books Online more details about this and all the methods supported by the xml data type.
Is it possible to retrieve relational data in a SQL Server database as XML?
Yes. As a matter of fact, it’s fairly easy. Just add the FOR XML clause to your SELECT statement and specify a mode for how the data should be returned. Rather than returning a typical rowset, the SELECT statement will return the data as XML. You can add the FOR XML clause to your top-level SELECT statements as well as to subqueries embedded inSELECT, INSERT, UPDATE and DELETE statements.
When you include the FOR XML clause in your SELECT statement, you must also specify one of four nodes: RAW, AUTO,EXPLICIT, or PATH. The RAW mode returns a single element for each row in the result set. The element uses the<row> tag and treats each non-null column as an attribute. For example, the following SELECT statement joins several tables and returns a column from each:
USE AdventureWorks2012;
GO
SELECT TOP 2
Sales.SalesOrderID AS SalesID,
Territory.name AS TerritoryName,
Country.name AS CountryName
FROM Sales.SalesOrderHeader Sales
INNER JOIN Sales.SalesTerritory Territory
ON Sales.TerritoryID = Territory.TerritoryID
INNER JOIN Person.CountryRegion Country
ON Territory.CountryRegionCode = Country.CountryRegionCode
FOR XML RAW;
Because the FOR XML clause is included, the data will be returned as XML. You can see this in the following results, in which each row returned is its own <row> element with the columns treated as attributes:
<row SalesID="43664" TerritoryName="Northwest" CountryName="United States" />
<row SalesID="43665" TerritoryName="Northwest" CountryName="United States" />
Unlike the RAW mode, the AUTO mode returns the results as nested XML elements, forming a hierarchy based on the order of the columns in the SELECT list. If we were to specify the AUTO keyword instead of RAW in our example SELECTstatement, the results would look quite different:
<Sales SalesID="43659">
<Territory TerritoryName="Southeast">
<Country CountryName="United States" />
</Territory>
</Sales>
<Sales SalesID="43660">
<Territory TerritoryName="Southeast">
<Country CountryName="United States" />
</Territory>
</Sales>
As you can see, the XML has fewer attributes and more elements than the preceding results. However, the form they take depends on how you construct your SELECT statement. It’s also worth noting that both modes support additional options, although neither one provide much control over how you form your XML. For that, we need to turn to the EXPLICIT andPATH modes.
The EXPLICIT mode provides much more control over the XML, but it can be more difficult to use because it requires some cumbersome workarounds to get your query to return the results in the proper form. The PATH mode provides much of the same control as the EXPLICIT mode, but it is far simpler to use, so let’s look at how that one works:
SELECT TOP 2
Sales.SalesOrderID AS "@SalesID",
Territory.name AS TerritoryName,
Country.name AS CountryName
FROM Sales.SalesOrderHeader Sales
INNER JOIN Sales.SalesTerritory Territory
ON Sales.TerritoryID = Territory.TerritoryID
INNER JOIN Person.CountryRegion Country
ON Territory.CountryRegionCode = Country.CountryRegionCode
WHERE SalesOrderID < 43662
FOR XML PATH('Sale'), ROOT('Sales');
Again, we’re running the same SELECT statement as the previous examples, but this time around, our FOR XML clause specifies the PATH mode and a name for each row element, in this case, <Sale>. The mode name is then followed by the ROOT directive, which specifies that our root node should be <Sales>. The statement now returns the following results:
<Sales>
<Sale SalesID="43659">
<TerritoryName>Southeast</TerritoryName>
<CountryName>United States</CountryName>
</Sale>
<Sale SalesID="43660">
<TerritoryName>Southeast</TerritoryName>
<CountryName>United States</CountryName>
</Sale>
</Sales>
As you can see, the PATH mode gives us a great deal of control over how we format the XML output. Of course, these examples are just a simple overview of how the FOR XML clause and its modes work, but they give you the idea of the many options available to you for outputting your queries to XML.