12/20/2011

Controlling Growth of a msdb Database

I recently encountered a situation where the drive hosting Sharepoint Databases in a Staging environment ran out of space. I logged onto the server and found that the msdb database has itself occupied 38 GB of the total disk space. Msdb database generally contain maintenance information for the database such as backups, log shipping and so on. My first step was to examine all the tables and I noted that there was not an abnormally large number of records.
I then decided to verify using a T-SQL script (shown below) exactly who the culprit actually is and the results were rather strange.
SELECT object_name(i.object_id) as objectName,
i.[name] as indexName,
sum(a.total_pages) as totalPages,
sum(a.used_pages) as usedPages,
sum(a.data_pages) as dataPages,
(sum(a.total_pages) * 8 ) / 1024 as totalSpaceMB,
(sum(a.used_pages) * 8 ) / 1024 as usedSpaceMB,
(sum(a.data_pages) * 8 ) / 1024 as dataSpaceMB
FROM sys.indexes i
INNER JOIN sys.partitions p
ON i.object_id = p.object_id
AND i.index_id = p.index_id
INNER JOIN sys.allocation_units a
ON p.partition_id = a.container_id
GROUP BY i.object_id, i.index_id, i.[name]
ORDER BY sum(a.total_pages) DESC, object_name(i.object_id)
GO
The output of the above T-SQL query is as shown in the screen capture below:

As you can see in the first highlighted section we can see that there are two indexes named c1lsmonitor_history_detail, and nc2lsmonitor_history_detail is present in the table named log_shipping_monitor_history_detail which has occupied 25698+9899=35597 MB= 34.76 GB and these were the primary cause of the large database size.
I then decided to perform a Re-index of the two indexes noted above and I did this by just right-clicking on the Index Name and selecting Rebuild.I also updated the statistics of the corressponding indexes. After completing the Re-indexing and Update Statistics, I tried to Shrink the msdb database and it shrank the database size from 35 GB to a mere 700 MB. please refer the screen capture below:

I then decided to include the msdb database as a part of daily reindexing and update statistics job which is set to occur daily.
Reindexing and updating statistics could alternatively be accomplished using the T-SQL below.
T-SQL for Re-indexing:
DECLARE
@dbname varchar(1000),
@parentname varchar(255),
@SQLSTR VARCHAR (1000),
@ctrl CHAR (2),
@command varchar(1000)
SET @ctrl = CHAR (13) + CHAR (10)
DECLARE DBCUR CURSOR FOR
select [name]
from sysdatabases where name not in
(
'master',
'model',
'tempdb'
)
order by 1
OPEN DBCUR
FETCH NEXT FROM DBCUR INTO @dbname
WHILE @@FETCH_STATUS = 0
BEGIN
select @command =
'
use ['+@dbname+']
Exec sp_MSForEachtable ''DBCC DBREINDEX ("?")''
'
exec (@command)
FETCH NEXT FROM DBCUR INTO @dbname
END
CLOSE DBCUR
DEALLOCATE DBCUR
GO
T-SQL for Update Statistics:
DECLARE
@dbname varchar(1000),
@parentname varchar(255),
@SQLSTR VARCHAR (1000),
@ctrl CHAR (2),
@command varchar(1000)
SET @ctrl = CHAR (13) + CHAR (10)
DECLARE DBCUR CURSOR FOR
select [name]
from sysdatabases where name not in
(
'master',
'model',
'tempdb'
)
order by 1
OPEN DBCUR
FETCH NEXT FROM DBCUR INTO @dbname
WHILE @@FETCH_STATUS = 0
BEGIN
select @command =
'
use ['+@dbname+']
Exec sp_MSForEachtable ''update statistics ? with fullscan''
'
exec (@command)
FETCH NEXT FROM DBCUR INTO @dbname
END
CLOSE DBCUR
DEALLOCATE DBCUR
GO
Thus we have successfully controlled the growth of the Distribution Database. Please let us know if you have any suggestions or comments on this approach.

Drop a Database by Closing Existing Connections using SSMS or T-SQL

To drop a SQL Server database, you will need exclusive access to the database ensure there are no other current users of the database or you will encounter the error:
Drop failed for Database ‘dbName’ …. Cannot drop database because it is currently in use
Ensuring there are no other current users can be very difficult – there may be jobs running using the database or there could be idle users who have opened the connections to the database and so on.
Therefore, you need to find all the spids and kill them. In SSMS when using the UI to drop the database there is an option to Close existing connections:

Alternatively, this can be done using the T-SQL script below.
USE master
Go
ALTER DATABASE [ClusterKey] SET  SINGLE_USER WITH ROLLBACK IMMEDIATE
DROP DATABASE ClusterKey
As you can see, first the database will be set to single user mode and point to remember is all the existing connections transactions will be rolled back.

12/19/2011

SQL Server Data Warehouse Cribsheet

by

SQL Server Data Warehouse Cribsheet

For things you need to know rather than the things you want to know
It is time to shed light on Data Warehousing and to explain how SSAS, SSRS and Business Intelligence fit into the puzzle. Who better to explain it all than Robert Sheldon?

Contents

Introduction

One of the primary components in a SQL Server business intelligence (BI) solution is the data warehouse. Indeed, the data warehouse is, in a sense, the glue that holds the system together. The warehouse acts as a central repository for heterogeneous data that is to be used for purposes of analysis and reporting.
Because of the essential role that the data warehouse plays in a BI solution, it’s important to understand the fundamental concepts related to data warehousing if you’re working with such a solution, even if you’re not directly responsible for the data warehouse itself. To this end, the article provides a basic overview of what a data warehouse is and how it fits into a relational database management system (RDBMS) such as SQL Server. The article then describes database modelling concepts and the components that make up the model, and concludes with an overview of how the warehouse is integrated with other components in the SQL Server suite of BI tools.
Note:
The purpose of this article is to provide an overview of data warehouse concepts. It is not meant as a recommendation for any specific design. In addition, the article assumes that you have a basic understanding of relational database concepts such as normalization and referential integrity. In addition, the examples used in here tend to be specific to SQL Server 2005 and 2008, although the underlying principles can apply to any RDBMS.

The Data Warehouse

A data warehouse consolidates, standardizes, and organizes data in order to support business decisions that are made through analysis and reporting. The data might originate in RDBMSs such as SQL Server or Oracle, Excel spreadsheets, CSV files, directory services stores such as Active Directory, or other types of data stores, as is often the case in large enterprise networks. Figure 1 illustrates how heterogeneous data is consolidated into a data warehouse.
Figure 1: Using a Data Warehouse to Consolidate Heterogeneous Data
The data warehouse must be able to store data from a variety of data sources in a way that lets tools such as SQL Server Analysis Services (SSAS) and SQL Server Reporting Services (SSRS) efficiently access the data. These tools are, in effect, indifferent to the original data sources and are concerned only with the reliability and viability of the data in the warehouse.
A data warehouse is sometimes considered to be a place for archiving data; however, that is not its purpose. Although historical data is stored in a data warehouse, only the historical range necessary to support analysis and reporting is retained there. For example, if a business rule specifies that the warehouse must maintain two years worth of historical data, older data is offloaded to another system for archiving or is deleted, depending on the specified business requirements.

Data Warehouse vs. Data Mart

A data warehouse is different from a data mart, although the terms are sometimes used interchangeable. In addition, there is some debate as to what exactly constitutes a data mart as compared to a data warehouse. However, it is generally accepted that a data warehouse is associated with enterprise-wide business processes and decisions (and consequently is usually a repository for enterprise-wide data), whereas the data mart tends to focus on a specific business segment of that enterprise. In some cases, a data mart might be considered a subset of the data warehouse, although this is by no means a universal interpretation or practice. For the purposes of this article, we’re concerned only with the enterprise-wide repository known as a data warehouse.

Relational Database vs. Dimensional Database

Because SQL Server, like Oracle and MySQL, is a RDBMS, any database stored within that system can be considered, by extension, a relational database. And that’s where things can get confusing.
The typical relational database supports online transaction processing (OLTP). For example, an OLTP database might support bank transactions or store sales. The transactions are immediate and the data is current, with regard to the most recent transaction. The database conforms to a relational model for efficient transaction processing and data integrity. The database design should, in theory, adhere to the strict rules of normalization which aim, among other things, to ensure that the data is treated as atomic units and there is minimal amount of redundant data.
A data warehouse, on the other hand, generally conforms to a dimensional model, which is more concerned with query efficiency than issues of normalization. Even though a data warehouse is, strictly speaking, a relational database (because it’s stored in a RDBMS), the tables and relationships between those tables are modelled very differently from the tables and relationships defined in the relational database. (The specifics of data warehouse modelling are discussed below.)
Note:
Because of the reasons described above, you might come across documentation that refers to a data warehouse as a relational database. However, for the purposes of this article, I refer to an OLTP database as a relational database and a data warehouse as a dimensional database.

Dimensional Database vs. Multidimensional Database

Another source of confusion at times is the distinction between a data warehouse and an SSAS database. The confusion results because some documentation refers to an SSAS database as a dimensional database. However, unlike the SQL Server database engine, which supports OLTP as well as data warehousing, Analysis Services supports online analytical processing (OLAP), which is designed to quickly process analytical queries. Data in an OLAP database is stored in multidimensional cubes of aggregated data, unlike the typical table/column model found in relational and dimensional databases.
Note:
I’ve also seen a data warehouse referred to as a staging database when used to support SSAS analysis. Perhaps from an SSAS perspective, this might make sense, especially if all data in the data warehouse is always rolled up into SSAS multidimensional cubes. In this sense, SSAS is seen as the primary database and the warehouse as only supporting that database. Personally, I think such a classification diminishes the essential role that the data warehouse plays in a BI solution, so I would tend to avoid this sort of reference.
OLAP technologies are usually built with dimensionally modelled data warehouses in mind, although products such as SSAS can access data directly from relational database. However, it is generally recommended to use a warehouse to support more efficient queries, properly cleanse the data, ensure data integrity and consistency, and support historical data. The data warehouse also acts as a checkpoint (not unlike a staging database!) for troubleshooting data extraction, transformation, and load (ETL) operations and for auditing the data.

The Data Model

A data warehouse should be structured to support efficient analysis and reporting. As a result, the tables and their relationships must be modelled so that queries to the database are both efficient and fast. For this reason, a dimensional model looks very different from a relational model.
There are basically two types of dimensional models: the star schema and snowflake schema. Often, a data warehouse model follows one schema or the other. However, both schemas are made up of the same two types of tables: facts and dimensions. Fact tables represent a core business process, such as retail sales or banking transactions. Dimension tables store related details about those processes, such as customer data or product data. (Each table type is described in greater detail later in the article.)

The Star Schema

The basic structure of a star schema is a fact table with foreign keys that reference a set of dimensions. Figure 2 illustrates how this structure might look for an organization’s sales process.

Figure 2: Using a Star Schema for Sales Data
The fact table (FactSales) pulls together all information necessary to describe each sale. Some of this data is accessed through foreign key columns that reference dimensions. Other data comes from columns within the table itself, such as Quantity and UnitPrice. These columns, referred to as measures, are normally numeric values that, along with the data in the referenced dimensions, provide a complete picture of each sale.
The dimensions, then, provide details about the functional groups that support each sale. For example, the DimProduct dimension includes specific details about each product, such as color and weight. Notice, however, that the dimension also includes the Category and Subcategory columns, which represent the hierarchical nature of the data. (Each category contains a set of subcategories, and each subcategory contains a set of products.) In essence, the dimension in the star schema denormalizes—or flattens out—the data. This means that most dimensions will likely include a fair amount of redundant data, thus violating the rules of normalization. However, this structure provides for more efficient querying because joins tend to be much simpler than those in queries accessing comparable data in a relational database.
Dimensions can also be used by other fact tables. For example, you might have a fact that references the DimProduct and DimDate dimensions as well as references other dimensions specific to that fact. The key is to be sure that the dimension is set up to support both facts so that data is presented consistently through each one.

The Snowflake Schema

You can think of the snowflake schema as an extension of the star schema. The difference is that, in the snowflake schema, dimensional hierarchies are extended (normalized) through multiple tables to avoid some of the redundancy found in a star schema. Figure 3 shows a snowflake schema that stores the same data as the star schema in Figure 2.
Figure 3: Using a Snowflake Schema for Sales Data
Notice that the dimensional hierarchies are now extended into multiple tables. For example, the DimProduct dimension references the DimProductSubcategory dimension, and the DimProductSubcategory dimension references the DimProductCategory dimension. However, the fact table still remains the hub of the schema, with the same foreign key references and measures.

The Star Schema vs. the Snowflake Schema

There is much debate over which schema model is better. Some suggest—in fact, insist—that the star schema with its denormalized structure should always be used because of its support for simpler and better performing queries. However, such a schema makes updating the data a more complex process. The snowflake schema, on the other hand, is easier to update, but queries are more complex. At the same time, there’s the argument that SSAS queries actually perform better against a snowflake schema than a star schema. And the debate goes on.
In many cases, data modellers choose to do a hybrid of both, normalizing those dimensions that support the user the best. The AdventureWorksDW and AdventureWorksDW2008 sample data warehouses take this approach. For example, in the AdventureWorksDW2008 database, the hierarchy associated with the DimProduct dimension is extended in a way consistent with a snowflake schema, but the DimDate dimension is consistent with the star schema, with its denormalized structure. The important point to keep in mind is that the business requirements must drive the data model—with query performance a prime consideration.

The Fact Table

At the core of the dimensional data model is the fact table. As shown in figures 2 and 3, the fact table acts as the central access point to all data related to a specific business process—in this case, sales. The table is made up of columns that reference dimensions (the foreign keys) and columns that are measures. A fact table supports several types of measures:
  • Additive: A type of measure in which meaningful information can be extracted by aggregating the data. In the preceding examples, the SalesAmt column is additive because those figures can be aggregated based on a specified date range, product, salesperson, or customer. For example, you can determine the total sales amount for each customer in the past year.
  • Nonadditive: A type of measure that does not expose meaningful information through aggregation. In the above examples, the UnitPrice column might be considered nonadditive. For example, totaling the column based on a salesperson’s sales in the past year does not produce meaningful data. Ultimately, a measure is considered nonadditive if it cannot be aggregated along any dimensions. (If a measure is additive along some dimensions and not others, it is sometimes referred to as semiadditive.)
  • Calculated: A type of measure in which the value is derived by performing a function on one or more measures. For example, the TotalAmt column is calculated by adding the SalesAmt and TaxAmt values. (These types of columns are sometimes referred to as computed columns.)
If you refer back to figure 2 or 3, you’ll see that no primary key has been defined on the fact table. I have seen various recommendations on how to handle the primary key. One recommendation is to not include a primary key, as I’ve done here. Another recommendation is to create a composite primary key based on the foreign-key columns. A third recommendation is to add an IDENTITY column configured with the INT data type. Finally, another approach is to add columns from the original data source that can act as the primary key. For example, the source data might include an OrderID column. (This is the approach taken by the AdventureWorksDW2008 data warehouse.)
As stated above, the goal of any data warehouse design should be to facilitate efficient and fast queries (while still ensuring data integrity). However, other considerations should include whether it will be necessary to partition the fact table, how much overhead additional indexing (by adding a primary key) will be generated, and whether the indexing actually improves the querying process. As with other database design considerations, it might come down to testing various scenarios to see where you receive your greatest benefits and what causes the greatest liabilities.

The Dimension

A dimension is a set of related data that supports the business processes represented by one or more fact tables. For example, the DimCustomer table in the examples shown in figures 2 and 3 contains only customer-related information, such as name, address, and phone number, along with relevant key columns.

The Surrogate Key

Notice that the DimCustomer dimension includes the CustomerKey column and the CustomerBusKey column. Let’s start with CustomerKey.
The CustomerKey column is the dimension’s primary key. In most cases, the key will be configured as an IDENTITY column with the INT data type. In a data warehouse, the primary key is usually a surrogate key. This means that the key is generated specifically within the context of the data warehouse, independent of the primary key used in the source systems.
You should use a surrogate for a number of reasons. Two important ones are that the key provides the mechanism necessary for updating certain types of dimension attributes over time and it removes the dependencies on keys originating in different data sources. For example, if you retrieve customer data from two SQL Server databases, a single customer might be associated with multiple IDs or the same ID might be assigned to multiple customers.
That doesn’t mean that the original primary key is discarded. The original key is carried into the dimension and maintained along with the surrogate key. The original key, usually referred to as the business key, lets you map the source data to the dimension. For example, the CustomerBusKey column in the DimCustomer dimension contains the original customer ID, and the CustomerKey column contains the new ID (which is the surrogate key and primary key).

The Date Dimension

A date dimension (such as the DimDate dimension in the examples) is treated a little differently from other types of dimension. This type of table usually contains one row for each day for whatever period of time is necessary to support an application. Because a date dimension is used, dates are not tracked within the fact table (except through the foreign key), but instead in the referenced date dimension. This way, not only is the day, month, and year stored, but also such data as the week of the year, quarter, and semester. As a result, this type of information does not have to be calculated within the queries. This is important because date ranges usually play an important role in both analysis and reporting.
Note:
The DimDate dimension in the examples uses only the numerical equivalents to represent day values such as day of week and month. However, you can create a data dimension that also includes the spelled-out name, such as Wednesday and August.
A fact table can reference a date dimension multiple times. For instance, the OrderDateKey and ShipDayKey columns both reference the DateKey column in DimDate. If you also want to track the time of day, your warehouse should include a time dimension that stores the specific time increments, such as hour and minute. Again, a time dimension can be referenced by multiple facts or can be referenced multiple times within the same fact.
Unlike other types of dimensions whose primary key is an integer, a date dimension uses a primary key that represents the date. For example, the primary key for the October 20, 2008 row is 20081020. A time dimension follows the same logic. If the dimension stores hours, minutes, and seconds, each row would represent a second in the day. As a result, the primary key for a time such as 10:30:27 in the morning would be 103027.

The Slowly Changing Dimension

Dimension tables are often classified as slowly changing dimensions (SCDs), that is, the stored data changes relatively slowly, compared to fact tables. For instance, in the previous examples, the FactSales table will receive far more updates than the DimProduct or DimSalesPerson dimensions. Such dimensions normally change very slowly.
The way in which you identify a SCD affects how the dimension is updated during the ETL process. There are three types of slowly changing dimensions:
  • Type 1: Rows to be updated are overwritten, thus erasing the rows history. For example, if the size of a product changes (which is represented as one row in the DimProduct dimension), the original size is overwritten with the new size, and there is no historical record of the original size.
  • Type 2: Rows to be updated are added as new records to the dimension. The new record is flagged as current, and the original record is flagged as not current. For example, if the product’s color changes, there will be two rows in the dimension, one with the original color and one with the new color. The row with the new color is flagged as the current row, and the original row is flagged as not current.
  • Type 3: Updated data is added as new columns to the dimension. In this case, if the product color changes, a new column is added so that the old and new colors are stored in a single row. In some designs, if the color changes more than once, only the current and original values are stored.
The built-in mechanisms in SQL Server Integration Services (SSIS) implement only Type 1 and Type 2 SCDs. Although you can build an ETL solution to work with Type 3 SCDs, Type 2 SCDs are generally considered more efficient, easier to work with, and provide the best capacity for storing historical data.
When working with SCDs in SSIS, you can use the Slowly Changing Dimension transformation to implement the SCD transformation workflow. However, SSIS does not identify SCDs at the dimension level, but rather at the attribute (column) level. Figure 4 shows the Slowly Changing Dimension Columns screen of the Slowly Changing Dimension wizard.

Figure 4: Specifying the Slowly Changing Dimension Columns in SSIS
For each column you can choose whether that column is a Changing Attribute (Type 1) or a Historical Attribute (Type 2). The wizard also offers a third option—Fixed Attribute, in which case, the value of the column can never change.
There is one other consideration when working with SCDs. If a dimension supports Type 2 SCDs, you need some way to mark each row to show whether it is the current row (the most recent updated row) and the historical range of when each row was current. The most effective way to do this is to add the following three columns to any dimension that supports SCD attributes:
  • Start date: The date/time when the row was inserted into the dimension.
  • End date: The date/time when the row became no longer current and a new row was inserted to replace this row. If the row is the most current row, this column value is set to null or assigned a maximum date that is out of the present-day range, such as December 31, 9999.
  • Current flag: A Boolean or other indicator that marks which row within a given set of associated rows is the current row.
Some systems implement only the start date and end date columns, without a flag, and then use the end date to calculate which row is the current row. In fact, the Slowly Changing Dimension transformation supports using only the dates or using only a flag. However, implementing all three columns is generally considered to be the most effective strategy when working with SCDs. One approach you can take when using the Slowing Changing Transformation wizard is to select the status flag as the indicator and then modify the data flow to incorporate the start and end date updates.

The Data

Although the data warehouse is an essential component in an enterprise-wide BI system, there are indeed other important components. Figure 5 provides an overview of how the SQL Server suite of tools might be implemented in a BI system.
Figure 5: The SQL Server BI Suite
As you can see, SSIS provides the means to retrieve, transform, and load data from the various data sources. The data warehouse itself is indifferent to the source of the data. SSIS does all the work necessary to ensure that the data conforms to the structure of the warehouse, so it is critical that the warehouse is designed to ensure that the reporting and analysis needs are being met. To this end, it is also essential to ensure that the SSIS ETL operation thoroughly cleanses the data and guarantees its consistency and validity. Together SSIS and the data warehouse form the foundation on which all other BI operations are built.
After the data is loaded into the warehouse, it can then be processed into the SSAS cubes. Note that SSIS, in addition to loading the data warehouse, can also be used to process the cubes as part of the ETL operation.
In addition to supporting multidimensional analysis, SSAS supports data mining in order to identify patterns and trends in the data. SSAS includes a set of predefined data-mining algorithms that help data analyzers perform such tasks as forecasting sales or targeting mailings toward specific customers.
Another component in the SQL Server BI suite is SSRS. You can use SSRS to generate reports based on the data in either the data warehouse or in the SSAS database. (You can also use SSRS to access the data sources directly, but this approach is generally not recommended for an enterprise operation because you want to ensure that the data is cleansed and consistent before including it in reports.) Whether you generate SSRS reports based on warehouse data or SSAS data depends on your business needs. You might choose not to implement SSAS or SSRS, although it is only when these components are all used in orchestration that you realize the full power of the SQL Server BI suite.

Conclusion

Regardless of the components you choose to implement or the business rules you’re trying to address in your BI solution, you can see that the data warehouse plays a critical role in that solution. By storing heterogeneous and historical data in a manner that ensures data integrity and supports efficient access to that data, the data warehouse becomes the heart of any BI solution. As a result, the better you understand the fundamental concepts associated with the data warehouse, the more effectively you will understand and be able to work with all your BI operations.

12/16/2011

Data Warehousing - Fact and Dimension Tables

Data warehouses are built using dimensional data models which consist of fact and dimension tables. Dimension tables are used to describe dimensions; they contain dimension keys, values and attributes. For example, the time dimension would contain every hour, day, week, month, quarter and year that has occurred since you started your business operations. Product dimension could contain a name and description of products you sell, their unit price, color, weight and other attributes as applicable.



Dimension tables are typically small, ranging from a few to several thousand rows. Occasionally dimensions can grow fairly large, however. For example, a large credit card company could have a customer dimension with millions of rows. Dimension table structure is typically very lean, for example customer dimension could look like following:

Customer_key
Customer_full_name
Customer_city
Customer_state
Customer_country


Although there might be other attributes that you store in the relational database, data warehouses might not need all of those attributes. For example, customer telephone numbers, email addresses and other contact information would not be necessary for the warehouse. Keep in mind that data warehouses are used to make strategic decisions by analyzing trends. It is not meant to be a tool for daily business operations. On the other hand, you might have some reports that do include data elements that aren't necessary for data analysis.



Most data warehouses will have one or multiple time dimensions. Since the warehouse will be used for finding and examining trends, data analysts will need to know when each fact has occurred. The most common time dimension is calendar time. However, your business might also need a fiscal time dimension in case your fiscal year does not start on January 1st as the calendar year.



Most data warehouses will also contain product or service dimensions since each business typically operates by offering either products or services to others. Geographically dispersed businesses are likely to have a location dimension.



Fact tables contain keys to dimension tables as well as measurable facts that data analysts would want to examine. For example, a store selling automotive parts might have a fact table recording a sale of each item. The fact table of an educational entity could track credit hours awarded to students. A bakery could have a fact table that records manufacturing of various baked goods.



Fact tables can grow very large, with millions or even billions of rows. It is important to identify the lowest level of facts that makes sense to analyze for your business this is often referred to as fact table "grain". For instance, for a healthcare billing company it might be sufficient to track revenues by month; daily and hourly data might not exist or might not be relevant. On the other hand, the assembly line warehouse analysts might be very concerned in number of defective goods that were manufactured each hour. Similarly a marketing data warehouse might be concerned by the activity of a consumer group with a specific income-level rather than purchases made by each individual.

SQL Server 2008’s Management Data Warehouse - Part 2

by

To identify the location of the MDW enter a server name, and database. Use the “New” button to create a new MDW database. Note that if you click on the new button and specify a new database to be created and then cancel the wizard the database will still be created.

There are two modes that in which a Data Collection might process. There are those collections that collect data based on a snapshot in time, and then those that constantly collect data. The Data Collections that are constantly collecting data use the “Cache directory” to store collected data between uploads to the MDW. To specify a cache location, either type in the name of the directory, or browse for it by using the ellipse (…) button. If you browse for the cache directory it must be present. If you type it in it you must make sure you identify an existing directory. If the directory doesn’t exist the system Data Collection processes that require a cache will fail until you create the cache folder specified. I’d suggest you create the cache folder in advance to avoid having any Data Collections failures issues to deal with.

Once you select a server, a database and identify a cache directory location the “Next >” button will become available. When you click on the next button the following screen will be displayed:






On this screen you identify the different rights each login will have in the MDW. There are three different roles identified: mdw_admin, mdw_reader, and , mdw_writer. The mdw_admin role will be allowed to read, write and update data, as well as run purge and cleanup jobs against the MDW. The mdw_reader role only has access to read data in the MDW, where as the mdw_writer role can write and upload data to the MDW. If you are a DBA and you want to completely manage the MDW then make sure you give yourself access to the “mdw_admin” role. You don’t have to add all the rights when using the wizard, you can go directly into the MDW and place people in these roles later if you desire. Once you are done mapping logins you can either click on the “Next>” or “Finish>>|” button, they both take you to a summary window. On the summary window you can verify what you have entered in the wizard. If while you are reviewing the summary information if you find something you want to change you can click on the “<Back” button to go back and change your MDW configuration options and then click onto the “Finish>>|” button to jump you back to the summary window.

After you click finish the wizard will take a few minutes to create the MDW environment. When the wizard is done setting up the MDW the following screen is displayed:




Once you have completed the wizard the MDW database will have been created, as well as three system Data Collections, and a number of SQL Server agent jobs and SSIS packages.

If you expand the “Data Collection” item in Object Explore within SSMS you will be able to see the three different system Data Collections under the “System Data Collection Set” item.

In addition to the three different system Data Collections a number of SQL Agent jobs and SSIS packages are also created. These SQL Agent jobs and SSIS packages are used to automate the extraction and load process of the MDW for each Data Collection.

System Data Collections
The three different Data Collection items created are: Disk Usage, Query Statistics, and the System Statistics. The Disk Usage Data Collection collects disk space usage information related to Data and Log files associated with each database. This collected information can be used to track the growth rate of your databases over time. This data is useful for performing capacity management.

The Query Statistics Data Collection collects information about any queries that are run against the SQL Server instance. Since Query Statistics Data Collection is disabled by default you will need to enable it if you want to collect query statistics. The System Statistics Data Collection gathers performance counter information, like CPU, Memory, etc. This Data Collection allows you to monitor the resource usages of various counters over time, so you can identify resource bottlenecks and trends over time.

SQL Server 2008 Performance Monitoring
The MDW and the Data Collection containers provide DBAs with an easy method to gather performance monitoring data for SQL Server 2008. Having the performance data collection process built into SQL Server 2008 reduces the need to build your own data collection routines. By using the MDW and the Data Collection process DBAs now have the tools necessary to provide them with data that they can use to track the performance of their SQL Server environment over time and perform capacity planning when new hardware needs to be acquired.

SQL Server 2008’s Management Data Warehouse

by

I’ve been on a performance kick lately. I’ve been sharing with you how to use SQL Server 2005 DMVs and DMFs to identify the resource intensive queries for SQL Server 2005. Well now I’m going to switch gears and look at SQL Server 2008 and discuss one of the new concepts known as Management Data Warehouse (MDW) and the processes around managing data in the MDW. In this article I’ll give you an overview of what the MDW is, how to set it up using the wizard, and will discuss the different system Data Collections that are installed with the wizard.

What is the Management Data Warehouse?
The MDW is just a single database inside of a SQL Server 2008 instance, which holds various performance related statistics. The performance statistics in the MDW are gathered via special data gathering routines, known as Data Collections. The MDW can contain Data Collection information from a single instance, or can hold data collected from multiple instances. The Data Collection process uses SSIS and SQL Agent jobs to gather, schedule and purge information in the MDW.

SQL Server 2008 provides three different system Data Collection definitions. These Data Collections are Disk Usage, Query Activity, and Server Activity. Each of these Data Collection definitions identifies the data to be collected; how it often it should be collected and how long it should be stored in the MDW. Information for these system Data Collections can be found in the snapshots schema within the MDW. Microsoft has also provided standard report to allow you to drill down into data collected for each of these collections using SSMS. I will not be covering the standard reports in this article.

If you want to collect your own performance data periodically you are allowed to define your own data collection definitions to feed the MDW database. User defined Data Collection information will be stored in the custom_snapshots schema. For your own data collections you will need to build your own reporting process. You can do that with an application, or building custom reports that can be run directly from SQL Server Management Studio (SSMS) that references your data collected in the MDW.

What is a Data Collection?
A data collection is a container that allows you to specify what data to collect, when to collect it, and how long to keep it. It does this by using Data Collector. Everyone should be familiar with the concept of a Data Collector. It is someone or something that goes around and collects items from a source, using a specific method and then depositing the collected items in target location somewhere. The tax man is a collector. He takes taxes from you, via a tax return and then deposits them into the tax reserves. Well a SQL Server 2008 Data Collection does the same thing. It collects data from a source location (a SQL Server instance), using a Data Collector and deposits it in a target location, the MDW.

A given Data Collector collects a specific type of data. As of the February CTP of SQL Server 2008 there are only four different collector types. They are: T-SQL Query, SQL Trace, Performance Counters, and Query Activity. Each data collection is defined against the source where you want to collect the data, and then associated with a collector type to identify the type of data that will be collected.

The data can be collected continuously, on a predetermined schedule, or manually. A data collection also identifies how long the collected data will be retained in the MDW. The Data Collection process handles the loading, and purging of data in the MDW. When data is collected continuously SQL Server caches the information. By caching the information SQL Server 2008 optimize the performance impact of collecting data continuously.

All data collected by a Data Collection container is stored in the MDW that is identified for the specific instance of SQL Server where the Data Collection is defined. The MDW database for an instance might be stored physically within the instance, or store on another instance. Where it is stores the collected data depends on how the MDW is set up for a particular instance. When I discuss how to use the wizard to setup the MDW for an instance you should get a better understanding of how to identify a location for the MDW for a given SQL Server instance.

The Data Collection process requires SSIS to transform and store the data collection information and SQL Server Agent schedule to execute the data gathering process. Therefore if you plan to use the MDW then you will need to install SSIS and make sure both SSIS and SQL Server Agent services are running.

Setting Up the Management Data Warehouse
To set up the MDW you need to run through a wizard. The wizard is started by clicking on “Configure Management Data Warehouse” item, which can be found by right clicking on the “Data Collection” item under the “Management” folder in the “Object Explorer” pane within SSMS, as the screen shot below shows:



When the wizard is started a “Welcome to the Configuration Management Warehouse Wizard” window will be displayed. The welcome screen explains a little about the wizard, and the rights needed to create and manage the MDW. The kinds of rights needed by the wizard will depend on whether you are setting up a new MDW database, or just setting up a new SQL Server instance to populate an existing MDW. If you are setting up a new MDW database then you need to make sure the login you are running the wizard under has rights to create a database. If you are only using the wizard to set up a new SQL Server instance to use an existing MDW then the login only needs the rights to define users in the MDW database and associate the database users with appropriate roles (more on this later). When you click on the “Next>” button on the welcome screen the following window will be displayed:






On this screen you will identify the server and database where the MDW will live, or already lives if you are just configuring a new SQL Server instance to use and existing MDW database. The MDW database could live on the SQL Server instance being configured, or it could live on another instance on the same server, or a complete different SQL Server 2008 machine. If you have many SQL Server machines and/or instances you might want to consider using the same MDW database for all instances. This way you all your data collection information will be stored in a single MDW database. This is especially useful if you want to write your own reporting mechanism. By doing this you will be able simply the reporting process by housing all your Data Collection data in a single database.

12/13/2011

The XML Methods in SQL Server

by

The XML Data type has definite uses, but the way of interrogating, retrieving, and manipulating the values of properties and attributes within XML have been so foreign to the SQL language as to be somewhat of a barrier to their use. Fortunately, Robert Sheldon has once more managed to make the subject accessible to those of us who just need to get the job done.
In my last article, “Working with the XML Data Type in SQL Server,” I included examples that returned data from columns and variables of the XML data type. Based on those examples, it might seem that retrieving XML data is similar to retrieving objects configured with other data types. However, that’s true only when returning the entire XML value. If instead you want to work with individual components within an XML instance, such as  when retrieving the value of a single attribute, you must use one of the five methods that available to the XML data type—query(), value(), exist(), nodes(), or modify().
To use an XML method, you call it in association with the database object configured with the XML data type, as the following syntax illustrates:
DbObject.XmlMethod('XQuery'[, 'SqlType']) [AS TableAlias(ColumnAlias)]
The DbObject placeholder refers to a column, variable, or parameter configured with the XML data type. After the object name, you add a period, following by the name of the XML method. Next you provide, in parentheses, the arguments required for the specific method.
Most of the XML methods require only one argument: an XQuery expression that identifies the XML components to be retrieved or modified, and any actions to be taken on those components. XQuery, a powerful scripting language used specifically to access XML data, contains the elements necessary to create complex expressions that can include functions, operators, variables, and values. MSDN provides a complete XQuery language reference that you can refer to while you work with the XML methods.
In addition to the XQuery expression, an XML method might also require a Transact-SQL data type as a second argument or a table and column alias tagged on after the arguments. As we work through the article, you’ll see when and why these additional components are necessary.
To demonstrate how the various XML methods work, I’ve written a number of examples that use the methods to retrieve and modify data. The examples are based on the Stores table, which I created and populated with the following Transact-SQL script:
USE AdventureWorks2008R2
GO

IF OBJECT_ID('Stores') IS NOT NULL
DROP TABLE Stores
GO

CREATE TABLE Stores
(
  StoreID INT PRIMARY KEY,
  Survey_untyped XML,
  Survey_typed XML(Sales.StoreSurveySchemaCollection)
);

 INSERT INTO Stores
 VALUES
 (
   292,
  '<StoreSurvey>
    <AnnualSales>800000</AnnualSales>
    <AnnualRevenue>80000</AnnualRevenue>
    <BankName>United Security</BankName>
    <BusinessType>BM</BusinessType>
    <YearOpened>1996</YearOpened>
    <Specialty>Mountain</Specialty>
    <SquareFeet>21000</SquareFeet>
    <Brands>2</Brands>
    <Internet>ISDN</Internet>
    <NumberEmployees>13</NumberEmployees>
    <Products Type="Bikes">
      <Product>Mountain</Product>
      <Product>Road</Product>
      <Product>Racing</Product>
    </Products>
    <Products Type="Clothes">
      <Product>Jerseys</Product>
      <Product>Jackets</Product>
      <Product>Shorts</Product>
    </Products>
  </StoreSurvey>',
  (SELECT Demographics FROM Sales.Store
   WHERE BusinessEntityID = 292)
);
Notice that the table includes the Survey_untyped column, which is an XML untyped column, and the Survey_typed column, which is an XML typed column. The schema collection I associated with the typed column is already included in the AdventureWorks2008R2 database, which is where I’ve created the table. The INSERT statement in the Transact-SQL above also shows the XML document that I added to the untyped column. For the typed column, I retrieved the following XML document from the Demographics column (a typed XML column) in the Sales.Store table in the AdventureWorks2008R2 database:
<StoreSurvey xmlns="http://schemas.microsoft.com/sqlserver/2004/07/adventure-works/StoreSurvey">
  <AnnualSales>800000</AnnualSales>
  <AnnualRevenue>80000</AnnualRevenue>
  <BankName>United Security</BankName>
  <BusinessType>BM</BusinessType>
  <YearOpened>1996</YearOpened>
  <Specialty>Mountain</Specialty>
  <SquareFeet>21000</SquareFeet>
  <Brands>2</Brands>
  <Internet>ISDN</Internet>
  <NumberEmployees>13</NumberEmployees>
</StoreSurvey>
Notice that the XML includes the schema associated with that data. The schema comes from the same schema collection I used when I created the typed column in the Stores table.
The data I added to the untyped column is nearly identical to the typed XML. The untyped XML doesn’t include the namespace information, of course. However, it does include additional product information, which we’ll use to demonstrate the XML methods. So now that we have setup out of the way, let’s look at how those methods work.

The query() method

The query() method retrieves a subset of untyped XML from the target XML instance. It’s probably the simplest and most straightforward of the XML methods. You need only specify the database object, the method name, and an XQuery expression, as shown in the following syntax:
DbObject.query('XQuery')
You’ll notice that the XQuery expression is entirely enclosed in single quotes and those are enclosed in parentheses. The trick with any XML method is to get the XQuery expression right. But at its simplest, the expression must define the path of the element you want to retrieve.
In the following example, I use the query() method to return data from the <StoreSurvey> element in both the typed and untyped columns:
SELECT
  Survey_untyped.query('/StoreSurvey')
    AS Info_untyped,
  Survey_typed.query('declare namespace
    ns="http://schemas.microsoft.com/sqlserver/2004/07/adventure-works/StoreSurvey";
    /ns:StoreSurvey') AS Info_typed
FROM
  Stores;
For the untyped column, I specified the column name (Survey_untyped) followed by a period and then the method name (query). Within the parentheses and single quotes, I defined the XQuery expression (/StoreSurvey). That’s all there is to it. Because the <StoreSurvey> element is the root node, the entire XML document is returned.
The XQuery expression for the Survey_typed column is a bit more complex. Because it is a typed column, the expression should be preceded by a namespace declaration. The namespace must be the same as the one referenced within the XML document stored in the typed column.
To declare a namespace, you specify the declare namespace keywords, followed by the alias name (in this case, ns). After the alias, you include an equal sign (=) and then the schema path and name. You then end the declaration with a semi-colon (;) to separate it from the main body of the XQuery expression. After you declare the namespace, you can use the alias within the expression to reference that namespace.
The expression itself is nearly identical to the one used for the untyped column, except that you must precede the element name with the namespace alias and a colon (ns:). As with the untyped column, the expression will return the entire XML document because it specifies only the root node.
Although the preceding example is helpful in demonstrating the basics of using the query() method, it’s not much use beyond that because you can just as easily retrieve all the column contents simply by specifying the column name. However, in the following example, I get more specific by limiting the results to the <AnnualSales> child element:
SELECT
  Survey_untyped.query('/StoreSurvey/AnnualSales')
    AS Info_untyped,
  Survey_typed.query('declare namespace
    ns="http://schemas.microsoft.com/sqlserver/2004/07/adventure-works/StoreSurvey";
    /ns:StoreSurvey/ns:AnnualSales') AS Info_typed
FROM
  Stores;
The only difference between this example and the preceding one is that I added /AnnualSales to the XQuery expression for the untyped column and /ns:AnnualSales to the expression for the typed column. The XQuery for the untyped column returns the following results.
<AnnualSales>800000</AnnualSales>
And the XQuery for the typed column returns these results:
<ns:AnnualSales xmlns:ns="http://schemas.microsoft.com/sqlserver/2004/07/adventure-works/StoreSurvey">800000</ns:AnnualSales>
The primary difference between the two is that the data returned by the typed column includes the namespace information. The element values themselves (800000) are the same in both columns. If the <AnnualSales> element had included its own child elements, those too would have been displayed. For instance, the following example retrieves the <Products> elements, which are child elements of <StoreSurvey>, just like <AnnualSales>:
SELECT
  Survey_untyped.query('/StoreSurvey/Products')
    AS Products
FROM
  Stores;
Because the XML document includes two <Products> elements and those elements each include several <Product> child elements, the SELECT statement returns all product-related elements, as shown in the following results:
<Products Type="Bikes">
  <Product>Mountain</Product>
  <Product>Road</Product>
  <Product>Racing</Product>
</Products>
<Products Type="Clothes">
  <Product>jerseys</Product>
  <Product>jackets</Product>
  <Product>shorts</Product>
</Products>
If you want to return a specific element and its child elements, you can do so by referencing one of its attributes. For example, suppose you want to include only products from the Bikes category. To do so, you can modify your XQuery expression as follows:
SELECT
  Survey_untyped.query('/StoreSurvey/Products[@Type="Bikes"]')
    AS BikeProducts
FROM
  Stores;
Now the XQuery expression includes bracketed information that specifies the Type attribute and its value. The attribute name is preceded by the at (@) symbol, and the attribute value is preceded by an equal sign (=) and enclosed in double quotes. As a result, the SELECT statement now returns the following XML fragment:
<Products Type="Bikes">
  <Product>Mountain</Product>
  <Product>Road</Product>
  <Product>Racing</Product>
</Products>
As you would expect, only elements whose products are in the Bikes category are returned. You could have just as easily specified “Clothes” rather than “Bikes,” and you would have received the product elements related to clothing.

The value() method

The value() method returns a scalar value from the targeted XML document. The returned value is converted to the data type you specify when you call the method. The value() method makes it easier to work with individual values in order to compare or combine them, either with other XML values or values of different types.
As the following syntax illustrates, when you call the value() method, you must specify the XQuery expression and the Transact-SQL data type for the returned data:
DbObject.value('XQuery', 'SqlType')
For example, suppose you want to pull the amount of sales from the <AnnualSales> element in either the typed or untyped column in the Stores table. You can specify the path in the XQuery expression plus a numerical data type as the second argument, as shown in the following SELECT statement:
SELECT
  Survey_untyped.value('(/StoreSurvey/AnnualSales)[1]', 'int')
    AS Sales_untyped,
  Survey_typed.value('declare namespace
    ns="http://schemas.microsoft.com/sqlserver/2004/07/adventure-works/StoreSurvey";
    (/ns:StoreSurvey/ns:AnnualSales)[1]', 'int') AS Sales_typed
FROM
  Stores;
For the untyped column, the XQuery expression includes the element path as you saw in earlier examples. However, after the path, you must add an integer in brackets that indicates which element of that name you should retrieve. More often than not, you will simply add [1], as I’ve done here. This assures that, if there are multiple elements with the same name, only the first one will be returned. It also assures that only one element is being referenced at a time. In fact, even when there is only one element, as is the case with <AnnualSales>, you must still include the [1] because the value() method expects a singleton value.
The second argument passed into the value() method is the name of the data type, in this case, int. That means an int value will be returned by the method. Note that the method’s two arguments must each be enclosed in single quotes and separated with a comma.
As for the typed column, the namespace declaration and path are also similar to what you’ve seen in earlier examples. The only difference is that this expression must also include the [1], just as we did it for the untyped column. As a result, for each column, the SELECT statement returns a single int value of 800000, without any additional element information.
You can also retrieve an attribute value, rather than an element value, by specifying the name of the attribute in your element path. For example, the following SELECT statement retrieves the value of the Type attribute for the second instance of the <Products> element:
SELECT
  Survey_untyped.value('(/StoreSurvey/Products/@Type)[2]', 'varchar(10)')
    AS ProductType
FROM
  Stores;
The expression includes [2], rather than [1], in order to retrieve data from the second instance of <Products>. As a result, the SELECT statement now returns the varchar value Clothes.
As stated earlier, the value() method is also handy if you want to combine or compare data. In the following example, I create a calculated column that’s based on two values returned from the untyped column:
SELECT
  Survey_untyped.value('(/StoreSurvey/AnnualSales)[1]', 'int') -
  Survey_untyped.value('(/StoreSurvey/AnnualRevenue)[1]', 'int')
    AS Expenses
FROM
  Stores;
The first instance of value() retrieves the <AnnualSales> value. The second value() instance retrieves the <AnnualRevenue> value. The second value is then subtracted from the first value to return a scalar value of 720000.
You can also use XQuery functions in your expressions. For instance, in the following example, I use the concat function to add a string to the <Specialty> value:
SELECT
  Survey_untyped.value('concat("Bike specialty: ",
   (/StoreSurvey/Specialty)[1])', 'varchar(25)')
   AS Specialty
FROM
  Stores;
When you use the concat function, you specify each element that you want to concatenate as an argument to the function and separate those arguments with a comma. The statement returns the value Bike specialty: Mountain.
As you can see, the string has been concatenated with the <Specialty> value. I could have just as easily concatenated multiple element values or added more string values.

The exist() method

The exist() method lets you test for the existence of an element or one of its values. The method takes only one argument, the XQuery expression, as shown in the following syntax:
DbObject.exist('XQuery')
The key to using the exist() method is in understanding the values it returns. Unlike the query() and value() methods, the exist() method doesn’t return XML content. Rather, the method returns one of the following three values:
  • A BIT value of 1 if the XQuery expression returns a nonempty result
  • A BIT value of 0 if the XQuery expression returns an empty result.
  • A NULL value if the XML data type instance is null.
A good way to test how the exist() method works is to use a variable to capture the method’s results, as I’ve done in the following example:
DECLARE @xml XML;
DECLARE @exist BIT;
SET @xml = (SELECT Survey_untyped FROM Stores);
SET @exist = @xml.exist('/StoreSurvey[BusinessType="BM"]');
SELECT @exist;
First, I declared the @xml variable as type XML. Then I declared the @exist variable as type BIT. I set the value of @xml to equal the XML document in the Survey_untyped column. I then used the exist() method on the @xml variable to test for the existence of /StoreSurvey[BusinessType="BM"] within the XML document. In other words, the exist() methods checks whether the <BusinessType> child element exists and whether it contains a value of “BM”.
I then assigned the results returned by the exist() method to the @exist variable and used a SELECT statement to return the contents of the variable. Because the XML document contains this child element and that element has a value of “BM”, the method returns a value of 1. However, if I change “BM” to “BMX”, as I’ve done in the following example, the method returns 0 because the element and value don’t exist exactly as specified:
DECLARE @xml2 XML;
DECLARE @exist2 BIT;
SET @xml2 = (SELECT Survey_untyped FROM Stores);
SET @exist2 = @xml2.exist('/StoreSurvey[BusinessType="BMX"]');
SELECT @exist2;
Now that you have a sense of how the exist() method works, let’s look at how you might use it in your queries. In the following example, I include a WHERE clause that uses the exist() method to again test for the existence of /StoreSurvey[BusinessType="BM"]:
SELECT
  Survey_untyped.query('/StoreSurvey/Products[@Type="Bikes"]')
    AS BikeProducts
FROM
  Stores
WHERE
  Survey_untyped.exist('/StoreSurvey[BusinessType="BM"]') = 1;
The WHERE clause compares the results of the exist() method to the number 1. If the results equal 1, that is, if the XML contains the specified element and values, the WHERE clause evaluates to True and the SELECT statement returns the product information, as shown in the following results:
<Products Type="Bikes">
  <Product>Mountain</Product>
  <Product>Road</Product>
  <Product>Racing</Product>
</Products>
If the exist() method had specified a different <BusinessType> value, as it does in the following example, the SELECT statement would instead return an empty result set:
SELECT
  Survey_untyped.query('/StoreSurvey/Products[@Type="Bikes"]')
    AS BikeProducts
FROM
  Stores
WHERE
  Survey_untyped.exist('/StoreSurvey[BusinessType="BMX"]') = 1;
Again, the key to the exist() method is to remember that it returns only one of three values: 1, 0, or NULL.

The nodes() method

The nodes() method can be a bit more slippery to understand than the other XML methods. To begin with, rather than returning XML or scalar values, the nodes() method returns what is essentially a table that includes one column. That means you should use the method only in those parts of a statement that can handle rowset views, such as the FROM clause. It also means that, when you call the nodes() method, you must assign a table alias and column alias to the rowset view returned by the method, as shown in the following syntax:
DbObject.nodes('XQuery') AS TableAlias(ColumnAlias)
The table and column aliases let you reference the rowset view in other parts of the statement. The method is useful when you want to shred an XML document, that is, decompose the document into a relational format. To better understand how this works, let’s look at an example that uses the nodes() method on an XML variable:
DECLARE @bikes XML
SET @bikes =
  '<Products>
    <Product>Mountain</Product>
    <Product>Road</Product>
    <Product>Racing</Product>
   </Products>'
SELECT
  Category.query('./text()')
    AS BikeTypes
FROM
  @bikes.nodes('/Products/Product')
    AS Bike(Category);
First, I declared the @bikes variable with the XML data type. Then I set its value to equal the XML fragment. In the FROM clause, I invoked the nodes() method on the variable to shred the <Product> elements. My goal was to retrieve the value from each instance of that element. I then provided a table alias (Bike) and a column alias (Category) to identify my rowset view. Now I essentially have a table with a single column that contains a row for each <Product> element.
When you use the nodes() method to create a rowset view of the XML data, you have to use one of the other XML methods to retrieve the contents of that view. In this case, I used the query() method in the SELECT list to return the value of each row. Notice that to retrieve the value I used a period to reference the current context node and the text() node function to retrieve only the element values, as shown in the following results:
Mountain
Road
Racing
If you want to use the nodes() method to retrieve data from a table, you can use the CROSS APPLY operator in the FROM clause to associate the table with the rowset view returned by the method, as shown in the following example:
SELECT
  Category.query('./text()')
    AS BikeTypes
FROM
  Stores CROSS APPLY
  Survey_untyped.nodes('/StoreSurvey/Products[@Type="Bikes"]/Product')
    AS Bike(Category);
In this case, I specified the XQuery path as StoreSurvey/Products[@Type="Bikes"]/Product' in order to return only the bike-related products from the Survey_untyped column. However the SELECT list itself is the same as the preceding example, so this statement returns the same results as that example.

The modify() method

The methods we’ve looked at up to this point have been concerned only with reading data in an XML document, but the modify() method lets you update that data. As the following syntax shows, the only argument you pass into the modify() method is the XQuery expression:
DbObject.modify('XQuery')
In this case, however, the XQuery expression is actually a special type of expression that uses the XML Data Modification Language (XML DML), which is a SQL Server extension to XQuery. The modify() method lets you define XML DML expressions that can add, update, or delete elements within an XML document.
NOTE: This section covers only the basics of the modify() method. You can find more information about the method and its uses in the XML Data Modification Language Workbench, which provides additional details and examples about modifying XML data.

Adding components to XML data

To add components to your XML data, you must specify the insert keyword and target location in your expression. For instance, the following UPDATE statement adds the <Comments> element to the typed and untyped XML columns in the Stores table:
UPDATE Stores
SET Survey_untyped.modify('
  insert(<Comments>Largest bike store in region</Comments>)
  after(/StoreSurvey/NumberEmployees)[1]'),
  Survey_typed.modify('declare namespace ns=
  "http://schemas.microsoft.com/sqlserver/2004/07/adventure-works/StoreSurvey";
  insert(<ns:Comments>Largest bike store in region</ns:Comments>)
  after(/ns:StoreSurvey/ns:NumberEmployees)[1]')
WHERE StoreID = 292;
For the Survey_untyped column, I first specified the modify() method name, followed by the XML DML expression, which I enclosed in parentheses and single quotes, just like the other XML methods. Within the expression, I included the insert keyword and the element I wanted to add: (<Comments>Largest bike store in region</Comments>). Notice that I enclosed the element in parenthesis and included the element’s value.
Next, I added the after keyword and specified the location of where to add the new element. I also included the [1] because the method requires a singleton value. Now the new element will be added after the first instance of the /StoreSurvey/NumberEmployees element.
As you would expect, modifying the Survey_typed column followed the same process, except that I also included the namespace declaration and aliases. An important issue to consider, however, when working with typed XML columns is that any changes you make must conform to the schema that governs the XML content. For example, if I had tried to add the <Comments> element to any other location within the document, I would have received a violation error. However, if you’ve set up your statement properly and adhered to the schema, the new element should be added to the XML document with no problem.

Updating components in XML data

To use the modify() method to update XML data, you must include the replace value of keywords, rather than the insert keyword. You must then specify the component you want to update as well as the updated information. For example, in the following UPDATE statement, I change the value of the <Comments> element:
UPDATE Stores
SET Survey_untyped.modify('
  replace value of (/StoreSurvey/Comments/text())[1]
  with "2nd largest bike store in region" ')
WHERE StoreID = 292;
After I specified the replace value of keywords, I added the element path along with the text() node function, which let me change only the element’s value (without affecting the element itself). And as with the previous example, I also included a [1] because a singleton value is expected. Next, I specified the with keyword and the new element value, enclosed in double quotes. As a result, when I ran the statement, it replaced the old element value with the new one.

Deleting components from XML data

To use the modify() method to remove data from an XML document, you must specify the delete keyword, followed by the component you want to delete, as shown in the following example:
UPDATE Stores
SET Survey_untyped.modify('delete(/StoreSurvey/Comments)[1]')
WHERE StoreID = 292;
In this case, it was simply a matter of specifying the path to the <Comments> element, in parentheses, after the delete keyword. Of course, all this was followed by [1] to keep the database engine happy.

The XML Methods

The methods available to the XML data type provide you with a set of powerful tools for working with XML data. And as you can see from the examples, most of that power rests in your ability to create XQuery expressions that target the information you want to access. Yet the expressions shown here are relatively basic when compared to how extensive the XQuery language is. In fact, to make the most of what the XML methods offer, you must invest the time necessary to understand the various elements that make up that language. Until then, what I’ve shown you here should provide you with the first steps necessary to start accessing and updating your XML data. Just know that there’s a much bigger universe out there waiting to be discovered.