Saturday, August 21, 2010

Moving blog… again

Hi

It’s been a while since I have written here. The main reason is that it is summer and I was on vacations, and not very keen on sitting in front of the computer.

The other reason is that I have decided to move the blog again. The visibility of this blog unfortunately is not great, even though I think I wrote a few posts which are pretty interesting – so say those who read them :). I made quite a bit of effort to increase the visibility of the blog with great help of Brent Ozar. Unfortunately, the main problem is still practical absence of my posts in the Google search results. This makes my posts life very short and I feel they could do better.

Thankfully Adam Machanic sent me email if I would consider setting up blog on the SQLBlog.com. I must say I found this very rewarding, because bloggers on SQLBlog.com belong to the finest experts of the SQL Server. I decided to move my blog to SQLBlog.com, I hope the content I will publish will serve more developers looking for knowledge. So please repoint your RSS readers to my new blog: http://sqlblog.com/blogs/piotr_rodak

See you there!

Thursday, July 22, 2010

With head in the clouds

I write this post inspired by excellent article of Jeremiah Peschka (blog) who wrote about the future of databases. I think that this is a good topic to write something I hope meaningful about.

Jeremiah made several good observations about the nature of current database designs and technologies and what it seems to be an upcoming technology of future – the name-value stores.

I am sometimes faced with opinions of sort “Dude, these relational databases are sooo gone!”. Are they? Let’s see, what good “no sql”, cloud database should have:

Obviously trendy key-value collection. This is implemented usually using hash table, where key is transformed to certain numeric value, and this value is used to index table that holds actual data values. To add a value, you have to calculate hash of the key, resolve conflict and insert the value into appropriate index of the table. The other option is to use tree data structure, where nodes are balanced according to the key value, and each node contains actual data. To access data, you have to traverse tree based on the value of the key. When you insert a value, you have to rebalance tree (if you use balanced tree) to maintain predictable access time to the data.

In some cases you want to retrieve collection of objects within a specific order. This can be done in two ways – either you sort data each time the collection is requested, or you maintain doubly linked list of keys that enables you to move between nodes in specific order. Ah these old good *prev and *next pointers. The first option gives you overhead each time you retrieve the data, but doesn’t necessarily significantly consume memory. The second option requires additional memory and management for adding or removing values. If you change key value, the list has to be updated so the elements are returned in proper order.

Believe it or not, even clouds touch the ground sometimes. Even the cloud data have to be persisted sometimes. How do you do it? You can use either proprietary binary file structure for performance or xml, that’s pretty much it. When you write the data, you have to ensure that some other process is not writing it as well – concurrency control. You have to ensure that when data are written, other processes can read consistent view of data. There’s another thing – you don’t always write all data you have in memory. You should be able to persist only changes. To identify changed values on disk and update or insert or delete them accordingly. If you want to keep performance, these disk data structures have to be optimized for searching – here we have indexes.

All these points above are key factors of a good “no sql” database. All these points are also implemented in SQL Server. You have balanced trees for indexes, hash tables for plan cache, key indexed tables for row offsets on page level, linked lists of pages in indexes, sorting, data persistence technology optimized for concurrency and speed. You can, if you want to access data through OR mapper so you deal with collections, attributes etc.

So, what are the differences? The main and most important difference is the replication. SQL Server insists on storing changed data to disk first and then replicates it, the cloud data stores do the opposite – they replicate data to neighbouring nodes of the network before persisting it. This change of approach was made possible by development of technology, growing speeds of networks and capacities of memory installed on servers. It may be cheaper and faster to throw a few gigabytes over the fast network to the other side of the world than to store them in local SAN. The idea relies on the statistics – it is very unlikely that all of a sudden servers located in several places in the world will fail at the same time. SQL Server approach stems from old times, where servers were mostly standalone machines which had to ensure that no matter what, data will be available after disaster recovery.

The cloud applications often use traditional databases as their back ends in various locations for the reasons I described above. Traditional databases offer transactional and safe way of persisting data, what has to happen at some point of time.

I hope that this article will shed some light on internal implementation of cloud databases, especially the fact that all data structures and algorithms were invented may years ago, sometimes 50 years ago. It’s the power of technology and money what made cloud data stores possible, nothing else. There is nothing new under the sun.

Friday, July 16, 2010

Disabling audit triggers

This is a follow up to the posts about audit triggers. I wrote about them some time ago, on my previous blogging platform. The first post discussed COLUMS_UPDATED() function that is very useful in triggers if you want to find out which columns are affected by DML operation. Second post showed how you can create an audit framework that subsequently is used to log what happens to data when users modify it.

Now, there are situations, when you want to make changes to a table, but you don’t want to have these operations logged. And I am not talking about criminal activities – but for example, a table is populated daily by ETL process which updates or inserts thousands of rows. Sometimes it doesn’t make sense to log all these operations – after all triggers incur some performance degradation on the database.

There are several ways to disable a trigger. First, you can use disable trigger statement This approach has one drawback – when you disable trigger, it is disabled for every user who modifies the data. So you may loose audit entries while your ETL is running. In some scenarios it’s not an option.

Alternatively you can implement trigger to check for certain condition and decide whether DML operation should be audited or not. There are several ways of doing this. Two are described by Itzik Ben-Gan in his excellent Inside SQL Server 2005 - T-SQL Programming. Triggers can check if a table with specific name exists in the temp database or use session context. The disadvantage of these options is that you have to modify code external to triggers, for example your stored procedures or batches.

You can check if there is certain entry in a configuration table. For example you can have table Audit.tIgnoreUsers which will contain user names for which you don't want to audit. Then, there’s a simple query to decide if the trigger should proceed or not:

   1: if(exists (select 1 from Audit.tIgnoreUsers where UserName = suser_name()))
   2:     return;

Alternatively, you can check if particular user belongs to particular role. To do this, you have to create role first and add user that you want to exclude from auditing.

   1: if not exists (select 1 from sys.database_principals where name = 'ExcludeFromAudit' and type='R')
   2:         exec sp_executesql N'create role ExcludeFromAudit authorization dbo'
   3:  
   4: go
   5:  
   6: exec sp_addrolemember 'ExcludeFromAudit', 'ETLLogin'
   7:  
   8: go
   9:  
  10: raiserror('Logins skipped in auditing: ', 10, 1) with nowait;
  11:  
  12: exec sp_helprolemember 'ExcludeFromAudit'

Then, in the audit trigger you just add code somewhere on the top of the trigger:

   1: if(is_rolemember('ExcludeFromAudit') = 1)
   2: begin
   3:     --print 'Audit skipped for user ' + suser_name()
   4:     return
   5: end

And that’s it. I find this way easier to manage, as typically users do not have rights to modify roles. The good practice is to have separate login for ETL from other logins, so you can audit every action that for example front end users perform, but the ETL stays almost unaffected.

Please note that if you happen to be sysadmin and want to test the is_rolemember on a user defined role, it will always return 0. Quite confusing, I know. There is a Connect item for IS_MEMBER function which behaves in the similar way, by design.

Thursday, July 8, 2010

INSTEAD OF triggers - Using views in SSIS

This is the last post in the series about updateable views. The previous posts (Updateable views – how to use them and INSTEAD OF triggers on views) described the mechanics of the views and triggers that you can use. I haven’t answered yet to question where views and triggers can be useful?

The common issue that developers meet when developing SSIS packs is that there is not ‘built-in’ destination transformation that would insert rows when they don’t exist in the destination table and update rows that exist. Usually developers use a lookup and split the dataflow to OLEDB destination and OLEDB command. Adam Haines (blog) wrote a very nice post about using INSTEAD OF triggers on views to improve this solution.

The other scenario I see quite interesting is logging rows that fail to be inserted into destination table. Usually failure occurs because of constraint violations – like invalid value of the column on which there is a CHECK constraint or NULL where it shouldn’t be.

BCP allows for output of invalid rows into a separate file. The same sometimes is used in SSIS solutions. You can create File destination to save rows which fail to be input into the table. I see a few issues with this approach. If you use CSV file, you loose distinction between NULL and empty fields. If you use raw binary file – you can’t really investigate its contents. There is also all this hassle with destinations, file locations, disk space, access etc.

I’ve been working with SSIS for a few years and in the past we came up with idea to create custom component, written in C# that would be used in script components of SSIS at the end of data flows, connected with final destinations with red arrows. This worked pretty well. In ProcessInputRow handler the row would be passed to the component which would create XML from it using reflection. The XML would subsequently be saved in a table in database, making it easy to query and troubleshoot.

Well, this approach has a few drawbacks as well: you need to deploy and maintain component’s code, the error handling is processed row by row, and reflection does not improve performance, especially when upstream data were totally wrong and all rows had to be redirected to the audit output.

I recently began new project, and had to approach the question of audit failed rows once more. I didn’t want to use the component due to these problems I wrote above. I realized that I can use SQL Server to convert the data from the pack into XML. Let’s look how it works.

First, let’s create testing environment. We will create package that moves data from one database to another. The following snippet creates source and destination databases and table that contains the source data.

   1: use master
   2: go
   3: create database SSISTestSourceDB
   4: go
   5: create database SSISTestDestinationDB
   6: go
   7:  
   8: use SSISTestSourceDB
   9: go
  10: create table tSource(Col1 int, Col2 int null, Col3 varchar(20), Col4 datetime)
  11: go
  12: insert tSource(Col1, Col2, Col3, Col4)
  13: values 
  14: (1, 1, 'Value1', CURRENT_TIMESTAMP),
  15: (2, null, 'Value1', CURRENT_TIMESTAMP), --for null value error
  16: (3, 3, 'DB', CURRENT_TIMESTAMP), --for min length check constraint violation
  17: (4, 4, 'Value1', CURRENT_TIMESTAMP),
  18: (5, 5, 'Value1', CURRENT_TIMESTAMP),
  19: (6, 100, 'Value1', CURRENT_TIMESTAMP), --for out of range check constraint violation
  20: (7, 7, 'Value1', CURRENT_TIMESTAMP),
  21: (8, 8, 'Value1', CURRENT_TIMESTAMP),
  22: (9, 9, 'Value1', CURRENT_TIMESTAMP),
  23: (10, 10, 'Value1', CURRENT_TIMESTAMP), --two tens for pk violation on destination
  24: (10, 11, 'Value1', CURRENT_TIMESTAMP),
  25: (11, 12, 'Value1', CURRENT_TIMESTAMP)
  26:  
  27: go

Now let’s create table in the destination database:

   1: use SSISTestDestinationDB
   2: go
   3: create schema Staging authorization dbo
   4: go
   5: create table Staging.tDestination 
   6: (
   7:     Col1 int not null constraint PK__1 primary key clustered,
   8:     Col2 int not null constraint CHK__Col2_Range check (Col2 < 30),
   9:     Col3 varchar(20) not null constraint CHK__MinLenGreaterThan2 check (len(Col3) > 2),
  10:     Col4 datetime not null
  11: ) 
  12: go

The Audit table will store the audit data:

   1: --table that will accept errors from the pack
   2: create table Audit.tAuditETLErrors
   3: (
   4:     [ErrorCode] [bigint] NULL,
   5:     [ErrorColumn] [bigint] NULL,
   6:     [ErrorDescription] [nvarchar](1000) NULL,
   7:     [ErrorData] [xml] NULL, --actual row data
   8:     [UpdateDate] [datetime] not null constraint DF_UpdateDate1 default (getdate())
   9: )
  10: go

As you see, the table is pretty simple: ErrorCode and ErrorColumn are derived directly from data flow in the SSIS package, ErrorDescription is added to retrieve more meaningful error description and ErrorData is XML column that contains serialized audit row.

Now the main part of the idea: a view with an INSTEAD OF INSERT trigger on it.

   1: --create error handling view and trigger
   2: create view Audit.vDestinationSSISError
   3: as
   4: select convert(bigint, 0) ErrorCode, 
   5:         convert(bigint, 0) ErrorColumn, 
   6:         convert(nvarchar(1000), N'') ErrorDescription, 
   7:         dst.* 
   8:     from Staging.tDestination dst
   9:     where 1 = 0 --this view doesn't have to return any data
  10: go
  11: create trigger Audit.trgins_vDestinationSSISError on Audit.vDestinationSSISError
  12: instead of insert
  13: as
  14: begin
  15:     set nocount on
  16:     --insert new row to audit table. actual data are converted to XML
  17:     insert Audit.tAuditETLErrors (ErrorCode, ErrorColumn, ErrorDescription, ErrorData, UpdateDate)
  18:     select ErrorCode, ErrorColumn, ErrorDescription, 
  19:     (select * from inserted 
  20:         for xml raw, elements xsinil, root('etl_error')), current_timestamp
  21:     from inserted
  22: end
  23: go

The view doesn’t return any data, because in this case it is used only for trigger. The view has to expose ErrorCode, ErrorColumn and ErrorDescription columns, plus all columns of the destination table if you want to audit all columns.

The trigger uses select… for xml query to serialize incoming data into xml stream that is inserted eventually to ErrorData column.

The package is built in pretty simple way. There is OLEDB source, OLEDB destination – using ‘fast’ bulk load option. Then there is OLEDB ‘Normal’ destination – for batches which were rolled back because of errors in the ‘Fast’ destination.

screenshot47

Then, for rows that didn’t make it to the destination table, redirection to additional destination is made, to the view destination:

 screenshot51

The script component before the final destination adds ErrorDescription to the data flow:

screenshot49

screenshot50

There’s a catch: you better use ADO.NET destination in the package because OLEDB destination is unaware of the triggers on the views. While you can configure OLEDB destination to use view, the rows will fail to be inserted because of the same reasons they weren’t inserted in the fist place. You can solve it by creating a table with exactly the same schema as the destination table but without any constraints. This approach is a bit awkward and you have to maintain tables in sync. I found it easier to use ADO.NET destination. If you use connection string stored in package configurations, you just have to be sure that you use connection string for .NET for this destination, otherwise it will not work.

screenshot48

You map the columns the normal way:

screenshot52

This is what you’ll see when you run the package:

screenshot56

From 12 rows read from the source, three were inserted using bulk insert, and nine went to the ‘Slow’ destination. There, rows were inserted row by row, and rows which failed were redirected to the error output – you see there were four of them. Then the script component adds ErrorDescription column and rows are inserted into the view. When you select from the Audit.tAuditETLErrors table you will see all four of them:

screenshot57

I find this setup pretty useful, I wonder what do you think.

I attach the test script and the test SSIS project so you can straight go to the fun part.

Monday, June 28, 2010

INSTEAD OF triggers on views

Last time I wrote about updateable views – a feature that is not considered by many database architects and developers when they implement data interface. The updateable views have some limitations what makes them useful in only specific situations. Two most important limitations in my opinion are that you cannot update columns from more than one table in single update statement and that any derived or calculated columns cannot be directly updated.
Luckily, there is a way to bypass these limitations by using INDSTEAD OF triggers on views. The triggers are fired as the name implies instead of DML statement, so you have to implement effective operation within the trigger. If the trigger is empty, no update, insert or delete will be made.
Let’s look at Person.vContact2 view again. The view returns columns ContactId, NameStyle, Full Title, FirstName, MiddleName, LastName and several others. The Full Title column is a column that returns custom information depending on the underlying Title column from the Person.Contact table. You cannot update the Full Title column directly, because it doesn’t exist in the Person.Contact. If you want to change the information returned by the view, you have to modify the Title column in the table. Of course, you can update the table directly or via stored procedure and in most cases this would be the preferred way, but in some cases updating the view has its benefits. You may also have a requirement that the tables must not be updateable directly and the views are the only interface you can use to modify the data.
So, to change the Full Title column in the view, you have to modify the Title value appropriately. As you remember, the definition of the view is as follows:
   1: ALTER view [Person].[vContact2]
   2: as
   3: select ContactID, NameStyle, 
   4: case when Title = 'Mr.' then 'Very long title 1'
   5:     when Title = 'Ms.' then 'Very long title 2'
   6:     else 'Other title' end [Full Title], 
   7: FirstName, MiddleName, LastName, Suffix, EmailAddress, EmailPromotion, Phone, PasswordHash, PasswordSalt, rowguid, ModifiedDate
   8: from Person.Contact

Lines 4-6 contain definition of the Full Title column. In my AdventureWorks database, ContactID has ‘Other title’ returned by the view. I want to change it to ‘Very long title 1’.

You can create triggers for INSERT, UPDATE and DELETE statements. One of the ways is to create one trigger for all three types of operation, the other is to create a separate trigger for each of types separately. The latter approach simplifies logic within the trigger a bit, on the expense of maintainability – you have three database objects to worry about instead of one.

This is sample trigger that is fired for every DML operation against the view:

   1: create trigger Person.trgVContact2 on Person.vContact2
   2: instead of insert, update, delete
   3: as
   4: begin
   5:     print 'Trigger called'
   6: end

Now let’s call an update on the view:

   1: select * from Person.vContact2 where ContactId = 4
   2: update Person.vContact2 set LastName = 'NoName' where ContactId = 4
   3: select * from Person.vContact2 where ContactId = 4

Nice. The trigger was fired as you can see in the output. But, if you look at the LastName column, it was not changed. This is because this is an instead of trigger and it replaces the original operation. To actually update the column,  you have to implement the trigger in more useful way:

   1: if exists (select 1 from sys.objects where object_id = object_id('Person.trgVContact2'))
   2: drop trigger Person.trgVContact2
   3: go
   4: create trigger Person.trgVContact2 on Person.vContact2
   5: instead of insert, update, delete
   6: as
   7: begin
   8:     if (@@rowcount = 0)
   9:     begin
  10:         print 'No rows matching criteria'
  11:         return
  12:     end
  13:     if exists(select 1 from inserted) and exists (select 1 from deleted) --update operationr
  14:     begin
  15:         update contacts set LastName = inserted.LastName
  16:         from Person.Contact contacts inner join inserted on contacts.ContactId = inserted.ContactId 
  17:     end
  18:     else if exists(select 1 from inserted) --insert operation
  19:     begin
  20:         return; --TODO: implement with useful logic
  21:     end
  22:     else --delete operation
  23:     begin
  24:         return; --TODO: implement with useful logic
  25:     end
  26: end

As you see, this trigger allows for updating only LastName column. This column this column is exposed by the view without any modifications. On a side, did you know that triggers are fired even when no rows are affected? This was a surprise for me a few months ago. You can prevent potentially costly code from running if you check if there is anything to process. Lines 8-12 show how it can be done.

Ok, what about Full Title and other columns? This is the modified code for update operation only:

   1: if exists(select 1 from inserted) and exists (select 1 from deleted) --update operationr
   2:     begin
   3:         if update(PasswordHash) or update(PasswordSalt)
   4:         begin
   5:             print 'Password data changes are not permitted'
   6:             return
   7:         end
   8:         update contacts 
   9:             set NameStyle = inserted.NameStyle,
  10:                 Title = case inserted.[Full Title]
  11:                             when 'Very long title 1' then 'Mr.'
  12:                             when 'Very long title 2' then 'Ms.'
  13:                             else contacts.Title --no change in this case
  14:                         end,
  15:                 FirstName = inserted.FirstName,
  16:                 MiddleName = inserted.MiddleName,
  17:                 LastName = inserted.LastName,
  18:                 Suffix = inserted.Suffix,
  19:                 EmailAddress = inserted.EmailAddress,
  20:                 EmailPromotion = inserted.EmailPromotion,
  21:                 Phone = inserted.Phone,
  22:                 ModifiedDate = current_timestamp
  23:         from Person.Contact contacts inner join inserted on contacts.ContactId = inserted.ContactId 
  24:     end

Triggers offer you opportunity to validate what columns are updated. In certain scenarios, you don’t want users to be able to modify sensitive data. This can be achieved using triggers for example. In lines 3-7 of the above script you can see code preventing update operation on PasswordSalt and PasswordHash operations.

Ok, so far, so good. Triggers on views are one of the methods of implementing logic required to update data. However usually it is better to do it using stored procedures. There are scenarios though when triggers on views give you quite interesting ways of implementing ETL. More about this in next post.

Monday, June 21, 2010

Updateable views – how to use them

This post is first part of miniseries that discusses ways of updating data in views. In this post I will discuss updateable views, in the second part there will be short overview of INSTEAD OF triggers and interesting implications they offer.

Note: All examples in this post are made using AdventureWorks sample database, which you can download from CodePlex page.

With simple views selecting just from single table, with no derived columns the query engine knows how to translate insert or update statement on view to appropriate operation on underlying tables, and no trigger is required. The following script shows an example of such behaviour:

   1: select * from Person.vContact1 where ContactID = 1
   2: update Person.vContact1 set MiddleName = 'S.' where ContactID = 1
   3: select * from Person.vContact1 where ContactID = 1


The query engine translates update statement against the view to an update statement against the underlying table, and if you look at the execution plan of the update statement, you’ll see there is no difference to an update statement against the bare table:


viewupdate1


If the view is more complex, the query engine doesn’t know how to execute the operation and the update fails. There may be several different error messages, depending on the type of the error. For example, if you have calculated column in your view, you cant update this column and obviously you can’t insert rows into the view. The following view returns more or less the same information as the Person.vContact1, but it replaces original value of MiddleName column with custom string:


   1: ALTER view [Person].[vContact2]
   2: as
   3: select ContactID, NameStyle, 
   4: case when Title = 'Mr.' then 'Very long title 1'
   5:     when Title = 'Ms.' then 'Very long title 2'
   6:     else 'Other title' end [Full Title], 
   7: FirstName, MiddleName, LastName, Suffix, EmailAddress, EmailPromotion, Phone, PasswordHash, PasswordSalt, rowguid, ModifiedDate
   8: from Person.Contact


If you try to update the Full Title column, the update will fail:

   1: select * from Person.vContact2 where ContactID = 1
   2: update Person.vContact1 set [Full Title] = 'Some other title' where ContactID = 1
   3: select * from Person.vContact2 where ContactID = 1

Msg 207, Level 16, State 1, Line 2
Invalid column name 'Full Title'.



This is obviously reasonable: Full Title column is calculated at runtime, when the view is executed and cannot be altered in any way because there is no storage related with this column – this column doesn’t exist in the table. You can however update other columns in this view, because they are bound directly to columns in the table.

   1: select * from Person.vContact2 where ContactID = 1
   2: update Person.vContact1 set MiddleName = 'D.' where ContactID = 1
   3: select * from Person.vContact2 where ContactID = 1

In some cases views return data from more than one table. Let’s examine such case. The following script updates City name in one of the tables that comprise the Purchasing.vVendor view:

   1: select * from Purchasing.vVendor where VendorId = 1
   2: select * from Person.Address where AddressLine1 = '683 Larch Ct.'
   3: update Purchasing.vVendor set City = 'Buenos Aires' where VendorId = 1
   4: select * from Person.Address where AddressLine1 = '683 Larch Ct.'
   5: select * from Purchasing.vVendor where VendorId = 1

When you run the above script you’ll notice that query engine was smart enough to update only appropriate row in the Person.Address table. Note also that the view,although joins eight tables, uses inner joins only and the query engine is able to determine exact range of rows to update in each participating table.

There are some other conditions that have to be met to make a view updateable, for example, when you run an update query, you have to modify only columns from one base table at a time. The following query fails:


   1: update Purchasing.vVendor 
   2:     set City = 'Buenos Aires',
   3:     MiddleName = 'M.' 
   4:     where VendorId = 1

Msg 4405, Level 16, State 1, Line 1
View or function 'Purchasing.vVendor' is not updatable because the modification affects multiple base tables.


Other conditions include (after MSDN):
  • The columns being modified in the view must directly reference the underlying data in the table columns. The columns cannot be derived in any other way, such as through the following:

    • An aggregate function: AVG, COUNT, SUM, MIN, MAX, GROUPING, STDEV, STDEVP, VAR, and VARP.
    • A computation. The column cannot be computed from an expression that uses other columns. Columns that are formed by using the set operators UNION, UNION ALL, CROSSJOIN, EXCEPT, and INTERSECT amount to a computation and are also not updatable.
  • The columns being modified are not affected by GROUP BY, HAVING, or DISTINCT clauses.
  • TOP is not used anywhere in the select_statement of the view together with the WITH CHECK OPTION clause.
This will conclude first part of the series. In the next post I will write about triggers on views and their possible usage.