Sunday, June 6, 2010

How to prepare sample data

Recently someone on one of SQL Server forums had a question how to prepare sample data so users can work on database restored from production. The requirement was that testers must not know the real data, names, addresses, emails and so on. On the other hand, users find very hard to work with totally random strings, like xwzr as first name, for example.
There are several tools on the market that help to prepare sample data. One of them is Visual Studio Database Edition, the other is SQL Data Generator from RedGate. I am sure there are many more.
The drawback of the above two excellent products is that they are not free. In some cases this is a major obstacle, because bosses are somehow not as likely to spend money for your tools (and toys) as you would like them to be.
So sometimes you have to implement sample data generator yourself. As it turns out, it is not too complicated (especially for simple cases).
Let’s assume that we want to replace all names in Person.Contact table in AdventureWorks database.
There are 19972 rows in this table in my database. I would like to replace all FirstName, MiddleName and LastName values with values that I prepare. Obviously, it is not so easy to come up with 20000 other names which don’t belong to Klingon language.
But if you think about this, 20000 is 50 x 40 x 10. If you have one fifty first names, forty last names and ten middle names, you can create 20000 unique combinations of these. You can easily find lists of names if you search, but for example this page contains all names that we might need to populate 20000 rows.
For this example, I created three tables, each for separate list of names:
   1: create table #RandomLastNames(RandomName nvarchar(50))
   2: create table #RandomFirstNames(RandomName nvarchar(50))
   3: create table #RandomMiddleNames(RandomName nvarchar(50))


I populated them using the names I found on the page I linked before. For middle names, I just typed in several initials and a NULL value, as NULLs are present in MiddleName column in Person Contact.

   1: insert #RandomMiddleNames (RandomName)
   2: select N'J.' union all
   3: select N'T.' union all
   4: select N'R.' union all
   5: select N'A.' union all
   6: select N'C.' union all
   7: select N'M.' union all
   8: select N'W.' union all
   9: select N'D.' union all
  10: select N'S.' union all
  11: select N'K.' union all
  12: select N'Z.' union all
  13: select NULL


To get number of combinations of the data you just select count from cartesian product of all three tables:

   1: select count(*) [No of combinations] from
   2: #RandomFirstNames cross join 
   3:     #RandomLastNames cross join 
   4:         #RandomMiddleNames

The rest is easy. To preview how data you have will be replaced you run this query:


   1: select cn.ContactID, cn.FirstName + N' ' + isnull(cn.MiddleName + N' ', N'') + cn.LastName OriginalData,
   2: (select top 1 RandomName from #RandomFirstNames where cn.ContactId = cn.ContactId order by newid()) FirstName, 
   3: (select top 1 RandomName from #RandomMiddleNames where cn.ContactId = cn.ContactId order by newid()) MiddleName, 
   4: (select top 1 RandomName from #RandomLastNames where cn.ContactId = cn.ContactId order by newid()) LastName
   5: into #DataMapping
   6: from Person.Contact cn

To actually update the data just run this update:

   1:  
   2: update Person.Contact set
   3: FirstName = b.FirstName,
   4: MiddleName = b.MiddleName,
   5: LastName = b.LastName
   6: from Person.Contact a inner join 
   7:         #DataMapping b on a.ContactId = b.ContactId

And that’s it! You can modify the script I attach to suit your needs, you can modify other data the same way, like emails, address lines etc.

Sample code download here.


No comments:

Post a Comment