20091113

Recursive Common Table Expressions

Problem:  I need to load a lot of dummy data into a table and time is limited.

Solution:  Common Table Expressions (CTE) used in a recursive fashion.  Here is an excellent demo on how CTE’s can be used recursively:


Example:

SET NOCOUNT ON

CREATE TABLE [sqlCTE]([field_bigint] [bigint] NULL) ON [PRIMARY]
GO

DECLARE @s DATETIME
DECLARE @i int

SET @i = 0
SET @s = getdate()
WHILE (@i <= 1024032)
BEGIN
      INSERT INTO [sqlCTE]([field_bigint]) VALUES (@i)
      SET @i = @i + 1
END
PRINT datediff(ss,@s,getdate())
GO

TRUNCATE TABLE sqlCTE
GO

DECLARE @s DATETIME
SET @s = getdate();
with n (j)
as
(
select j = 0
union all
select j = j + 1 from n where j < 32000
)
INSERT INTO [sqlCTE]([field_bigint])
select na.i
from ( select i = j + k
   from n
   cross join
   ( select k = j * 32001
      from n
      where j < 32
   ) n2
) na
option (maxrecursion 32000)
PRINT datediff(ss,@s,getdate())
GO

Results:
297
5

So, to load 1,024,032 records:

1)      With a WHILE loop: 4min and 57seconds
2)      With a recursive CTE: 0min and 5seconds.

You do the math. 

Thanks to my colleague who introduced me to recursive CTEs and to Nigel Rivett  of Simple-Talk (http://www.simple-talk.com/) for the thorough explanation of the CTE structure (http://www.simple-talk.com/content/print.aspx?article=260)

Also, check out the SQL Server Template Explorer (Ctrl+Alt+T in SSMS) in the “Recursive Queries” folder for more examples from Microsoft.



20091028

Rounding issues with OLE DB Destination in SSIS

Problem:  Your MS Excel source data holds a value, 12.3456789.  You need to INSERT this value into a SQL Server destination table with a column of type decimal(19,2).  The desired result is 12.35, however when you use the SSIS Data Flow Task with the OLE DB source and destination components, your result is 12.34.  By default, the underlying code in the SSIS task will truncate the remaining decimal places rather than using a rounding function.


Troubleshooting:  To keep things simple, I decided to use a SQL Server Table source instead of an Excel source for this example (the core issue remains the same and we’ll assume that we cannot change the way we query the source data; for example, “SELECT ROUND([Amount], 2) FROM <table or worksheet>…”).  So, first, let’s create two tables.  One table representing the source data:

CREATE TABLE [dbo].[tbl_Source]([Amount] [float] NULL)
INSERT INTO [dbo].[tbl_Source]([Amount]) VALUES (12.3456789)

…and one table representing the destination data:

CREATE TABLE [dbo].[tbl_Destination](
      [Test_ID] [int] IDENTITY(1,1) NOT NULL,
      [Amount] [decimal](19, 2) NOT NULL
)

Now let’s manually execute an INSERT statement:

INSERT INTO [dbo].[tbl_Destination]([Amount])
SELECT [Amount] FROM [dbo].[tbl_Source]

The results are (Test #1):





Somewhere behind the scenes, the SQL Server database engine is implicitly handling the rounding for us.  This is the behavior I was expecting out of my SSIS package:




…here are the result produced from the steps above (Test #2):




..12.34 is not what I was expecting!  I was running Profiler while the package above executed and here is the important statement:

insert bulk [dbo].[tbl_Destination]([Amount] decimal(19,2))with(TABLOCK,CHECK_CONSTRAINTS)

Who has used THAT statement before?  And where in the world is the source data coming from?  Apparently “INSERT BULK” is undocumented feature, so let’s try changing the “Data Access Mode” of the OLE DB Destination task from “Table or view – fast load” to just “Table or view”:






The results are the same (Test #3):




…however, the key statements that I gathered from Profiler are:

INSERT [dbo].[tbl_Destination]([Amount]) VALUES(@Param000004)
exec sp_cursor 180150003,4,0,N'[dbo].[tbl_Destination]',@Amount=12.34

So, the process was definitely different in the backend, but the results were the same.  I looked into the SQL Server Destination component but that would not implicitly convert the float to the decimal(19,2).  I also looked into the Data Conversion component, but it would not let me specify the precision parameter for the decimal data type.

Solution:  So then I decided to look into expressions.  The expressions on the existing Source and Destination Data Flow components would not do what I needed them to do.  The Derived Column component would:





…and the properties window for the Derived Column (Expression = ROUND(Amount, 2)):




…and the result (Test #4):




This wasn’t too hard to figure out and probably not worth blogging about, but I’m still new to SSIS, had some idle time and writing all this down helps me learn, so there you go!  Feel free to show me other ways to accomplish this or if you have any questions.

20091022

Demystifying the SSIS PIVOT and UNPIVOT Data Flow Tasks

From a high-level, general point of view:

PIVOT:  turns ROWS into COLUMNS
UNPIVOT:  turns COLUMNS into ROWS

From a graphical point of view:
PIVOT:

UNPIVOT:


From another graphical point of view:
PIVOT:


UNPIVOT:


From a logical point of view:
PIVOT:  If you need to summarize or group your row data, most likely you will need to use the PIVOT task

UNPIVOT:  If you have a set of horizontal columns like [Ham, Soda, Milk, Beer, Chips, etc.] or [Period 1, Period 2, Period 3, etc.] or [Joe, Mike, Kelly, Pat, etc.] or [New York, Atlanta, Los Angeles, San Francisco, etc.] and you would like to list these values vertically in rows, UNPIVOT is probably the task for you.

ONE IMPORTANT NOTE:  The Pivot Data Flow Item does NOT aggregate data.  Therefore, to accomplish the first pivot operation above, you will need to use the Aggregate Data Flow Item to aggregate the data before sending the dataset to the Pivot Data Flow Item.


SSIS Example (with actual data shown):

PIVOT operation:


UNPIVOT operation:


Same SSIS Example (with properties shown):


PIVOT operation (2 pics):

Photobucket


UNPIVOT operation:



The trickiest part in setting this up is setting up the properties of the Pivot Data Flow Task:



For the input columns, the most important property is PivotUsage.  Here are the possible values and what they mean:

0 - This column is passed straight through.  (It will still be a column in the output, none in this example)
1 - "Set Key".  The distinct values in this column(s) will become the rows in the output. (in this example, "Name" is the Set Key)
2 - "Pivot Column" The VALUES in this column will become columns in the output as determined by you in the Output Columns.
3 - "Values"  The values in this column will be at the intersection of the rows and columns that you just set up.

Microsoft reference: http://msdn.microsoft.com/en-us/library/ms140308.aspx

If you have six columns that you want displayed in your output, then you'd create six columns here.  In this example, we only want 4 ("Name", "Category X", "Category Y", "Category Z"):





You must map each output column to an input using the input's LineageID.  You put the LineageID into the output column's SourceColumn.  The PivotKeyValue identifies which distinct value in the input's "Pivot Column (PivotUsage = 2)" that will serve as the "key" for this column (in this case it's "x" for the "Category X" column).


Clear as mud?

20091012

Hello world!

Thanks for visiting, here you will find BI posts from a long time .NET developer newly turned Data Architect!  I hope you will find answers to your BI-related questions.