SSIS has some interesting components for 'out of the box' support to obtain data via FTP and even Web Services. However, in creating a demonstration project for a class I am teaching, I found the need to download files from a web page via http. I found a cool little script on sqljunkies that did just the trick.
To further describe the problem, I am interested in downloading the weekly performance statistics for Division 1 College Football, graciously made available by the NCAA here. There are three files, one each for the schedule, offensive and defensive statistics. I really want the package to fire on a weekly basis, and get the new stats updating my data mart.
As a brute-force, first-pass effort, I created package variables to represent the remote file uri, the local filename and local path. In the case of the Offensive production statistics, these variables were named RemoteUriOffense, LocalFileNameOffense, and Extract_Home. Unfortunately, one of the downsides of the script task is having to use some variable at the package (or container) level and explicitly declare that variable as being available for read access by the script. An example of that is below:
Substituting the variable names declared as readable by the script, in the script code as illustrated below.
While the script task from sqljunkies does get the job done, there are a few disadvantages. First and foremost, creation of the variables for each of the files. I could have done some manipulation of a single set of variables, iterating through for all three files, and using a single script task, firing the task once for each file instance. Additionally, given the variable infrastructure needed for the script task, and the probability this is not the last time I will ever get data via http, this may be a good candidate to create a custom component.
Among the advantages of the custom component, all the logic for file download would be centralized. What happens, in my three-script components when the a change occurs to the download logic (for instance)? Yep, I am replicating the change for each extract. True enough this problem is solved by using a single script task and iterating over a list of files to download.
However, this may just be the perfect task on which to base a first custom component. Stay tuned.
Friday, September 21, 2007
Thursday, September 13, 2007
Expressions and the Data Flow Pipeline
File this under the "more than one way to skin a cat" topic. I am creating a demo for one of my upcoming presentations, and I noticed a technique for cleaning data from extracts that escaped my earlier reading/training/lab time.
In particular, I am creating some SSIS packages to load Comma Separated Variable (csv) files. Only these files, from an extract provider over which I have little or no control, contain embedded quotes in them. For example a data record would read:
"column1 value", "column2 value", "column3 value"
This did not strike me as odd until the flat file connection manager for this extract had NO desire to see the quotes as part of the field delimiter. Unfortunately, my first pass at loading this data created a table with the following values.
Obviously, NOT what I wanted, even in a staging table, with which to begin my Transform and Load processing.
One way or another, I would have to strip the double quotes out of the data, and for some reason I latched onto the idea of using an expression within an SSIS Derived Column transformation.
Soon enough, I had the Derived Column transform, and was feeding it a steady diet of DT_STR columns from the Flat File Data Source. It seemed as though surely the REPLACE function would suit my needs, but how to express the target expression is "? To little surprise REPLACE(column1, ""","") is NOT what the SSIS Execution Engine is looking for. Thankfully, in the books online Microsoft has furnished a reference for common escape sequences. In my case, " was just what the doctor ordered. REPLACE(column1,"\"","") effectively stripped all of the quote characters from the data, leaving my staging table looking a little more appealing.
Somewhere on the Web, I found a custom SSIS component to strip quotes from an extract. That still may be worthwhile, as my candidate inbound set only had 50,000 or so records.
If anyone out there reading this has ideas as to how to download files (using http), and without a Web Service, drop me a reply via the blog. Perhaps that will be the subject of my next post.
In particular, I am creating some SSIS packages to load Comma Separated Variable (csv) files. Only these files, from an extract provider over which I have little or no control, contain embedded quotes in them. For example a data record would read:
"column1 value", "column2 value", "column3 value"
This did not strike me as odd until the flat file connection manager for this extract had NO desire to see the quotes as part of the field delimiter. Unfortunately, my first pass at loading this data created a table with the following values.
Column 1 | Column 2 | Column 3 |
"Value" | "Value" | "Value" |
Obviously, NOT what I wanted, even in a staging table, with which to begin my Transform and Load processing.
One way or another, I would have to strip the double quotes out of the data, and for some reason I latched onto the idea of using an expression within an SSIS Derived Column transformation.
Soon enough, I had the Derived Column transform, and was feeding it a steady diet of DT_STR columns from the Flat File Data Source. It seemed as though surely the REPLACE function would suit my needs, but how to express the target expression is "? To little surprise REPLACE(column1, ""","") is NOT what the SSIS Execution Engine is looking for. Thankfully, in the books online Microsoft has furnished a reference for common escape sequences. In my case, " was just what the doctor ordered. REPLACE(column1,"\"","") effectively stripped all of the quote characters from the data, leaving my staging table looking a little more appealing.
Somewhere on the Web, I found a custom SSIS component to strip quotes from an extract. That still may be worthwhile, as my candidate inbound set only had 50,000 or so records.
If anyone out there reading this has ideas as to how to download files (using http), and without a Web Service, drop me a reply via the blog. Perhaps that will be the subject of my next post.
Thursday, September 06, 2007
The Dummy Script Task, or Conditional Execution 100
So I am working on building a data warehouse ETL system with SSIS. Like so many times before, there may be certain steps in the process that I wish to execute, based on some condition. Maybe it is just one of Larry Wall's virtues, laziness, but I want a process to drive flawlessly whenever executed, 3pm or 3am, manually or automatically. An example of this is truncation of destination tables in a data mart / data warehouse project in either a Continuous Integration environment or in an execution of a priming load. To that end, I have employed the following trick I call the Dummy Script Task. SSIS provides a nice mechanism to conditionally execute one task, based upon evaluation of an expression, constraint or both.
Truncating the destination table is easy enough, using an Execute SQL Task. However, creating the conditional execution is a less straightforward. In order to employ conditional execution, the package must have two tasks in between which sits the conditional execution flow. The Truncate Destination Table (Execute SQL Task) is just sitting in a Sequence Container (labeled "Reload_YN" for clarity), with no flow, conditional or otherwise onto which to create the condition.
Enter the dummy script task.
The dummy script task is just that. A Script Task from the toolbar, with NO CODE BEHIND IT! The task merely serves as the initial end-point for a flow that will conditionally execute. It could not be more simple, just drag a Script Task from the toolbar and optionally rename it.
So as you will note from the figure at left, there is a blue data flow connecting the Script Task to the Truncate Destination Tables task. The flow is further noted as conditional by the fx label attached to it. Right mouse clicking a flow, and choosing the Edit selection presents you with a dialog similar to the one below.
This dialog allows us to set the Truncate task to execute ONLY if one or more conditions are true. I have created a package variable, called ReloadYN, of type boolean, that determines if tasks related only to a reload are to be executed. Perhaps the subject of another post, the package variable can be set many ways, via a dtexec command line option, script, or in this case via an indirect package configuration. By having this little gem in my standard Dimension and Fact table package templates, I know the entire database will be truncated (or not) on a consistent basis. One more thing to not have to worry about.
Truncating the destination table is easy enough, using an Execute SQL Task. However, creating the conditional execution is a less straightforward. In order to employ conditional execution, the package must have two tasks in between which sits the conditional execution flow. The Truncate Destination Table (Execute SQL Task) is just sitting in a Sequence Container (labeled "Reload_YN" for clarity), with no flow, conditional or otherwise onto which to create the condition.
Enter the dummy script task.
The dummy script task is just that. A Script Task from the toolbar, with NO CODE BEHIND IT! The task merely serves as the initial end-point for a flow that will conditionally execute. It could not be more simple, just drag a Script Task from the toolbar and optionally rename it.
So as you will note from the figure at left, there is a blue data flow connecting the Script Task to the Truncate Destination Tables task. The flow is further noted as conditional by the fx label attached to it. Right mouse clicking a flow, and choosing the Edit selection presents you with a dialog similar to the one below.
This dialog allows us to set the Truncate task to execute ONLY if one or more conditions are true. I have created a package variable, called ReloadYN, of type boolean, that determines if tasks related only to a reload are to be executed. Perhaps the subject of another post, the package variable can be set many ways, via a dtexec command line option, script, or in this case via an indirect package configuration. By having this little gem in my standard Dimension and Fact table package templates, I know the entire database will be truncated (or not) on a consistent basis. One more thing to not have to worry about.
Putting it all together
Based on the value of ReloadYN at run-time, either the Truncate Destination Tables task executes (wiping the slate clean for a new load/test) or not. You can verify this by executing the package, or sequence container, from within Business Intelligence Developer Studio. The 'skipped' task will turn neither red, yellow or green. Execution continues at the first step following the Sequence Container.
Subscribe to:
Posts (Atom)