SSIS, Tempdb Database, and SQL Server Log Files

The tempdb system database is a global resource that is available to all users connected to the instance of SQL Server. Tempdb is re-created every time SQL Server is started, which means the system always starts with a clean copy of the database and there is never anything in tempdb that is saved from one session of SQL Server to another.

The tempdb’s usage could be roughly separated into 3 categories:

User objects: Any user with permission to connect to an instance of SQL Server would be able to create temporary tables and table variables. Temporary user objects that are explicitly created, such as: global or local temporary tables, temporary stored procedures, table variables, or cursors, are stored in tempdb database.

Internal job: Tempdb holds the intermediate data when SQL Server is doing tasks like:

  • DBCC CHECKDB or DBCC CHECKTABLE;
  • Common Table Expressions (CTE);
  • ORDER BY, GROUP BY, UNION Queries;
  • Index rebuild or creation.

Version store: Version Store is a collection of data pages that hold the data rows used for row versioning. Triggers, MARS (Multiple Active Result Sets), and online index operations are examples of SQL Server features that use version store.

Here in this post, I’m going to talk about how tempdb is affected when SSIS is used to load data into target tables or update previously loaded data (specifically in fact tables), and provide some workaround for preventing the database from growing too large. The solution provided here will pretty much have the same effect on SQL Server logs, resulting in smaller log files and saving space.

Populating Fact Tables

A fact table in a data warehouse has direct or indirect (through mapping tables) relationship to the dimension tables it is related to. Whether or not the table that a fact table is related to is dimension or mapping, there must be a column (foreign key) in the fact table that refers to the primary key of the linked table. This means there are 2 sets of columns in a fact table: measure columns which hold the numeric values that should be analysed and foreign keys to dimension tables that are used to slice and dice the measure values.

The set of dimension and/or mapping foreign keys in a fact table define its granularity: the lowest level that data should be aggregated before loading into DW. In other words, data should be aggregated at the right granular level, which is defined by its foreign key columns.

A good way of aggregating data at the required granular level is using a GROUP BY clause in the query that is used to pump the data into DW’s fact table. For example, if the structure of the fact table to be populated is like the one depicted below:

Image

Then the following query is the best to populate the table:

SELECT     ‘ProductKey’ = p.[ProductKey]

,’DateKey’ = d.[DateKey]

,’SalesAmount’ = SUM(s.Sales_Value)

FROM       [OLTP].[dbo].[Sales] s

INNER JOIN [DW].[dbo].[DimProduct] p ON s.Product_Key = p.ProductKey

INNER JOIN [DW].[dbo].[DimDate] d ON s.Sales_DateTime = d.DateTime_Value

GROUP BY    p.[ProductKey]

,d.[DateKey]

What should be considered at this stage is that since populating a fact table, especially for the first time, needs a huge number of records to be transferred from OLTP database to the DW database and the query to do so has a GROUP BY clause that engages tempdb database in the process of preparing data, the ETL should be designed in the way that it doesn’t blow up tempdb: The ETL for populating fact tables must be designed and implemented in such a way that it doesn’t load the data in a single data transfer process, and breaks it into smaller chunks of data to be loaded once at a time.

The best candidate for breaking down data is the date-related factor at which the data is aggregated. In our example the data is aggregated at the date level (it could be at any other level like Week, Month, or Year.)

Design and implementation of ETL using SSIS:

As explained before, the ETL for populating fact table should be designed in such a way that it loads data into DW by period (day, week, month, and year) to prevent tempdb and log files from growing too large. This needs our SSIS package to do the following steps:

A- Get the list of distinct periods for the whole data that is going to be loaded,

B- Save the periods extracted in previous step into an SSIS variable,

C- Run the process that cleanses and inserts data into DW once at a time for each period.

Let’s see how each step could be implemented using SSIS:

1- Open BIDS, create a new SSIS project;

2- Add a new Execute SQL Task to your SSIS Project. This task is responsible for reading the data to be loaded and build a list of distinct periods;

3- Create a variable of type Object at the scope of Package. This variable is going to hold the list of periods and will be used by different tasks at different steps of loading data. I named it Periods;

4- Add another variable of type Integer, Int32, to your package again at the scope of Package. This variable will be used to hold each period of time read from Periods variable, for which the package is going to load data. In my example this variable is called CalendarDate;

5- Next step is to extract the periods for which the data should be loaded into the fact table. This would be done by writing a query that extracts the distinct list of time periods from the OLTP database, and configure the previously added Execute SQL Task to run the query and save the result in the Periods variable:

a-  Add the SQL script to your Execute SQL Task:

b- Change the ResultSet property of the Execute SQL Task to Full Result Set;

c- Click on the Result Set tab, and add a new result set to store the result of running your query to the Periods Variable:

6- To get the package to transfer data once for a period at a time, we need a Foreach Loop Container. The Foreach container is going to loop through what we have already loaded into Periods SSIS variable, and run the process for transferring the data once for each period:

a- Add a Foreach Loop container to your project and open its editor by double-clicking on it;

b- Go to Collection tab, and select “Foreach ADO Enumerator” from Enumarator drop down and “User::Periods” from ADO object source variable:

c- Click on on Variable Mapping tab of Foreach Loop Editor and add CalendarDate variable. This variable is going to hold the period for which the data is going to be transferred from OLTP to DW at each iteration:

7- Add the DFT and the sequence container (optional) to your project to do the transfer using CalendarDate variable. You should be able to find loads of resources on how to use a variable in an OLE DB source of a DFT on internet, in case you don’t know how to do that.

Improve your SSIS package’s performance

Hello everyone.

I spent almost the whole last week and the first 2 days of this week trying to improve my BI solutions’ performance. In my quest on learning the tricks to make my package faster, I came across SSIS Performance Design Patterns video by Matt Masson. A comprehensive discussion indeed, that I’m gonna list in vrief here:
1- Utilize parallelism: It is easy to utilize parallelism in SSIS. All you need to do is to recognize which Data Flow Tasks (DFTs) could be started at the same time and set the control flow constraints of your package in the way that they all can run simultaneously.

2- Synchronous vs. Asynchronous components: A synchronous transformation of SSIS  takes a buffer, processes the buffer, and passes the result through without waiting for the next buffer to come in. On the other hand, an asynchronous transformation needs to process all its input data to be able to give out any output. This can case serious performance issues when the size of the input data to the async. transformation is too big to fit into memory and needs to be transferred to HDD at multiple stages.

3- Execution tree: An execution tree starts where a buffer starts and ends where the same buffer ends. These execution trees specify how buffers and threads are allocated in the package. Each tree creates a new buffer and may execute on a different thread.  When a new buffer is created such as when a partially blocking or blocking transformation is added to the pipeline, additional memory is required to handle the data transformation; however, it is important to note that each new tree may also give you an additional worker thread.

4-OLE DB Command transformation: OLE DB Command is a row-by-row transformation, meaning that it runs the command in it on each one of its input rows. This make sit to be damn too slow when the number of the rows goes up. The solution for boosting performance is to stage data into a temporary table and use Execute SQL Task outside that DFT.

5- SQL Server Destination vs. OLE DB Destination: There are multiple reason why to use OLE DB Destination and not use SQL Server Destination:

  • OLE DB Destination is mostly faster,
  • OLE DB Destination is a lot clearer when it fails (The error message is more helpful),
  • SQL Server Destination works only when SSIS is installed on the destination server.

6- Change Data Capture (CDC): Try to reduce the amount of data to be transferred to the maximum level you can, and do it as close to the source as you can. A Modified On column on the source table(s) helps a lot in this case.

7- Slowly Changing Dimension (SCD) transformation: There is only one advice about SSIS’s Slowly Changing Dimension transformation, and that is get rid of it! The reasons are:

  • It doesn’t use any cached data, and goes to the data source every single time it is called,
  • It uses many OLE DB Command transformations,
  • Fast Data Load is off by default on its OLE DB Destination.

I recommend you to go and watch the whole video if you have enough time. All these topics are discussed more in details in the video.
Cheers!

Populating Data Warehouse Using SSIS

ETL tools are meant to be used for moving data from a source to a destination. They make life of a data warehouse or BI developer easy by providing easy to use user interfaces and many functionalities to cleanse data before loading into destination. But the situation is not always this easy, especially when moving data from production database (OLTP) to the data warehouse: you need to reference data in both of the different databases to be able to populate your DW with correct data, which generally refers to substituting dimension ID columns with the “Surrogate Keys” that are pre-loaded into data warehouse. Let’s have a look at different choices a developer has for referencing multiple databases in his SSIS package:

1- Using Linked Servers: The first solution that might occur to a developer could be referencing the tables in both databases at the query level, by doing a JOIN between them and stating the databases’ names before each table’s name to differentiate tables from different sources. This needs a linked server been set up between the 2 referenced databases. You can check this pageif you like to learn how to create a linked server.Let’s see an example of the queries that can be used when having a linked server. Suppose that the production database instance is named OLTP and the instance of database that is acting as data warehouse is named DW. Let’s assume there are 2 tables in our OLTP database, with following relationship:

When pulling data from this database into DW, the dimension tables’ IDs (in this case, SalesPersonID) must be replaced with surrogate keys:

If there is a linked server set up between OLTP and DW servers of the above example, then the query that populates SalesFact table in DW looks like this:

SELECT       ‘Amount’ = s.Amount
,’SalesPersonKey’ = dsp.SalesPersonKey
FROM         [OLTP].[dbo].Sales s
INNER JOIN [DW].[dbo].DimSalesPerson dsp ON s.SalesPersonID =              dsp.SalesPersonFullAlternateKey

2- SSIS Lookup Transformation: In cases when it is not possible to create a linked server between the two data sources for reasons like security concerns, SSIS’s Lookup transformation can be used. Lookup transformation is capable of doing many things for a BI developer, from adding new records to a destination table to replacing the columns from different tables. To replace a column from one table with a column from another one using Lookup transformation, all you need to do is to tick the column from referencing table in columns tab of your Lookup transformation:

    You can even replace the old column (SalesPersonID) with the new one (SalesPersonKey) by setting the Lookup Operation column:

This mechanism could be used to replace many IDs with the appropriate Surrogate Keys by using a waterfall of lookups, one for each dimension.

SSIS SQL Server Deployment, File System Deployment, and SSIS Package Store

SSIS packages can be stored in file system, as well as SQL Server. Let’s have a look at each one of these options for storing our SSIS packages.

1- SQL Server:

When a package is deployed to SQL Server, it is actually stored in MSDB database. The easiest way to deploy a SSIS package to SQL Server is by using Package Deployment Utility:

  • Open your SSIS project in BIDS (Business Intelligence Development Studio),
  • Right-click on the project entry in solution explorer and select properties,
  • Navigate to deployment tab, and set CrteateDeploymentUtility to true,
  • Build your project.

After completing these steps, go to bin -> Deployment folder of your project’s folder and double-click on .SSISDeploymentManifet file. This will launch Package Installation Wizard,which will get you through the necessary steps to deploy your package to SQL Server:

  • In the Deploy SSIS Package page of the wizard, select SQL Server deployment and click Next:

SQL Server Deployment

  • Specify Target SQL Server page is where you must specify the instance of SQL Server to which you want to deploy your package. Insert the instance name in Server name drop-down list, select appropriate Authentication mode, and the Path for the folder in which the package will be deployed:

SQL Server Deployment

Note that the list of folders you’ll get for the Package Path is exactly the same as folders you’ll see when you connect to SSIS using SSMS. If you have a new project in hand and you need to create a new folder for this project’s packages, you should Connect to SSIS through SSMS, Right-click on MSDB folder in object explorer, and select New Folder.

  • Next page is Select Installation Folder. You can set where package and its dependencies to be stored in your hard drive by clicking the Browse button:

SQL Server Deployment

  • Last step is to click Next in Confirm Installation page:

SQL Server Deployment

Apart from package deployment utility, developers can use DTUtil comand line utility for moving a package, which I believe is more complex and time taking that this option.

Now that you have completed these steps, your package is stored in MSDB database in dbo.sysssispackages table.

2- File System:

When a package is stored in file system, the package itself and its dependencies are store in a folder in hard drive. a package can be deployed to file system easily, by copying and pasting the .dtsx file to the target folder. This is the easiest way. Package Deployment Utility and DTUtil can be used as well, like SQL Server Deployment.

If you connect to SSIS through SSMS, you’ll see a folder named File system. This is the folder for displaying the packages that are deployed to file system. But not all the packages available in your hard drive will be displayed here. The folder that SSMS will look into to list the packages is C:\Program Files\Microsoft SQL Server\100\DTS\Packages. You can add new folders to here and then copy your packages, for better management.

If you like to add new folders for your SSIS packages other than just File System and MSDB to be vieable in SSMS, you should locate the file named “MsDtsSrvr.ini.xml” and configure it. Basically what you need to do is to add a new Folder node to this XML file.

Do not forget to restart your SSIS service after making any of these changes,  otherwise you won’t see any result.

SSIS Package Store:

Some websites and blogs mention a third type of package storage named SSIS Package Store, which is not correct. SSIS Package Store is nothing but combination of SQL Server and File System deployment, as you can see when you connect to SSIS through SSMS: it looks like a store which has categorized its contents (packages) into different categories based on its manager’s (which is you, as the package developer) taste. So, don’t get it wrong as something different from the 2 types of package deployment.

Search for Objects in an instance of SQL Server

There are cases when a developer needs to look for a specific object (table, Column, etc.) inside a database. A good example of such a scenario is when you have a database that has no diagrams or PK-FK constraints assigned, and you want to find out the PK-FK relationships and how tables in the database are connected together. This happens regularly in developing a Data Warehouse and designing an OLAP solution, when you want to transfer all the tables required to the Data Warehouse. And when the database is huge in terms of number of tables and views and the naming convention is not good and understandable, this becomes crazy!

SQL Server Management Studio (SSMS) gives you the ability to export the script used to do any operation on any type of object accessible through SSMS. For example, to get the T-SQL script for creating a table you can right-click on the table, select Script Table AS -> CREATE TO -> New Query Editor Window.

You may have guessed that we can follow almost the same procedure to export all the objects in a database to a query window and look for an object, e.g. a column, using the its name. To do this follow these steps:

1- After connecting to your instance of SQL Server, right-click on the database you are going to search in, navigate to Tasks, and then Generate Scripts… . The Script Wizard will launch:

2- Click Next in the Welcome window. You will see the Select Database window. This is where you can select the database you want to create the script for and search in:

3- In the next page, you can select the options for the objects you want to script. If you are going to get the script of the database just for searching an object in it, leave the options as they are:

4- Choose Object Types page is where you can select the types of objects to script. Note that you can select only the object types of which you have at least one instance in your database. For example, you won’t see the option to script user-defined functions if there are no functions defined by users in the database:

5- Based on the selections you have made in the previous page, you will get one or more pages to select the objects to script. If you have chosen to script only tables, you will see a page to select the tables to script:

6- In the output option, you will get the options for the script mode. You can the destinations of type File, Clipboard, or Query Window. I suggest to script the definition of your objects to a Query Window, because it is more readable in SSMS:

7- Check objects listed in the Script Wizard Summary and click Finish. The Generate Script Progress is done, a new window containing the generated script in SSMS will open.You can look for the objects in here by pressing Crtl+F:

P.S. : I owe a special thanks to Hendra, a very good friend and colleague of mine. Thanks mate. 🙂