It has been suggested that when loading dimensions, the first step should be to clear the current dimension table and reload in total from the data source. I use integer identity PK fields in all my dimension tables.
When loading my fact table I'm only loading rows that have been added or changed since the last SSIS run.
Wouldn't this introduce the possibility of mis-matched key relationships if, for some reason, the order of the rows in the dimension table changed and a fact row that doesn't get modified in this SSIS run pointed to a surrogate key in a dimension table that was changed?
I suppose in most cases a change in the OLTP that caused a reordering of the dimension table - say a customer was dropped - would show up as a modification in the fact load and cause the row to be updated. However if someone changed the indexing in the OLTP it might cause the customer order to chagne causing them to be loaded in a different sequence with different surrogate keys in the dimension table.
It just concerns me that a possibility for corruption seems to exist in this type of scenerio. It seems like one should load the dimensions and facts in the same manner.
Who suggests truncating dimensions before loading them? This is not good practice. New rows should be inserted and existing rows should be updated. If you need to track changes (Slowly Changing Dimension), then you should use the SCD wizard inside SSIS.|||John:
I am a student of the Kimball method of data warehousing. In most of my experience I have never completely rebuilt a dimension during each update, for precisely the same reason you mention. In my experience most changes in dimensions are handled using the Slowly Changing Dimension process, and SSIS provides a very elegant tool to do this for you.
I am not sure what your fact is, but unless you have scheduled a regular roll-out of data from your warehouse, say you only wanted to maintain a specific time period of data in your warehouse, you should not really be deleting rows from your dimension. If you have a very large dimension that you wish to purge occassionally I suggest that you devise a process that will account for the fact relationships that exist with the dimension records that are marked for deletion.
On a regular basis though it is not typical to completely refresh dimensions during each update.
|||Thanks! I thought it came from the Kimball video posted here but I could have misunderstood something I saw on the screen that was being used for something else or it might have been from one of my books. It seemed too easy to be workable. However it's not that big a deal to treat dimension loading the same as facts.
I'm lucky in that I don't have any slowly changing data (yet) so I can treat all my modifications as simple updates. And every row in my OLTP has a CreateDate, ChangeDate, DeleteDate attached to it.
|||The only thing I truncate in my data warehouses are aggregated fact tables. The detail fact tables undergo inserts only, while aggregates (monthly data, for instance) are just roll ups of the detail tables. Instances like this are perfect for truncations since they need to be rebuilt every night.|||Now that I'm looking at incremental updates of my dimensions I forsee another problem.
My sales date comes from the same OLTP table as most of my facts. I don't have a separte date table. So to load the date dimension I'm going to read the same rows as I would for loading my fact table. My OLTP row will alert me that the row has been modified but NOT whether the date column has been modified.
Also, I assume I'm going to be picking up several thousand sales each day so there will be several duplications. Is this why some designs, including some from microsoft, leave the date field in the fact table and let SSAS generate date dimensions from there?
|||You do not need to have a data source of a Time Period table. You can very simply create a time period dimension using SSIS or a stored procedure for a time period that spans the entire history and then several years into the future, and the appropriate grain of your fact.
Again, it is typical that you have a very robust Time Period dimension that has all the attributes users would need to slice the data along.
|||John:
I suggest you start a new thread on this topic of Time Dimension as you have already marked this post as being answered. This way others will look at the post.
No comments:
Post a Comment