Tuesday, March 27, 2012

Dimensional modelling advice

I am a former DBA that has moved into a totally new role and
environment, as a data architect on an enterprise data warehousing
project for the national statistics office. I've started reading the
Kimball Data Warehouse Toolkit (2nd Edition) and now have a feel for
the approach that I'd like to follow with the data modelling, ie.
using the data warehouse bus architecture approach that is advocated
by Kimball.
However, the main concern that I have with this approach is that we
have hundreds of different data sources in the form of surveys that we
conduct as well as data that we share from other govt and non-govt
agencies. So the data that we will be loading into our DW is not
sourced from our OLTP systems but from a multitude of other businesses
and households. If we were to model the data on a business process
basis there is the potential to end up with dozens/hundreds of star
schemas, as well as the ongoing maintenance of systems development for
when new surveys are created. Internally there is some support for
developing a generic dimensional model that has the flexibility to
accept any type of data (be it of an economic, social or environmental
nature). I could see that this would work as far as loading data into
the warehouse goes, but think that it would introduce a layer of
abstraction that would make it difficult for the business users to
understand and query the data (or do BI tools get around this). So
the first question is, how suitable is generic modelling for a data
warehouse implementation and if not, how big an issue is deploying a
large number of star schemas for an integrated data warehouse?
Secondly, a question around granularity. We have many situations
where different surveys request similar data but with slight
variations. eg. We could ask about a businesses export totals in a
number of different ways in several surveys - total exports of all
products for 2003, exports of fruit and veges for April, exports of
apples for Q3 2004, exports of apples for June in US$, total exports
to Europe etc. How is this handled in a dimensional model? Is it
possible to model this in a single star schema when the granularity
appears to be different. NB. Is may not be possible to derive
aggregate totals from data with a finer grain because a different
subset of businesses will have been asked a different set of
questions, so aggregate totals might not include the full subset of
data that is needed for an accurate total.
Lastly, does anybody know of an example of a dimensional model that
could be used as a starting point for survey collection and analysis
data warehouse?
TIA,
Keith.
Keith -
I would suggest that you also purchase another Kimball book entitled
"Data Warehouse ETL Toolkit"; they should really be sold as a pair.
Approaching a data warehouse with a strong relational 3nf
bakcground can be difficult to understand and implement the
denormalizing structures needed for a successful dw.
I've implemented large dw systems with numerous external sources of
data, and I've found that if one really boils the inbound information
the complexity is not what it seems at first.
FWIW.
\
b.
Keith Chung wrote:

> I am a former DBA that has moved into a totally new role and
> environment, as a data architect on an enterprise data warehousing
> project for the national statistics office. I've started reading the
> Kimball Data Warehouse Toolkit (2nd Edition) and now have a feel for
> the approach that I'd like to follow with the data modelling, ie.
> using the data warehouse bus architecture approach that is advocated
> by Kimball.
> However, the main concern that I have with this approach is that we
> have hundreds of different data sources in the form of surveys that we
> conduct as well as data that we share from other govt and non-govt
> agencies. So the data that we will be loading into our DW is not
> sourced from our OLTP systems but from a multitude of other businesses
> and households. If we were to model the data on a business process
> basis there is the potential to end up with dozens/hundreds of star
> schemas, as well as the ongoing maintenance of systems development for
> when new surveys are created. Internally there is some support for
> developing a generic dimensional model that has the flexibility to
> accept any type of data (be it of an economic, social or environmental
> nature). I could see that this would work as far as loading data into
> the warehouse goes, but think that it would introduce a layer of
> abstraction that would make it difficult for the business users to
> understand and query the data (or do BI tools get around this). So
> the first question is, how suitable is generic modelling for a data
> warehouse implementation and if not, how big an issue is deploying a
> large number of star schemas for an integrated data warehouse?
> Secondly, a question around granularity. We have many situations
> where different surveys request similar data but with slight
> variations. eg. We could ask about a businesses export totals in a
> number of different ways in several surveys - total exports of all
> products for 2003, exports of fruit and veges for April, exports of
> apples for Q3 2004, exports of apples for June in US$, total exports
> to Europe etc. How is this handled in a dimensional model? Is it
> possible to model this in a single star schema when the granularity
> appears to be different. NB. Is may not be possible to derive
> aggregate totals from data with a finer grain because a different
> subset of businesses will have been asked a different set of
> questions, so aggregate totals might not include the full subset of
> data that is needed for an accurate total.
> Lastly, does anybody know of an example of a dimensional model that
> could be used as a starting point for survey collection and analysis
> data warehouse?
> TIA,
> Keith.

No comments:

Post a Comment