Automatic Construction of Multidimensional Schema from OLAP Requirements Ahlem Nabli* — Jamel Feki* — Faiez Gargouri** Laboratoire LARIM *
Faculté des Sciences Economiques et de Gestion de Sfax Route l’aérodrome km 4B.P. 1088-3018 Sfax, Tunisie ** Institut Supérieur d’Informatique et du Multimédia de Sfax
Route Mharza km 1,5B.P. 1030-3018 Sfax, Tunisie
E-mail : {ahlem.nabli, jamel.feki , faiez.gargouri}@fsegs.rnu.tn Abstract The manual design of data warehouse and data mart schemes can be a tedious, error-prone, and timeconsuming task. In addition, it is a highly complex engineering task that calls for methodological support. This paper lays the grounds for an automatic generation approach of multidimensional schemes. It first defines a tabular format for OLAP requirements. Secondly, it presents a set of algebraic operators used to transform automatically the OLAP requirements, specified in the tabular format, to data mart modelled either as star or constellation schemes. Our approach is illustrated with an example.
1. Introduction During the last decades, the database community has devoted an increasing effort in the data warehouse (DW) research area [8]. Nevertheless, the crucial issues related to DW design still need further investigations. In most cases, researchers have focused on specific issues such as dimensional data models c.f. [16] [11] [6], materialized views c.f.[7] [5] and integration of heterogeneous data sources c.f.[13]. A few researches have focused on the semi-automatic or automatic construction of data warehouse or data mart (DM) schemes c.f. [12] [4] [2] [3]. However, no significant research has been made so far to develop a complete design methodology [6] [15]. More specifically, there is no DW design methodology that addresses OLAP requirement acquisition, systematic DM schema derivation and, finally, DW schema construction. In addition, designing a DW requires a methodology different from those commonly adopted for operational information systems [15]. For the latter, the design is typically based on the analysis of the whole operational data system whereas DW design focuses on the analysis of analytical user requirements. In addition, current software tools are dedicated to assist the administrator in the DM and DW construction and production of
0-7803-8735-X/05/$20.00©2005 IEEE
analytical results. However, with these tools, the DW and DM schemes must be built beforehand and in most cases manually. Consequently, this task can be tedious, error prone and time-consuming, especially with the large volume and variation of data sources. This lack of methods and tools motivated us to explore a new track leading to an appropriate design methodology. Our long-term objective is to develop a computeraided methodology for, first, DM design and then DW design based on use defined OLAP requirements. In this methodology, OLAP requirements are specified as twodimensional fact sheets (2D-F) comparable with those usually used in the visualization of the analytical results [11]. The 2D-F sheet format is appropriate because it constitutes a familiar format and intuitive presentation for the decision makers [16] [18]. The specified OLAP requirements are used to generate systematically DM schemes that are integrated to build the DW schema. Thus, this methodology defines precisely the OLAP requirements, and accelerates the DM and DW schema design. In this paper, we focus on the DM schema derivation; more specifically, we present a set of algebraic operators for generating the DM schemes based on given OLAP requirements. Section 2 is an overview of our approach. Section 3 presents the multidimensional models and defines the two-dimensional fact sheet (2D-F) as an input model to express the OLAP requirements. Section 4 defines the algebraic operators, while section 5 illustrates their use through an example.
2. Our Approach In our approach, an OLAP requirement is a 2D-F sheet mainly composed of: one fact to be analysed, several measures, several dimensions and hierarchies. The DM design phase accepts requirements expressed according to this model and generates multidimensional schemes [11]. The multidimensional schema generation is done in two complementary phases (Figure 1):
OLAP Generation of Stars
Requirements
DM
Generation of constellation
Generated stars
Schemes
Multidimensional schemes
Figure 1. DM schemes generation from OLAP requirements. - Generation of multidimensional star schemes by grouping sheets referring to the same application domain and describing the same fact, then by merging them to build a star schema. - Generation of constellation schemes by integrating the star schemes issued from the previous phase and relevant to the same application domain. The following section introduces definitions that we use to detail the above two steps.
3. Multidimensional model Nowadays, decisional systems use multidimensional models (MDM). In a MDM, data are represented as dimensional schemes that consist of a set of facts, dimensions and hierarchies [17]. We formally present these concepts in the following subsections.
- dN is the name of a dimension, - Att is a set of all attributes of d (including weak attributes), - HIER = {HM1d, HM 2d.., HM td} is a set of full hierarchies of d. The attributes of a dimension d are organized in full hierarchies. These attributes are ordered from the finest towards the highest granularity. Definition. A full hierarchy HM id of a dimension d is an acyclic path defined as (NHMid, ParamF, Att-F) where: - NHMid is the name of a full hierarchy, - ParamF= is an ordered list of attributes used in HMid , - Att-F is a function which associates an attribute pi to the set of its weak-attributes with ∀ i∈ [1..n], Att-F(pi)={ ate, ..,atr } and ∀ j∈ [e..r], atj∈Att et atj ∉ParamF.
3.3. Data models 3.1. Fact An analyzed subject is represented by the fact concept. Each fact reflects information that has to be analysed. Definition. A fact F is defined as (fname, Mf) where: - fname is the name of a fact, - Mf ={mF1, mF2…, mFn} is a finite set of attributes representing the measures of the fact, each measure mFi is defined as mFi = (NameMFi, FuncMFi) - NameMFi is the name of a measure - FuncMFi is an aggregate function (Sum, Average, ..).
3.2. Dimension A dimension reflects information according to which measures will be analysed, i.e., it is the axis of analysis. A dimension is generally made up of a finite set of attributes [17]. In a dimension, some attributes take part to define various levels of detail (hierarchies), whereas others are less significant but used, for instance, to label results. The latter are said weak attributes. Definition. HIER) where:
A dimension d is defined as (dN, Att,
Our approach uses two data models: - An OLAP requirement model that is the input model presented as two dimensional fact sheets (2D-F), - A data mart model that is a multidimensional output model. 3.3.1. OLAP requirement model. As mentioned previously, in our design methodology, the OLAP requirements are collected through an acquisition module in a familiar format to the decision makers, as twodimensional fact sheets (2D-F). Each 2D-F is a matrix that describes the measures of a fact F belonging to a domain. The 2D-F measures are recorded according to two dimensions of analysis. Each dimension can be visualized as a sub-hierarchy of a full one. Thus, a 2D-F sheet is a particular star schema that is reduced to two dimensions (DIM = {d1, d2}), each one associated with only one subhierarchy. During the acquisition step, weak-attributes are implicitly selected and not visualized with each selected non-weak-attributes. Also dimensions are associated to the analysed fact. Example: The 2D-F sheet given by the figure 2 analysis the SALE fact where the measure qty is recorded according to Client and Date dimensions.
Since a 2D-F sheet is a particular star, we suggest representing it graphically as a star according to Golfarelli formalism [14]. Thus, the above example is represented by the S1 schema of Figure 3.
- Funct is a function which associates a fact fi to the list of its dimensions with ∀i∈ [1..s ], Funct(fi ) = {di…, dp} with ∀ j ∈ [ i..p ], dj ∈DIM. Unit-Price
Category Client Dimension
Fact measure SALE ( Qty)
Sub-Hierarchy Client
Client Region
Prod-Name IdProduct Product
First-Name
Slice
Name
Age IdClient Client
City Date
Year
SALE
Month
SALE
qty amountt
Date
Date Dimension
Sub-Hierarchy Date
Figure 2. An example of 2D-F sheet analysing the SALE Fact. First-Name
Region City
Name
IdClient Client
qty Date
IdDate Month
IdDate Month
Quarter
Quarter
Year
Year
S2
S3
Figure 4. Examples of two star schemes representing OLAP requirements.
SALE qty
4. An algebraic operators for DM schema construction
Date
This section presents DM schema construction by merging OLAP requirements. The DM schema generation is carried out through a function defined as a succession of basic operations among which we distinguish the following two categories: the substitution operation and union operations.
IdDate Month Year
Figure 3. S1 : Graphical representation of the above SALE sheet.
4.1. Substitution operation In order to illustrate the construction process, let us consider the two additional requirements S2 and S3 graphically represented as two star schemes in Figure 4. 3.3.2. Data mart model. A DM is characterized by its Multidimensional Schema (MS) which can be either a star schema analysing a single fact examined according to axis of analysis (dimensions) or a constellation schema gathering several facts with shared dimensions [17] [9]. Each schema belongs to one specific application domain. We use multidimensional schema as a generic term for both star and constellation. Definition. A multidimensional schema is defined as a tuple (Nsch, ND-sch, Fsch, DIM, Funct) where: - Nsch is the name of a multidimensional schema, - ND-sch is the name of a domain to which a schema belongs, - Fsch = {F1, F2,..,Fs} is a set of facts, - DIM = {d1, d2…, dv} is a set of dimensions,
Naturally, each dimension in a 2D-F sheet relates to a sub-hierarchy of a full hierarchy. Full hierarchies are already prepared in a specific referential (Figure 11) and used by the acquisition module. This referential is currently built by identifying expected facts, dimensions and hierarchies from existing operational and external databases [1] [4] [14]. To prepare suitably the operations of drill down and roll up at various levels, we substitute each sub-hierarchy in a 2D-F sheet by its associated full hierarchy. Definition. The hierarchy substitution operation SubstH replaces an existing sub-hierarchy in a 2D-F sheet by its corresponding full hierarchy: SubstH (Sch, d, hd, HMid) = Sch’ Input: - Sch =(Nsch, ND-sch, Fsch, DIM, Funct) is a MS - d = (dN, Att, HIER) is a dimension ,
- hd= (Nh, ParamF, Att-F) is a sub-hierarchy of d, with h.ParamF = - HMid =(NHMid, ParamF, Att-F) is the full hierarchy HMid.ParamF = Condition: - d∈ Sch.Dim - h.paramF⊂ HMid.ParamF - e>=1 and l.
4.2. Union operation The union operations are useful to merge a set of OLAP requirements (2D-F sheets) describing the same fact or
HM1produit
HM1 client
HM2client
Region First-Name
Department Id
t
Sub-Category
Name
City
Category Categorie
Unit-Price
Slice
Prod-Name
Age
IdProduct
Clien
Product
Client
Id
SALE
SALE
Client
Revenue
amount
Date
Date
Name
client
SALE
qty
qty
First-Name
Date
Id Day
IdDate Day
IdDate Day
Month
Month
Month
Quarter
Quarter
Semester
Semester
Semester
Year
Year
Year
Date
S1’’
S2’’
HM1Date
Figure 5. Schemes after hierarchy substitution.
Quarter
S3’’
Slice
Region
Example: as we aim to incrementally construct multidimensional schemes by merging OLAP requirements, the next step to complete the previous example is to merge S1’’ and S3’’. This consists in adding the HM2Client hierarchy in S3’’ to the Client dimension of S1’’. Thus, we require the application of the following operation: UnionH (S1’’, Client, HM2clientS3’’) = SE.
Department City IdClient Client
Year
Date
Figure 6. SE: Star schema built on S1’’ and S3’’. 4.2.2. Measure union. Different measures coming from two schemes defined on the same fact should be joined together to obtain a schema richer in measures. In fact, the resulting schema gathers complementary users’ points of views. This can be achieved through the measure union operation. Definition. The measure union operation UnionM adds a measure m to a fact F in a multidimensional schema Sch. The syntax of the operator is: UnionM (Sch, F, m) = Sch’ Input: - Sch =(Nsch, ND-sch, Fsch, DIM, Funct) is a MS - F =(fname, Mf) , where Mf ={mF1, mF2…, mFn} - m = (NameM, FuncM) is a measure, Conditions: - F∈ Sch. Fsch - m ∉ F.Mf Output: - Sch’ is the MS Sch where the fact F is enriched by measure m such as Mf=Mf ∪<m>. Example: We continue the construction of the multidimensional schema (a star) on the resulting one by the addition of measure revenue of the S3’’ schema in SE schema. We require the following operation: UnionM (SE, SALE,revenue) = SE1.
Age First-Name Name
SALE qty revenue Mont IdDate h Quarter Day
Semestre Year
Age First-Name Name
SALE qty Mont IdDate h Quarter Day
IdClient Client
Slice
Region
Semestre
Department City
Date
Figure 7. SE1: Star schema built on SE and S3’’. 4.2.3. Dimension union. The 2D-F requirement sheets describing the same fact in a domain are gathered to build a star schema. This operation enriches iteratively, the original schema by complementary dimensions coming from another schema. We define the operator of dimension union. Definition. The dimension union operation UnionD consists in adding a dimension to a fact in a multidimensional schema. The syntax of the operator is: UnionD (Sch, F, ds) = Sch’ Input: - Sch =(Nsch, ND-sch, Fsch, DIM, Funct) is a MS - F =(fname, Mf) with Funct(F) = {d1, d2…, dv} - ds = (Nds, Att, HIER) is a dimension Conditions: - F∈ Sch. Fsch - ds ∉ Funct(F) Output: - Sch’ is the multidimensional schema Sch enriched by the dimension ds with DIM = DIM ∪ and Funct(F) = {d1, d2… dv}∪{ds}. Example: To build a star on SE1 and S2’’ we add the Product dimension of S2’’ then the amount measure of the Sale fact of S2'' to SE1. We require the following operations: UnionD (SE1, SALE, Product) = SE2; UnionM (SE2, SALE, amount) = SE3 These operations produce the star schema SE3 of Figure 8. Region
Age
Department City Id
Slice
Client
First-Name Name
Client
Semester Year
Month Quarter
IdDate Day
Date
SALE qty revenue amount Product
Prod-Name Unit-Price Category IdProduct Sub-category
Figure 8. SE3: Star schema built on SE1and S2’’.
4.2.4. Fact union. This union consists in building a constellation schema by adding a fact defined in a star schema to an existing multidimensional schema. Definition. The fact union operation UnionF consists in adding a fact to a multidimensional schema. The syntax of this operator is as follows: UnionF (Sch, F, DimF) = Sch’ Input: - Sch =(Nsch, ND-sch, Fsch, DIM, Funct) is a MS - F =(fname, Mf) is a fact, - DimF = {d1…,db} is a set of dimensions, Conditions: - F∉ Sch. Fsch - DimF ⊂ Dim - Funct(F)= Dim Output: - Sch’ is the Sch constellation schema such as Fsch= Fsch∪{F} and Funct(F) = Dim. Example: To complete our case study, let’s have a new star SE4 schema to merge with SE3 (the criterion of fusion is a coefficient of similarity, not presented here [11]).
stores in a DM referential. This latter is described as an UML class diagram (see Figure 12). Region Social-Name
City Id
First-Name
City
Supplier
Id
Month
Name
Client
Client
SHIPMENT qty amount
SALE qty revenue amount
IdDate
Date
Prod-Name Unit-Price IdProduct Category
Product
Sub-category
Day
Quarter
Year
Age
Departement
Supplier
Semester
Slice
Region
Department
Figure 10. SC1: Constellation schema built on SE3 and SE4. Measure NameM Formule TypeM
1..n +Mt
Sheet NameSh DateSpec Desc
Domain NameD
1..n +DomSh
Additivity NameAd Fact NameF
1..n
Region 2..n
Department
Social-Name
City
2..n
2..n Full-Hierarchy NameFH
Dimension NameD
Sub-Hierarchy NameSH
+Dim_hierarchy
IdSupplier
+Hier_sousHier
Attribute NameA Type
Supplier
1..n
Param Level
SHIPMENT qty amount
Date
Quarter
Day IdDate
Month
Year
Weak
Semester
Figure 9. SE4: Star schema. The fusion of SE3 with SE4 consists in adding the SHIPMENT fact to SE3 then the Supplier dimension to the schema result. This fusion requires the application of the following operations: UnionF(SE3,SHIPMENT,{Date}) = SC ; UnionD(SC, SHIPMENT, Supplier) = SC1. The result of the application of the fact union and dimension union operators is given by the constellation schema SC1 of Figure 10.
Strong
1..n
0..n
Figure 11. Simplified UML class diagram of the OLAP requirements referential. Measure NameM Formule TypeM
Constellation NameC NbFact
Domain NameD
1..n +DomSh
Additivity NameAd
1..n
2..n
2..n Dimension NameD
Fact NameF
Hierarchy NameH +Dim_sousHier
5. Experimentation In order to validate our proposals, we have developed a prototype that helps the administrator during the design phase of the DM schema. The interface enables us to input OLAP requirements expressed as set of 2D-F sheets. The requirement set is stored in a referential of OLAP requirements (see Figure 11). From these requirements, the prototype automatically generates DM schemes that it
Attribute NameA Type
1..n
Param Level Weak
Strong
1..n
0..n
Figure 12. Simplified UML class diagram of the DM referential.
6. Conclusion The work presented in this paper is a step towards the automatic construction of DW schemes. More precisely, it defined a set of algebraic operators that allow the automatic generation of DM schemes from precisely specified OLAP requirements. The latter are presented as a 2D-F sheet format that easily enables decision makers to express their requirements. In addition, this paper presented a set of algebraic operators, which can be applied iteratively to generate DM schemes from 2D-F sheets interpreted as star schemes. Furthermore, the algebraic operators can be used to accommodate for data mart and data warehouse evolution. The algebraic operators have been implemented in a toolset. Currently, we are examining how to generate automatically the DW schema from DM schemes, and how to map the DW schema to data sources.
7. References [1] A. Bonifati, F. Cattaneo, S. Ceri, A. Fuggetta, and S. Paraboschi,” Designing Data Marts for Data Warehouse”,in ACM Transaction on Software Engineering and Methodology, ACM, vol. 10, Octobre 2001, p. 452483. [2] C. Phipps, K. Davis, “Automating data warehouse conceptual schema design and evaluation ”, DMDW'02, Canada, 2002. [3] C. Sapia” On Modeling and Predicting User Behavior in OLAP Systems”, In Proc. of the Int. Workshop on Design and Management of Data Warehouses (DMDW’99), Heidelberg, Germany, 1999. [4] D. Moody, M. Kortnik, “ From Enterprise Models to Dimensionals Models : A Methodology for Data Warehouse and Data Mart Design ”, DMDW’00, Sweden, 2000. [5] D. Theodoratos, S.Ligoudistianos, T.Sellis, “ View selection for designing the global data warehouse”. Data and Knowledge Engineering (DKE).39, 3, 2001, pp.219240. [6] D.Theodoratos,T. Sellis, “Designing Data Warehouse”. Data and Knowledge Engineering (DKE), 31, 3, Oct. 1999, pp. 279 - 301. [7] E. Baralis, S.Paraboschi, E.Teniente, “ Materialized view selection in multidimensional database", Proc. 23 rd
Very Large Database Conf. (VLDB97),Athens, Greece, 1997, pp. 156-165. [8] E.F.Codd, “ Providing OLAP (on-line analytical processing) to user-analysts : an IT mandate ”, Technical Report, Codd E.F. and Associates, 1993. [9] F. Ravat, O. Teste, G. Zurfluh, “ Modélisation multidimensionnels des systèmes décisionnels ”, Revue ECA, vol. 1, n° 1-2, 1999, p.201-212. [10] J. Pokorny, “Modelling Stars Using XML”. The 4th ACM international workshop on Data warehousing and OLAP. Atlanta, Georgia, USA, 2001,pp 24 – 31. [11] J.Feki, “Vers une conception automatisée des entrepôts de données : Modélisation des besoins OLAP et génération de schémas multidimensionnels “, 8th MCSEAI, 9-12 Mai 2004, Souse-Tunisie. [12] K.Hahn, C. Sapia, M. Blaschka,” Automatically Generating OLAP Schemata from Conceptual Graphical Models”, In Proc. of the 1st International Workshop on Data Warehousing and OLAP (DOLAP), Washington, DC, USA, November 10, 2000, [13] M.Bouzeghoub, Farias B.Loscio, Z.Kedad, Soukane A. “ Heterogenous data source integration and evolution ”, 13th International Conference on Database and Expert Systems Applications - DEXA 2002, September 2-6, 2002. [14] M.Golfarelli, D.Maio, S.Rizzi, “Conceptual design of data warehouses from E/R schemes”, 31st Hawaii International Conference on System Sciences, 1998. [15] M.Golfarelli, S.Rizzi, “Designing the Data Warehouse : Key Steps and Crucial Issues ”, Journal of Computer Science and Information Manegement, vol. 2, n°3, 2001, p. 1-14. [16] R Agrawal, A.Gupta, S.Sarawagi, “ Modeling Multidimensional Databases “, Research Report, IBM Almaden Research Center, San Jose (California), 1995. Parus dans les actes de ICDE'97. [17] R.Kimball, The Data Warehouse Toolkit, John Wiley and Sons, Inc., New York, 1996. [18] W. Lehner, “ Modeling Large Scale OLAP Scenarios ”, 6th International Conference on Extending Database Technology (EDBT'98), Valence (Espagne), 23-27 Mars 1998.