N. García-Pedrajas et al. (Eds.): IEA/AIE 2010, Part III, LNAI 6098, pp. 616–625, 2010. © Springer-Verlag Berlin Heidelberg 2010
Processing of Crisp and Fuzzy Measures in the Fuzzy Data Warehouse for Global Natural Resources* Bożena Małysiak-Mrozek1, Dariusz Mrozek1, Stanisław Kozielski1 Institute of Informatics, Silesian University of Technology Akademicka 16, 44-100 Gliwice, Poland {Bozena.Malysiak, Dariusz.Mrozek, Stanislaw.Kozielski}@polsl.pl 1
Abstract. Fuzzy Data Warehouse (FDW) is a data repository, which contains fuzzy data and allows fuzzy processing of the data. Incorporation of fuzziness into data warehouse systems gives the opportunity to process data at higher level of abstraction and improves the analysis of imprecise data. It also gives the possibility to express business indicators in natural language using terms, like: high, low, about 10, almost all, etc., represented by appropriate membership functions. Fuzzy processing in data warehouses can affect many operations, like data selection, filtering, aggregation, and grouping. In the paper, we concentrate on various cases of data aggregation in our recently implemented fuzzy data warehouse storing consumption and requirement for global natural resources represented as crisp and fuzzy measures. We show several examples of data aggregation and filtering using the extended syntax of the SQL SELECT statement. Keywords: fuzzy data warehouse, data aggregation, fuzzy sets, membership function, query language, natural resources, decision support
1 Introduction Data warehouses are special purpose repositories that collect huge volumes of data usually for reporting and querying. These kinds of systems support fast and complex data analysis and constitute a foundation for decision support systems. Data warehouses are optimized for fast data retrieval. The goal is often achieved through the denormalization of the data warehouse schema towards so called star schema or snowflake schema [1]. This minimizes the number of join operations during the querying process. In data warehouse systems, fact data are organized in central fact table, which stores numerical values of some business indicators. These facts can be analyzed along different dimensions, like time or geography, which are represented by dimension tables. In order to guarantee the fast data retrieval from data warehouse systems, data are usually pre-aggregated at different levels of attribute hierarchies in particular dimensions [2]. *
The research presented here were done as a part of research and development project no. O R00 0068 07 and have been supported by Ministry of Science and Higher Education funds in the years 2009-2011
616
Analysis of the consumption and requirement for global natural resources is very important for the long-term management of these resources. It delivers useful information for ecologists, giving them an insight into the politics of natural resource use in particular countries and how it affects the natural environment. It is also one of the key indicators for the Ministry of Economy in every country providing the information about the current exploitation and exhaustion of the resources deposits. Furthermore, it allows to predict the future exploitation capabilities and therefore, has a vast influence on the economic safety and independence of the country. In the analysis of global natural resources some facts cannot be represented by crisp measures. E.g. if we consider requirement for particular resources and the cost of the requirement, these values can be merely estimated. For this reason, it seems desirable to include some fuzziness into the data warehouse storing this kind of facts. Furthermore, processing crisp measures with the use of fuzzy techniques gives huge latitude in the formulation of analysis criteria. Therefore, we define the Fuzzy Data Warehouse (FDW) as the data repository, which contains fuzzy data and allows the fuzzy processing of the data. This approach to data analysis and processing gives several advantages. One of the advantages is the possibility to analyze data in more general manner, e.g. users can search the information based on imprecise terms [3-5]. Moreover, the use of fuzzy techniques makes it easy to concentrate similar data using appropriate criteria [6]. Finally, it is natural to use fuzzy techniques, when we collect inaccurate data, e.g. measures coming from test systems or estimated facts, and we need to explore them with a bit of tolerance. Having fuzzy measures (measures defined on columns that store fuzzy data), we must implement the arithmetic of fuzzy numbers in order to aggregate fuzzy data. The following works [7-9] present successful implementations of fuzzy methods in processing data in data warehouse systems. However, they usually concentrate on fuzzy processing of crisp values. In the paper, we show different methods and cases of data aggregation for crisp and fuzzy measures stored in the fuzzy data warehouse for global natural resources.
2 Theoretical Background A fuzzy number is of L-R type, if there exist reference functions L (left), R (right) and scalars > 0, > 0 and m – called modal value. The scalars , are called left and right spreads, appropriately [10]. Symbolically, the L-R fuzzy number is represented by the triple (m, , ). Generally, we can say the L-R fuzzy number is a fuzzy set A, for which the membership function is described by the expression (1). m x L for x m where m, , R A 1 for x m, x m for x m R
(1)
617
A fuzzy interval is of L-R type, if there exist reference functions L, R and left and right spreads > 0, > 0 and m, n, where m < n, and (m, n) is the interval of modal values. The L-R fuzzy interval is denoted by the four (m, n, , ) [10]. Generally, we can say that the L-R fuzzy interval is a fuzzy set A, for which the membership function is described by the expression (2). m x L for x m where m, n, , R A 1 for m x n, x n for x n R
(2)
Data aggregation is a key element for the analytical processing in all OLAP (OnLine Analytical Processing) systems [1], [2]. During the OLAP processing, we usually want to perform a sum or an average of values for the particular range of rows, and sometimes we need to know the minimal or maximal value of some attributes in a group of records. While aggregating fuzzy data, sum and average aggregates can be implemented according to the arithmetic of fuzzy numbers, the other two aggregates, min and max, can be applied thanks to the fuzzy logic. The fifth important operation allows to count the number of rows or values in a group of records. This operation does not require any special implementation and can be used regardless of the type of data that we operate on – crisp or fuzzy. Addition of L-R fuzzy numbers In order to implement the sum aggregate (FSUM) for fuzzy data, we have to define the addition of L-R fuzzy numbers. If we assume that fuzzy numbers A1 and A2 are represented by triples: A1 = (mA1, A1, A1), and A2 = (mA2, A2, A2),
(3)
the sum of them has the following form [11]: A1 + A2 = (mA1 + A2, A1 + A2, A1 + A2 ) = (mA1 + mA2, A1 + A2, A1 + A2)
(4)
Addition of two L-R fuzzy numbers is presented in Fig. 1. A1(x1)
A2(x2)
1
0
A1 +
1
A1
mA1 - A1
A1
mA1 mA1 + A1
0
x1 mA2 - A2
A2
(y)
1
A2
A2
mA2 mA2 + A2 x2
A1 + A2 A1 + A2 y 0 mA1 + A2 mA1 + A2 mA1 + A2 - + A1 + A2
A1 + A2
Fig. 1. Addition of two L-R fuzzy numbers A1 and A2
Division of L-R fuzzy number by a crisp value In the work, we use the division of the L-R fuzzy number by a crisp value in order to calculate the average aggregate function FAVG, which operates on fuzzy data. In this
618
case, we adopt dependencies that are defined for the division of two L-R fuzzy numbers. Crisp value is a special instance of the L-R fuzzy number, which has not left and right spreads (B = 0, B = 0). In the situation, the following dependencies occur [11]: mA/B = mA/mB mA/B + A/B = (mA + A)/mB mA/B - A/B = (mA - A)/mB
(5)
Both operations (addition and division) can be also applied to L-R fuzzy intervals. For example, for two L-R fuzzy intervals A1 and A2 represented by four parameters: A1 = (mA1, n A1, A1, A1), and A2 = (mA2, nA2 A2, A2),
(6)
the sum of these L-R fuzzy intervals is determined by the expression (7). A1 +A2 = (mA1, nA1, A1, A1 ) + (mA2, nA2, A2, A2) = = (mA1 + mA2, nA1 + nA2, A1 + A2, A1 + A2).
(7)
Logical operators We have to define the majority and minority operators for fuzzy numbers in order to compare and sort fuzzy data, and also to implement the maximum (FMAX) and minimum (FMIN) aggregate functions. For given fuzzy numbers the majority relation is defined on the basis of comparison of appropriate parameters of their membership functions. In our approach, we first compare the modal values and check, which of them is greater. If they are equal, we check right spreads, and if they are equal, we check left spreads. The minority relation is calculated in the same way, but if the modal values are equal, we first check left spreads and if they are equal, we then check right spreads. Linguistic Quantifiers Linguistic quantifiers are special terms introduced by Zadeh and taken from the natural language. They are expressed by terms, like: almost all, most, at least half, several, at least 20% [10]. Since these terms include the information about proportions, Zadeh suggested they can be represented as fuzzy subsets of the unit interval [0,1]. Linguistic quantifiers express the cardinality of the group of aggregated data. With the use of linguistic quantifiers we can also specify filtering conditions on groups of data. In the syntax of the SQL SELECT statement the fuzzy linguistic quantifiers occupy the place of aggregate functions. For these reasons, they are sometimes called linguistic aggregate functions.
3 Example of a Fuzzy Data Warehouse In data warehouses, we store business or other types of facts in so called fact table. These facts are described by means of measures – calculable attributes in the fact table. Since some facts can be provided with a bit of approximation, some measures must have the fuzzy representation. Therefore in our work, in order to gather and
619
analyze fuzzy values, we have defined the new data type LR and methods that operate on the LR data in the data warehouse management system. In section 4, we show different cases of data aggregation in the data warehouse containing fuzzy measures. We present extensions of the SELECT statements regarding aggregation of fuzzy data and filtering groups. In given examples, we base on our last implementation of the Resources fuzzy data warehouse storing consumption and requirement for global natural resources. In the simplified schema of the Resources data warehouse presented in Fig. 2, we can distinguish the following elements of the multidimensional model: Dimensions Geography, with the hierarchy of attributes: ContinentCountry Time, with the hierarchy of attributes: YearQuarterMonth Resource, with the hierarchy of attributes: CategoryResource Facts – Requirements, with measures: ConsumptionAmount – resource consumption, crisp data ConsumptionCost – cost of the resource consumption, crisp data RequirementAmount – requirement for a resource, L-R fuzzy number RequirementCost – cost of the requirement for a resource, L-R fuzzy number
Fig. 2. Structure of the Resources data warehouse
Fuzzy data stored in the Resources data warehouse can be presented in two forms: as the four (l, m, n, p), where: (m, n) is the interval of modal values, l=m-, p=n+ ; and >0, >0 are left and right spreads, respectively. The four (l, m, n, p) determines the trapezoidal membership function for a fuzzy number. The triangular membership function is a special case of the trapezoidal function, where m=n. as linguistic value approx. x, where x is a mean of m and n values. Sample values of the RequirementAmount fuzzy measure in both presentation forms are shown below. RequirementAmount ------------------------(2416;2517;2617;2717) (976;1017;1017;1057) (1571;1637;1697;1762)
RequirementAmount ------------------approx.2567 approx.1017 approx.1667
The Resources data warehouse is stored in the ROLAP technology, where data and aggregates are maintained by the relational database engine. The data warehouse was populated with data from bulletin boards, press news and electronic papers.
620
4 Data Aggregation and Filtering Groups in FDW Data warehouse systems are generally oriented on querying and aggregating data. If a data warehouse consists of fuzzy measures, we need to extend the standard syntax of the SQL SELECT statement, which is commonly used while retrieving data. This implicates not only the implementation of fuzzy aggregates, like: FSUM, FAVG, FMIN, FMAX, but also extensions to the HAVING clause, which is applied to filter groups of records during data analysis. In the section, we present several examples of aggregation of fuzzy measures incorporated in the extended SQL language. In our analysis of the global resources requirements, we have identified the following cases, which differ from each other with the processing procedure: 1. aggregation of fuzzy data, 2. aggregation of crisp data with a fuzzy condition in the HAVING clause, 3. aggregation of fuzzy data with a crisp condition in the HAVING clause, 4. aggregation of fuzzy data with a fuzzy condition in the HAVING clause, 5. usage of fuzzy quantifiers that operate on the selected group of rows. Example 1: In the example, we present the aggregation of fuzzy values for the RequirementCost measure. We want to generate the ranking of 10 countries, which GDP per capita exceeds 30 000$ and bear the highest expenses of the required coal resource in the 2010. The query in our extended SQL language has the following form: SELECT TOP 10 g.Country, FSUM(RequirementCost) AS TotalCost FROM DimGeography g JOIN FactRequirements rf ON g.GeoId=rf.GeoId JOIN DimTime t ON rf.TimeId=t.TimeId JOIN DimResource r ON rf.ResId=r.ResId WHERE r.Resource=’coal’ AND g.GDP > 30000 AND t.Year=2010 GROUP BY g.Country ORDER BY TotalCost DESC;
In the presented query, the FSUM aggregate function operates on fuzzy numbers stored in the RequirementCost column. Results of the SELECT statement are shown below. Aggregated value of the RequirementCost measure is presented with the use of linguistic value. In the result set we present sample data. Country -------------USA Japan France Australia ...
TotalReqCost --------------approx. 1863 approx. 1632 approx. 1629 approx. 1534
Example 2: In this example, we show the use of standard average aggregate function (AVG) operating on crisp data. In the case, the AVG function appears also in the HAVING clause of the SELECT statement. In this statement, we construct the fuzzy filtering condition for particular groups of data. Display average month consumption of the oil resource in particular years and countries, which area exceeds 120000 km2. In the final result display only these countries, which consumption is high with the compatibility degree greater than 0.7.
621
The query in the extended SQL language has the following form: SELECT g.Country, t.Year, r.Resource, AVG(ConsumptionAmount) AS AverageMonthConsumption FROM DimGeography g JOIN FactRequirements rf ON g.GeoID=rf.GeoID JOIN DimTime t ON rf.TimeID=t.TimeID JOIN DimResource r ON rf.ResID=r.ResID WHERE g.Area > 120000 AND r.Resource = 'oil' GROUP BY r.Resource, g.Country, t.Year HAVING ( AVG(ConsumptionAmount) ~= high() ) > 0.7;
In the fuzzy filtering condition of the HAVING clause the result of the AVG function (crisp value) is compared to the L-R fuzzy number (high consumption) represented by the trapezoidal membership function (Fig. 3) defined earlier by the domain specialist. The result of the comparison is the compatibility degree between crisp value and fuzzy number. The compatibility degree (CD) should be greater than 0.7.
Fig. 3. Comparison of the crisp value (4184) and fuzzy number (high consumption)
In order to compare the crisp value of the AVG aggregate function with the fuzzy number, we had to develop the new equality operator (~=). This operator returns the value of the compatibility degree from the set [0,1]. Returned compatibility degree is compared to the given threshold 0.7, determining whether the row is displayed in the result set, or not. The result of the sample SQL query is presented below. Country -------China Russia USA
Year -----2009 2007 1998
Resource -------oil oil oil
AverageMonthConsumption ----------------------5773 5142 4184
Example 3: In this example, we show the use of the average aggregate function (FAVG) operating on fuzzy data. The FAVG function appears in the fuzzy filtering condition in the HAVING clause. The result of the aggregation (fuzzy number) is compared to the given crisp value. Generate report showing average costs of requirements for particular resource categories in particular countries and continents in the year 2010. Show only these countries, where costs of requirements are about US$ 150mln with the compatibility degree greater than or equal to 0.5. The query in the extended SQL language has the following form: SELECT g.Continent, g.Country, r.Category, FAVG(RequirementCost) AS AverageReqCost FROM DimGeography g JOIN FactRequirements rf ON g.GeoID=rf.GeoID JOIN DimTime t ON rf.TimeID=t.TimeID JOIN DimResource r ON rf.ResID=r.ResID WHERE t.Year = 2010 GROUP BY g.Continent, g.Country, r.Category HAVING ( FAVG(RequirementCost) ~= 150 ) >= 0.5;
622
Processing of the fuzzy filtering condition is similar to the previous example (comparison of the crisp value and fuzzy number, Fig. 3). Results of the query look as follows (sample data): Continent -------------------Asia Asia Europe Europe North America
Country --------------China Saudi Arabia France Italy USA
Category --------------non-renewable non-renewable non-renewable renewable non-renewable
AverageReqCost ----------------approx. 146 approx. 148 approx. 150 approx. 146 approx. 153
Example 4: In the example, we present the use of the sum aggregate function (FSUM) operating on fuzzy data. We also show the fuzzy filtering condition in the HAVING clause, which consists of two fuzzy numbers. Display total requirements for resources in particular continents. Show these continents, where total requirement is low, with a compatibility degree greater than 0.8. The query in the extended SQL language has the following form: SELECT g.Continent, r.Resource, r.Unit, FSUM(RequirementAmount) AS TotalRequirement FROM DimGeography g JOIN FactRequirements rf ON g.GeoID=rf.GeoID JOIN DimTime t ON rf.TimeID=t.TimeID JOIN DimResource r ON rf.ResID=r.ResID GROUP BY g.Continent, r.Resource, r.Unit HAVING ( FSUM(RequirementAmount) ~= low() ) > 0.8;
The result of the FSUM function (in the HAVING clause), which is a fuzzy number, is compared to the fuzzy value low requirement. The comparison is realized by the ~= equality operator, which returns the compatibility degree between two fuzzy numbers (Fig. 4). In the final result we obtain rows having the degree grater than 0.8. Continent ---------------------Australia & Oceania Australia & Oceania Australia & Oceania South America South America South America
Resource -----------gas tea water coffee grain oil
Unit ---ton kg m3 kg ton ton
TotalRequirement ------------------approx. 315491 approx. 316487 approx. 301659 approx. 318204 approx. 307051 approx. 316282
In Fig. 4 we can observe different possible cases of the comparison of two fuzzy numbers represented by trapezoidal membership functions. If two fuzzy numbers have only one common point (Fig. 4a), this point determines the compatibility degree of these two fuzzy numbers. If there is more than one common point (Fig. 4b) and we obtain more compatibility degrees, we choose the highest one.
Fig. 4. Comparison of two fuzzy numbers: a) one common point, b) many common points
623
Example 5: In the last example, we use the almost_all fuzzy linguistic quantifier in order to determine the proportion of data in the group, for which the FSUM aggregate function is calculated. The proportion is determined based on the given condition. Show these countries, for which almost whole requirement for the coal took place in the 3rd quarter of the 2009. Show only these cases, for which the compatibility degree is greater than 0.7. The query in the extended SQL language looks as follows: SELECT g.Country, FSUM(RequirementAmount) AS TotalRequirement FROM DimGeography g JOIN FactRequirements rf ON g.GeoID=rf.GeoID JOIN DimTime t ON rf.TimeID=t.TimeID JOIN DimResource r ON rf.ResID=r.ResID WHERE t.Year = 2009 AND r.Resource = 'coal' GROUP BY g.Country HAVING allmost_all( t.Quarter = ‘Q3’ ) > 0.7;
We treat the almost all linguistic quantifier as a fuzzy subset of the unit interval [0,1]. This quantifier is defined by the membership function presented in Fig. 5a. For each group of rows (for each country) we calculate the percentage of rows, which fulfil the specified condition (t.Quarter = ‘Q3’). Afterwards, we determine the compatibility degree between the percentage and the almost_all quantifier (Fig. 5b).
a)
b)
Fig. 5. Membership function for the almost_all linguistic quantifier (a). Calculation of the compatibility degree between sample percentage (0.925) and the almost_all quantifier (b)
In the result set, we want to show only these countries, which have the compatibility degree is greater than the given threshold 0.7 (we present sample data): Country --------------China France Germany Japan
TotalRequirement ---------------------approx. 30661 approx. 25783 approx. 26836 approx. 23773
It is worth noting that in all presented examples, we used only classical grouping by crisp attributes. It was our conscious intention. Actually, we have developed methods of fuzzy grouping of crisp data and grouping methods for fuzzy data. However, they were a subject of our previous paper [12].
5 Concluding Remarks Exploring the consumption and requirement for global natural resources needs soft computing methods in order to extend the spectrum of the analysis. Implementation of the L-R fuzzy type in a data warehouse allows to store and analyze imprecise data and study data, which have similar characteristics. Especially, in huge data repositories,
624
this leads to the rough and fast data analysis and filtering. Therefore, it makes the general improvement of the investigation possibilities, which is sometimes difficult to achieve in standard data warehouses. If we decide to include fuzzy measures (e.g. numerical values of some estimated facts), we have to use elements of arithmetic of fuzzy numbers or fuzzy logic during data aggregation. This may lead to the increase of the response time. However, our research on the effectiveness of presented queries shows the response time rises about 10% in comparison to corresponding queries implemented with the use of the standard SQL language. Such a good efficiency of our solution was achieved through the implementation of presented methods as a part of a database management system, where the fuzzy data warehouse resided. Data aggregation is one of the basic operations performed in data warehouses and methods presented in the paper have a general purpose. Recently, we have been developing a fuzzy data warehouse for forensic and criminalistic analysis as a part of the project O R00 0068 07.
References 1. Kimball, R., Reeves, L., Margy, R., Thornthwaite, W.: The Data Warehouse Lifecycle Toolkit. John Wiley & Sons (1998) 2. Ponniah P.: Data Warehousing Fundamentals. A Comprehensive Guide for IT Professionals. John Wiley & Sons (2001) 3. Bosc, P., Pivert, O.: SQLf: A Relational Database Language for Fuzzy Querying. IEEE Transactions on Fuzzy Systems. Vol. 3, No 1, 1—17 (1995) 4. Kacprzyk, J., Zadrozny, S.: SQLf and FQUERY for Access. In: IFSA World Congress and 20th NAFIPS International Conference. pp. 2464—2469 (2001) 5. Małysiak, B., Mrozek, D., Kozielski, S.: Processing Fuzzy SQL Queries with Flat, ContextDependent and Multidimensional Membership Functions. In: 4th IASTED International Conference on Computational Intelligence, pp. 36—41, ACTA Press, Calgary (2005) 6. Chaudhuri, S., Ganjam, K., Ganti, V., Motwani, R.: Robust and Efficient Fuzzy Match for Online Data Cleaning. In: 2003 ACM SIGMOD International Conference on Management of Data. San Diego, California, pp. 313—324 (2003) 7. Hua-Yang Lin, Ping-Yu Hsu, Gwo-Ji Sheen: A Fuzzy-based Decision-Making Procedure for Data Warehouse System Selection. Journal of Expert Systems with Applications. pp. 939—953 (2007) 8. Perez, D., Somodevilla, M.J., Pineda, I.H.: Fuzzy Spatial Data Warehouse: A Multidimensional Model. In: 8th Mexican International Conference on Current Trends in Computer Science, pp. 3—9, IEEE (2007) 9. Fasel, D., Zumstein, D.: A Fuzzy Data Warehouse Approach for Web Analytics. LNCS, vol. 5736, pp. 276—285. Springer, Heidelberg (2009) 10.Bouchon-Meunier, B., Yager, R.R., Zadeh, L.A.: Fuzzy Logic and Soft Computing. Advances in Fuzzy Systems, Application and Theory, vol.4, Singapore (1995) 11.Dubois, D., Prade, H.: Fundamentals of Fuzzy Sets. Kluwer Academic Publisher (2000) 12.Małysiak-Mrozek, B., Mrozek, D., Kozielski, S.: Data Grouping Process in Extended SQL Language Containing Fuzzy Elements. AISC, vol. 59, pp. 247—256. Springer, Heidelberg (2009)
625