Software Feature Model Recommendations Using Data Mining Abdel Salam Sayyad, Hany Ammar, Tim Menzies Lane Department of Computer Science and Electrical Engineering West Virginia University Morgantown, WV, USA
[email protected],
[email protected],
[email protected] 6- Randomly make more choices using the commitments learned so far. 7- Go to step 3. This search returns the set of key design choices that selects for better designs. The decisions generated in this way find hidden influences inside the feature set, and provide the designer with a much needed summary of the most critical decisions that must be made in the beginning, thus saving design time and reducing the probability of error. Of course, the designer is still left with much to decide based on their domain knowledge and the business concerns of various stakeholders. But this approach provides the peace of mind of knowing the features that are most likely to cause constraint violations, and knowing the recommended configurations which best mitigate that risk. Thus this new method can be embedded in broader interactive or offline configuration assistance tools. The rest of the paper is organized as follows: section II provides background material on feature models, the SPLOT website, and the SXFM format for representing feature models. Section III describes the proposed analysis method. Sections IV and V describe the case study and the results of applying our method. In sections VI, and VII we discuss related work, conclusions and future work.
Abstract—Feature Models are popular tools for describing software product lines. Analysis of feature models has traditionally focused on consistency checking (yielding a yes/no answer) and product selection assistance, interactive or offline. In this paper, we describe a novel approach to identify the most critical decisions in product selection/configuration by taking advantage of a large pool of randomly generated, generally inconsistent, product variants. Range Ranking, a data mining technique, is utilized to single out the most critical design choices, reducing the job of the human designer to making less consequential decisions. A large feature model is used as a case study; we show preliminary results of the new approach to illustrate its usefulness for practical product derivation. Keywords- Feature Models, design decisions, range ranking.
I.
INTRODUCTION
The activity of deriving product variants from a given feature model occurs early in the software life cycle, and the decisions made right there propagate throughout the life of the product, for better or for worse. Detecting and removing bugs early on amounts to great saving in the cost of building the software. The longer a bug stays in a system, the more expensive it is to remove. For example, removing a design error can be two orders of magnitude more expensive during testing that during coding [5]. The reason for this is that the longer software contains some feature, the more likely it becomes that this feature is used by other parts of the software. So changing something that is months to years old in software is much more expensive than changing a mistake that was only made yesterday. That is, the ability to avoid errors even before system is built (as done in this range ranking analysis) can have a dramatic impact on the cost of delivering the software. What can happen is that the space of choices becomes so large, that it becomes cumbersome (or even impossible) to check all the rules against all the choices. To assist designers in this task, we can use the following algorithm which finds the core decisions that most select for better designs. The process is quite generic: 1- Create an empty set of decisions and let the current “score” of that set be zero. 2- Randomly make some choices. 3- Score the resulting design. 4- Add to the decisions the choices that select for better “scores’. 5- If the new scores are not better than before, then exit.
c 2012 IEEE 978-1-4673-1759-7/12/$31.00
II.
BACKGROUND
A. Feature Models A feature is an end-user–visible behavior of a software product that is of interest to some stakeholder [1]. A feature model represents the information of all possible products of a software product line in terms of features and relationships among them. Feature models are a special type of information model widely used in software product line engineering. A feature model is represented as a hierarchically arranged set of features composed by: 1. Relationships between a parent feature and its child features (or subfeatures). 2. Cross-tree constraints that are typically inclusion or exclusion statements in the form: if feature F is included, then features A and B must also be included (or excluded). Figure 1 [4] depicts a simplified feature model inspired by the mobile phone industry. The model illustrates how features are used to specify and build software for mobile phones. The software loaded in the phone is determined by the features that it supports. According to the model, all
47
RSSE 2012, Zurich, Switzerland
phones must include support for calls, and displaying information in either a basic, color or high resolution screen. Furthermore, the software for mobile phones may optionally include support for GPS and multimedia devices such as camera, MP3 player or both of them. [4] A feature model can also contain cross-tree constraints between features. These are typically in the form: Requires: If a feature A requires a feature B, the inclusion of A in a product implies the inclusion of B in such product. In Figure 1, mobile phones including a camera must include support for a high resolution screen. Excludes: If a feature A excludes a feature B, both features cannot be part of the same product. In Figure 1, GPS and basic screen are incompatible features. [4]
Starting with a feature model in SXFM format, we start at the root and examine all sub-features: If they are mandatory (marked with :m) then keep them in every product. If they are optional (marked with :o) then toss a coin to decide if they are included. If they are a group of m to n items, then pick a random integer between m and n and include that many items from the group, chosen at random. The removal of any feature will result in the removal of all its children. <meta> Mobile Phone ISA_RDA … :r Mobile Phone(root) :m calls(calls) :o GPS(gps) :m screen(screen) :g (_g_1) [1,1] : basic(basic) : color(color) : high-resolution(hi_res) :o media(media) :g (_g_2) [1,*] : camera(camera) : mp3(mp3) constraint_1: ~gps or ~basic constraint_2: camera or ~hi_res
Figure 1: Feature model for mobile phone product line
B. SPLOT and SXFM SPLOT (Software Product Line Online Tools) website [13] was launched in May 2009 to “put Software Product Lines research into practice through the delivery of state-ofthe-art online tools targeting academics and practitioners in the field”. The website is host to a feature model repository which (as of February 2012) includes 181 feature models, most of them representing realistic systems. The SPLOT website defines the Simple XML Feature Model (SXFM) format [14], and provides a Java library for handling it. Thus all feature models in the online repository are written in the SXFM format. Figure 2 shows the SXFM format for the mobile phone feature model from Figure 1. It shows the root feature (marked with :r), the mandatory features (marked with :m), the optional features (marked with :o), and the group features (marked with :g). The crosstree constraints are listed at the bottom in Conjunctive Normal Form (CNF). III.
Figure 2: Mobile Phone feature model in SXFM format
We score each resulting product variant by the ratio of satisfied cross-tree constraints. This will be the measure we seek to optimize as we apply the range ranking learner to find the most far-reaching design decisions. The output of this stage is a large CSV (CommaSeparated Variables) file in which the rows represent the product variants and the columns represent the features. Each cell is filled with a ‘1’ if the feature is selected and a ‘0’ otherwise. The last column represents the scores of all the variants. The structure of the CSV file is shown in Figure 3.
OUR ANALYSIS METHOD
A. Generating Random Product Variants We first generate an arbitrarily large pool (we used 10,000) of product variants from the given feature model. Figure 3: Structure of CSV file
48
We also calculate the frequency at which each constraint is violated throughout the entire set of variants. This will be compared to the performance after the feature model is “improved” with the resulting design choices.
ranges, i.e. the choices that lead to better scores, and thus better satisfy the design constraints. C. Applying the Design Decisions With the set of choices resulting from the previous stage, we produce a constrained version of the original feature model, and we run the procedure outlined in section 3.1 to obtain a new list of “frequencies of violations”. This will be the main point of comparison and comment.
B. Range Ranking The range ranking method is a data mining technique used to find the key influences in large sets of data. The algorithm used here is adapted from previous work such as [7] and [8]. In range ranking, we divide the attributes into discrete ranges. In the case of the CSV table obtained from the previous stage, the features take the values “0” or “1” so the discretization is already done. Range ranking searches through the space of all possible ranges. If we find that some range selects for a subsets of the rows, and that subset has a better mean than before, we print that range and recurse on the subset. Range ranking explores two sets: asIs and toBe. asIs is initialized to all the rows. toBe is the subset of the asIs containing the next best range. To find toBe, we work as follows: 1. Let before be the number of rows in asIs. 2. For each attribute: a. Sort asIs by their attribute values. b. For all asIs rows with the same range r. i. Find the number of rows nr with that range ii. Find the sum of the scores for that row sr. In our case, this is the number of constraints violated on each row containing r. 3. Find the range in all attributes with maximum score ( )
IV.
Our case study for this paper is the “Electronic Shopping” feature model [10], found in the SPLOT feature model repository (see section 2.3). It is the largest realistic feature model in that repository, with 290 features and 21 cross-tree constraints. We present the results for this feature model in section Error! Reference source not found.. V.
5. 6.
RESULTS
We produced 10,000 variants of the Electronic Shopping feature model, scored them and output a CSV file. We note at this stage that 126 variants scored 100%, which indicates a probability 1.26% of producing correct product variants at random. Next we fed the CSV file into the range ranking script. The output result was as shown in Table I. Each row represents a step in the range ranking process. The columns show the 10th, 30th, 50th, 70th, and 90th percentiles for the scores of all the variants. In the first step, all the product variants are included, and the median score is 76%. In the second step, 3341 variants are included which share the property (register_to_buy =1), and the median score improves to 85%. In the third step, 1112 variants are included which share the properties (register_to_buy =1 and physical_goods =0), and the median score improves to 90%, and so on.
∑( )
Formally, this is the lift of that range and measures how much the mean of the rows with r scores better than all the rows. 4.
CASE STUDY
Table I: result of range ranking
10000
Distribution of number of satisfied constraints in variants that do not constraint the expression in the decision column 10th 30th 50th 70th 90th 61 71 76 80 90
3341
71
80
85
90
95
1112
76
85
90
90
95
569
80
85
90
90
95
136
80
90
90
95
100
41 9
85 0
90 95
90 95
95 100
100 100
4
0
95
95
100
100
2
0
0
100
100
100
varian ts
Let toBe be all rows with the top scoring range and now be the number of rows in toBe. If now equals before then nothing has been improved and we quit. Else, we a. Print the top scoring range and report the number of constraints in toBe. b. Set asIs to toBe; c. Go to step 1.
In step 5, we quit whenever the next range does not improve the current asIs situation. Formally, this search is a forward select with a horizon of one; i.e. if no improvements are made within that horizon, the search terminates (such a search is sometimes call a “greedy search”). Better results might be obtained with a horizon greater than one. However, as shown below, we do remarkably well with a horizon of one. The result of this stage is the extraction of feature configuration decisions which characterize the top best
decision none register_to_bu y =1 physical_good s =0 size =1 registered_che ckout =0 _id_26 =1 _id_31 =1 shipping_addr ess =0 _id_25 =1
We implemented the first four decisions, namely: 1- register_to_buy=1, i.e. make it a mandatory feature. This also requires it “grandparent” feature to become mandatory.
49
2- physical_goods=0, i.e. delete this feature. 3- size=1, i.e. make it mandatory. 4- registered_checkout=0, i.e. delete this feature, and delete the two children below it. Since it is part of a [1,*] group, and there is only one member left, i.e. Guest checkout(_id_87), then _id_87 must be moved one level up labeled mandatory, and the group designation is removed.
feature model (The Web Portal) was used to illustrate this method. b. Paper [2] proposed a Stratified (i.e. layered) Analytic Hierarchy process, which first helps to rank and select the most relevant high level business objectives for the target stakeholders (e.g., security over implementation costs), and then helps to rank and select the most relevant features from the feature model to be used as the starting point in the staged configuration process. The E-Shop feature model was used and user input was collected to evaluate this method. c. Paper [12] employs Hierarchical Task Network (HTN) planning, a popular planning technique to automatically select suitable features that satisfy the stakeholders’ business concerns and resource limitations. A performance evaluation is provided with three feature models containing 25, 45, and 65 features. The worst case run time is reported to be 89 seconds, which is quite significant for these moderate-size feature models. d. Paper [3] tried to provide automated reasoning on extended feature models (i.e. feature models with extra-functional features). Using their extension, they were able to assign extra-functionality such as price range or time range to features. They used CSP solvers to solve the CSP and return a set of features which satisfy the stakeholders’ criteria. e. Paper [15] used a filtered Cartesian flattening method to select optimal feature sets according to resource constraints. In their method, they map the feature selection problem to a multidimensional multi-choice knapsack problem (MMKP). By applying existing MMKP approximation algorithms, they provide partially optimal feature configuration in polynomial time. f. Paper [16] introduced the MUSCLE tool which provided a formal model for multistep configuration and mapped it to constraint satisfaction problems (CSPs). Hence, CSP solvers were used to determine the path from the start of the configuration to the desired final configuration. Non-functional requirements were considered such as cost constraints between two configurations. A sequence of minimal feature adaptations is calculated to reach from the initial feature model configuration to the desired feature model configurations. 2- Syntactic correctness checking, in which the tool lets the user choose features to include or remove, and provides instant feedback on whether each selection violates any design constraints. This is also referred to as interactive or online configuration, such as the tools described in [6] and [9]. In [17], a procedure is suggested for fixing errors in product variants using constraint satisfaction problems (CSPs).
Why have we implemented four decisions only? Why not implement all eight? Simple: 1- It shows the great impact of just a few “influential choices”; hence the power of range ranking. 2- The designer still needs to make domain-informed choices. We do not intend to cancel the role of the human designer!
Freq of violations
We next generate 10,000 variants from the “reduced” feature model, and we make the following observations: 1- This time, 1056 variants scored 100%, which indicates a probability 10.56% of producing correct product variants at random. Compare this to 1.26% before applying the decisions. 2- Figure 4 shows the frequency of violations of constraints before applying the decisions (in the foreground) and after applying the decisions (in the background). We notice that 11 constraints out of 22 are not violated any more (just by making 4 decisions)…
60% 40%
before
20%
after
0% Constraints
Figure 4: Frequency of violation before and after applying decisions
VI.
RELATED WORK
Existing tools that assist the designer in product configuration and feature selection fall into two categories: 1- Semantic correctness checking, in which the tool seeks to prioritize features to be selected based on the stakeholder’s business concerns. Examples are: a. Paper [11] presented an approach to collaborative product configuration that supports the splitting of the feature model into smaller units called configuration spaces and the arrangement of such spaces in a workflow-like plan. A moderate-sized
50
Our work is closer to the second category, since we are concerned with the consistency of products, i.e. satisfying all constraints. But the existing tools do not provide any prioritization of features for the configuration process based on constraint satisfaction. VII.
615-636, September 2010. [5] B. Boehm, "Software Engineering," IEEE Trans. Computers, vol. 25, no. 12, pp. 1226-1241, 1976. [6] Wonseok Chae and Timothy L. Hinrichs, "SMARTFORM: A Web-based Feature Configuration Tool," in Fourth International Workshop on Variability Modelling of Software-intensive Systems (VAMOS), Linz, Austria, 2010.
CONCLUSIONS AND FUTURE WORK
In this paper, we presented a unique approach to product configuration starting from a feature model, in which we utilize a data mining technique to find the most critical feature selections, with a recommendation whether to include such features or remove them. Given our recommendations, the designer is aware of the most influential design decisions, and therefore should turn their attention to those decisions first. Once the initial few decisions are locked in, the designer proceeds to make the remaining choices utilizing their domain knowledge and stakeholder concerns, but knowing that the constraints violation probability has been dramatically reduced, thus focusing more on the semantic correctness aspect of product configuration. Future work may pursue the following avenues: 1- Examining the scalability and performance of this method with larger feature models and with comparison to the performance of other analysis tools. 2- Examining the effect of this added facility on the product configuration process, such as interactive configuration and cooperative configuration, and measuring the improvements in design cost and software reliability and maintainability. 3- Integrating the range ranking method in existing interactive configuration tools.
[7] Gregory Gay, Tim Menzies, Misty Davies, and Karen Gundy-Burlet, "Automatically finding the control variables for complex system behavior," Automated Software Engineering, vol. 17, no. 4, pp. 439-468, 2010. [8] Gregory Gay et al., "Finding robust solutions in requirements models," Automated Software Engineering, vol. 17, no. 1, pp. 87-116, 2010. [9] Mikolas Janota, Goetz Botterweck, Radu Grigore, and Joao Marques-Silva, "How to complete an interactive configuration process?," in Proceeding of 36th International Conference on Current Trends in Theory and Practice of Computer Science (SOFSEM), 2010. [10] Sean Quan Lau, "Domain analysis of e-commerce systems using feature-based model templates," Dept. Electrical and Computer Engineering, University of Waterloo, Canada, Master's Thesis 2006. [11] Marcilio Mendonca, Thiago Bartolomei, and Donald Cowan, "Decision-making coordination in collaborative product configuration," in 23rd Annual ACM Symposium on Applied Computing, 2008. [12] Samaneh Soltani, Mohsen Asadi, Hatala Marek, Dragan Gasevic, and Ebrahim Bagheri, "Automated Planning for Feature Model Configuration based on Stakeholder's Business Concerns," in Automated Software Engineering Conference, Lawrence, KS, USA, 2011, pp. 536-539.
ACKNOWLEDGMENT This research work was funded by the Qatar National Research Fund (QNRF) under the National Priorities Research Program (NPRP) Grant No.: 09-1205-2-470.
[13] SPLOT - Software Product Line Online Tools. [Online]. http://www.splot-research.org [14] SXFM Format. [Online]. http://gsd.uwaterloo.ca:8088/SPLOT/sxfm.html
REFERENCES [1] Sven Apel, Hendrik Speidel, Philipp Wendler, Alexander von Rhein, and Dirk Beyer, "Detection of Feature Interactions using Feature-Aware Verification," in ASE, Lawrence, KS, USA, 2011, pp. 372-375.
[15] J. White, B. Dougherty, and D. C. Schmidt, "Selecting highly optimal architectural feature sets with Filtered Cartesian Flattening," Journal of Systems and Software, vol. 82, no. 8, pp. 1268–1284, August 2009.
[2] Ebrahim Bagheri, Mohsen Asadi, Dragan Gasevic, and Samaneh Soltani, "Stratified Analytic Hierarchy Process: Prioritization and Selection of Software Features," in Proceedings of the 14th International Software Product Line Conference, Jeju Island, South Korea, 2010, pp. 300-315.
[16] J. White, B. Dougherty, D. C. Schmidt, and D. Benavides, "Automated reasoning for multi-step feature model configuration problems," in Proceedings of the 13th Software Product Lines Conference, San Francisco, CA, USA, 2009, pp. 11-20.
[3] D. Benavides, A. Ruiz-Cortés, and P. Trinidad, "Automated Reasoning on Feature Models," in Proceeding of CAISE, 2005, pp. 491-503.
[17] J. White, D. C. Schmidt, D. Benavides, P. Trinidad, and A. Ruiz-Cortés, "Automated Diagnosis of Product-line Configuration Errors in Feature Models," in Proceedings of the 12th International Software Product Line Conference, Limerick, UK, 2008.
[4] David Benavides, Sergio Segura, and Antonio Ruiz-Cortes, "Automated analysis of feature models 20 years later: A literature review," Information Systems, vol. 35, no. 6, pp.
51