USING COLLABORATIVE FILTERING FOR DEALING WITH MISSING ...

Report 2 Downloads 93 Views
June 23, 2010 17:16 WSPC/118-IJUFKS

S0218488510006635

International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems Vol. 18, No. 4 (2010) 431–449 c World Scientific Publishing Company

DOI: 10.1142/S0218488510006635

USING COLLABORATIVE FILTERING FOR DEALING WITH MISSING VALUES IN NUCLEAR SAFEGUARDS EVALUATION

ROSA M. RODR´IGUEZ∗ and LUIS MART´INEZ† Department of Computer Science, University of Ja´ en, Campus las Lagunillas s/n, Ja´ en, 23071, Spain ∗ [email protected][email protected] DA RUAN Belgian Nuclear Research Centre (SCK · CEN), Mol, 2400 & Ghent University, Gent, Belgium [email protected]; [email protected] JUN LIU School of Computing and Mathematics, University of Ulster, BT37 0QB, N. Ireland, UK [email protected] Received 30 January 2010 Revised 15 May 2010 Accepted 5 June 2010 Nuclear safeguards evaluation aims to verify that countries are not misusing nuclear programs for nuclear weapons purposes. Experts of the International Atomic Energy Agency (IAEA) carry out an evaluation process in which several hundreds of indicators are assessed according to the information obtained from different sources, such as State declarations, on-site inspections, IAEA non-safeguards databases and other open sources. These assessments are synthesized in a hierarchical way to obtain a global assessment. Much information and many sources of information related to nuclear safeguards are vague, imprecise and ill-defined. The use of the fuzzy linguistic approach has provided good results to deal with such uncertainties in this type of problems. However, a new challenge on nuclear safeguards evaluation has attracted the attention of researchers. Due to the complexity and vagueness of the sources of information obtained by IAEA experts and the huge number of indicators involved in the problem, it is common that they cannot assess all of them appearing missing values in the evaluation, which can bias the nuclear safeguards results. This paper proposes a model based on collaborative filtering (CF) techniques to impute missing values and provides a trust measure that indicates the reliability of the nuclear safeguards evaluation with the imputed values. Keywords: Nuclear safeguards; fuzzy linguistic approach; 2-tuple, missing values; trustworthy; collaborative filtering; imputation. ∗ Corresponding

author. 431

June 23, 2010 17:16 WSPC/118-IJUFKS

432

S0218488510006635

R. M. Rodr´ıguez et al.

1. Introduction Among its different duties the IAEA carries out nuclear safeguards evaluation that aims to detect non diversion of nuclear materials and verifies that a State is living up to its international undertakings not to use nuclear programs for nuclear weapons purposes. The safeguards system is based on assessments of the correctness and completeness of the State’s declarations to the IAEA concerning nuclear material and related to nuclear activities.18 As a part of the IAEA efforts to strengthen international safeguards, it includes the enhancement of its ability to provide credible assurance of the absence of undeclared nuclear material and activities. The IAEA uses huge amounts of information on States’ nuclear and related activities. To develop an effective nuclear safeguards evaluation, the IAEA has developed the Physical Model18 for a systematic and comprehensive indicator system by means of a convenient structure of State’s nuclear activities of the nuclear fuel cycle that includes all the main activities that might be involved in production of nuclear weapon-usable materials. The evaluation framework thus consists of a hierarchical and multi-layer structure (see Fig. 1) from indicators to factors. The latter are synthesized from the former such that the existence of nuclear activities is determined by indicators and/or factors that suggest or indicate the existence of nuclear programs for nuclear weapons.

Fig. 1.

Structure of the overall evaluation.

The IAEA experts evaluate indicators on the basis of their analysis and knowledge according to the available information in different sources of vague and uncertain information.27,28 So far, the main focus to solve nuclear safeguards satisfactorily has been the management of the uncertainty inherent to this problem, different proposals have been published.27,28,31,35 The use of the fuzzy linguistic approach46 which had provided successful results, in other evaluation fields5–7,24,29,36 and other topics,2,30,43 has also provided a way to cope with such uncertainties in nuclear safeguards evaluation.27,28 Once the uncertainty of the problem can be managed in a reasonable way, the focus of interest on nuclear safeguards problem has recently moved to the treatment of missing values in the expert’s evaluations.

June 23, 2010 17:16 WSPC/118-IJUFKS

S0218488510006635

Learning User Profiles

433

Due to the complexity, vagueness, ill-structure of nuclear safeguards and the huge number of indicators, the IAEA experts cannot often provide assessments for some indicators appearing missing values in the evaluation framework. There exist different methods for managing missing values11,22,32,34 such as, deletion, imputation and use as they are. Different treatments can provide different results, therefore, it is necessary for the study, development and application of different strategies to make a decision about what is the best one to deal with missing values in nuclear safeguards evaluation. So far the initial works for managing missing values in nuclear safeguards have treated them as they are.21,22 However, this paper proposes their management by an imputation process by means of a CF model based on a K-nn scheme.8,13 An imputation consists of substituting a missing value for an assessment that should represent the value that the expert would provide if he/she had the necessary knowledge about it. Collaborative filtering based on a K-nn scheme has successfully applied in recommender systems,14 academic orientation4 and many other fields. Although the problems are quite different, the goal of finding values to replace missing values is similar. Additionally, the dataset used in nuclear safeguards fits pretty well the requirements of CF. Therefore, a CF based model can provide suitable values to be used as imputations in this problem. An overall optimum CF model does not exist for every problem and every dataset, we have carried out a case study with a nuclear safeguards linguistic dataset to compute the imputations with the most suitable model. The imputations computed by the CF are not real experts’ assessments, but rather than approximated ones. It is crucial for this model to provide a value of trustworthiness of each imputed value to compute the reliability of the nuclear safeguards evaluation. This paper is structured as follows: In Sec. 2, we revise some related works to nuclear safeguards evaluation. In Sec. 3, we review some necessary preliminaries. In Sec. 4, we present a new proposal for managing missing values in nuclear safeguards evaluation by means of an imputation model and a trust measure. In Sec. 5, we show an illustrative example of the proposed methodology, and finally, we conclude the paper in Sec. 6. 2. Related Works The IAEA Physical Model17 provides a structure for organizing the safeguards relevant information that is used by IAEA experts to evaluate in a better way the safeguards significance of information on some State’s activities. The Physical Model identifies and describes indicators for a particular process that already exists or is under development. Over 900 indicators are identified within the IAEA study throughout the whole fuel cycle. The specificity of each indicator has been designated to a given nuclear activity and is used to classify the indicators according to

June 23, 2010 17:16 WSPC/118-IJUFKS

434

S0218488510006635

R. M. Rodr´ıguez et al.

their strength. An indicator that appears only if a specific nuclear activity exists or is being developed or whose presence always is accompanied by a certain nuclear activity is a “strong” indicator. An indicator which is present for many other reasons or is associated with many other activities, it is a “weak” indicator. In between there are “medium” indicators. In Ref. 27 a linguistic evaluation model was presented for the treatment of nuclear safeguards based on the 2-tuple symbolic approach.15 This model is based on the IAEA Physical Model and is divided into several levels with a lower complexity (see Fig. 1). The global assessment is obtained by a multi-step linguistic aggregation process. The authors focus on the synthesis and evaluation analysis of the Physical Model indicator information. Besides, they present and analyze different kinds of ordinal linguistic aggregation operators. Another proposal to manage the nuclear safeguards problem was presented in Ref. 28, for modelling, analyzing and synthesizing information on nuclear safeguards. This framework can be divided into three parts: (i) it uses the multi-layer evaluation model based on the hierarchy structure of the Physical Model17 and is a multi-layer comprehensive structure, (ii) the modelling framework of multi-experts synthesis is based on the evidential reasoning approach and (iii) the inference of the rule-base is implemented by using the evidential reasoning algorithm. Other proposals about nuclear safeguards can be found in Refs. 31 and 35. In Ref. 31 was presented a decision support system to help the analyst in the state evaluation process in the nuclear non-proliferation framework. In Ref. 35 was developed a flexible and realistic linguistic assessment approach to provide a mathematical tool for synthesis and evaluation analysis of nuclear safeguards indicator information. In this framework, some weighted aggregation functions introduced by Yager42,44 were analyzed and extended. Having the previous proposals dealing with the general framework of nuclear safeguards, IAEA inspectors (experts) are not fully able to score any indicators at the time they inspected a specific site. In reality, inspectors (experts) could hardly obtain real data/values for the indicators they have inspected. This fact provokes that experts provide evaluations with missing values in several out of 914 indicators of nuclear safeguards. So far, the IAEA has not used any mathematical framework to address the problem of missing values. All are mainly done by means of meetings, discussions, further investigations and then by some common sense subjective judgements plus some statistics tools without considering extra uncertainties. However, due to the necessity of addressing this main issue in nuclear safeguards, recently some proposals21,22 formally dealing with experts’ incomplete evaluations have arisen. These proposals treat the missing values as they are avoiding the deletion of scarce information or any imputation.

June 23, 2010 17:16 WSPC/118-IJUFKS

S0218488510006635

Learning User Profiles

435

3. Preliminaries In this section, we briefly review a linguistic background and some concepts about missing values that will be used in the proposal. 3.1. Linguistic background This section revises some concepts about the fuzzy linguistic approach46 and the linguistic 2-tuple model15 because the proposal for imputing missing values uses as basis the linguistic nuclear safeguards model presented in Refs. 27 and 28. 3.1.1. Fuzzy linguistic approach Nuclear safeguards evaluation deals with uncertain information, which is mainly related to human cognitive processes. Such uncertainties cannot be easily assessed in a quantitative form, but rather in a qualitative one. In that case, the use of linguistic assessments instead of numerical values seems a better approach. The fuzzy linguistic approach provides a direct way to represent qualitative aspects as linguistic values by means of linguistic variables.46 We shall use linguistic modelling for nuclear safeguards evaluation by extending the work presented in Refs. 27 and 28. To use linguistic information for modelling nuclear safeguards assessments we need to choose some appropriate linguistic descriptors for the term set and their semantics, also an important parameter to be determined is the granularity of uncertainty, i.e., the cardinality of the linguistic term set used to express the information. One possibility of generating a linguistic term set, S = {s0 , ..., sg }, consists of directly supplying the term set by considering all the terms evenly distributed on a scale where a total order is defined.43 For example, a seven-term set, S, could be: S = {s0 : nothing (n), s1 : very low (vl), s2 : low (l), s3 : medium (m), s4 : high (h), s5 : very high (vh), s6 : perf ect (p)} In these cases, it is usually required that there exist the following operators: • Negation operator: Neg(si ) = sj such that j = g − i (g + 1 is the cardinality) • Maximum operator: max(si , sj ) = si if si ≥ sj • Minimum operator: min(si , sj ) = si if si ≤ sj The semantics of the terms is represented by fuzzy numbers defined in the interval [0, 1], described by membership functions. A way to characterize a fuzzy number is to use a representation based on parameters of its membership function,3 e.g., see Fig. 2. The use of linguistic information implies processes of computing with words (CW),45 different computational models can be found in the literature.9,15,40,41 By

June 23, 2010 17:16 WSPC/118-IJUFKS

436

S0218488510006635

R. M. Rodr´ıguez et al. nothing

0

very low

low

0.17

0.33

Fig. 2.

medium

0.5

high

very high

perfect

0.67

0.83

1

A seven-term set.

extending the works,27,28 we shall use the 2-tuple linguistic representation model15 to accomplish the processes of CW in the following subsection. 3.1.2. Linguistic 2-tuple model The 2-tuple linguistic model was introduced in Ref. 15 to improve the precision of the processes of CW, and in Ref. 27, 28 was used to model and manage the uncertainty in nuclear safeguards evaluation and accomplish the computational processes. This model represents the linguistic information by means of a pair of values, called 2-tuple, (si , α), where s is a linguistic term and α is a numerical value representing the symbolic translation. Definition 1. The symbolic translation is a numerical value assessed in [−0.5, 0.5) that supports the difference of information between a counting of information β assessed in the interval of granularity [0, g] of the term set S and the closest value in {0, . . . , g} which indicates the index of the closest linguistic term in S. This linguistic model defines a set of functions to carry out transformations between numerical values and 2-tuple to facilitate the processes of CW. Definition 2. Let S = {s0 , . . . , sg } be a linguistic term set and β ∈ [0, g] a value supporting the result of a symbolic aggregation operation. Then, 2-tuple that expresses the equivalent information to β is obtained as follows: ∆ : [0, g] → S × [−0.5, 0.5) ( si , i = round (β) ∆(β) = (si , α), with α = β − i, α ∈ [−0.5, 0.5) where round is the usual round operation, si has the closest index label to β, and α is the value of the symbolic translation. It is noteworthy that ∆ is bijective15 and ∆−1 : S × [−0.5, 0.5) −→ [0, g] is defined by ∆−1 (si , α) = i + α. In this way, the 2-tuple of S is identified by the numerical value in the interval [0, g].

June 23, 2010 17:16 WSPC/118-IJUFKS

S0218488510006635

Learning User Profiles

437

Additionally, the 2-tuple linguistic model has associated a computational model that was presented in further detail in Ref. 15. Such a computational model is based on the aforementioned functions. 3.2. Treatment of missing values The performance of decision making, data mining, classification and statistical models are mainly based on the quality of the input data. In many real world situations the gathered dataset is incomplete, because it presents missing values that produce different problems and a lack of quality in their performance.19,32,38 Missing values can appear for different reasons,26 such as negligence, damage, privacy issues, imprecision, vagueness and lack of expertise. Therefore, it is necessary to treat such missing values when data is incomplete. There exist different strategies for dealing with missing values:16,19,32,38 (1) Deletion: It consists of deleting objects that present missing values. Though this strategy is easy to carry out, it presents several drawbacks: • The possibility of eliminating useful information in the data and leading to some biased results.32,47 • It might happen that the proportion of missing values is too big and most of data are ignored because of the deletion of objects that present missing values. Jim´enez and Mateos20 presented two approaches to deal with missing performances in multi-criteria decision making problems disregarding the attributes for which a decision alternative provides no performance. (2) Using as it is: In this approach, original data sets with missing attribute values are not preprocessed, being the missing values ignored and used the remaining observations. The models that adopt this strategy should be capable of using incomplete data. Kabak and Ruan22 proposed an initial methodology to solve problems in nuclear safeguards evaluation, where the missing values are managed with this strategy. (3) Imputation: It consists of replacing a missing value by an estimation. Most of models dealing with missing values use this strategy. Once those missing values have been imputed, the classical models can be applied. Different proposals have been developed for this strategy.11,12,16,34 Grzymala-Busse12 presented probabilistic methods to handle missing values. One method consists of replacing missing values by the most probably known attribute value if it is a symbolic attribute, and by the average of known values if it is a numerical attribute value. Another method replaces a missing value by an attribute value with the largest conditional probability, given the concept to which the case belongs. Pawlak34 presented different methods for dealing with missing data in classifications based on probabilistic methods as well. Recently, Herrera-Viedma et al.16 presented a method for imputing values for incomplete preference relations based on the additive transitivity.

June 23, 2010 17:16 WSPC/118-IJUFKS

438

S0218488510006635

R. M. Rodr´ıguez et al.

4. A Collaborative Filtering Model for Imputing Missing Values in Nuclear Safeguards Evaluation Satisfactory evaluation models for nuclear safeguards have been so far developed.27,28 Such models deal with linguistic information and develop a synthesis process to obtain a global assessment (see Fig. 3).

Fig. 3.

General steps for nuclear safeguards evaluation.

However, the ideal situation showed in Fig. 3 in which every expert assesses all indicators, is not common in nuclear safeguards as it has been previously pointed out. The real and common situation in this problem is more similar to the one showed in Fig. 4 in which some indicators are not assessed by the experts, and therefore, missing values are included in the nuclear safeguards evaluation. The management of these missing values is currently a main issue.

Fig. 4.

General steps for nuclear safeguards evaluation with missing values.

Some initial proposals to tackle this problem based on the strategy treat as they are have been already proposed.21–23 However, in this paper we propose a new model to address this problem that consists of imputing the missing values by means of a Collaborative Filtering (CF) method that follows the scheme presented in Fig. 5, where once the CF model has imputed the missing values, a similar method to the one presented in Fig. 3 is applied.

June 23, 2010 17:16 WSPC/118-IJUFKS

S0218488510006635

Learning User Profiles

Fig. 5.

439

Safeguards evaluation by means of collaborative filtering.

The imputed values computed by the CF model are approximated ones to the values that the expert would provide if he/she had the adequate knowledge. Usually, CF models provide precision metrics that show average errors of the computed values, but these metrics are not enough in the imputation process, because we need a trustworthy value rather than an error value to figure out the reliability of the nuclear safeguards results obtained with the imputed values. Figure 5 shows the global assessment is provided together with a trustworthy value to measure its reliability. 4.1. An imputation process based on collaborative filtering To facilitate the understanding of the process, we describe the concerned objects regarding the nuclear safeguards framework. Let E = {e1 , . . . , em } be a set of IAEA experts that assess a set of indicators I = {ind1 , . . . , indn }. Given that we deal with a linguistic framework, each expert, ei , has to assess the indicators according to his/her knowledge and expertise providing a vector of linguistic assessments, Ni = {xij }, xij = sk ∈ S = {s0 , . . . , sg }, i ∈ {1, . . . , m}, j ∈ {1, . . . , n}. The existence of missing values means that some assessments, xij , are not assessed by ei , due to the reasons pointed out previously. Such missing values can seriously bias the nuclear safeguards results leading to wrong decisions. Analyzing the type of data sets used in nuclear safeguards (experts, indicators) and the objective we pursue of imputing missing values, this problem might be addressed by a CF process based on a K-nn scheme13 in a similar way to collaborative recommender systems,14 where starting from a dataset of users and items, it is computed a probable preference value of an item for a user according to the similarity to other items or users. This proposal consists of the following phases: • Grouping nuclear safeguards indicators: a K-nn algorithm groups the indicators by their similarity. • Estimating missing values: an estimation process based on the previous groups is carried out to obtain linguistic values that will be imputed. Both phases are further detailed in the coming subsections.

June 23, 2010 17:16 WSPC/118-IJUFKS

440

S0218488510006635

R. M. Rodr´ıguez et al.

Fig. 6.

A Collaborative Filtering Process scheme.

4.1.1. Grouping indicators To elaborate imputations the CF process, we will first group the necessary elements. In the nuclear safeguards framework it will group indicators (see Fig. 6). There exist many methods to group elements based on neighbor selection methods, such as k-nearest neighbor selection,13 threshold-based neighbor selection33 and clusteringbased grouping.39 The neighborhood formation based on the K-nn is the most common scheme because of its accuracy and robustness for both memory-based and model-based approaches.1,10,37 This method groups the concerned objects according to their similarity. Previous studies show that neighbors with some higher similarity are more valuable for future estimations. The K-nn scheme has been successfully applied to different domains,4,13,14 but clearly, there does not exist any optimum configuration for any dataset. Hence, in each case, it should carry out a previous study to adjust different parameters and measures that provide useful groups as estimators. In Sec. 5 we will show the study carried out for our proposal. The groups considered are computed according to their similarity, so a similarity measure must be chosen. In Ref. 13 were introduced two similarity measures as the most used in the CF field. For the nuclear safeguards framework, we have modified the original definitions of the measures to deal with linguistic information by using the linguistic 2-tuple representation model15 and introducing the function ∆−1 . • Cosine similarity measure: P −1 k −1 k ∆ (xl )∆ (xm ) w(l, m) = cos(xl , xm ) = qP k pP −1 (xk ) −1 (xk ) m k∆ k∆ l

(1)

where l and m are indicators and xkl , is the assessment provided by expert ek to the indicator l. • Pearson correlation coefficient: P ( k ∆−1 (xkl ) − ∆−1 (x¯l ))(∆−1 (xkm ) − ∆−1 (x¯m )) q w(l, m) = P pP −1 (xk ) − ∆−1 (x −1 (xk ) − ∆−1 (x¯ ))2 ¯l ))2 m m k (∆ l k (∆

(2)

June 23, 2010 17:16 WSPC/118-IJUFKS

S0218488510006635

Learning User Profiles

441

being xkl the assessment provided by expert ek to the indicator l, and x¯l the average assessment to the indicator l. 4.1.2. Estimating values for imputation Once the neighborhoods have been computed, the CF model estimates a plausible value for the indicator that has not been assessed yet, i.e., a missing value. The estimations to be imputed are computed by a weighted aggregating process of the K nearest neighbors selected previously. Similarly to the similarity measures, there exist different possibilities.13 In this case we have studied two of them: • Weighted sum xel = ∆(

Pm=k

w(l, m)∆−1 (xem ) ) Pm=k m=1 w(l, m)

m=1

• Item average + adjustment Pm=k w(l, m)∆−1 (xem ) − x¯e e xl = ∆( m=1 Pm=k ) m=1 |w(l, m)|

(3)

(4)

where l is an indicator, e ∈ E, is the expert who has not provided his/her assessment and k, is the number of indicators selected to compute the value, xle .

Because we are dealing with linguistic information, we use the functions ∆ and ∆−1 to manage it. 4.2. Trustworthy of the imputed values The collaborative filtering methods provide precision metrics such as MAE (Mean Absolute Error),14 ROC25 to compute the error committed regarding the estimations computed, but these metrics are not sufficient for the imputed values in nuclear safeguards, because they indicate average errors. However, we need to obtain a value that indicates the trustworthiness of the imputed values to compute the reliability of the nuclear safeguards results when the imputed values are used. To this end, we define a measure that provides the trustworthiness of the estimated values obtained by the CF process. This measure is based on the case study carried out for the proposal (see Sec. 5) and others like the case study for academic orientation.4 The trustworthy measure is given by: k T (xil ) = (1 − siml )h + siml , T (xil ) ∈ [0, 1] (5) K h=

g − sd(xil ) g

where siml is the arithmetic mean of the similarities among the indicator, l, and its k nearest neighbors. And h, indicates the homogeneity of the assessments xil ,

June 23, 2010 17:16 WSPC/118-IJUFKS

442

S0218488510006635

R. M. Rodr´ıguez et al.

being sd the standard deviation of the assessments used to compute the imputed value and g + 1 the granularity of S. Eventually, k indicates the real number of neighbors involved in the computation of xil from the initial K computed by the K-nn algorithm. The reason of this definition is that we have observed in the aforementioned case studies that reflects the more assessments of the K neighbor used to estimate the imputed value, the more trustworthy is. Similarly, the more homogeneous are the assessments, the more reliable are the imputed values. Hence the more T (xil ) the more reliable is the value. Importantly, this trustworthy value will be used across the nuclear safeguards process (see Fig. 5) carrying out a similar synthesis and aggregation process to the indicators to obtain a reliability value of the nuclear safeguards results taking the imputed missing values into account. 5. An Illustrative Example In this section we illustrate the proposed imputation process presented by a reduced nuclear safeguards dataset that has been also used in other nuclear safeguards models to deal with missing values.21,22 We show the performance of the imputation process and the trustworthy of the imputed values. Keeping in mind that there is not an optimum CF model based on the K-nn scheme for any problem and any dataset. First, we present the case study carried out to adjust the CF model by means of a memory based K-nn scheme to the problem and afterwards, we present the imputation process in nuclear safeguards evaluation. 5.1. Case study to adjust K-nn parameters for the nuclear safeguards dataset The dataset used in this case study is a reduced nuclear safeguards dataset that contains just 22 indicators out of the 914 used in the whole process17 (see Table 1). Table 1.

Description of the dataset.

Number of experts

4

Number of indicators

22

Total number of assessment

88

Linguistic domain

{n,vl,l,m,h,vh,p}

This case study aims to optimize the different parameters of the CF model based on a K-nn scheme4,22 to obtain the best imputed values. Due to the reduced dataset, we can use a memory-based approach because there are not scalability problems at this stage. Such parameters are:

June 23, 2010 17:16 WSPC/118-IJUFKS

S0218488510006635

Learning User Profiles

443

• K-neighborhood size: the number of neighbors we use to estimate the missing values have not been assessed • Similarity measure: the measure to group the objects • Estimation method: the method estimates the missing values that will be imputed This survey has been run 500 times for each studied configuration, (Confi ), which is described in Table 2: Table 2.

Configurations.

Cosine and Weighted sum

Cosine and Item average

K = 15

K = 10

K=5

All-1

n=8

n=4

Conf1

Conf2

Conf3

Conf4

Conf5

Conf6

Pearson and Weighted sum

Pearson and Item average

K = 15

K= 10

K=5

All-1

n=8

n=4

Conf7

Conf8

Conf9

Conf10

Conf11

Conf12

Table 3.

Experimental results.

Num. missing values

Conf1

Conf2

Conf3

Conf4

Conf5

Conf6

10% 20%

1.173 1.231

1.237 1.272

1.307 1.375

1.214 1.275

1.228 1.301

1.296 1.353

Conf7

Conf8

Conf9

Conf10

Conf11

Conf12

1.488 1.749

1.499 1.8

1.585 1.87

1.498 1.782

1.514 1.814

1.576 1.892

10% 20%

where K is the number of the nearest neighbor selected used with the Weighted sum, and with the Item average we have used All-1 (which is the missing value) and given n, that usually it is a number multiple of two. To choose the best configuration for the proposal, we have used the MAE14 precision metric. The obtained experimental results are showed in Table 3. For each configuration, we have used a rate of 10% and 20% of missing values, from which the bolded values are the best results. Therefore, the configuration used for the imputation process of the illustrative example is: • Number of nearest neighbors: K = 15 • Similarity measure: cosine distance (see Eq. (1)) • Estimation method: weighted sum (see Eq. (3))

June 23, 2010 17:16 WSPC/118-IJUFKS

444

S0218488510006635

R. M. Rodr´ıguez et al.

5.2. Imputation process based on CF for nuclear safeguards. An illustrative example Here, we show the performance of the imputation process in nuclear safeguards evaluation. To do so, we use the dataset in Table 4, where there are four experts, and 22 indicators that are assessed in the linguistic term set S (see Fig. 2). Additionally, there are 10 missing values. Table 4.

Experts evaluations.

ind.

1

2

3

4

5

6

7

8

9

10

p1

h

p

vh

m

?

h

m

p

m

l

11 ?

p2

l

vh

?

m

l

m

l

vl

l

l

m

p3

h

h

p

vh

m

vh

vh

h

?

m

vh

p4

p

p

p

?

p

m

vh

vh

h

h

m

ind.

12

13

14

15

16

17

18

19

20

21

22

p1

vl

p

vh

?

l

h

m

m

p

h

m

p2

m

p

m

l

l

m

m

?

m

m

vl

p3

l

p

p

vl

l

p

m

vh

p

vh

?

p4

?

vh

h

l

m

?

l

vh

h

p

vl

Using the configuration (Conf1 ) obtained in the case study carried out in Sec. 5.1, we obtain the imputed values and their correspondent trustworthy values in Table 5. Table 5. ind. 3 4 5 9 11 12 15 17 19 22

e1 pred.

trust

(h,-.46)

.863

(h,-.132)

.93

(m,.493)

.798

Predictions and trust measures.

e2 pred. (m,-.154)

e3 trust .866

pred.

(vh,-.361)

(m,-.16)

e4 trust

pred.

trust

(h,.313)

.866

(h,.242)

.796

(vh,-.459)

.866

.931

.866 (vh,-.4)

.858

To clarify the computation of the imputed values and their trustworthiness, we show in further detail the computations of the missing value, x15 , provided by the expert, e1 , in the indicator, ind5 . • First, it is computed the similarity between ind5 and the remaining ones (see Table 6), by using the cosine distance and keeping the K=15 nearest indicators

June 23, 2010 17:16 WSPC/118-IJUFKS

S0218488510006635

Learning User Profiles Table 6.

445

Similarities between indicator 5 and the other 21. Similarities between indicators w(5,13) = 0.87 w(5,1) = 0.993 w(5,2) = 0.944 w(5,14) = 0.878 w(5,3) = 0.949 w(5,15) = 0.905 w(5,4) = 0.999 w(5,16) = 0.97 w(5,6) = 0.85 w(5,17) = 0.992 w(5,7) = 0.953 w(5,18) = 0.822 w(5,8) = 0.97 w(5,19) = 0.949 w(5,9) = 0.91 w(5,20) = 0.878 w(5,10) = 0.982 w(5,21) = 0.973 w(5,11) = 0.85 w(5,22) = 0.894 w(5,12) = 0.923 Table 7.

The 15 nearest neighbors.

Similarities between indicators sim(5,8) = 0.97 sim(5,5) = 1 sim(5,4) = 0.999 sim(5,7) = 0.953 sim(5,1) = 0.993 sim(5,3) = 0.949 sim(5,17) = 0.992 sim(5,19) = 0.949 sim(5,9) = 0.99 sim(5,2) = 0.944 sim(5,10) = 0.982 sim(5,12) = 0.923 sim(5,21) = 0.973 sim(5,15) = 0.905 sim(5,16) = 0.97

(Table 7): 4 + 12 + 36 (4 ∗ null) + (2 ∗ 2) + (4 ∗ 3) + (6 ∗ 6) √ = √ √ w(5, 1) = √ = 0.993 null + 4 + 16 + 36 null + 4 + 9 + 36 56 49 • The estimation of the imputed value is computed by the weighted sum: xil = x15 = ∆(3.54) = (high, −.46)

• Once the imputed value has been estimated, the model computes its trustworthy according to Eq. (5). To apply such a function it is necessary to compute the following values: sd = 1.45, g = 6, h = 0.759, sim5 = 0.968 and k = 13 Thus, the trustworthy of the imputed value x15 is: T (x15 ) = (1 − 0.968) ∗ 0.759 + 0.968 ∗ 0.867 = 0.863 According to the model, the imputed value for the missing value of the indicator 5 provided by the expert 1 is (high,−.46) and its trustworthy 0.863. After having showed the performance of the imputation model, we impute all the missing values. It is then carried out the safeguards process presented in Fig. 5 by a multi-step aggregation process to obtain an overall nuclear safeguards assessment, and additionally we will compute its trustworthy as well in this case. Different aggregation operators can be used in this synthesis, but for the sake of simplicity, we applied an average operator for obtaining the following results:

June 23, 2010 17:16 WSPC/118-IJUFKS

446

S0218488510006635

R. M. Rodr´ıguez et al.

• Nuclear safeguards overall assessment: (high, −.23) • Trustworthy: 0.864 At this point, we remark that the nuclear safeguards model presented in Fig. 4 without the imputation process obtains the following result: • Nuclear safeguards overall assessment: (medium, 0.33) We can then observe the relevance of a correct treatment of the missing values in nuclear safeguards, because their management leads to different results. 6. Conclusions Nuclear safeguards is a complex evaluation problem dealing with vague and uncertain information. Both uncertainty and inaccuracy make that experts provide missing values in their evaluations. The treatment of these missing values can better obtain the final results. In this contribution, we presented a collaborative filtering imputation process for imputing missing values in nuclear safeguards. Together the imputation process we also introduced a trustworthy measure of such imputed values to compute the reliability of the results. For our further research in this problem, we will investigate a hybrid model for dealing with missing values such that they would be imputed when their trustworthy is high enough and those ones with a low trustworthy will be treated as they are.22 Acknowledgements This work is partially supported by the Research Project TIN-2009-08286, P08TIC-3548 and FEDER funds. References 1. G. Adomavicius and A. Tuzhilin, Toward the next generation of recommender systems: A survey of the state-of-the-art and possible extensions. Ieee Trans. Knowledge and Data Engineering 17(6) (2005) 734–749. 2. B. Arfi, Fuzzy decision making in politics: A linguistic fuzzy-set approach LFSA, Political Analysis 13(1) (2005) 23–56. 3. P. P. Bonissone and K. S. Decker, Selecting Uncertainty Calculi and Granularity: An Experiment in Trading-Off Precision and Complexity, in L. H. Kanal and J. F. Lemmer (Eds.), Uncertainty in Artificial Intelligence (North-Holland, 1986). 4. E. J. Castellano and L. Mart´ınez, A web-decision support system based on collaborative filtering for academic orientation. case study of the spanish secondary school, J. Universal Computer Science 15(14) (2009) 2786–2807. 5. S. L. Chang, R. C. Wang and S. Y. Wang, Applying a direct multi-granularity linguistic and strategy-oriented aggregation approach on the assessment of supply performance. European J. Operational Research 117(2) (2007) 1013–1025. 6. C. T. Chen, Applying linguistic decision-making method to deal with service quality evaluation problems, Int. J. Uncertainty, Fuzziness and Knowledge-Based Systems 9(Suppl.) (2001) 103–114.

June 23, 2010 17:16 WSPC/118-IJUFKS

S0218488510006635

Learning User Profiles

447

7. C. H. Cheng and Y. Lin, Evaluating the best main battle tank using fuzzy decision theory with linguistic criteria evaluation, European J. Operational Research 142 (2002) 174–186. 8. T. M. Cover and P. E. Hart, Nearest neighbor pattern classification, IEEE Trans. Information Theory 13(1) (1967) 21–27. 9. R. Degani and G. Bortolan, The problem of linguistic approximation in clinical decision making, Int. J. Approximate Reasoning 2 (1988) 143–162. 10. M. Deshpande and G. Karypis, Item-based top-n recommendation algorithms, ACM Trans. Inf. Syst. 22(1) (2004) 143–177. 11. D. Dubois and H. Prade, Incomplete conjunctive information, Computers and Mathematics with Applications 15(10) (1988) 797–810. 12. J. W. Grzymala-Busse, Three approaches to missing attribute values a rough set perspective, in Proc. 4th Int. ISKE Conf. Intelligent Decision Making Systems, Hasselt, Belgium, World Scientific, 2009, pp. 153–158. 13. J. L. Herlocker, J. A. Konstan and J. Riedl, An empirical analysis of design choices in neighborhood-based collaborative filtering algorithms, Information Retrieval 5(4) (2002) 287–310. 14. L. J. Herlocker, J. A. Konstan, L. G. Terveen, and J. T. Riedl, Evaluating collaborative filtering recommender systems, in ACM Trans. Information Systems 22 (2004) 5–53. 15. F. Herrera and L. Mart´ınez, A 2-tuple fuzzy linguistic representation model for computing with words, IEEE Trans. Fuzzy Systems 8(6) (2000) 746–752. 16. E. Herrera-Viedma, F. Chiclana, F. Herrera and S. Alonso, Group decision-making model with incomplete fuzzy preference relations based on additive consistency, IEEE Trans. Systems, Man and Cybernetics-Part B: Cybernetics 37(1) (2007) 176–189. 17. IAEA, Int. Atomic Energy Agency, Rep. STR-314, Vienna, chapter Physical model, 1999. 18. IAEA, IAEA Bulletin, Annual Report, Vol. 43, chapter Nuclear Security and Safeguards, 2001. 19. N. Jiang and L. Gruenwald, Advances in Databases: Concepts, Systems and Applications, Vol. 4443/2008, chapter Estimating missing data in data streams, eds. R. Kotagiri and P. R. Krishna and M. Mohania and E. Nantajeewarawat, 2007, pp. 981–987. 20. A. Jim´enez and A. Mateos, Two approaches to deal with missing performances within MAUT, in Proc. 4th Int. ISKE Conf. Intelligent Decision Making Systems, Hasselt, Belgium, World Scientific, 2009, pp. 203–208. ¨ Kabak and D. Ruan, Dealing with missing values in nuclear safeguards evaluation, 21. O. in Proc. 4th Int. ISKE Conf. Intelligent Decision Making Systems, Hasselt, Belgium, World Scientific, 2009, pp. 145–152. ¨ Kabak and D. Ruan, A cumulative belief degree-based approach for missing val22. O. ues in nuclear safeguards evaluation, IEEE Trans. Knowledge and Data Engineering (http://doi.ieeecomputersociety.org/10.1109/TKDE.2010.60), 2010. ¨ Kabak and D. Ruan, A cumulative belief-degree approach for nuclear safeguards 23. O. evaluation, in Proc. 2009 IEEE Int. Conf. Systems, Man and Cybernetics, San Antonio, TX, USA, 2009, pp. 2285–2290. 24. Y.-L. Kuo and C.-H. Yeh, Evaluating passenger services of asia-pacific international airports, Transportation Research Part E: Logistic and Transportation Review 39(1) (2003) 35–48. 25. T. C. W. Landgrebe and R. P. W. Duin, Efficient multiclass ROC approximation by decomposition via confusion matrix pertubation analysis, IEEE Trans. Pattern Analysis and Machine Intelligence 30(5) (2008) 810–822.

June 23, 2010 17:16 WSPC/118-IJUFKS

448

S0218488510006635

R. M. Rodr´ıguez et al.

26. R. J. A. Little and D. B. Rubin, Statistical Analysis with Missing Data (John Wiley, New York, 1987). 27. J. Liu, D. Ruan and R. Carchon, Synthesis and evaluation analysis of the indicator information in nuclear safeguards applications by computing with words, Int. J. Appl. Math. Comput. Sci. 12(3) (2002) 449–462. 28. J. Liu, D. Ruan, H. Wang, and L. Mart´ınez, Improving nuclear safeguards evaluation through enhanced belief rule-based inference methodology, Int. J. Nuclear Knowledge Management 3(3) (2009) 312–339. 29. L. Mart´ınez, Sensory evaluation based on linguistic decision analysis, Int. J. Aproximated Reasoning 44(2) (2007) 148–164. 30. L. Mart´ınez, M. J. Barranco, L. G. Perez and M. Espinilla, A knowledge based recommender system with multigranular linguistic information, Int. J. Computational Intelligence Systems 1(3) (2008) 225–236. 31. L. Maschio, A decision support system for safeguards information analysis, Int. J. Nuclear Knoledge Management 2(4) (2007) 410–421. 32. L. B. Oltman and S. B. Yahia, Concept Lattices and their applications, Vol. 4923/2008, chapter Yet another approach for completing missing values, eds. S. B. Yahia and E. M. Nguifo and R. Belohlavek, 2008, pp. 155–169. 33. M. P. O’Mah, N. J. Hurley and G. C. M. Silverstre, An evaluation of neighbourhood formation on the performance of collaborative filtering 21(3–4) (2004) 215–228. 34. M. Pawlak, Kernel classification rules from missing data, IEEE Trans. Information Theory 39(3) (1993) 979–988. 35. D. Ruan, J. Liu and R. Carchon, Linguistic assessment approach for managing nuclear safeguards indicator information, Int. J. Logistics Information Management 16(6) (2003) 401–419. 36. P. J. S´ anchez, L. Mart´ınez, C. Garc´ıa, F. Herrera and E. Herrera-Viedma, A fuzzy model to evaluate the suitability of installing an ERP system, Information Sciences 179(14) (2009) 2333–2341. 37. B. Sarwar, G. Karypis, J. Konstan and J. Reidl, Item-based collaborative filtering recommendation algorithms, in WWW ’01: Proc. 10th Int. Conf. on World Wide Web, New York, USA, ACM, 2001, pp. 285–295. 38. J. Siddique and R. Belin, Using and approximate bayesian bootstrap to multiply impute nonignorable, Computational Statistics and Data Analysis 53(2) (2008) 405– 415. 39. L. H. Ungar and D. P. Foster, Clustering methods for collaborative filtering, in Proc. Workshop on Recommendation Systems (AAAI Press, 1998). 40. J. Wang and J. Hao, A new version of 2-tuple fuzzy linguistic representation model for computing with words, IEEE Trans. Fuzzy Systems 14 (2006) 435–445. 41. Z. S. Xu, A method based on linguistic aggregation operators for group decision making with linguistic preference relations, Information Sciences 166 (2004) 19–30. 42. R. R. Yager, Families of OWA operators, Fuzzy Set and Systems 59(2) (1993) 125–148. 43. R. R. Yager, An approach to ordinal decision making, Int. J. Approximate Reasoning 12 (1995) 237–261. 44. R. R. Yager, Fusion of ordinal information using weighted median aggregation, Int. J. Approximate Reasoning 18 (1998) 32–35. 45. R. R. Yager, Approximate reasoning as a basis for computing with words, Computing with Words and Information/Intelligent Systems 2: Applications (Physica Verlag, 1999), pp. 50–77.

June 23, 2010 17:16 WSPC/118-IJUFKS

S0218488510006635

Learning User Profiles

449

46. L. A. Zadeh, The concept of a linguistic variable and its applications to approximate reasoning, Information Sciences, Part I, II, III, 8,8,9: 199–249, 301–357, 43–80, 1975. 47. S. Zhang, Y. Qin, X. Zhu, J. Zhang and C. Zhang, Optimized parameters for missing data imputation, PRICAI 2006: Trends in Artificial Intelligence, Vol. 4099/2006, eds. Q. Yang and G. Webb, 2006, pp. 1010–1016.