Using visualization for information management tasks ... - CiteSeerX

Report 2 Downloads 113 Views
Using Visualization for Information Management Tasks Christiaan Fluit [email protected]

Jeroen Wester [email protected]

Aidministrator Nederland BV, Amersfoort http://www.aidministrator.nl/

Abstract Taxonomies are a powerful modelling tool when building interfaces for disclosing large information repositories. However, their actual use is far from trivial; tasks such as creation, instantiation and maintenance of taxonomies are often difficult and time-consuming. We present a number of ways in which the Cluster Map, a component for the visualization of instantiated taxonomies, can help in these tasks. Using the proposed visualizations, a user gains insight in the information, detects anomalies, monitors the information as it evolves over time and assesses the quality of the output of automatic classification tools. The proposed visualizations are presented in the context of one of our customers, for which we create web portals based on taxonomies, providing access to a large document collection.

1.

Introduction

The sheer size and explosive growth of information repositories in recent years have prompted organizations to reconsider how efficient and effective access to their information can be guaranteed. One way to approach this problem is to provide end users with personalized information presentations, taking into account their task, vocabulary, perspective, etc. Portals are a common and easy to use mechanism for realizing such personalized information presentations [7] [9]. Often, the content of these portals is based on the repository’s metadata, which contains descriptive and contextual information about the items in the repository. Taxonomies are a particular kind of metadata, capturing the characteristics of a domain in a hierarchy of classes (also referred to as concepts). Their use implies a number of non-trivial tasks, e.g. taxonomy creation, classification of items and maintenance of the taxonomy and its instantiation. These tasks require overview of and insight into the domain and the data involved.

Information visualizations are good candidates for assisting users with acquiring this overview and insight, harnessing human perceptual capabilities to detect patterns and outliers in visual information. This paper explores the use of visualization in a taxo nomy-based environment. Spectacle [3], one of Aidministrator’s core products, is a platform for constructing in formation presentations, such as portals, based on taxonomies. It has been used for disclosing large information repositories, e.g. document collections. Assuming that a taxonomy already exists or can be derived from existing information structures, the following information management tasks occur frequently when applying Spectacle (see Figure 1): A. Acquiring insight into an information repository. B. Monitoring the repository as it evolves over time. C. Populating the repository using automatic, heuristic classification techniques. Title: outline.eps Creator: fig2dev Version 3.2 Patchlevel 3c Preview: This EPS picture was not saved with a preview included in it. Comment: This EPS picture will print to a PostScript printer, but not to other types of printers.

Figure 1: Three information management tasks. We will describe these information management tasks in the context of one of our customers. This customer uses Spectacle for providing portals to various kinds of end users, through which they can access a large document repository. We will show how these tasks can benefit from our visualization, the Spectacle Cluster Map [4] [6], enabling the information managers to offer higher-quality portals. Note that we position the proposed visualizations as an aid for information providers, enabling them to op-

Proceedings of the Sixth International Conference on Information Visualisation (IV’02) 1093-9547/02 $17.00 © 2002 IEEE

timise their information dissemination, instead of as an alternative means to access the information for informa tion consumers. This work builds on earlier proposed usage of our technology presented at IV2001 [6], and reports on its use in the context of a specific case. The visualisations described in this paper are made together with and judged by domain experts in the customer’s organization. The goal was to decide on ways to support the tasks of information managers. This paper is organized as follows: section 2 and 3 describe the customer’s domain and Cluster Map visualization respectively. Section 4 elaborates on the information management tasks and describes how they can benefit from visualization in general. Section 5 proposes several Cluster Map visualizations that address these tasks, illustrated with examples on the customer’s data. Finally, section 6 summarizes our findings so far.

2.

In order to perform their tasks, information managers need to know what information is available and how (well) it is classified. Note that they are domain experts, not experts in document management and automatic classification technologies.

3.

The Spectacle Cluster Map

We will briefly introduce the basics of the Cluster Map; see [4] and [6] for a more elaborate introduction. The Cluster Map visualizes the instances of a number of selected classes, using these classes as its main organization principle. Figure 2 shows a Cluster Map, displaying the instances of three classes. The large spheres represent classes and are labelled with the class name and cardinality. The smaller spheres represent instances, in this case documents.

The Application Domain

One of Aidministrator’s customers is Bouwradius, a knowledge centre in the building and construction sector in the Netherlands. One of Bouwradius’ tasks is to act as an “information broker”: documents from various building- and construction-related sources are gathered, classified and distributed among various information consumers within and outside Bouwradius. Bouwradius uses Spectacle for providing convenient access to the documents in their repository. Ge nerated portals deliver relevant documents to information consumers and give them a natural and intuitive perspective on those documents, based on their user profile. The taxonomy used in this domain consists of a large set of keywords. Each document is labelled with one or more keywords. Therefore, the classes in the taxonomy can overlap (i.e. share instances). This is an important taxonomy characteristic, not only for Bouwradius but also for Spectacle applications in general, since it allows users to find the same information in multiple ways. The classes in the taxonomy are used for defining the structure and content of each portal. Currently, domain experts from Bouwradius label the documents manually. In the near future this will be replaced by an automatic classification system based on machine learning techniques. This Spectacle application has two kinds of users: • The information consumers receive relevant and structured content in their personalized portal, optimised for their task and reflecting their views. • The information managers are Bouwradius employees responsible for filtering and structuring the information, maintaining the set of user profiles that the information consumers may choose from, and in time also the automatic document classification.

Figure 2: An example Cluster Map. Instances that have the same set of class memberships are grouped together in a cluster. Each cluster connects to the classes it represents through a balloon-shaped edge. When a subclass relation holds between two classes (not used in this example), they are connected by a directed edge. Clusters are then only connected to their most specific classes. Cluster Maps contain a lot of information about the instantiation of the classes, specifically exploiting the overlaps between them. For example, Figure 2 shows that the Insulation class has more overlap with Materials than with Tools. Such observations can trigger hypotheses about the available information and the domain in general. The calculation of the graph layout results in the geometric closeness of objects indicating their semantic closeness: classes that share ni stances are located near each other, and so are instances with the same or similar class memberships. The Cluster Map comes with a highly interactive, animated GUI, designed for browsing-oriented exploration of

Proceedings of the Sixth International Conference on Information Visualisation (IV’02) 1093-9547/02 $17.00 © 2002 IEEE

the taxonomy. Through interaction a user can also retrieve information about the specific documents that are contained in a class or cluster. Summarizing, we can say that Cluster Maps enable a user to acquire high-level overview of and insight in the taxonomy structure and instantiation, and map very naturally on the domain discussed above.

4.

Three Tasks of an Information Manager

In the introduction we have identified three frequently occurring information management tasks related to the use of taxonomies. In this section we will elaborate on these tasks, using Bouwradius as a context, and point out opportunities for the application of visualization technologies.

4.1.

Acquiring Insight

Information managers require a global overview of and insight into the document collection, e.g. to see how documents are distributed over classes, how classes relate to each other through shared instances, etc. This knowledge enables them to determine how to present information to the consumers in an effective and mean ingful way. Additionally, they can optimise the existing classification, e.g. by filtering noise in the document labelling or restructuring the existing taxonomies, resulting in higherquality portals . Numerous possibilities for visualizing such information collections are described and compared in the information visualization literature, see e.g. [1] and [2].

4.2.

Monitoring Information Repositories

Information managers have the desire to know how the content of an information repository changes over time, so that they can adapt the taxonomy accordingly. For exa mple, the taxonomy may need to provide more detail in an area with an increasing amount of documents. This kind of overview can be achieved by periodically creating the visualizations proposed for the former task, or by extending them with an explicit time dimension. Furthermore, there is always the possibility of a conceptual change in the domain: the nature of the central topics may slowly change. This is also called semantic or conceptual drift. It is interesting to see whether visualization can be used as an early warning system for such subtle but important changes.

4.3.

Measuring Quality of Automatic Classification

Bouwradius plans to use an automatic classification tool based on machine learning techniques in the near future, to replace the manual classification process (domain

experts reading and labelling every document). The application of this tool consists of a number of steps: 1. A taxonomy is defined. 2. A set of documents is manually classified using the classes of this taxo nomy. 3. This set is split into a training set and a test set. 4. The classification tool uses the training set to learn how to recognize these classes. This results in a classifier that can automatically classify a document with some degree of certainty. 5. The classifier is applied on the test set. The resulting classification is used to assess the classifier’s quality, e.g. in terms of number of misclassified documents, precision and recall, etc. A separate data set is used for this purpose to ensure the classifier’s generalizability. 6. Depending on this quality, one may wish to change certain parameters and repeat step 4 and 5. In the domain of Bouwradius, it seems natural to use the existing, manually labelled documents for creating training and test sets. Clearly, the quality of these data sets is very important, since it determines the quality the classifier can achieve. Section 4.1 already suggested that vis ualization can aid discovering and fixing errors in the data. Another area where visualization can help is in step 6 of the application procedure. Assessment of the quality of a classifier often comes down to interpreting statistics such as the ratio of correctly classified documents. Although useful as a first indication, these statistics give little qualitative information on the performance of the system. For example, a classifier may correctly classify certain classes but fail to differentiate between certain other classes. Vis ualization of the classifier’s output with respect to the desired output enables a user to acquire insight in these qualities, allowing him to make wellinformed parameter changes. When the training and quality assessment is fast enough, this can even make interactive exploration of the parameter settings possible. The generalization ability measured by the test set should also be monitored over time , since a conceptual drift in the domain may result in less accurate classifications. Therefore, this task overlaps with the task formulated in section 4.2. A beneficial side effect of using visual representations of the classifier’s quality is that they help the information managers, who cannot be expected to be experts in machine-learning, to better comprehend these techniques. This increases their trust and understanding in the system (even though it remains a “black box”), which on its turn increases its acceptability for everyday usage.

4.4.

Summary

The main problems in this context can be summarized as follows: the information consumer’s requirements are

Proceedings of the Sixth International Conference on Information Visualisation (IV’02) 1093-9547/02 $17.00 © 2002 IEEE

met by their personalized portals, but the information managers often lack the global overview needed to fulfill their tasks. Furthermore, the use of automatic classification will only help to automate the classification task, but overview and insight into the information are still needed to direct the classifier and ensure high-quality output. Visualization can significantly help to achieve this overview and insight, enabling information managers to produce higher-quality portals.

5.

the middle (Beton), i.e. in a richer taxo nomy they would have been subclasses. Therefore, all their documents would ideally also be classified as being about Beton. However, there are several clusters of documents not classified as Beton, leading to the same problem as mentioned above.

Using Cluster Maps for Information Management Tasks

We will now describe several ways in which the Cluster Map can be used for the formulated information management tasks, following the structure of section 4, and illustrate it with examples on the Bouwradius data.

5.1.

Acquiring Insight

The Cluster Map can immediately be applied on the set of classified documents for acquiring insight into the domain, since it was originally designed for such analysis tasks. In order to realize insight-providing visualizations, assistance from domain experts is needed to find relevant sets of classes to visualize, since the taxo nomy is too large to visualize at once. Several methods are conceivable for establishing useful visualizations. Usually one starts with visualizing collections of classes about similar topics to see how they relate at the instance level. At first, this can result in detecting flaws in the classification. For exa mple, Figure 3 is a visualization of 5 classes concerning “beton” (concrete). Note that this map differs from previous maps in that it displays a cluster as a single visual entity instead of its individual objects, leading to better scalability. Also, the edges starting from the same class have the same colour, resulting in better interpretability. From this visualization two observations can be made about the classification. The first is that synonymous terms in the taxonomy are not necessarily treated as synonymous at the instance level: Betonverwerken and Betonverwerking both refer to the same concept (concrete processing), but they share hardly any instances; only two documents have been labelled with both keywords. This effect occurs very often in the classification, in various forms, e.g. singular vs. plural words, abbreviations vs. full descriptions, nouns vs. verbs, etc. This is problematic because the document repository using this classification has no knowledge about thesaurical relations between keywords. Therefore, if a query to the document repository does not specify all applicable key words, only a fragment of the relevant documents will be returned. The second observation is that all classes at the border of Figure 3 are conceptually specializations of the class in

Figure 3: Documents about "beton" (concrete). It is not clear what has caused these flaws. It could be that the authors overlooked some keywords, since the taxonomy is rather large and unstructured, or that they mistakenly assumed that the document repository would know about redundancies in the taxonomy, etc. The next two examples provide more insight into the domain.

Proceedings of the Sixth International Conference on Information Visualisation (IV’02) 1093-9547/02 $17.00 © 2002 IEEE

Figure 4: Documents about "daken" (roofs).

Figure 5: Documents about educational topics.

Figure 4 shows a number of classes related to “daken” (roofs). Figure 5 visualizes various educational classes. Although both images are rather complex for making very detailed observations, it does allow for one high-level observation: the classification provides much more detail for documents about education. Not only are there more classes about education than about roofs, the classes and clusters are also much smaller, indicating a higher distinctiveness of the classification. Domain experts from Bouwradius explain that the classification of technical subjects is more sound and stable than the classification of educational topics, and that the characteristics as observed in the educational classification are not desirable due to their complexity. In the Bouwradius context, the taxonomy classes are used for defining presentation-oriented classes, reflecting the portal structures. Visualizations of these classes result in maps with a much more task- and user-oriented perspective. Figure 6 is an example of this visualization strategy. It shows how documents are used in the curriculum of bricklayers. Each class represents the documents relevant to a single module in the course. The number at the start of a class name indicates the module order. In order to reduce the visual clutter that results from visualising all classes at once, we have removed all clusters with less than 10 documents. This is justifiable since we only want to have a very global overview of the mo dules.

Figure 6: Bricklayer curriculum documents.

Proceedings of the Sixth International Conference on Information Visualisation (IV’02) 1093-9547/02 $17.00 © 2002 IEEE

The visualisation helps to understand that some modules, like Metselwerk algemeen and Schoon- en vuilmetselwerk , are more central to the domain: they have a strong interrelationship with most other mo dules. This can be expected since these are the first two modules in the course. Other modules are more peripheral, like Vloeren and Stelwerk voor metselaars. Note that three subareas emerge at the top (course 13 and 15), left (4, 5, 6, 7 and 9) and right side (12 and 14) of the map; these classes are closely interconnected and are mainly connected with the rest through the courses 1 and 2. The general layout and the formation of conceptual subareas in the visualisation met the expectations of the domain experts. They were concerned about the fact that some modules, such as the courses 5 and 14, had little or no documents of their own (even when considering that small clusters have been removed). There are numerous ways to obtain additional insightful visualizations, for example, by using derived classes such as in Figure 6, or by using additional information such as a document’s author, origin, etc. The domain experts are often able to extend this list, since they know a lot about the context of the taxo nomy (e.g. how it is established, how it is used) that is not apparent from the taxonomy itself and that can be exploited to establish highly informative visualizations. It may be interesting to see whether algorithms can be developed that support the user with the interactive exploration of the taxonomy, by suggesting interesting visualizations based on the taxonomy’s characteristics. For example, such tools could notify a user about sets of classes with a similar instantiation (e.g. a large intersection and a small symmetric diffe rence).

5.2.

Monitoring Information Repositories

Several options exist for tracking changes in an information repository over time. One approach is to periodically visualize the total document set, or the set of added documents, and see how the distribution of the information changes. This will not only show changes in volume but also changes in relationships between classes, possibly triggering a revision of the taxonomy. belong to a class. Figure 7 shows a content drift in four consecutive years. Each graph shows all documents about energie and duurzaam bouwen, as well as a class representing all documents published in a specific year. We see that in 1997 (top left) there are no documents published on energie. The number of energie documents grows in the following years but decreases in 2000 (bottom right). The number of documents related to duurzaam

bouwen is evenly distributed over the last three years. Instead of adding the year class to the visualisation, we could also use it to filter the documents displayed in the map. This map would be easier to interpret, but gives no indication of how the shown amounts of documents relate to the total number published in that year.

Figure 7: Changes over time for Energie and Duurzaam bouwen. A problem with this visualization is that it is hard to differentiate changes in volume from conceptual changes, which can also result in growing or shrinking classes and clusters. A solution may lie in the fact that a lot of classifiers do not inherently produce a binary classification, where a document either belongs or does not. Instead they produce a weighted classification, where a document belongs to a class with some probability. A binary classification is then established by thresholding the weights. We suspect that conceptual changes in the domain are reflected in these weights (their average value will decrease). Furthermore, when using automatic classification, the visualization of the errors made on a manually labelled test set may be repeated periodically, in order to track if and how the generalization capabilities change.

Proceedings of the Sixth International Conference on Information Visualisation (IV’02) 1093-9547/02 $17.00 © 2002 IEEE

Visualization for monitoring purposes may be implemented simply by producing various maps over time. A user interface is also conceivable that allows its users to “go back and forth in time”, showing the changes using smooth animations.

5.3.

Visualizing Automatic Classification Quality

In this subsection we describe a recently devised and speculative method for giving a visual impression of a classifier’s quality. Section 4.3 introduced the training and test sets that are used when training an automatic classifier. By visualizing both the original classification of the training set and the classification produced by the classifier, the classification quality is assessed; ultimately both visualizations should be identical. Another approach is to visualize the original classification and highlight those documents that are incorrectly classified by the classifier. This allows one to quickly oversee the quality of the classifier and detect parts of the taxonomy that appear to be difficult to characterise. Figure 8 shows a mock-up of such a visualization, using grey spheres for correctly classified documents and red spheres for errors. This visualization tells us that the classification quality is reasonable for the two classes at the right but that it can be improved for the classes at the left.

Figure 8: Visualizing classification errors. Note that this visualization gives considerably more information about the classifier’s quality than a ratio of correctly classified documents. Using this in-

formation one can decide to change certain parameter settings (some algorithms use separate parameters for each class) or even switch to a different algorithm. Subsequent changes in the results can immediately be visualized, giving instant insight into the obtained improvements. This visualization method can also be applied on the test set, resulting in a visual overview of the generalization ability of the algorithm. This can again give rise to different parameter settings or the choice of a different algorithm. Instead of using the original classification, one can also use the classification produced by the classifier as a basis for the map and highlight the errors in it. The resulting visualization has a different meaning. When the original classification is displayed, one clearly sees which parts of the taxo nomy are hard to learn. On the other hand, when the classifier’s classification is used, one gets an overview of the reliability of the classifier’s classification per class, i.e. it indicates for every class whether documents classified as such are likely to really belong to that class. Note that this difference in content corresponds more-or-less to the concepts of recall and precision, which are common error measurements for information retrieval and machine learn ing algorithms. 5.3.1

Problems to be Investigated

Although this visualization method seems promising, there are still a number of problems that need to be investigated. First, there are some methodological issues, e.g. whether it is sound to use the visualization of the training set; maybe one should only use the test set. Also, it has to be clearly formulated how these visualization methods relate to the precision and recall concepts, as they are very common quality measurements in Machine Learning and Information Retrieval. Furthermore, a shortcoming of the proposed visualization is that, when the original classification is visualized, a user will want to know the classifier’s classification of an incorrectly classified object, and vice versa. For example, in Figure 8 (which is based on the original classification) one might wonder whether the large number of errors is caused because the algorithm fails to differentiate between Heat Insulation and Moistureproof Constructing or whether the incorrectly classified objects are simply distributed over all other classes by the classifier. Related to this is the fact that, since objects may belong to multiple classes, the current vis ualization gives no indication of the degree of misclassification of an object. Various solutions are conceivable, e.g. offering new ways of user interaction (mouse-over effects, animations, etc.) or an extension of the visualization metaphors.

Proceedings of the Sixth International Conference on Information Visualisation (IV’02) 1093-9547/02 $17.00 © 2002 IEEE

5.3.2

Related Work

There are few systems that use visualization for aiding machine learning tasks and that go beyond regular charts and similar diagrams (see [5] for an introduction to this field). A prominent example is MineSet [8], a system for (among other things) classifying objects based on decision tree and Bayesian methods. A difference with our approach is that MineSet uses the visualization to let the user understand the classifier’s structure (e.g. the decision tree), with only a secondary role for the resulting classification. Our approach abstracts from the inner workings of the classifier, still regarding it as a black box, but tries to increase the user’s trust and understanding of the classifier’s capabilities by focusing on the difference between the desired and the actual output. Consequently, our approach is also algorithm-independent. However, we cannot give an explanation of the classifier’s mistakes, but it is not clear how many of our users (the Bouwradius information managers) would be able to comprehend this explanation in the first place without expert assistance.

6.

Summary

In this paper we have explored a number of vis ualizations that assist in building and maintaining taxonomy -based applications such as portals. These vis ualizations are specifically aimed at assisting information managers, rather than information consumers. The initial results show that the visualizations already benefit current practice, since they enable a user to achieve insight in the existing classification. Furthermore, visualization methods are formulated that assist in monitoring an information repository over time and assessing the quality of the output of automatic classification tools.

Several topics for further research and development have been mentioned, varying from the improvement of the current visualization component (e.g. improving the number of classes that can be visualized at the same time, zooming and filtering functionality) to the development of special purpose viewers optimised for certain tasks. Empirical results are strongly needed to guide future development.

References [1] C. Chen. Information Visualisation and Virtual Environments. Springer-Verlag, London, 1999. [2] M. Dodge, R. Kitchin. Atlas of Cyberspace. Addison Wesley, 2001. [3] C. Fluit, H. ter Horst, J. van der Meer and M. Sabou. Spectacle. In J. Davies, D. Fensel, and F. van Harmelen, editors, Towards the Semantic Web: Ontology-driven Knowledge Management. John Wiley, 2002 (to appear). [4] C. Fluit, M. Sabou and F. van Harmelen. Ontology -Based Information Visualisation. In V. Geroimenko, C. Chen, editors, Visualising the Semantic Web. Springer Verlag, 2002 (to appear). [5] U. Fayyad, G. Grinstein and A. Wierse. Information Visualisation in Data Mining and Knowledge Discovery. Morgan Kaufmann, 2001. [6] F. van Harmelen, J. Broekstra, C. Fluit, H. ter Horst, A. Kampman, J. van der Meer and M. Sabou. Ontology based Information Visualisation. In Proceedings of the Fifth International Conference on Information Visualisation. IEEE Computer Society, 2001 [7] A. Maedche, S. Staab, N. Stojanovic, R. Studer and Y. Sure. SEAL – A Framework for Developing SEmantic Web PortALs. In Proceedings of the British National Conference on Databases, 2001. [8] MineSet. http://www.sgi.com/software/mineset.html. [9] S. Staab and A. Maedche. Knowledge Portals – Ontologies at Work. AI Magazine, 21(2), 2001.

Proceedings of the Sixth International Conference on Information Visualisation (IV’02) 1093-9547/02 $17.00 © 2002 IEEE