Learning Comprehensible Descriptions of Multivariate Time Series Mohammed Waleed Kadous
School of Computer Science & Engineering University of New South Wales Sydney, NSW 2052, Australia
[email protected] Abstract Supervised classi cation is one of the most active areas of machine learning research. Most work has focused on classi cation in static domains, where an instantaneous snapshot of attributes is meaningful. In many domains, attributes are not static; in fact, it is the way they vary temporally that can make classi cation possible. Examples of such domains include speech recognition, gesture recognition and electrocardiograph classi cation. While it is possible to use ad hoc, domain-speci c techniques for \ attening" the time series to a learner-friendly representation, this fails to take into account both the special problems and special heuristics applicable to temporal data and often results in unreadable concept descriptions. Though traditional time series techniques can sometimes produce accurate classi ers, few can provide comprehensible descriptions. We propose a general architecture for classi cation and description of multivariate time series. It employs event primitives to analyse the training data and extract events. These events are clustered, creating prototypical events which are used as the basis for creating more accurate and comprehensible classi ers. A minimal implementation of this architecture, called TClass, is applied to two domains, one real and one arti cial and compared against a nave approach. TClass shows great promise, particularly in comprehensibility, but also in accuracy. Keywords: machine learning, classi cation, temporal classi cation, gesture recognition, time series.
1 INTRODUCTION One prominent area of machine learning research is supervised classi cation and in particular, attributevalue learning. Attribute-value learners typically assume that each training instance is de ned by a set of attributes, that for the purposes of the learner, do not change. In many domains however, it is this change in attributes over time that makes classi cation possible. Examples of such domains include recognition of gestures, medical signal analysis, robot sensor analysis and mining in temporal databases. To make the distinction clear, consider the speech recognition task, where an audio signal representing a word must be classi ed. The audio signal consists of 22 spectral frequency coecients, updated 50 times a second. Looking at the values of the coecients at a single instant of time is not very useful for classi cation; rather, classi cation can only be performed by looking at the changes of these coecients over time. It is possible to extract features from time series, then apply a conventional learner. There are several drawbacks to such an approach. Firstly, the techniques tend to be ad hoc, domain-speci c and labourintensive; the current work arises from the author's frustrating experiences trying to do this for the sign recognition task [Kadous, 1995]. Secondly, there are special heuristics applicable to temporal domains that are dicult to capture by such a conversion process (for example, dierent parts of the instance being slightly stretched or delayed in time). Thirdly, descriptions built using these extracted features can be hard to understand. A description of the problem, followed by two examples of temporal classi cation tasks are given. A general architecture is proposed for the classi cation and description of multivariate time series. An implemen-
bell (b) or funnel (f ). Samples are generated as follows: α
( ) = (6 + ) [ ](t) + (t) b(t) = (6 + ) [ ] (t) (t ? a)=(b ? a) + (t) f (t) = (6 + ) [ ] (t) (b ? t)=(b ? a) + (t) 8 0 t < a; < [ ] = : 10 at > bt b c t
Frames
a;b
a;b
Channels
a;b
β
a;b
0
1
2
3
4
5
6
7
8
9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
time
Stream
Figure 1: The relationship between channels, frames and streams. tation of the system, called TClass, is presented; as is its application to the two temporal classi cation tasks. As with conventional supervised classi cation, the input consists of a set of training instances and associated class labels. In this case however, the instances are streams. Each stream consists of a sequence of frames. Each frame represents an instant of time of the stream, and consists of a set of measurements or values from dierent sources, termed channels. The number of frames need not be xed. Figure 1 illustrates the relationship between these terms. Each stream is also labelled with its class1. Our main goals in this work are: to produce a classi er with a low error rate, and to produce descriptions which are comprehensible to a human.
where and (t) are drawn from a standard normal distribution N (0; 1), a is an integer drawn uniformly from [16; 32] and b ? a is an integer drawn uniformly from [32; 96]. See gure 2 (right hand side) for some instances of each class. The cylinder class is characterised by a plateau from time a to b, the bell class by a gradual increase from a to b followed by a sudden decline and the funnel class by a sudden increase at a and a gradual decrease until b. Although univariate (i.e., only has one channel) and of a xed length (128 frames), the CBF task attempts to characterise some of the typical properties of temporal domains. Firstly, there is random amplitude variation as a result of the in the equation. Secondly, there is random noise (represented by the (t)). Thirdly, there is signi cant temporal variation in both the start of events (since a can vary from 16 to 32) and the duration of events (since b ? a can vary from 32 to 96). The left-hand side of gure 2 shows the rules generated by our system and discussed in the rest of this paper.
2 EXAMPLE LEARNING TASKS
2.2 AUSLAN SIGN RECOGNITION
1.1 PROBLEM DESCRIPTION
Because of the novelty of classi cation in temporal domains, standard datasets, such as those used in static classi cation, are not easy to nd. For this reason, we have had to use one arti cial domain and one natural domain where we collected the data ourselves. We discuss these domains rst to give a practical example of typical temporal classi cation problems.
2.1 CYLINDER-BELL-FUNNEL The arti cial cylinder-bell-funnel task was originally proposed by Saito [Saito, 1994], and further worked on by Manganaris [Manganaris, 1997]. The task is to classify a stream as one of three classes, cylinder (c), 1 It is also possible to conceive of a more complex learning task, where each stream has a sequence of class labels. However, for the rest of this paper, we will only consider single class labels.
The goal for this learning task is to classify a subset of Auslan (Australian sign language, the language of the Australian Deaf community) signs. Instances of the signs were collected using an instrumented glove2. The domain consists of eight channels: x (left/right), y (up/down), and z (away/towards the body) position, wrist roll (whether the palm is facing up or down), thumb, fore, middle and ring nger bend. This information is updated approximately 23 times a second. Thus each frame consists of the value of these eight channels at an instant in time. The signs were collected one at a time starting and ending at a xed A Nintendo PowerGlove was used to collect the data. It is highly sensitive to noise, suers low temporal and value resolution, is sensitive to environmental factors, only captures information from one hand and has no sensor on the little nger. Newer equipment has been procured which should signi cantly improve the quality of the data, and consequently classi cation. 2
Cylinder examples 10 8 6
Value
Rule 7: local max at time 31 local min at time 116 -> class cyl
4 2 0 -2 -4 20
40
60 Time Bell examples
80
100
120
0
20
40
60 Time Funnel examples
80
100
120
0
20
40
80
100
120
10 8 6
Value
Rule 1: increases from time 28 to 52 local max at time 48 local max at time 67 -> class bell
0
4 2 0 -2 -4
10 8 6
Value
Rule 12: decreases from time 34 to 48 flat from 2 to 26 local max at time 61 -> class funnel
4 2 0 -2 -4 60 Time
Figure 2: Selected rules produced by TClass in the CBF task and some instances of each class.
Rule 6: z: global maximum is 0.070312 [hand goes far away from body] z: increases from time 10 to 30 [hand moving to far position shown] fore: flat from 0 to 5 with value 0.72 [initially finger is fully bent] -> class come
Rule 7: y: local maximum at time 38 [i.e. touching chin as shown] z: decreases from time 11 to 31 [hand is pulled towards body] fore: flat from time 3 to 41 with value 0.02 [forefinger kept straight for most of sign] -> class thank
Figure 3: Selected rules produced by TClass and the corresponding de nitions from the Auslan Dictionary [Johnston, 1989].
Table 1: The training set for the BR task.
Stream 1 2 3 4 5 6
c
FFFTTTFFFFFF FFFTFFTFFTTTT FFTFFTFFFFFFTTT FFFFTTTTFFFFF FFFTTTFFFF FFTTFFTFFTTT
Class Red Blue Blue Red Red Blue
point (a closed st at the side of the body). The average number of frames per sign was 51, but varied from 30 frames up to 102 frames. Figure 3 shows two signs from the Auslan domain. The top part shows the gloss (dictionary de nition) of the sign from the Auslan dictionary [Johnston, 1989]. The lower part shows some of the rules extracted by TClass.
3 ARCHITECTURE AND IMPLEMENTATION We propose a general architecture for temporal classi cation problems as shown in gure 4. To explain concisely, we will use a simple illustrative domain, the Blues and Reds (BR) task. This domain has single channel (c), which can only have two possible values: true (T) or false (F). There are two possible classes for each stream, Blue or Red. A sample training set is shown in table 1. As can be seen, streams vary in length from 10 frames to 15 frames. The data is processed in two ways: global feature calculation and event extraction. Global feature calculation extracts various features the stream as a whole. Typical global features include global maxima, global minima, means of each channel and the duration of the stream. For many temporal domains, these global features provide a good \ rst cut" representation for classi cation3. In the BR task, a simple global feature is the ratio of time that the channel c is true to the overall duration of the stream. The results of applying this to the data of table 1 are shown in column 2 of table 3. Event extraction is applied by using a set of parametrised event primitives (PEPs). PEPs represent particular types of \events" that are expected to occur in the domain (the best PEPs are typically For example, in the large Auslan domain, 30 per cent accuracy can be obtained using only the maxima and minima of the x, y and z channels. 3
Table 2: TrueRun PEP applied to the BR training set.
Stream 1 2 3 4 5 6
Events found
f(3; 3)g f(3; 1); (6; 1); (9; 4)g f(2; 1); (5; 1); (12; 3)g f(4; 4)g f(3; 3)g f(2; 2); (6; 1); (9; 3)g
domain-speci c). Each PEP is de ned by:
A tuple of parameters p = (p1 ; :::; p ), which dek
scribe a particular parametrised event. Let P1 represent the range of the parameter p1 . Let P be the space of possible parameter values, i.e., P = P1 P2 ::: P . P will be referred to as the parameter space. A nding function f which takes a stream s and returns a set of parametrised events e, such that e P. k
A key characteristic of a PEP is that it should explicitly represent the temporal characteristics of the events as parameters; thus representing time in a way that facilitates learning. One simple PEP we could use with the BR task is to nds runs of T values, and returns a tuple (t; d) where t is the time the run starts and d is the duration of the run. The nding function would be given a stream and return all pairs (t; d). The results of applying this PEP to the BR training data is shown in table 2. We shall term this PEP the TrueRun PEP. Since each PEP is described in terms of a set of parameters, clustering can be performed in the parameter space of each PEP. For example, if the data from table 2 is plotted (as in gure 5) we can cluster it as shown. Typically, rather than hand-clustering, some clustering algorithm would be used. Each cluster then forms the basis of a synthetic event attribute. For each stream, the value of the synthetic event attribute is true only if it has an event belonging to that cluster. More generally, a con dence metric of cluster membership is used, rather than just a true/false value. Once we have these clusters, the training data can be re-attributed with these synthetic event attributes. Applying this to the BR data, we get the results shown in columns 3, 4 and 5 of table 3. The data from the global and event processing streams
Parametrised Event Primitives
Event
Raw Training Data
Synthetic Events Database
Parametrised Event Data
Extraction
Event Clustering (parameter space)
Synthetic Events
Event Attribution Training Data (Synthetic Event Attributes)
Training Data (Global Features)
Global Feature Calculation
Combination Training Data (Complete)
Classifier
Conventional Learning
Figure 4: General architecture for training. are recombined, in a format suitable for processing by an attribute-value learner. The result of applying c4.5rules [Quinlan, 1993] on the BR data in table 3 are shown in gure 6 (left side). Furthermore, if there is some way to produce a prototype from the clusters (for instance, by nding the centroid of the cluster), we can substitute the prototype's description to produce a descriptive rule4 (as shown on the right hand side of gure 6). It should also be pointed out that because of the way C4.5 operates, the rule for Blue is discriminant (i.e., how to tell it from Red), rather than characteristic (giving the de ning characteristics of Blue).
Parameter Space for TrueRun on BR task 5
TrueRun events
4 Cluster C
Duration (d)
Cluster B
3
2
Cluster A
1
0 0
2
4
6 Start time (t)
8
10
12
Figure 5: TrueRun parameter space and clustering for BR task.
Table 3: Global and event attributes for the BR task.
Stream 1 2 3 4 5 6
Globals Event Attributes True ratio A B C 0.25 0.46 0.33 0.31 0.30 0.50
No Yes Yes No Yes No No Yes No Yes No No
No Yes Yes No No Yes
3.1 IMPLEMENTATION An instance of the above architecture, called TClass, has been implemented. This implementation is designed in an object-oriented manner to allow dierent PEPs, global feature calculators, clustering algorithms and learners to be substituted. Currently, the following global feature calculators are available: mean, median, mode, maximum value and minimum value for a channel. These can be applied to any channel. Similarly, the PEPs implemented are:
Increasing (an extended period with signi cant 4 Such a rule is not suited for classi cation, it is only for description.
Rule 1:
clusterC = Yes -> class blue Default class: red
[63.0%]
Rule 1: (post processed) true run starting at 10 duration 3 -> class blue Default class: red
Figure 6: Rules produced by applying c4.5rules to table 3. positive gradient) and Decreasing (an extended period with signi cant negative gradient). These both return four parameters: start time, duration, average value of the channel, and average gradient. Flat (no perceptible change in gradient). This returns three parameters: the start time, average value and duration. Localmax and Localmin5 . These both return two parameters: time of the maximum/minimum, and the value. All of these have some simple heuristics for dealing with noise. For instance the local maximum applies a smoothing lter rst to reduce noise before locating maxima. Currently, TClass uses k-means clustering, with all values normalised by standard deviation. The con dence metric used for cluster membership is distance to the cluster's centroid. The learner can either be a nave Bayes learner6 or C4.5 [Quinlan, 1993] with the default parameters. Empirically, it was found that the k-means clustering algorithm did not perform well when all of the events from dierent classes were clustered simultaneously. Thus a slight modi cation was made to the architecture; the basic architecture was replicated from the event clustering stage onwards on a per-class basis, as shown in gure 7. Event clustering is now performed only on events coming from instances of the same class. These clusters are then used to extract the synthetic events as before. All training instances are then attributed7 with these synthetic events, creating a set of synthetic event attributes. These classspeci c synthetic event attributes are combined with 5 Note the distinction between a global maximum (over the whole signal) and a local maximum (a peak relative to the points surrounding it). 6 Laplace correction [Domingos and Pazzani, 1997] and equal frequency discretisation into 5 bins [Dougherty et al., 1995] for continuous values were used. 7 In this context, attribution occurs by nding the event from the instance that most closely matches the synthetic event. The distance between the synthetic event and the closest match is used as the synthetic event attribute.
the class-independent global features, creating a set of features that are used by the learner to induce a classi er. Because one such classi er is constructed per class, if there are n classes, there are n classi ers induced. To classify an unseen instance, event extraction and global feature calculation are applied. For each class, the instance is then attributed with that class' synthetic events, thus generating the synthetic event attributes required by the classi er associated with that class. The instance is then classi ed by each classi er. In this way, n classi cations will be made. The most common classi cation is then taken as the nal prediction.
4 EXPERIMENTAL RESULTS 4.1 ERROR RATE TClass was applied to both the Auslan and CBF tasks to get some idea of its accuracy compared with other techniques. Unfortunately, there is no central repository for temporal classi cation datasets (unlike conventional classi cation). This limits how rigorously the accuracy of TClass can be evaluated. For the CBF task, all ve PEPs were applied to the only channel. The global mean was the only global feature used. 266 instances of each class were generated. For ease of comparison with previous results, 10-fold cross validation was used. For the Auslan task, all ve PEPs were applied to the x, y and z channels. Increasing, Decreasing and Flat were applied to the remaining rotation and nger bend channels. The mean was calculated for each channel, as were the global maximum and minimum for the x, y, and z channels. Two tests were done, one with a small dataset consisting of 10 signs, and the other with a larger dataset with 95 signs. To simplify comparison with previously published results, 5-fold cross-validation was performed. A nave segmentation approach to feature extraction was used to allow comparison to TClass's performance. Each channel was divided into 10 segments along the
Parametrised
Global Training Data
Event Training Data (All) Class 1 Parametrised Event Training Data (Class 1)
Event Clustering
Synthetic
Event
(parameter space)
Events
Attribution
Training Data (Event Att)
Combination
Training Data (Complete)
Conventional Learning
Classifier 1
Combination
Training Data (Complete)
Conventional Learning
ClassifierN
Class N Parametrised Event Training Data (Class N)
Event Clustering
Synthetic
Event
(parameter space)
Events
Attribution
Training Data (Event Att)
Figure 7: Approach used in implementing to simplify clustering task. Classi er 1 through to Classi er N are voted to make a nal classi cation. time axis. The mean for each segment was computed and used as a feature. This created 10 features for the CBF task and 80 for the Auslan task. The same global features used with TClass were also combined with the segment features before being processed with the learners. Two learning algorithms were compared: nave Bayes and C4.5. The error rates of these experiments are shown in table 4. Table 4 shows some interesting patterns. Firstly, the \segment" approach to feature extraction performs surprisingly well, especially in the small Auslan task where there is no signi cant dierence8 between it and the TClass approach. At 2.4 per cent error it outperforms the best known published results for the CBF task, including complex approaches involving local discriminant bases using wavelets (3.75 per cent) [Saito, 1994] and trend episode analysis (2.98 per cent) [Manganaris, 1997] but is beaten by TClass's 1.9 per cent error rate. TClass has a better error rate overall compared to the nave segmentation technique, regardless of the underlying learner. Learner performance is consistent with previous results [Domingos and Pazzani, 1997]; in that in the Auslan task, which is characterised by few data (only 16 instances to train on per class), many classes and
many attributes, nave Bayes outperforms C4.5. On the other hand, in a domain with many data (240 training instances per class), few channels and few classes, such as the CBF task, C4.5 outperforms nave Bayes. The author had previously attempted both the small and large Auslan tasks using domain-speci c, labour intensive, ad hoc feature extraction and iterative analysis of the errors to guide the inclusion of additional features, followed by feature selection and a nearest neighbour classi er [Kadous, 1995]. TClass exceeds the performance of this approach for the small Auslan task (2.5 per cent error rate vs 8.0 per cent error rate), but lags behind in the large Auslan task (24.0 per cent error rate vs 17.0 per cent error rate). Considering the PEPs are generic, this is hardly surprising. It is also interesting to note the positive impact of using an ensemble of classi ers based on class. For example, in the large Auslan task, with TClass features and nave Bayes learning, the average error rate of an individual class classi er was 36.1 per cent, whereas the combined classi er had an error rate of 24.0 per cent. Similarly in the CBF task, the average error rate of individual class classi ers was 5.6 per cent, compared with the error rate of the combined classi er of was 1.9 per cent.
8 All signi cance statements are based on a paired t-test at the 99 per cent con dence level.
An important advantage of TClass over other techniques is that it produces comprehensible results. For
4.2 COMPREHENSIBILITY
Table 4: Error rates expressed as percentages (mean std error of the mean) on the learning tasks.
Features/Learner
TClass/Nave Bayes Segments/Nave Bayes TClass/C4.5 Segment/C4.5
Small Auslan Large Auslan 2:50 0:79 24:00 1:09 3:00 1:46 29:58 1:10 14:00 2:03 40:37 0:83 18:00 2:67 53:37 1:18
instance, after combining the outputs of synthetic events from the clusterer and c4.5rules, we can produce rules that may be useful for understanding what the learner is doing. Selected rules produced by TClass for the CBF task (see gure 2) capture some essential characteristics of each class. Because C4.5 is designed for discrimination, not characterisation, the rules should be interpreted with this in mind. For instance, it is unlikely that anything except a cylinder class will have both a maximum early in the signal and a minimum towards the end of the signal. But this does not de ne the cylinder class, only how it is dierent from the others classes. Because matching is based on con dences, one event may actually \register" (i.e., produce a non-zero con dence) with multiple synthetic features. For instances, for the bell class, if there was a local maximum at time 55, it would match, with a certain con dence, the local maximum at time 48 and the local maximum at time 61. Even a maximum at time 80 would match both of these events, but with less con dence. For the Auslan domain, selected rules found for two signs are shown against the glosses from the Auslan Dictionary [Johnston, 1989] in gure 3. Comments in square braces explain the connection between the rule and the gloss. As can be seen, some of the key characteristics of each sign are identi ed. For example, the identifying features for the come sign are that the hand travels far from the body (as indicated by the maximum z), there is a sudden movement of the hand away from the body early on in the sign and that the nger is initially fully bent. For the thank sign, the detected features are that there is a local maximum on channel y about midway through the sign (corresponding to the point in time where the hand touches the chin), that there is a pulling of the hand towards the body before the chin is touched, and that the fore nger remains straight for most of the sign. The comprehensibility of descriptions on both tasks is to the author's knowledge, novel. HMMs and other techniques have not been used to create comprehensible descriptions to
CBF
3:67 0:61 6:20 0:67 1:90 0:57 2:41 0:48
the author's knowledge.
5 RELATED WORK The best developed technique for temporal classification is the hidden Markov model (see [Rabiner and Juang, 1986] for more details). HMMs have proved themselves to be very useful for speech recognition, and are the basis of most commercial systems. Despite this, they do suer some serious drawbacks for general use. Firstly, the structure of the HMM { similar to that of a nite state machine { needs to be speci ed a priori. The structure selected can have a critical eect on performance. Secondly, extracting comprehensible rules from HMMs is not at all easy. Thirdly, even making some very strong assumptions about the probability model, there are frequently hundreds or thousands of parameters per HMM. As a result, many training instances are required to learn eectively. None of these three problems is of primary importance in the speech recognition domain9 . A recent development has been dynamic Bayes networks (DBNs) { a superset of HMMs { for temporal classi cation tasks. DBNs augment HMMs by allowing for more complex representations of the state space. Zweig and Russell, for example, were able to obtain higher accuracy than conventional hidden Markov models [Zweig and Russell, 1998] on a speech recognition task. The structure of the dynamic Bayes network was informed by physiological models of the vocal system, and while this is feasible for speech recognition, it may not be for other tasks. Friedman et al [Friedman et al., 1998] explore some techniques for learning DBN structure; these techniques are still primitive. 9 It is dicult to compare the current work with HMMs in an objective manner, since it is not clear what states and transitions are appropriate for the domains used in this work. Furthermore, because of the noise levels in the data and the small number of instances per class in the Auslan domain, we found in preliminary experiments that in many cases a simple 3-state (54 parameter) HMM did not converge.
Another technique that has gained some use is recurrent neural networks (RNNs) ([Bengio, 1996] has a good discussion of their application to temporal classi cation). This method utilises a normal feed-forward neural network, but introduces a \context layer" that is fed back to the hidden layer one timestep later. The context layer allows for retention of some state information. RNNs suer from many of the same problems as HMMs: incomprehensibility, many parameters need to be set with little guidance from theory, and slow learning. In addition, they do not work well for longer sequences of observations [Bengio, 1996]. Some work has also been completed on signals with high-level event sequence descriptions. Rather than representing temporal information as a time-sampled signal, as in this work, temporal information is represented as a set of timestamped events with parameters. This is a higher level of temporal abstraction than is used in this work, but is applicable to many problems, for example network trac analysis [Mannila et al., 1995] or network failure analysis [Oates et al., 1998]. In such cases, researchers look for sequences of events that cause particular phenomena. Several temporal expert systems have also been developed. Shahar [Shahar and Musen, 1995] suggests an expert system architecture for knowledge-based temporal abstraction in medical domains. This is extended to a framework for temporal abstraction in his later work [Shahar, 1997]. This framework is more extensive than the one presented here, but has a dierent focus; which is for building of expert systems. It seems overly complicated for learning purposes. Paliouras [Paliouras, 1997] shows how to automatically re ne parameters in an existing temporal expert system, and applies this technique to the analysis of whale songs. Recently machine learning approaches have showed some promise. Manganaris [Manganaris, 1997] developed a system for supervised classi cation of univariate (not multivariate, as in this work) signals using piecewise polynomial modelling, and applied it to space shuttle data as well as arti cial datasets. Keogh and Pazzani [Keogh and Pazzani, 1998] developed a technique for agglomerative clustering of univariate time series based on enhancing the time series with a line segment representation. By using delay portraits (looking at the relationship between a channel at time t and time t ? n), Rosenstein and Cohen [Rosenstein and Cohen, 1998] improve the reliability of robot sensor readings. Das et al [Das et al., 1998] extracts simple rules from univariate datasets.
6 CONCLUSIONS & FUTURE WORK 6.1 CONCLUSIONS The general architecture proposed improves both accuracy and comprehensibility by taking advantage of the special properties and dealing with the speci c problems of temporal classi cation. An implementation of this architecture, called TClass, is applied to several learning tasks. It proved to be more accurate than a nave approach, competitive with hand-crafted solutions and in some cases showed better performance than complex feature extraction techniques, while still providing useful descriptions of learnt concepts.
6.2 FUTURE WORK The PEPs presented are simple and do not identify domain-speci c features. Surprisingly, they perform relatively well. It is possible improve performance by designing multi-channel PEPs (for example, rather than treat x, y and z separately, a PEP could treat them as a single unit). Furthermore, the PEP design did make use of a useful property of temporal domains: correlated events on dierent channels. The classi ers used here do not use feature subset selection at all. Previous research [John et al., 1994] shows that learners may bene t signi cantly from feature selection, especially in cases like this one where so many features are generated by event clustering. The descriptions produced could be signi cantly improved over those generated by C4.5, since C4.5 is designed to tell classes apart; as a result, it does not always produce characteristic descriptions, rather it prefers discriminant ones [Quinlan, 1998]. While this is appropriate for a tree, it is not necessarily appropriate for building intuitive descriptions of temporal classes. Future work will focus on alternative learning algorithms for producing comprehensible descriptions.
Acknowledgements Thanks to The Creator for giving me the ability to do this work. Secondarily, my sincere thanks go to Michael Harries for assistance in many forms; Claude Sammut for sparing me time when it was very valuable to him; Ross Quinlan for many suggestions; my wonderful family; and to Hoi Shan Chong for helping me stay calm and stable.
References [Bengio, 1996] Bengio, Y. (1996). Neural Networks for Speech and Sequence Recognition. International Thomson Publishing Inc. [Danyluk, 1998] Danyluk, A. (1998). Predicting the future: AI approaches to time-series problems. Technical Report WS-98-07, AAAI Press. [Das et al., 1998] Das, G., Lin, K.-I., Mannila, H., Renganathan, G., and Smyth, P. (1998). Rule discovery from time series. In Proceedings of the Fourth International Conference on Knowledge Discovery and Data Mining (KDD-98). AAAI Press. [Domingos and Pazzani, 1997] Domingos, P. and Pazzani, M. (1997). On the optimality of the simple bayesian classi er under zero-one loss. Machine Learning, 29:103{130. [Dougherty et al., 1995] Dougherty, J., Kohavi, R., and Sahami, M. (1995). Supervised and unsupervised discretization of continuous features. In Machine Learning: Proceedings of the 12th International Conference on Machine Learning, pages 194{ 202. Morgan-Kaufmann. [Friedman et al., 1998] Friedman, N., Murphy, K., and Russell, S. (1998). Learning the structure of dynamic probabilistic networks. In Proceeding Uncertainty in Arti cial Intelligence Conference 1998 (UAI-98). AAAI Press. [John et al., 1994] John, G. H., Kohavi, R., and P eger, K. (1994). Irrelevant features and the subset selection problem. In Proceedings of the International Conference on Machile Learning 1994, pages 121{129. [Johnston, 1989] Johnston, T. (1989). Auslan Dictionary: a Dictionary of the Sign Language of the Australian Deaf Community. Deafness Resources Australia Ltd. [Kadous, 1995] Kadous, M. W. (1995). GRASP: Recognition of Australian sign language using instrumented gloves. Master's thesis, School of Computer Science and Engineering, University of New South Wales. [Keogh and Pazzani, 1998] Keogh, E. J. and Pazzani, M. J. (1998). An enhanced representation of time series which allows fast and accurate classi cation, clustering and relevance feedback. In [Danyluk, 1998], pages 44{51.
[Manganaris, 1997] Manganaris, S. (December 1997). Supervised Classi cation with Temporal Data. PhD thesis, Computer Science Department, School of Engineering, Vanderbilt University. [Mannila et al., 1995] Mannila, H., Toivonen, H., and Verkamo, A. I. (1995). Discovering frequent episodes in sequences. In Proceedings of the First International Conference on Knowledge Discovery and Data Mining (KDD-95), pages 210{215. [Oates et al., 1998] Oates, T., Jensen, D., and Cohen, P. R. (1998). Discovering rules for clustering and predicting asynchronous events. In [Danyluk, 1998], pages 73{79. [Paliouras, 1997] Paliouras, G. (1997). Re nement of Temporal Constraints in an Event Recognition System using Small Datasets. PhD thesis, University of Manchester. [Quinlan, 1993] Quinlan, J. R. (1993). C4.5: Programs for Machine Learning. Morgan Kaufmann. [Quinlan, 1998] Quinlan, J. R. (1998). Personal discussion. [Rabiner and Juang, 1986] Rabiner, L. R. and Juang, B. H. (1986). An introduction to hidden markov models. IEEE Magazine on Accoustics, Speech and Signal Processing, 3(1):4{16. [Rosenstein and Cohen, 1998] Rosenstein, M. T. and Cohen, P. R. (1998). Concepts from time series. In AAAI '98: Fifteenth National Conference on Arti cial Intelligence, pages 739{745. AAAI, AAAI Press. [Saito, 1994] Saito, N. (1994). Local feature extraction and its application using a library of bases. PhD thesis, Yale University. [Shahar, 1997] Shahar, Y. (1997). A framework for knowledge-based temporal abstraction. Arti cial Intelligence, 90(1-2):79{133. [Shahar and Musen, 1995] Shahar, Y. and Musen, M. A. (1995). Knowledge-based temporal abstraction in clinical domains. Technical report, Stanford University. [Zweig and Russell, 1998] Zweig, G. and Russell, S. (1998). Speech recognition with dynamic Bayesian networks. In Fifteenth National Conference on Arti cial Intelligence (AAAI'98), pages 173{180. AAAI Press.