Enhancing the Business Analysis Function with Semantics Sean O’Riain1 and Peter Spyns2 1
Semantic Infrastructure Research Group European Software Centre, Hewlett-Packard, Ballybrit Business Park, Galway, Ireland
[email protected] http://h40055.www4.hp.com/galway/ 2 Vrije Universiteit Brussel, STAR Lab Pleinlaan 2, Gebouw G-10, B-1050 Brussel, Belgium
[email protected] http://www.starlab.vub.ac.be
Abstract. This paper outlines a prototypical work bench which offers semantically enhanced analytical capabilities to the business analyst. The business case for such an environment is outlined and user scenario development used to illustrate system requirements. Based upon ideas from meta-discourse and exploiting advances within the fields of ontology engineering, annotation, natural language processing and personal knowledge management, the Analyst Work Bench offers the automated identification of, and between business discourse items with possible propositional content. The semantically annotated results are visually presented allowing personalised report path traversal marked up against the original source.
1 Introduction 1.1 Background Business analysis is largely performed as a Business Intelligence1 (BI) activity with data mining and data warehousing acting as the driving force in the monitoring, identification and gathering of information on topics of interest. On-line analytical processing performed on historical data allows report generation and data views, from which further BI analysis is typically performed. Data not formally mapped as part of the extract, transform and load phases, passes through the process unaltered. Current efforts to mine this unstructured data rely heavily upon problematic document level technologies such as string-based searching resulting in data being overlooked and its information value being lost. Enterprises performing customer analysis as a means to identify new business opportunities by necessity have to work their way through large volumes of free text to identify information of interest and check whether there are other informational items which are relevant to them. Current BI technologies present 1
Term coined by Gartner in the late 1980s that refers to the user-centred process of data gathering and analysis for the purpose of developing insights and understanding leading to improved and informed decision making.
R. Meersman, Z. Tari et al. (Eds.): OTM 2006, LNCS 4275, pp. 818 – 835, 2006. © Springer-Verlag Berlin Heidelberg 2006
Enhancing the Business Analysis Function with Semantics
819
limitations in facilitating this identification and association activity when processing information. Introducing semantic technology offers the potential to address these limitations and contribute towards a more accurate and sophisticated analytical function for the analyst. For informational analysis purposes the Form 10-Q2 may be considered as comprising financial accounts and statements from the CEO3. CEO statements concern a company’s performance, are seen as a promotional medium designed to present a corporate image and are hugely important in building credibility and investor confidence. They also serve to present the quantitative aspects of the financial accounts. Despite the fact that analysts have a clear expectation of information content that the statements may contain, it remains a huge task to search for, identify and filter actual relevant information due to the writing style being purposely rhetorical, argumentative and subjective. The writing style attempts to restrict the readers in developing alternative interpretations of the information presented and draws upon meta-discourse (cf. section 3.1 below) to achieve these goals, i.e. the CEO in making financial commentary would attempt to guide the reader/analyst into accepting and agreeing with the company position or view point. Despite this, an analyst engaged in the analytic process intuitively filters, refines and ultimately infers relevant information from what is being presented. Analysts are largely assisted by experience coupled with an inherent awareness and understanding of the meta-discourse employed whether consciously aware of it or not. With information identification as the context, this paper’s aim is to present the idea of a semantically enabled analysts work bench which would allow the identification of business discourse items and their relationships to other discourse items if present. The business case, technical requirements and design for such an application are presented. Development work to date is outlined along with the experimental scenario and its evaluation framework. 1.2 Business Case HP4 for a number of years has provided outsourced services to Independent Software Vendors (ISV’s). Due to changing business practices the Business Process Outsourcing (BPO) business team was tasked with exploring the possibility of extending the current service offerings to different areas of the product development cycle where HP has considerable competencies. Which ISVs to pursue for business was determined on the basis of findings from company health checks5. Forms 10-Q consisting of consolidated financial information and management statements were identified as a major source for this type of information.6 2
3
4 5 6
Quarterly report filed to the Securities and Exchange Commission (SEC) in the US. It includes un-audited financial statements and provides a continuing view of the company's financial position. Investors and financial professionals rely upon these 10-Q forms when evaluating investment opportunities. Chief Executive Officer, the highest ranking officer of a company, who oversees the company's finances and strategic planning. European Software Centre, Hewlett-Packard Galway Limited. Analysis of a company’s performance and its strategic plans. Typically downloadable from a company’s web site.
820
S. O’Riain and P. Spyns
An initial 70 candidates were selected and subjected to a lengthy analysis process resulting in the identification of five which were then approached for business discussion. The elapsed time period for this activity was nine months. The majority of it was taken up with the identification and extraction of information of potential interest hidden within the Forms 10-Q sections containing the management’s discussion and analysis of financial conditions. Contributory factors impeding this manual intelligence gathering activity were the volume of available information, its growth rate and the fact that much of the information is available in free text form only, for which no automated processing procedures have been applied until now. There is clearly a requirement for an automated intelligence monitoring solution that would offer an opportunity to make more manageable the identification, analysis and facilitate extraction for re-use of this information within the specific domain area. This paper discusses how the use of semantic technologies and natural language processing techniques along with business specific input is currently being applied to develop such a workbench. The remainder of the paper is organized as follows. Section 2 presents the functional requirements in terms of a user scenario development and introduces the case study. Section 3 introduces the linguistic analysis approach (subsections 3.1 & 3.2) and ontology methodology (subsection 3.3 & 3.4) adhered to. Section 4 outlines the proposed solution, its high level components and grounding technologies accompanied by a worked example covering all implementation steps from natural language identification to eventual semantic annotation. Section 5 discusses preliminary evaluation results. Section 6 presents related work areas, and finally Section 7 concludes the paper.
2 Requirements Business requirements are presented in two steps. The first (cf. section 2.1) involves gathering high level requirements before moving to general usability and functionality considerations of which the former is only outlined here. The second (cf. section 2.2) presents the results of translating the user requirements to application design. To facilitate understanding detailed requirements functionality are presented in terms of user scenario development. This approach was also utilised to present and refine the application design with the user community. 2.1 General The business requirement put succinctly is for an application that contributes to the analytic function by reducing the time taken and subsequently the resources necessary to accurately process and evaluate Form 10-Qs for the purpose of performing company health checks. The two main areas identified where analytic processing resource savings could be achieved were: 1. Identification and association of information items: The ability to perform automated analysis on the free text areas of a Form 10-Q would assist analysts in the identification of information of importance to them.
Enhancing the Business Analysis Function with Semantics
821
Importance was defined on two levels. Firstly as relevant information containing some element of propositional information7 (here termed informational items). Secondly as defined relationships between these information items, allowing them to be considered collectively, which would additionally contribute towards the further identification of propositional content and hence the analytic function. 2. Personal Knowledge Management: Having the ability to view the level of relevant informational items and their relationships instantiations (termed semantic paths) would assist in providing the analyst with an overview of the documents information content. Path traversal to information items of interest along with the ability to view these paths in entirety, accompanied with supporting functionality allowing for information extraction and/or exclusion, would assist an analyst in structuring, managing and personalising the analysis approach. 2.2 Scenario Development Fig. 1 represents an application mock up for the work bench based upon business requirements by the analysts involved. It comprises the three viewing areas of: − Navigator: A tree view hierarchical ontology navigator allowing report traversal and selection of terms, their roles8 and instances for viewing. − Browser Display: Displays and allows dynamic traversal of the semantically annotated report with instance mark up − Relationship Viewer: A tree view hierarchical ontology navigator which displays the roles between terms along with the text associated with either the role or term instance. An analyst in performing a company health check analysis will typically scan the introductory section to gain an insight as to what sections within the report may contain useful information. The remainder of the report is then systematically gone through using these identified areas to guide analysis function on. The major difficulty faced is the search for, identity of and filtering of actual relevant information involving substantial document traversal through large volumes of free text. The viewing areas presented will provide the functionality to perform these activities. In a prototypical case, an analyst having loaded the Form 10-Q will have the report automatically processed and annotated (i.e. enriched with meta-data related to the typical content of a Form 10-Q) based upon the argumentation categories and term/roles like those listed in section 4.1. Using the Navigator to traverse a report allows context specification (here ’Sales’) and provides a listing of all associated terms and roles. As illustrated in Fig. 1, further 7
8
This is information about thoughts, actors or state of affairs outside of the text [15]. Here interpreted as information that may contribute towards an understanding of the actual information content in the text. DOM2 terminology, refer to section 3.4.
822
S. O’Riain and P. Spyns
Fig. 1. AWB Main in-text Visualization Area
traversal provides indicators as to which information items such as ‘Products’ have multiple instances within the report. The indicators also provide a summary of which anticipated partial semantic paths binary relationships are actually instantiated (here ‘Products’ has one instantiated relationship ‘Announcements’, ‘Announcements’ in turn has two instantiated relationships, ‘Release’ and ‘Delay’). In effect the analyst has the capability to select and instantiate semantic paths9 through the document. Individual or multiple term or role annotation instances selected for viewing (here ‘Delay’) will be highlighted with a background colour in the report displayed within the Browser Display. Within this area the analyst can easily identify and select a particular annotation instance (here ’market delay’) to view its instantiated in-text binary relationships within the Relationship Viewer. The Relationship Viewer taking the original context (’Sales’) as its root presents the semantic traversal path between terms in the binary relationship (cf. section 4.1). Based upon the ontology structure used by the Navigator, it additionally includes the annotation instance information item as part of the tree view allowing the analyst to build up a complete picture of in-context business informational items, without the overhead of having to actually consider the context. The overall in-text viewing capability allows quick identification of informational items and their binary relationships which when considered collectively (i.e. the semantic path) offer the best opportunity to access their actual propositional content value. Once accessed semantic paths terms, binary relationships or specific information item instances can 9
Currently the functionality for the partial semantic path recognition is under development.
Enhancing the Business Analysis Function with Semantics
823
be selectively excluded and filtered out from further consideration. Exclusion used in this manner provides the analyst with varying degrees of flexibility in managing the information overload as part of the analytic process.
3 Methods The critical research question is whether it is possible to express what the analyst is looking for in terms of meta-discourse functions, and if so, to what extent can it be automated. Expressing analyst information needs therefore requires an understanding of how meta-discourse and its function intrude into the activity. It is the identification of this propositional information (but stripped from the rhetorical and subjective elements) that is the overall goal of the Analysts Work Bench (AWB). To achieve this aim, a multidisciplinary approach combining natural language processing and ontology engineering is proposed. The following sections describe the theories, methodologies and tools used in this experiment. 3.1 Linguistic Analysis Hyland defines meta-discourse as a linguistic tool that creates a textual structure that goes beyond the statement of the subject matter and gives clues as to the purpose and attitude of the writer [18]. In effect meta-discourse consists of text tokens that do not contribute to the propositional development of the text but serves to guide the reader in interpretation and response to the text [17]. In describing the functional categories of meta-discourse, Hyland incorporates Thompson’s classification to model metadiscourse as comprising the functional categories of interactive and interactionable [17,24]. Interactive refers to the writers attempts at constraining the text to their preferred interpretation and goals. Resource usage is this category is concerned with how discourse is organized and the extent to which the text is structured with knowledge of the reader and their needs in mind. Interactive resources are used by the writer to organize propositional information in a manner that the reader is likely to find coherent and convincing, effectively managing the information flow. Interactionable refers to the level of writer intrusion into the text by way of comment, opinion and evaluation [18]. Interactionable resources are employed to anticipate, acknowledge, challenge or negate alternate interpretations being drawn, in effect restricting opportunities for alternate views in the first instance. Table 1 below details the meta-discourse categories and their functional descriptions. Hyland uses the proposition meta-discourse distinction as a starting position for academic meta-discourse exploration, but others have included propositional content as part of meta-discourse [5,18]. To further blur the issue it is often the case that meta-discourse and propositional elements occur within one sentence and what is considered propositional in one context is meta-discourse in another. Results from the study of meta-discourse within CEOs’ letters indicate that the functional devices of transitions and hedges (see Table 1) together account for 66% of all discourse items [17]. Within the area of business postgraduate studies, this figure
824
S. O’Riain and P. Spyns Table 1. Model of Meta-discourse in Academic Texts (adapted from [18])
Interactive Resource Category
Function
Device lexicalisation
Transitions
Express semantic relation between main clauses
And/or/but /in addition/thus
Frame Markers
Draw attention to discourse goals or indicate topic or argument shifts
Finally/to conclude/my purpose here is to/ I argue here/ Well now
Endophoric Markers
Refers to information in other parts of the text
See section X/noted above/see Fig X/In section X
Evidentials
Refer to source of information in other parts of the text
According to X/ 2005, X/X states
Code Glosses
Assist reader in interpretation of ideational material
Such as/for instance/in words/namely/e.g.
other
Interactional Devices Hedges
Withhold writers full commitment to proposition
Possible/might/perhaps/about
Boosters Emphatics
Emphasises force or writers certainty in proposition
In fact/it is obvious/definitely/ clearly/it is clear that
Attitude markers
Express writers proposition
Unfortunately/I agree/agreement/ surprise/surprisingly
Engagement markers
Explicitly refer to relationship with reader
Self mentions
Explicit reference to author(s)
attitude or
to build
Consider/note that/you can see I/we/my/our
rises to 90% [5]. While we were unable to use the meta-discourse devices themselves as a means to assist in propositional content identification, we were however able to use the idea of the transitions and hedges functional categories when constructing the ontology and subsequent grammar rules (expanded upon in section 3.3). 3.2 The GATE Linguistic Engineering Framework The General Annotation for Text Engineering (GATE)10 is a component-based general architecture and graphical development environment for natural language engineering. Provided APIs allow selective inclusion of languages, processing and visual resources components such as tokenisation, semantic tagging, verb phrase chunking and sentence splitting. GATE’s Java Annotation Patterns Engine (JAPE) providing finites state transducers over annotations, allows grammar rule specification and recognition of regular expressions within these annotations. The annotations organised as a graph are modelled as java sets of annotations facilitating manipulation. The last decade has seen the acceptance of shallow analysis techniques involving pattern analysis and regular expressions. GATE provides flexibility 10
May be downloaded from http://gate.ac.uk. Further details on GATE may be found in [1].
Enhancing the Business Analysis Function with Semantics
825
allowing adaptation to activities such as these and to follow on activities such as ontology based information extraction – e.g., for technology watch [20]. It is for these reasons that we selected GATE. 3.3 Ontology Modelling Reports written in the language of business discourse ensure that analysts face the problem of first identifying information items of interest and then attempting to interpret their actual business message. In tackling identification within this context we adopted the idea of the meta-discourse hedges category and used it predominately as an assist to the analyst in targeting sentence clauses that were making a contribution to actual information content. Complicating the problem was the fact that any single information item provides only part of the overall proposition being made and the level of company commitment to it. Interpretation therefore had to be based upon considering relevant information together, requiring a level of semantic association between items that adhered to business logic. Consequently we drew upon the meta-discourse transition category for semantic association and hedges category for assisting the analyst in introducing business logic. Combining both approaches offered the possibility to draw out from the manner in which something is being said, what in fact is being said. It provides the analyst with the ability to actually determine what proposition is being made. (cf. section 4.1 for results of these influences). Adhering to this novel approach, domain experts were used to manually construct a taxonomy for business argumentation categories and their lexicalisation. Due to the domain of application requiring deep background knowledge of the discourse used, the inclusion of large vocabularies which would eventually require gazetteers was purposely avoided and domain specific terms and phrases considered only. This helped to address a secondary problem introduced by the nature of discourse itself, namely that uniform use of synonyms was not possible or practicable in all circumstances as they can and do vary in their meaning. E.g. While the phrase ‘delays some_text introduction’ or ‘delays in market acceptance’ could be rephrased with a delay synonym of postpone, neither could be rephrased with other “delay” synonyms such as “wait/stay/check” and still retain their intended business meaning. Wishing to build upon database modelling and development experience as a means of bridging the knowledge gap between database and ontology design, we required an ontology development methodology that would facilitate the transition. Resultantly the DOGMA Ontology Modelling Methodology (DOM2) [30,31], inspired by the principles of Object Role Modelling (ORM) [14] 11 and aN Information Analysis Method (NIAM) [33], was selected. The ontology engineering process resulted in the manual construction of a high level domain ontology (see Table 3 and Table 4 for an example extract). 3.4 The DOGMA Ontology Engineering Framework Developing Ontology Guided Mediation for Agents” or DOGMA [23] has as its core the notion of double articulation, which decomposes an ontology into an ontology base (intuitive binary and plausible conceptualisations of a domain) and a separate 11
Familiarity with ORM and its use in Relational Database Design is assumed.
826
S. O’Riain and P. Spyns
commitment layer, holding a set of instances of explicit ontological commitments (or domain rules) for an application [29,31]. The ontology base An ontology base consists of intuitively plausible conceptualisations of a real world domain, i.e. specific binary fact types, called meta-lexons, formally noted as a triple . They are abstracted (see below) from lexons, written as sextuples . Informally we say that a lexon is a fact that may hold for some domain, expressing that within the context γ and for the natural language ζ the term1 may plausibly have term2 occur in role with it (and inversely term2 maintains a co-role relation with term1) [30] (example in subsection 4.1). Lexons and meta-lexons are meant to reach a common and agreed understanding about the domain conceptualisation (important/relevant notions and how they are expressed in one or more natural languages [22]) and assist human understanding. Lexons are independent of specific applications and should cover relatively broad domains (linguistic level). They form a lexon base, which is constituted by lexons grouped by context and language [29]. Meta-lexons are language-neutral and contextindependent (conceptual level). Natural language terms are associated, via the language and context combination, to a unique word sense represented by a concept label (e.g. the WordNet [9] identifier person#2). With each word sense, a gloss or explanatory information is associated that describes that notion. To account for synonymy, homonymy and translation equivalents there is an m:n relationship between natural language terms and word senses [30]. Going from the language level to the concept level corresponds with converting the lexons into meta-lexons. The commitment layer The layer of ontological commitments mediates between the ontology base and its applications. Each commitment is a consistent set of rules (or axioms) that add specific semantics to a selection of meta-lexons of the ontology base [19]. The commitment layer, with its formal constraints, is meant for interoperability issues between information systems, software agents and web services, as is currently promoted in the Semantic Web area. The constraints are mathematically founded and concern rather typical DB schema constraints e.g. cardinality, optionality etc. While we note the purpose and function of the commitment idea and indeed use it (cf. section 4.1), our implementation does not require the formal definition of rules. The selection of application relevant meta-lexons to form a commitment rule while corresponding to the notion of a semantic path differs as it requires formal constraints to instantiate the particular commitment. The semantic path does not. Simply stated, the domain model as represented in the ontology base can be richer than the actual content of the semantic paths.
4 Proposed Solution The workbench primary functions are firstly information identification and secondly information extraction. Fig. 2 provides a high level overview of the workbench conceptual architecture and its natural language processing components.
Enhancing the Business Analysis Function with Semantics
827
Fig. 2. The AWB High Level Component Modules
The work bench browser component is responsible for display and user related interaction while the Natural Language Processing (NLP) component is responsible for NLP, source mark up and information extraction (IE). The NLP module tokenizes the report, stems the tokens and performs sentence splitting. JAPE grammar rules identify and categorize recognized patterns to concept categories. The source mark up module semantically marks up the patterns found in the source report and renders the annotated result to the AWB browser for display. The IE module using the ontological concept structure is responsible for template slot filling (cf. Table 5) and ontology population. The applications data store can be a relational database, RDF data store (e.g., [3,4]) or both. 4.1 Prototype Development In conjunction with domain experts, (syntactic) analysis of information sources (e.g. the report text displayed in Fig. 1) was embarked upon to identify domain specific categories, terms and binary relationships. Applying the DOGMA philosophy as explained in section 3.4 to the analysis results, the ontology engineer manually constructed a series of lexons such as the one presented in Table 2. The approach taken was that searching for a combination of these patterns would provide the best opportunity for propositional content identification. Table 2 represents the lexon set for the semantic path from ’Product’ to ‘Announcement’ to ‘Delays’. From an analyst’s view point the lexon extract (expressed as elementary notions) would be interpreted as: When constructing a picture of the sales area, references12 to products are of interest. To understand what is occurring in the ‘Product’ space, references to ‘Announcements’ are of interest. Lastly for ‘Announcements’, references to items that bring to attention ‘Developments’, ‘Releases’ or ‘Delays’ are important. 12
Term purposely used to simplify dealings with the business community that refers to the tacit binary relationship.
828
S. O’Riain and P. Spyns Table 2. Sales Lexon Extract
Context (γ) = Sales, Language (λ) = UK English Head term (t1)
Role (r1)
Co-role (r2)
Tail term (t2)
Product
Follows
Precedes
Announcement
Product
Is_described_by
Describes
Announcement
Announcement
Publicises
Is_announced_in
Delay*
Announcement
Publicises
Is_announced_in
Release
Announcement
Publicises
Is_announced_in
development
Next, synsets (as used in WordNet [9]) and natural language explanations and glosses were introduced allowing the grounding of concepts and relationships (cf. Table 3) and the creation of meta-lexons afterwards. Conversion of lexons into meta-lexons copes with ideas expression in several ways, e.g. by morphology (whether inflection or conjugation allowing different forms of the same word), by synonymic wording (lexicology / terminology) or by syntax such as organising a sentence using a noun instead of a verb. As there was no requirement13 to have multi–lingual capability the principle of having ontologies transcend in so far as possible specific linguistic influences14 has not been strictly adhered to. Resultantly schema mapping issues with different language usage where words in one language do not have a direct equivalent lexical translation in another necessitating paraphrasing were not of concern15 and allowed us combine the role and co-role into a single semantic relationship. The labels assigned to the concepts and relationships do not mimic any existing language expressions [25] – cf. Table 3. Due to issues with synonyms introduction (see section 3.3) and the lack of requirement to cater for multi-lingual aspects, the combination of language and context normally associated with term disambiguation was only used as a means of moving from the language to concept level. The resulting meta-lexons based upon the Sales Lexon Extract in Table 2 are shown in terms of their constituent concept labels (as established in Table 3) in Table 4, along with invoked sample grammar rules. Automated concept identification was implemented as a set of JAPE grammar rules using the GATE tool (see section 3.2). The grammar rules provided the implementation vehicle for the meta-lexons that correspond to the semantic path building blocks. Referring to the row indicated by an astrix in Table 2 and Table 4, Rule 1 provides the JAPE rules to recognise the concept “announcement” while Rule 2 detects the presence of ”delays in market acceptance”. The former rule uses the macro (ANNOUNCE) based upon stemming16 and synonym expansion to identify the inflected word form for annotation. Once found the right hand side of the rule 13
Form 10-Qs are a requirement for US companies only Due to the difficulty of such an exercise, some authors argue for language neutral representations rather than language independent ones [160]. 15 See [80] for details on how multilinguality is handled within DOGMA. 16 GATE’s provided stemmer plug-in is based upon the Porter stemmer for English. 14
Enhancing the Business Analysis Function with Semantics
829
Table 3. Grounding of concepts (excerpt)
Context (γ) = Sales, Language (ζ) = UK English Label {Explanation; gloss; synonyms etc.} C1002 Item manufactured/made and sold to general public; Saleable item; item, goods C1005 Public statement made for public consumption; Scheduled event; Inform, proclaim, advise C1014 Delay in market indicating reduction in average order in take; Reduction in order intake; No synonyms R1001 Set of Program to expand; Intention; Postponement R1003 Cancel or reduce; Order; Timely R1005 Cycle; Resources; Market; Longer product Table 4. Sales Meta-lexon
C1
R
C2
Rule: C1005
C1002
R1001
C1005
( (ANNOUNCE) )
C1002
R1002
C1005
C1005
R1003
C1014*
C1005
R1004
C1016
C1005
R1005
C1017
Rule 1
: C1005--> : C1005.C1005 Rule: C1014 ( (DELAY) (SPACE_WORD_SPACE)* ((MARKET) (ACCEPT) ) : C1014--> : 1014.C1014
Rule 2
(denoted by ‘Æ’) fires to annotate the pattern with the ontology concept label ‘C1005’. The latter rule working in a similar manner first recognising (DELAY) followed by multiple white spaces and noise words before recognising occurrences of (MARKET) and (ACCEPT). The entire concept is then given the label ‘C1014’. The JAPE rules cope with morphological variations. General synonym expansion using WordNet [9] produced “noise” in intermediary results due to issues of general language knowledge applicability. Resultantly synonym expansion introduction was only possible with analyst selection and agreement17. Subsequently, a combination of scenario and relational templates are used to reflect abstract representations of the domain and present an instantiated part of the ontology. Table 5 provides such a reduced Information Template used for ontology population based upon the metalexons introduced earlier in Table 4. Instantiated templates can be built per report source from which specific IE actions and expert system reasoning (here business analysis) can be performed e.g., checking
17
After this annotation stage more complex JAPE rules operating on the concept level could be applied to add specific tags for meta-lexon instances indicating partial semantic paths. We opted however to not search for implicit relationship at this point but leaving them to the ontological association stage where they are based upon both semantic and business logic.
830
S. O’Riain and P. Spyns Table 5. Information Template (Reduced)
Announcement InstanceID
Auto generated
InfoItem
Further, the announcement of the release … and the actual release, of new Windows-based server operating systems or products incorporating similar .. could cause our existing ….
LinkedConcept1
C1005
LinkedRelationship
R1002
LinkedConcept2
C1014
Delay in market acceptance InstanceID
Auto generated
InfoItem
If we are unable to keep pace with technological developments … hindered by: delays in our introduction of new products… delays in market acceptance of new products and services or new releases …
LinkedConcept1
C1014
LinkedRelationship
R1003
LinkedConcept2
C1005
of whether or not a company’s products, already announced, suffer from a delay in being brought to market. Template slots essentially act as database records lending themselves easily to SQL operations within an RDB context or RDF triples if using an RDF store (e.g., [3,418]). Construction of closed semantic paths with specific meta-lexons in this manner ensures inherent semantics and a bounded view. In this regard the templates themselves and the semantic path correspond closely to the DOGMA notion of a commitment bridging the gap between the ontology base and application layers and representing a particular view of the ontology. Here the ontology concepts themselves as defined by a semantic path perform the constraint function, removing the necessity for formalised rules as only a selection of relevant meta-lexons are needed.
5 Preliminary Evaluation The requirements identify the two areas of identification and association of information items and Personal Knowledge Management as areas where business resource savings are possible. As the application is still under development the discussion will be limited to a preliminary (and reduced in scope) validation of the initial research hypothesis, i.e. that a combination of NLP, IE and ontology technology contributes to a tangible resource reduction for the business analysts in gathering business intelligence from company reports. The purpose of the evaluation 18
May be down loaded from http://jena.sourceforge.net/index.html.
Enhancing the Business Analysis Function with Semantics
831
exercise was to gain an initial indication as to the tool’s usefulness and whether overhead associated with the initial setup was justifiable. Therefore, we only involved a single senior annotator expert. Thus, for reasons of methodological soundness, precision and recall have not been calculated as such. In the future, a thorough evaluation according to the method described in [11] will be conducted. In performing the evaluation (see. Table 6 for results) we had a senior domain expert analyse manually 10-Q forms19, with the instruction to identify and annotate information items based upon partial semantic path recognition. The semantic path used was the full sales meta-lexon extract of which Table 4 mentioned previously is only an excerpt. From these the analyst performed further expert validation to identify only those items of actual relevance. The same activity sequence was then performed on reports that were automatically annotated to identify partial semantic paths. The outcomes have been evaluated in a three-fold manner. The first phase involved the performance comparison of manual vs. automated annotation, while the second measured the overall impact of the tool on the business intelligence activities (manual vs. automated relevance). Finally, the time needed by a business analyst to perform his/her task with and without tool support has been compared (cf. Table 7). The tool supported analysis outperformed the manual by 34% translating in actually relevant terms (8 vs. 12) to an increase of 50% in the number of additional informational items annotated by the analyst (11 vs. 22). Further expert analysis on these additional relevant items indicate that they directly contributed to reinforcing analysis hypothesis and most significantly in one instance, brought to attention an item that led to new analysis thinking. Table 6. Partial Semantic Path Identification results
Concept C1002 C1005 C1011 C1012 C1013 C1014 C1015 C1016 C1017 Totals
Information Item No. Manual Automated Annotated Relevant Annotated Relevant 2 2 8 4 1 1 2 1 0 0 0 0 0 0 2 0 0 0 0 0 4 3 4 4 2 0 2 1 2 2 4 2 0 0 0 0 11 8 22 12
Table 7 lists the timings taken for an analyst performing the report analysis activity mentioned previously, first manually and then after automated semantic annotation. Automation resulted in a 78% overall resource saving over manual activities and removed completely the need for the introductory section analysis. 19
Based upon the average of a year’s quarterly reports for one company.
832
S. O’Riain and P. Spyns Table 7. Resource Saving
Report section Introduction Main body Totals
Manual 10 80 90
Time (minutes) Automated Difference ignored -10 20 -60 20 -70
Resource Saving (%) 100 75 78
6 Related Work The application of IE techniques to SEC Form 10-Q business reports is principally similar to the MUC evaluation exercises (e.g. [15]). Recent efforts such as OntoText’s KIM [26] have extended the classical (but limited) nature and number of entity recognition sets to include proper names, geographical names, business names etc. These sets however remains insufficient for the nature and type of information that an analyst would wish to extract from the Form 10-Q. The medical field in particular has seen earlier large scale research efforts conducted to extract complex knowledge from documents (e.g., cf. [10,27]). Typical of these projects is that they draw upon “deep” linguistic and semantic analysis along with well developed models of knowledge representation and reasoning. The last decade has seen the emergence and adaptation of light weight pattern analysis techniques based upon regular expressions such as RegExTest20 applied to the identification of Nigerian fraud emails [13]. Shallow NLP analysis techniques, in many cases involving pattern analysis and regular expressions such as the GATE system have become mainstream [6]. A more recent step has been the combination of ontologies with information templates as used in IE 21. The template fillers are instances of a domain ontology concepts of which a selection constitutes the building blocks of a template. In a further step reasoning procedures can be added to the templates (turning the templates into frames in the tradition of Schank [28]. A rule based IE application concerned with market monitoring and technology watch within the employment market place, obtained promising results (precision 97%, recall 92%), be it on a small test-bed of documents [20]. Moreale using the same approach but within an e-learning context, constructed a service that presented the argumentation content of a student essay [24]. Typical of these prototypes is that while the results were said to be favourable, in each case the conclusions were that the tool required further work for large scale deployment. These findings confirm earlier results, namely that IE is improved when based upon an ontology [14,20]. Having conducted literature reviews, we remain unaware of a similar application such as the AWB in a business setting as described above that has as its basis a pressing business case and represents for HP a considerable resource saving opportunity and intelligence gathering assist for the Business Process Re-Engineering Group. 20 21
http:// sourceforge.net/projects/regextest/. A good introduction on combining ontologies with IE can be found in [70].
Enhancing the Business Analysis Function with Semantics
833
7 Conclusion The paper’s main contribution is the outline of the AWB prototype offering semantically enhanced analytical capabilities to the business analyst based upon the notion of semantic paths. Preliminary ‘in development’ findings have been presented outlining promising results of greatly enhanced relevant information identification and association capabilities which resulting in both resource savings and new analysis insight. As efforts to date have been directed toward the partial semantic path we now aim to expand upon this to high light the semantic relationship by implementing the complete semantic path. We are confident that evaluation on the resultant fully functional prototype will reinforce these findings. The other key learning has been that a blanket adaptation of NLP for particular domains (here Business Process Outsourcing) will be problematic without tailored usage of synonym expansion, particularly if a complex understanding of discourse within the domain is required. With the idea of semantic paths providing encouraging results even at this early developmental stage, investigative research into functionality allowing restricted domain natural language querying and reasoning based upon the ontology being expressed in OWL is currently underway.
Acknowledgements We gratefully thank John Collins, Business Development & Business Engineering Manager, HP Galway, for his evaluation effort, and David O’Sullivan, DERI Galway along with Robert Meersman VUB-STAR Lab, for their comments on draft versions of this text.
References 1. Bontcheva K., Tablan V., Maynard D. & Cunningham H., (2004), Evolving GATE to meet new challenges in language engineering, Journal of Natural Language Engineering 10 (3/4): 349-373 2. Broekstra J., Kampman A., van Harmelen F., (2002), SESAME: A generic architecture for storing and querying RDF and RDF Schema, in Horrocks I. & Hendler J., (eds.), Proc. of the First International Semantic Web Conf. (ISWC02), LNCS 2342, Springer, pp. 54-68 3. Cao T.-D., R.D.-K. & Fiès B. (2004), An Ontology-Guided Annotation System for Technology Monitoring. in Proc. of IADIS International WWW/Internet 2004 Conference. 4. Caroll J., Dickinson I., Dollin C., Reynolds D., Seaborne A., Wilkinson K., (2004), Jena: implementing the semantic web recommendations, Proceedings of the 13th international World Wide Web conference on Alternate track papers & posters 5. Crismore A. and Farnesworth R, (1990), Metadiscourse in popular and professional science discourse. The Writing Scholar, Studies in Academic Discourse, in W. Nash, (ed.),. New Bury Park: Sage. 119-36 6. Cunningham H., Maynard D., Bontcheva K. & Tablan V., (2002), GATE: A Framework and Graphical Development Environment for Robust NLP Tools and Applications, in Proc. of the 40th Anniversary Meeting of the Association for Computational Linguistics 7. Cunninham H., Bontcheva K. & Li Y., (2005), Knowledge Management and human language: crossing the chasm, Journal of Knowledge Management, 9 (5): 108-131
834
S. O’Riain and P. Spyns
8. De Bo J., Spyns P. & Meersman R., (2003), Creating a "DOGMAtic" multilingual ontology infrastructure to support a semantic portal. in Meersman R., Tari Z. et al., (eds.), On the Move to Meaningful Internet Systems 2003: OTM 2003 Workshops, LNCS 2889, Springer Verlag, pp. 253 - 266 9. Fellbaum C. (ed.), (1998), Wordnet, An Electronic Lexical Database, MIT Press 10. Friedman C., Hripcsak G., Alderson P., DuMouchel W., Johnson S. & Clayton P., (1995), Natural Language Processing in an operational clinical information system, Journal of Natural Language Engineering 1 (1): 83 - 103 11. Friedman C. & Hripcsak G., (1998), Evaluating Natural Language Processors in the Clinical Domain, Methods of Information in Medicine 37 (4/5): 334 - 44 12. Gao Y & Zhao G., (2005), Knowledge-based Information Extraction: a case study of recognizing emails of Nigerian frauds, in Montoyo A., Munoz R. & Metais E., (2005), Proceedings of the 10th International Conference of Applications of Natural Language to Information Systems (NLDB05), LCNS, 3513, Springer, pp. 161 – 172 13. Guarino N., Masolo C. & Vetere G., (1999), OntoSeek: Content-Based Access to the Web, IEEE Intelligent Systems, (4-5): 70-80 14. Halpin T, (2001), Information Modeling and Relational Databases: from conceptual analysis to logical design, Morgan-Kaufmann, San Francisco. 15. Hirschmann L. (1998), Language Understanding Evaluations: Lessons learned from MUC and ATIS, in Rubio A., Gallardo N., Castro R. & Tejada A. (eds.), 1st International Conference on Language Resources and Evaluation (LREC 98), ELRA, pp. 117-122 16. Hovy, E. & Nirenburg S. (1992). “Approximating an interlingua in a principled way”. Proceedings of the DARPA Speech and Natural Language Workshop, http://www.isi.edu/natural-language/people/hovy/papers/92darpa-il.pdf 17. Hyland, K., Exploring corporate rhetoric: metadiscourse in the CEO's letter. The Journal of Business Communication, 1998 18. Hyland K. and Tse P., Metadiscourse in Academic Writing: A Reappraisal. Applied Linguistics, 2004. 25 (2): p. 156-177 19. Jarrar M. & Meersman R., (2002), Formal Ontology Engineering in the DOGMA Approach, in Meersman R., Tari Z. et al., (eds.), On the Move to Meaningful Internet Systems 2002: CoopIS, DOA, and ODBASE; Confederated International Conferences CoopIS, DOA, and ODBASE 2002 Proceedings, LNCS 2519, Springer, pp. 1238 – 1254 20. Maynard D, et al. Ontology-based information extraction for market monitoring and technology watch. in ESWC Workshop "End User Apects of the Semantic Web". 2005.. 21. McGuiness D., (2004), Question Answering on the Semantic Web, IEEE Intelligent Systems Jan/Feb 2004: 82-85 22. Meersman R., (1999), The Use of Lexicons and Other Computer-Linguistic Tools, in Zhang Y., Rusinkiewicz M, & Kambayashi Y., (eds.), Semantics, Design and Cooperation of Database Systems, Proceedings of CODAS 99, Springer Verlag, pp. 1 – 14. 23. Meersman R., (2001), Ontologies and Databases: More than a Fleeting Resemblance, In, d'Atri A. and Missikoff M. (eds), OES/SEO 2001 Rome Workshop, Luiss Publications. 24. Moreale E. and Vargas-Vera M., Semantic Services in e-Learning: an Argumentation Case Study. Internat. Forum of Educational Technology & Society, 2004. 4 (7): p. 112-128 25. Nirenburg, S. & Raskin V. (2001). “Ontological Semantics, Formal Ontology, and Ambiguity”. Proceedings of the Second International Conference on Formal Ontology in Information Systems. ACM Press, 151 – 161 26. Popov B., Kiryakov A., Kirilov A., Manov D., Ognyanoff D. & Goranov M., (2003), KIM: Semantic Annotation Platform, in Proceedings of the 2nd International Semantic Web Conference (ISWC 03), Springer Verlag, pp. 484-499
Enhancing the Business Analysis Function with Semantics
835
27. Sager N, Friedman C & Lyman M., (1987), Medical Language Processing: computer management of narrative data, Addison Wesley 28. Schank R. & Abelson R., (1977), Scripts, Plans, Goals and Understanding, Lawrence Erlbaum Associates, Hillsdale N.J. 29. Spyns P., Meersman R. & Jarrar M., (2002), Data modelling versus Ontology engineering, in Sheth A. & Meersman R. (ed.), SIGMOD Record Special Issue 31 (4): 12-17 30. Spyns P., (2005), Adapting the Object Role Modelling method for Ontology Modelling . In, Hacid M.-S., Murray N., Ras Z. & Tsumoto S.,(eds.), Foundations of Intelligent Systems, Proceedings of the 15th International Symposium on Methodologies for Information Systems, LNAI 3488, Springer Verlag, pp. 276 – 284 31. Spyns P., (2005), Object Role Modelling for Ontology Engineering in the DOGMA framework, in Meersman R., Tari Z., Herrero P. et al. (eds.), Proceedings of the OTM 2005 Workshops, LNCS 3762, Springer Verlag, pp. 710 - 719 32. Thompson, G., Intreraction in academic writing: Learning to argue with the reader. Applied Linguistics, 2001. 22 (1): 58-78 33. Verheyen G. & van Bekkum P., (1982), NIAM, aN Information Analysis Method, in Olle T., Sol H. & Verrijn-Stuart A. (eds.), Information Systems Design Methodologies: A Comparative Review, North-Holland/IFIP WG8.1, pp. 537—590