Evaluation of Automatically Assigned MeSH Terms for ... - Springer Link

Report 2 Downloads 16 Views
Evaluation of Automatically Assigned MeSH Terms for Retrieval of Medical Images Miguel E. Ruiz1 and Aur´elie N´ev´eol2 1

University of North Texas, School of Library and Information Sciences P.O. Box 311068, Denton, Texas 76203-1068 USA [email protected] 2 National Library of Medicine Bldg. 38A, B1N-28A, 8600 Rockville Pike, Bethesda, MD 20894 USA [email protected]

Abstract. This paper presents the results of the State University of New York at Buffalo (UB) team in collaboration with the National Library of Medicine (NLM) in the 2007 ImageCLEFmed task. We use a system that combines visual features (using a CBIR System) and text retrieval. We used the Medical Text Indexer (MTI) developed by NLM to automatically assign MeSH terms and UMLS concepts to the English free text annotations of the images. We also used an equivalent system called MAIF that automatically assigns MeSH and UMLS concepts to French free text. Our results indicate that the use of automatically assigned UMLS concepts improves retrieval performance significantly. We also identified specific aspects of the system that could be improved in the future, such as the method used to perform the automatic translation of medical terms and the addition of image classification to process queries targeted to a specific image modality.

1

Introduction

ThispaperpresentstheresultsofourparticipationinimageCLEFmed2007.Inprevious years we have used a method that maps the queries to Unified Medical Language System (UMLS) concepts and then uses these concepts to find translations of the English queries into French and German [1, 2]. This method has been successful in handling English queries to find the corresponding French and German translations. For this year’s challenge, we focused on assessing 1) the use of an automatic indexing system providing Medical subject Headings (MeSH terms) and UMLS concepts; and 2) the use of UMLS-based translation with French as the query language. The impact of both features on retrieval performance was analyzed.

2

System Description

The system that was used this year combines two publicly available systems: – SMART: This is an information retrieval system developed by Gerald Salton and his collaborators at Cornell University [3]. SMART implements a generalized vector space model representation of documents and queries. C. Peters et al. (Eds.): CLEF 2007, LNCS 5152, pp. 641–648, 2008. c Springer-Verlag Berlin Heidelberg 2008 

642

M.E. Ruiz and A. N´ev´eol

This is an important feature since we wanted to include three different representations of the image annotations: Free text, MeSH terms, and UMLS concepts. – Flexible Image Retrieval Engine (FIRE): This is an open source content based image retrieval system developed at RWTH Aachen University, Germany [4]. For processing the annotations we also used two automatic text categorization tools that map free text to MeSH terms. We used the Medical Text Indexer (MTI) which is a tool developed at the U.S. National Library of Medicine (NLM) to assign MeSH terms to the English annotations. For processing French text we used Medical Automatic Indexer for French (MAIF) which is a tool similar to MTI that uses NLP as well as statistical methods to assign MeSH terms to free text. We did not have a tool to perform a similar mapping of the German text. We also decided to add the concept unique identifier (CUI) from the UMLS so that we could match queries and documents using these language independent concepts. Since MeSH is one of the vocabularies of UMLS, the assignment of the UMLS concepts was performed by getting the corresponding identifiers of the MeSH terms in UMLS.

3

Collection Preparation

As described in the ImageCLEFmed 2007 overview paper [5] the image collection used in this task consists of six sub-collections. Each collection has its own metadata in XML format for the image annotations. In order to process all collections uniformly we created a common XML schema and converted all the annotation to this new schema. Figure 1 shows the common metadata schema that was used. English queries and documents were processed by parsing them using MTI to identify MeSH concepts present in the free text and then add the corresponding MeSH terms as well as the UMLS concepts. MTI uses NLP techniques (implemented in Metamap) as well as a statistical K-Nearest-Neighbor (KNN) method that takes advantage of the entire MEDLINE collection [6]. MTI is currently being used at NLM as a semi-automatic and fully automatic indexing tool. For this task, we used the top 25 recommendations provided by the system ran with default filtering. French queries and documents were processed using a modified version of the MAIF described in [7]. MAIF is able to retrieve MeSH terms from biomedical text in French. It specifically retrieves main headings and main heading/subheading pairs. However, for the purpose of the image-CLEF task, we only used MAIF to retrieve MeSH main headings that were then mapped to UMLS concepts. We used a collection of 15, 000 French citations available from CISMeF (Catalogue and Index of Online Health Information in French available at www.cismef.org) for retrieving the French MeSH terms used in MAIF. The modified version of MAIF is similar to MTI in that it combines a NLP method and a statistical, knowledge-based method [7]. However, the two systems differ in the specific implementation of both methods. The combination of these two approaches takes

Evaluation of Automatically Assigned MeSH Terms

643

Fig. 1. Common XML schema and Ctypes for indexing

into account the relative score assigned to the terms by each method. The ”relative score” of a term is obtained by dividing the score of the term by the sum of all the scores assigned by the corresponding method. Combining the methods in this way gives an advantage to terms retrieved by the NLP method. Because the NLP approach tends to retrieve a smaller number of terms per document, the relative importance of each term tends to be higher than the relative importance of terms retrieved by the statistical method. The final term selection is performed using the breakage function described in [8]. The score assigned to a MeSH candidate represents its likelihood to be a good indexing term: the higher the score, the more likely it is that the corresponding MeSH term is a good indexing candidate. Given a list of indexing candidates and the score that has been assigned to them, the breakage function is meant to detect a breach of continuity in the scores, therefore highlighting the point in the candidate list where terms become significantly less likely to be correct indexing terms. The final set of MeSH main headings assigned to a document consists of all the terms ranked above this threshold. Once the collections were converted in to the common XML schema we use SMART to parse the XML documents and create three indexes (also called Ctypes in SMART). Ctype 0 was used for indexing free text from the original annotations, Ctype 1 was used to index the MeSH terms automatically assigned using the medical text indexing tools (MTI for English text and MAIF for French text), and Ctype 2 was used to index the UMLS concepts that were identified by MTI or MAIF.

4

Retrieval Model

We used a generalized vector space model that combines the vector representation of each of the four indexes presented in Figure 1. The final retrieval model can be represented using the following formula: score(image) = α ∗ ScoreCBIR + β ∗ simT ext (di , q)

(1)

644

M.E. Ruiz and A. N´ev´eol

where α and β are coefficients that weight the contribution of each system and simT ext is defined as: simT ext (di , q) = λ ∗ simwords (di , q) + μ ∗ simMeSHterms (di , q) +ρ ∗ simUMLSConcepts (di , q)

(2)

where λ, μ and ρ are coefficients that control the contribution of each of the ctypes. The values of these coefficients were computed empirically using the optimal results on the 2006 topics. The similarity values are computed using cosine normalization (atc) for the documents and augmented term frequency for the queries (atn). We also performed automatic retrieval feedback by retrieving 1, 000 documents using the original query and assuming that the top n documents are relevant. This allowed us to select the top m terms ranked according to Rocchio’s relevance feedback formula [9].

5

Experimental Results and Analysis

We submitted 7 official runs which are shown in Table 1. A total of 5 runs use queries in English and 2 runs use queries in French. Translations of the queries into the other two languages were automatically generated by expanding the query with the all UMLS terms associated to the concepts assigned by MTI or MAIF. From these runs we can see that the highest score was obtained by runs that use the English queries and combine the text and image results obtaining a Mean Average Precision (MAP) value of 0.2938 and 0.293 (UB-NLM-UBTI 3, and UB-NLM-UBTI 1). Overall these two runs perform well above the median run in imageCLEFmed 2007 (Median MAP= 0.1828) and rank 5th and 6th among all automatic mixed runs. Unfortunately our multilingual runs perform significantly below (MAP 0.254). This indicates that our automatic translation approach does decrease performance when compared to using the English queries only. We suspect that this could be due to the fact that the translations might be adding terms that change the focus of the query. Tables 2a-d show a series of unofficial runs that allow comparison of the methods that were used in our system. Table 2a shows the performance obtained by using free text (English only), automatically assigned UMLS concepts and the CBIR retrieval using FIRE. Our base lines for free text and UMLS concepts are quite strong since they both perform above the median system. The CBIR baseline is quite weak compared with the text and concept baselines. However, when compared to other visual only runs it is around average for CBIR runs. A query by query analysis of the results for the CBIR run shows that the MAP for 21 of the 30 queries is below 0.0001 which is a major factor for the poor performance shown. It appears that the fact that the queries require specific image modality seems to be a major factor since our CBIR system does not include an image classification module that could identify the image modality to filter out those images that do not have the requested modality in the query. Table 2b shows the results obtained using only English queries. Because the collection has predominantly English annotations we can see that these runs

Evaluation of Automatically Assigned MeSH Terms

645

Table 1. Performance of Official Runs Run name UB-NLM-UBTI 3 UB-NLM-UBTI 1 UB-NLM-UBmixedMulti2 UB-NLM-UBTextBL1 UB-NLM-UBTextBL2 UB-NLM-UBTextFR UB-NLM-UBmixedFR

Description English queries English queries English cross-lang English queries English cross-lang French cross-lang French cross-lang

type Mixed run Mixed run Mixed run Text only Text only Text only Mixed run

MAP 0.2938 0.293 0.2537 0.2833 0.2436 0.1414 0.1364

Exact-P 0.2893 0.2992 0.2579 0.2833 0.2461 0.1477 0.1732

P10 0.4167 0.4000 0.3167 0.4100 0.3033 0.1933 0.2000

P20 0.3867 0.3933 0.3017 0.3817 0.3017 0.1650 0.1933

correspond to our highest scoring official runs (UBTI 1 and UBTI 3). All these runs use the free text as well as the UMLS concepts automatically assigned to both queries and documents. These results confirm that the use of automatically identified concepts improves performance considerably when compared to using free text only. We can also see that the merging formula that combines visual and text features does work properly despite the fact that the CBIR run contributes little to the overall MAP. Our two top scoring runs use text as well as image features. The best automatic run (MAP=0.3018) was not submitted but is only marginally better than our highest official run. Table 2c and 2d show performance of our cross-lingual runs. These runs use the UMLS automatic translations based on the UMLS concept mapping obtained from the English text. We can see that this actually harms performance significantly compared with using English only queries. We believe that is due to the aggressive translation method that we tried to use since it seems to add terms that shift the focus of the query. We plan to explore this issue in more detail in our future research. Despite this result we can see that the results confirm that using UMLS concepts (which are language independent) improves performance with respect to using only free text translations. Also the use of the results from the CBIR system yield only small improvements in retrieval performance. Table 2d shows the result of our cross-lingual runs that use French as the query language. Our official French runs used the same parameters as the English runs and this seems to have harmed the results for French since the runs presented in our unofficial runs show significantly better performance. These results are comparable to the best French cross-lingual results presented by other teams in the conference. However, the overall French cross-lingual results achieve only 56% of the English retrieval performance.This could be due to the fact that the French resources we used (citation database and medical lexicon) are much smaller than the UMLS resources available for English. Table 3 presents runs that use all the manually generated terms in English, French and German that were provided in the ImageCLEFmed topics. These queries achieve the highest score using our system with a MAP of 0.3148 which is comparable to the best manual run reported this year [5]. As in our previously presented experiments, the results with the manual queries show improvements when automatically generated UMLS concepts and pseudo relevance feedback are used. Use of the CBIR results yields a small improvement.

646

M.E. Ruiz and A. N´ev´eol Table 2. Unofficial Runs Run name

MAP Exact-P P10 (a) Baseline runs

EN-free text only UMLS concepts only FIRE baseline (CBIR) (b) English

P20

0.2566 0.2724 0.4000 0.3433 0.1841 0.2007 0.2655 0.2345 0.0096 0.0287 0.0300 0.0317 only runs

EN-text-RF 0.2966 0.2782 0.4033 0.3800 EN-text baseline + image 0.2965 0.3189 0.4067 0.3817 EN-text rf + images 0.3028 0.2908 0.4033 0.3800 (c) Automatic English cross-lingual runs EN-Multi-Baseline 0.2111 0.2283 0.2533 EN-Multi + concepts 0.2789 0.2975 0.3400 EN-Multi + concepts + images 0.2800 0.2981 0.3433 EN-Multi-rf 0.2789 0.2975 0.3400 (d) Automatic French cross-lingual runs

0.2467 0.3100 0.3117 0.3100

FR-Multi-Baseline FR-Multi-Baseline + images FR-Multi- RF FR-Multi-RF + images

0.1500 0.1550 0.1883 0.1967

0.1442 0.1453 0.1618 0.1707

0.1456 0.1466 0.1680 0.1873

0.1700 0.1700 0.2133 0.2167

Table 3. Manual runs Run name MAP Exact-P Multi-manual text only 0.2655 0.3082 Multi-Manual text+contepts 0.3052 0.3127 Multi-Manual Text+concepts + images 0.3069 0.3148 Multi-manual rf 0.3092 0.2940 Multi-manual rf + images 0.3148 0.3005

P10 0.3933 0.4133 0.4167 0.4233 0.4200

P20 0.3467 0.3933 0.3933 0.3983 0.3967

Table 4. Comparison of results by type of query Type Visual Visual-Semantic Semantic

Free text UMLS concepts 0.19737 0.11478 0.11375 0.1056 0.32208 0.32275

CBIR Combination 0.01159 0.22064 0.01508 0.20118 0.00209 0.4596

We performed a query by query analysis to try to understand how the different methods proposed are affected by different types of queries. Table 4 shows the average MAP by groups of topics according to whether they are visual, semantic and mixed (visual-semantic). As expected the text based and UMLS concept based runs perform better in the semantic topics. The CBIR system performs slightly better in the visual and mixed topics while the poorest performance is

Evaluation of Automatically Assigned MeSH Terms

647

in the semantic topics. The combination shows consistent improvements in all three groups of topics.

6

Conclusions

From the results we can conclude that the use of automatically assigned UMLS concepts using MTI significantly improves performance for the retrieval of medical images with English annotations. We also confirm that our generalized vector space model works well for combining retrieval results from free text, UMLS concepts and CBIR systems. Despite the low performance of our CBIR system the merging method is robust enough to maintain or even improve results. We also conclude that our methods work better for semantic queries while still achieving significantly high performance for visual or mixed visual semantic queries. Our cross-lingual results using French as the query language are relatively low and indicate that we need to work on improving our translation method based on UMLS mapping. We plan to explore this further in our future research. The low results from the CBIR system indicate that we need to address the image classification problem so that the CBIR results can give a more significant contribution to the overall fusion of results.

Acknowledgements This work was supported in part by an appointment of A. N´ev´eol and M. E. Ruiz to the NLM Research Participation Program. This program is administered by the Oak Ridge Institute for Science and Education trhough an interagency agreement between the U.S. Department of Energy and the National Library of Medicine. We also want to thank Dr. Alan Aronson and the Indexing Initiative Project team at the NLM for their support and for making the MTI system available for this project.

References [1] Ruiz, M.: Combining image features, case descriptions and umls concepts to improve retrieval of medical images. In: Proceedings of the AMIA Annual Symposium, Washington, DC, pp. 674–678 (2006) [2] Ruiz, M.: Ub at imageclefmed 2006. In: Peters, C., Clough, P., Gonzalo, J., Jones, G., Kluck, M., Magnini, B. (eds.) CLEF 2006. LNCS, vol. 4730, pp. 702–705. Springer, Heidelberg (2007) [3] Salton, G. (ed.): The SMART Retrieval System: Experiments in Automatic Document Processing. Prentice-Hall, Englewood Cliffs (1983) [4] Deselaers, T., Keysers, D., Ney., H.: Features for image retrieval: A quantitative comparison. In: Rasmussen, C.E., B¨ ulthoff, H.H., Sch¨ olkopf, B., Giese, M.A. (eds.) DAGM 2004. LNCS, vol. 3175, pp. 228–236. Springer, Heidelberg (2004)

648

M.E. Ruiz and A. N´ev´eol

[5] M¨ uller, H., Deselaers, T., Kim, E., Kalpathy-Cramer, J., Deserno, T.M., Hersh, W.: Overview of the imageclef 2007 medical retrieval and annotation tasks. In: Peters, C., et al. (eds.) CLEF 2007. LNCS, vol. 5152. Springer, Heidelberg (2008) [6] Aronson, A., Mork, J., Gay, C., Humphrey, S., Rogers, W.: The nlm indexing initiative´s medical text indexer. In: MEDINFO, 11(Pt 1), pp. 268–272 (2004) [7] N´ev´eol, A., Mork, J., Aronson, A., Darmoni, S.: Evaluation of french and english mesh indexing systems with a parallel corpus. In: Proceedings of the AMIA Annual Symposium, pp. 565–569 (2005) [8] N´ev´eol, A., Rogozan, A., Darmoni, S.: Automatic indexing of online health resources for a french quality controlled gateway. Information Processing and Management 42, 695–709 (2006) [9] Rocchio, J.J.: Relevance feedback in information retrieval. In: Salton, G. (ed.) The SMART Retrieval System: Experiments in Automatic Document Processing, Englewood Cliff, NJ (1971)