Spark-MCA: Large-scale, Exhaustive Formal Concept Analysis for

Report 19 Downloads 25 Views
Spark-MCA: Large-scale, Exhaustive Formal Concept Analysis for Evaluating the Semantic Completeness of SNOMED CT Wei Zhu1,2 , Licong Cui3 , PhD, Guo-Qiang Zhang1 , PhD 1 Institute

for Biomedical Informatics, University of Kentucky, Lexington, KY of EECS, Case Western Reserve University, Cleveland, OH 3 Department of Computer Science, University of Kentucky, Lexington, KY [email protected], [email protected], [email protected] 2 Department

Wei Zhu

Spark-MCA

11/06/2017

1 / 16

Disclosure

The authors disclose that they have no relationships with commercial interests.

Wei Zhu

Spark-MCA

11/06/2017

2 / 16

Serve as a knowledge source in many biomedical applications Fast paced evolution (SNOMED International releases a new version almost every 6 months) results in quality issues, for example, semantic incompleteness The concept Structure of muscle acting on metatarsophalangeal joint (with identifier 707861009) was not present in the 201403 release but was added in the 201603 release.

Wei Zhu

Spark-MCA

11/06/2017

3 / 16

Challenge

SNOMED CT has over 300,000 concepts and 1,360,000 relations Jiang and Chute randomly select 10% of the contexts from the subbranches of two largest domains of SNOMED CT to perform an FCA-based analysis. 1 Performing FCA on entire SNOMED CT is computationally expensive

1

Jiang G, Chute CG. Auditing the semantic completeness of SNOMED CT using formal concept analysis. Journal of the American Medical Informatics Association. 2009 Feb 28;16(1):89-102. Wei Zhu

Spark-MCA

11/06/2017

4 / 16

Methods: Spark-MCA

Multistage algorithm for constructing concept lattices (MCA) 2 Extend MCA by taking advantage of Apache Spark framework Distributed MCA Lattice construction

2 Faster concept analysis. InInternational Conference on Conceptual Structures 2007 Jul 22 (pp. 206-219). Springer Berlin Heidelberg. Wei Zhu

Spark-MCA

11/06/2017

5 / 16

Multistage algorithm for constructing concept lattices (MCA) The input data for FCA is a binary relation between a set of objects and a set of attributes. This relation is represented in the form of a formal context, I = (X , Y , R), where X is a set of objects, Y is a set of attributes, and R is a relation R ⊆ X × Y .

Wei Zhu

Spark-MCA

11/06/2017

6 / 16

Spark-MCA

Split tasks to computing nodes Spark is suitable for iterative algorithms: cache intermediate results in memory and no I/O latency Optimization: cashe hashtable for fast subsets retrival Wei Zhu

Spark-MCA

11/06/2017

7 / 16

MCA with SNOMED CT C1 C2 C3 C4 C5

Shunt of cerebral ventricle to extracranial site (69483009) Neck repair (119592007) Ventricular shunt to cervical subarachnoid space (25600005) Creation of ventriculo-jugular shunt (60501001) Creation of subarachnoid/subdural-jugular shunt (60510009)

Figure 1: Concept with attributes in SNOMED CT

C2 C1 C2 C3 C4 C5

1 × × × ×

2 × × × ×

3 × × × ×

4

5

6

7

× × × ×

× × × ×

× × × ×

× × × ×

8

9

10

11

×

×

×

×

12

13

14

×

× ×

×

15

16

×

×

Table 1: Formal context of five selected concepts

C3

C1

C5

C4

Figure 2: IS-A relations of concepts in Table 1. Wei Zhu

Spark-MCA

11/06/2017

8 / 16

MCA result C8 Stage 0 C0 C1 C2 C3 C4 C5 Stage 1 C6 C7 Stage 2 C8

C2

{1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16} {1,2,3} {4,5,6,7} {1,2,3,4,5,6,7,8,9,10,11} {1,2,3,4,5,6,7,12,13,14} {1,2,3,4,5,6,7,13,15,16}

C1 C7 C6

{1,2,3,4,5,6,7,13} {1,2,3,4,5,6,7}

C3

{}

C5

C4

Table 2: The newly generated formal concepts in each stage for the formal context in Table 1 using MCA. C0

Figure 3: IS-A relations of concepts in Table 2. Wei Zhu

Spark-MCA

11/06/2017

9 / 16

Scalability

250

400

y seconds

y seconds

500

300

200

150

200 100

100 0

20

40 60 x processors

80

Figure 4: Scalability Experiment of Distributed MCA

Wei Zhu

100

0

20

40 60 x processors

80

100

Figure 5: Scalability Experiment of Lattice Construction

Spark-MCA

11/06/2017

10 / 16

Results Table 3: Comparison of Spark-MCA results and retrospective ground truthing with respect to main subhierarchies. Subhierarchy (# of concepts ) Body structure (30,623) Clinical finding (100,652) Pharmaceutical biologic product (16,797) Procedure (54,091) Situation with explicit context (3,723) Specimen (1,475) Event (3,683) Staging and scales (1,308)

FCA-New 11,311 153,401 7,696 325,114 1,718 1,283 14 6

Delta 743 10,121 2,099 2,893 991 200 23 123

Matched 353 1,825 109 875 45 24 0 0

Precision (%) 3.12 1.19 1.41 0.27 2.62 1.87 0 0

Recall 1 (%) 47.5 18.0 5.2 30.2 4.5 12 0 0

Recall 2 (%) 68.1 43.0 30.4 45.1 19.7 25 0 0

FCA-New: concepts found by Spark-MCA on fully defined concepts in 201403 release Delta: accumulated inserted concepts from 201403 release to 201609 release, including both fully defined and primitive concepts Matched: concepts in both FCA-New and Delta Wei Zhu

Spark-MCA

11/06/2017

11 / 16

Evaluation

Spark-MCA can be a practical method in evaluating semantic completeness.

Entire muscle acting on lumbar intervertebral joint (714363004) Spark-MCA derived a concept with same attributes Added in 201603

Wei Zhu

Spark-MCA

11/06/2017

12 / 16

Evaluation

Spark-MCA cannot find those added concepts in new release involving newly added attributes.

Difficulty swimming (714997002) Spark-MCA did not derive it Added in 201603 release, with one attribute not in the list of all 201403 attributes New attribute: 363714003—Interprets (attribute) = 714992008—Ability to swim (observable entity)

Wei Zhu

Spark-MCA

11/06/2017

13 / 16

Discussion

It is less useful to apply the FCA approach to terminological systems with very minimal semantic definitions (e.g., limited to one type of relationship) FCA-based approach is in general known to be sensitive to the density and complexity of the input formal context

Wei Zhu

Spark-MCA

11/06/2017

14 / 16

Conclusion

Spark-MCA, a scalable approach for exhaustively computing the formal concepts of the context as well as the associated subset relations Our results show that Spark-MCA provides a cloud-computing feasible approach for evaluating the semantic completeness of SNOMED CT

Wei Zhu

Spark-MCA

11/06/2017

15 / 16

Acknowledgment

This work was made possible by the University of Kentucky Center for Clinical and Translational Science (Clinical and Translational Science Award UL1TR001998). This work was also supported by the National Science Foundation under MRI award No. 1626364. We thank the Amazon Web Services (AWS) for its Cloud Credits for Research program that enabled the implementation of our distributed algorithms in Spark. Thank Dr.Guo-Qiang Zhang, my advisor, for the guidance

Wei Zhu

Spark-MCA

11/06/2017

16 / 16