Spark-MCA: Large-scale, Exhaustive Formal Concept Analysis for Evaluating the Semantic Completeness of SNOMED CT Wei Zhu1,2 , Licong Cui3 , PhD, Guo-Qiang Zhang1 , PhD 1 Institute
for Biomedical Informatics, University of Kentucky, Lexington, KY of EECS, Case Western Reserve University, Cleveland, OH 3 Department of Computer Science, University of Kentucky, Lexington, KY
[email protected],
[email protected],
[email protected] 2 Department
Wei Zhu
Spark-MCA
11/06/2017
1 / 16
Disclosure
The authors disclose that they have no relationships with commercial interests.
Wei Zhu
Spark-MCA
11/06/2017
2 / 16
Serve as a knowledge source in many biomedical applications Fast paced evolution (SNOMED International releases a new version almost every 6 months) results in quality issues, for example, semantic incompleteness The concept Structure of muscle acting on metatarsophalangeal joint (with identifier 707861009) was not present in the 201403 release but was added in the 201603 release.
Wei Zhu
Spark-MCA
11/06/2017
3 / 16
Challenge
SNOMED CT has over 300,000 concepts and 1,360,000 relations Jiang and Chute randomly select 10% of the contexts from the subbranches of two largest domains of SNOMED CT to perform an FCA-based analysis. 1 Performing FCA on entire SNOMED CT is computationally expensive
1
Jiang G, Chute CG. Auditing the semantic completeness of SNOMED CT using formal concept analysis. Journal of the American Medical Informatics Association. 2009 Feb 28;16(1):89-102. Wei Zhu
Spark-MCA
11/06/2017
4 / 16
Methods: Spark-MCA
Multistage algorithm for constructing concept lattices (MCA) 2 Extend MCA by taking advantage of Apache Spark framework Distributed MCA Lattice construction
2 Faster concept analysis. InInternational Conference on Conceptual Structures 2007 Jul 22 (pp. 206-219). Springer Berlin Heidelberg. Wei Zhu
Spark-MCA
11/06/2017
5 / 16
Multistage algorithm for constructing concept lattices (MCA) The input data for FCA is a binary relation between a set of objects and a set of attributes. This relation is represented in the form of a formal context, I = (X , Y , R), where X is a set of objects, Y is a set of attributes, and R is a relation R ⊆ X × Y .
Wei Zhu
Spark-MCA
11/06/2017
6 / 16
Spark-MCA
Split tasks to computing nodes Spark is suitable for iterative algorithms: cache intermediate results in memory and no I/O latency Optimization: cashe hashtable for fast subsets retrival Wei Zhu
Spark-MCA
11/06/2017
7 / 16
MCA with SNOMED CT C1 C2 C3 C4 C5
Shunt of cerebral ventricle to extracranial site (69483009) Neck repair (119592007) Ventricular shunt to cervical subarachnoid space (25600005) Creation of ventriculo-jugular shunt (60501001) Creation of subarachnoid/subdural-jugular shunt (60510009)
Figure 1: Concept with attributes in SNOMED CT
C2 C1 C2 C3 C4 C5
1 × × × ×
2 × × × ×
3 × × × ×
4
5
6
7
× × × ×
× × × ×
× × × ×
× × × ×
8
9
10
11
×
×
×
×
12
13
14
×
× ×
×
15
16
×
×
Table 1: Formal context of five selected concepts
C3
C1
C5
C4
Figure 2: IS-A relations of concepts in Table 1. Wei Zhu
Spark-MCA
11/06/2017
8 / 16
MCA result C8 Stage 0 C0 C1 C2 C3 C4 C5 Stage 1 C6 C7 Stage 2 C8
C2
{1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16} {1,2,3} {4,5,6,7} {1,2,3,4,5,6,7,8,9,10,11} {1,2,3,4,5,6,7,12,13,14} {1,2,3,4,5,6,7,13,15,16}
C1 C7 C6
{1,2,3,4,5,6,7,13} {1,2,3,4,5,6,7}
C3
{}
C5
C4
Table 2: The newly generated formal concepts in each stage for the formal context in Table 1 using MCA. C0
Figure 3: IS-A relations of concepts in Table 2. Wei Zhu
Spark-MCA
11/06/2017
9 / 16
Scalability
250
400
y seconds
y seconds
500
300
200
150
200 100
100 0
20
40 60 x processors
80
Figure 4: Scalability Experiment of Distributed MCA
Wei Zhu
100
0
20
40 60 x processors
80
100
Figure 5: Scalability Experiment of Lattice Construction
Spark-MCA
11/06/2017
10 / 16
Results Table 3: Comparison of Spark-MCA results and retrospective ground truthing with respect to main subhierarchies. Subhierarchy (# of concepts ) Body structure (30,623) Clinical finding (100,652) Pharmaceutical biologic product (16,797) Procedure (54,091) Situation with explicit context (3,723) Specimen (1,475) Event (3,683) Staging and scales (1,308)
FCA-New 11,311 153,401 7,696 325,114 1,718 1,283 14 6
Delta 743 10,121 2,099 2,893 991 200 23 123
Matched 353 1,825 109 875 45 24 0 0
Precision (%) 3.12 1.19 1.41 0.27 2.62 1.87 0 0
Recall 1 (%) 47.5 18.0 5.2 30.2 4.5 12 0 0
Recall 2 (%) 68.1 43.0 30.4 45.1 19.7 25 0 0
FCA-New: concepts found by Spark-MCA on fully defined concepts in 201403 release Delta: accumulated inserted concepts from 201403 release to 201609 release, including both fully defined and primitive concepts Matched: concepts in both FCA-New and Delta Wei Zhu
Spark-MCA
11/06/2017
11 / 16
Evaluation
Spark-MCA can be a practical method in evaluating semantic completeness.
Entire muscle acting on lumbar intervertebral joint (714363004) Spark-MCA derived a concept with same attributes Added in 201603
Wei Zhu
Spark-MCA
11/06/2017
12 / 16
Evaluation
Spark-MCA cannot find those added concepts in new release involving newly added attributes.
Difficulty swimming (714997002) Spark-MCA did not derive it Added in 201603 release, with one attribute not in the list of all 201403 attributes New attribute: 363714003—Interprets (attribute) = 714992008—Ability to swim (observable entity)
Wei Zhu
Spark-MCA
11/06/2017
13 / 16
Discussion
It is less useful to apply the FCA approach to terminological systems with very minimal semantic definitions (e.g., limited to one type of relationship) FCA-based approach is in general known to be sensitive to the density and complexity of the input formal context
Wei Zhu
Spark-MCA
11/06/2017
14 / 16
Conclusion
Spark-MCA, a scalable approach for exhaustively computing the formal concepts of the context as well as the associated subset relations Our results show that Spark-MCA provides a cloud-computing feasible approach for evaluating the semantic completeness of SNOMED CT
Wei Zhu
Spark-MCA
11/06/2017
15 / 16
Acknowledgment
This work was made possible by the University of Kentucky Center for Clinical and Translational Science (Clinical and Translational Science Award UL1TR001998). This work was also supported by the National Science Foundation under MRI award No. 1626364. We thank the Amazon Web Services (AWS) for its Cloud Credits for Research program that enabled the implementation of our distributed algorithms in Spark. Thank Dr.Guo-Qiang Zhang, my advisor, for the guidance
Wei Zhu
Spark-MCA
11/06/2017
16 / 16