October 1, 2013 21:5 WSPC/INSTRUCTION 2013GIW˙JBCB˙VirusClassificationCopmarison˙Final
FILE
Journal of Bioinformatics and Computational Biology c Imperial College Press
COMPARING VIRUS CLASSIFICATION USING GENOMIC MATERIALS ACCORDING TO DIFFERENT TAXONOMIC LEVELS
JING-DOO WANG Department of Computer Science and Information Engineering, Asia University, No. 500, Lioufeng Rd. Wufeng, Taichung 41354, Taiwan.
[email protected] In this paper, three genomic materials - DNA sequences, protein sequences and regions (domains) are used to compare methods of virus classification. Virus classes (categories) are divided by various taxonomic level of virus into three datasets for 6 order, 42 family and 33 genera. To increase the robustness and comparability of experimental results of virus classification, the classes are selected that contain at least 10 instances, and meanwhile each instance contains at least one region name. Experimental results show that the approach using region names achieved the best accuracies - reaching 99.9%, 97.3% and 99.0% for 6 orders, 42 families and 33 genera, respectively. This paper not only involves exhaustive experiments that compare virus classifications using different genomic materials, but also proposes a novel approach to biological classification based on molecular biology instead of traditional morphology. Keywords: Virus Classification; Taxonomy; Genome Sequence; Protein Clustering; Region.
1. Introduction Virus classification concerns the naming viruses and the placing of viruses into a taxonomic system. The two main systems currently used for virus classification are the ICTV (International Committee on Taxonomy of Viruses) system4 and the Baltimore classification system9 . The former shares many features with the system of classification of cellular organisms, such as taxon structure; the latter places viruses into one of seven groups depending on a combination of their types of nucleic acid (DNA or RNA), stranded-ness (single-stranded or double-stranded), sense, and method of replication7 . Viruses are mainly classified by their phenotypes, such as morphology, type of nucleic acid, mode of replication, host organisms, and the type of disease they cause. Observing the phenotypes of viruses requires considerable effort on the part of biologists (or virologists). Moreover, the inconsistencies of their observations, made at various laboratories or times may lead to arguments when attempts are made to verify or classify some unknown viruses. Viruses are diverse and flexible, 1
October 1, 2013 21:5 WSPC/INSTRUCTION 2013GIW˙JBCB˙VirusClassificationCopmarison˙Final
2
FILE
JING-DOO WANG
and many viruses exist whose taxa are still unknown and labeled ”unclassified” in the ICTV. Therefore, a novel approach for classifying viruses automatically and precisely is sought. A growing number of complete whole genomes are available in the NCBI 2 , enabling research based on genome-wide comparisons. For example, some studies have compared genomic signatures to analyze evolutionary relationships 15,28 , to identify signature genes for taxonomic characterization 16,17 , to classify sequences 10,18 , and to elucidate viral phylogeny 33 . Studies that involve comparisons of genome-wide sequences comparisons might address the challenge of making such comparisons without sequence alignment 21,27 . To take advantage of available classifiers that are used in machine learning8 or data mining32 , instances (species) must be transformed into representative vectors for virus classification in the vector space model13 . To achieve the above vector transformation precisely using genomic materials, two important issues must be addressed. One is feature extraction, which identifies the characteristics (features) of one class (category) of viruses that distinguish it from another. The other is the design of a weighting method that can specify the relative importance of these features. Various studies of virus classification using genomic sequences have been published34,29,30 . In 34 , Yu et al. proposed a natural vector approach that converted each virus into a 12-dimensional vector according to the quantity and global distribution of the nucleotides in its viral sequences, and then used the nearest neighbor method to classify 2044 single-segment viruses at different levels of Baltimore class, family, subfamily and genus. Their virus classification was computed quickly because it took into account topological information about the viruses in advance. Wang 29 compared classifications of 35 virus families based on ”DNA”(deoxyribonucleic acid) and ”Protein” (amino acids) sequences. To make their experiments more robust and to extend to different taxonomic levels, Wang 30 used 6 orders, 43 families and 33 genera for comparing virus classifications. However, their experimental results conflicted with their original expectation that the approach was based on protein sequences should be more accurate than that based on DNA sequences. However, in the studies29,30 , a group of protein sequences that were deemed to perform one biological function were found to combine with another group with a different function, making the functionality of the combined group ambiguous. To avoid this problem of ambiguity, the ”region” names (domains), within the notations of proteins in NCBI, are the features that are used for virus classification in this paper. To make the contribution of this paper solid for the readers, Section 2.1 introduces the preprocesses for collecting and extracting these three genomic materials. In this study, experiments were performed to classify viruses in NCBI using existing taxonomic levels. Experimental resources contain three datasets, including 6 orders, 42 families and 33 genera. Each class (category) in the dataset contains at least 10 instances (species) in which includes at least one region name that belongs to that instance. Experimental results show that the approach that was based on
October 1, 2013 21:5 WSPC/INSTRUCTION 2013GIW˙JBCB˙VirusClassificationCopmarison˙Final
FILE
COMPARING VIRUS CLASSIFICATIONS USING GENOMIC MATERIALS
3
Fig. 1. The Processes of Extracting Virus Genomic Materials.
”region” name achieved the best accuracies of 99.9%, 97.3% and 99.0% with the three datasets of 6 orders, 42 families and 33 genera, respectively. In summary, this paper provides a novel approach for analyzing taxonomy using genomic materials in the field of molecular biology, instead of using phenotypes. The remainder of this paper is organized as follows. Section 2 describes the method of transforming virus instances into representative vectors for three genomics materials. Section 3 presents the experimental results. Section 4 presents discussions and possible avenues for future work. Section 5 draws conclusions. 2. Method This paper presents two main processes for classifying viruses using genomic materials in the vector space model22 . One is to gather whole genomes of viruses and extract genomic materials. Another is to transform each of the virus instances into representative vectors using these genomic materials. Figure 1 and Fig.2 present above two processes. Section 2.1 and section 2.2 describe the processes in detail. 2.1. Genomic Materials Extraction As shown in Fig.1, the compressed file ”all.gbk.tar.gz” for virus genomes was firstly downloaded from the NCBI FTP site2 , and then the genomic materials, including virus taxonomy, DNA sequences, protein sequences and protein’s ”GI” number were extracted from the ”GenBank flat file format” files that were derived from the ”all.gbk.tar.gz”. For example, as shown in Fig.3, the genomic materials of the virus ”Bovine adenovirus A” were extracted from the file ”NC 006324”. Figure 3 presents the family and genus of the virus as ”Adenoviridae” and ”Mastadenovirus”, respectively. The bottom of the figure displays the protein annotated with ”CDS”, its se-
October 1, 2013 21:5 WSPC/INSTRUCTION 2013GIW˙JBCB˙VirusClassificationCopmarison˙Final
4
FILE
JING-DOO WANG
Fig. 2. The Processes of Virus Classification via Genomic Materials.
Fig. 3. Genomic Materials extracted from the ”N C 006324.gbk”.
quence, labeled with the tag ”/translation=”, and DNA sequences. As presented in Fig.4, the region name ”Adeno E1A”, for example, was extracted from the notation of ”YP 094027” which was downloaded automatically via a web agent11 by querying with the number ”GI:52801680” via the Entrez Programming Utilities (E-utilities) 1 .
October 1, 2013 21:5 WSPC/INSTRUCTION 2013GIW˙JBCB˙VirusClassificationCopmarison˙Final
FILE
COMPARING VIRUS CLASSIFICATIONS USING GENOMIC MATERIALS
5
Fig. 4. Region name ”Adeno E1A” extracted from the ”Y P 094027”(GI:52801680).
Table 1. The outline of processing vector transformation Genomic Materials
Feature Extraction
DNA sequences Protein sequences Regions (Domains)
K-mers sequence clustering region name
Vector Transformation Feature Weighting Vector Dimension (m) tf*(1/Entropy) tf*idf tf*idf
4k # of clusters # of region names
2.2. Vector Transformation for Instances With regard to the processes of representative vector transformation, some practical issues, such as feature extraction and weighting 14 , should be considered. Table 1 gives an overview of approaches to vector transformation based on three genomic materials. As shown in Fig.2, after three virus genomic materials - DNA sequences, protein sequences and region names, virus instances must be transformed into representative vectors using proper weighting methods such that each vector represent its original instance precisely. After vector transformation, as shown in Fig.2, the LIBSVM12 was used to perform virus classification. In the following, Section 2.2.2, Section 2.2.3 and Section 2.2.4 describe the vector transformations of the three genomic materials. Notably, the method for transferring DNA sequences and protein sequences into vectors were adopted from previous works29,30 .
October 1, 2013 21:5 WSPC/INSTRUCTION 2013GIW˙JBCB˙VirusClassificationCopmarison˙Final
6
FILE
JING-DOO WANG
2.2.1. Notations Let {C1 , C2 , . . . , Cc } be an actual partition of a data set X : x1,1 , x1,2 , . . . , x1,n1 , x2,1 , x2,2 , . . . , x2,n2 , X= . ..., xc,1 , xc,2 , . . . , xc,nc .
(1)
where xi,l ∈ Rm is the lth instance of the class Ci , i = 1, 2, . . . , c; l = 1, 2, . . . , ni ; Pc N = i=1 ni ; {xi,1 , xi,2 , . . . , xi,ni } ∈ Ci ; R represents a real number; m is the number of dimensions in the vector model, and c is the number of classes. 2.2.2. DNA Sequences vs. k-mer Approach The k-mer approach is a well-known method for transferring sequences (strings) into vectors20 . Let Pd be the dth pattern of k-mers. Let Pattern Frequency P F (Pd , Ci ) and P F (Pd , xi,l ) be the number of patterns Pd that appear in the class Ci and P F (Pd ,Ci ) ) be the probability instance xi,l , respectively. Let P rob(Pd , Ci ) = ( Pi=c P F (P ,C ) d
i=1
i
that the Pd is in class Ci . the Shannon entropy24 Entropy(Pd ) of pattern Pd across c classes is given by Eq.2. Entropy(Pd ) = −
i=c X (P rob(Pd , Ci )) ∗ log(P rob(Pd , Ci )).
(2)
i=1
Given a value k for the k-mer transformation of DNA sequences whose alphabet contains 4 symbols, ”A”, ”C”, ”G” and ”T”, the vector of one instance xi,l was transferred herein into a 4k -dimensional vector as Eq.3. k
< xi,l >=< x1i,l , x2i,l , . . . , xdi,l , . . . , x4i,l >,
(3)
1 where xdi,l = P F (Pd , xi,l ) ∗ Entropy(P , 1 ≤ d ≤ m = 4k . Notably, the welld) known weighing method tf ∗ idf 22 cannot be applied because when k is small, such as k = 5, the k-mers might appear in all of the sequences, possibly causing the idf values of all k-mer patterns to be the same.
2.2.3. Protein Sequences vs. Clustering The approach to clustering protein sequences, adopted from the previous work29 , is used in the rest of this paper. To transfer viruses into vectors via protein sequences, the protein sequences were clusters into the same group under the simplifying assumption that similar protein sequences had similar functionalities. The similarity between two protein sequences was measured using the E value as e−E , determined using ”pblast” program 20 ; two protein sequences were put into the same group if their E value was greater than a given threshold T as e−T . To determine the best value of the threshold T , however, several candidate values of T are used in experiments and the one that hields the highest accuracy is selected as the final
October 1, 2013 21:5 WSPC/INSTRUCTION 2013GIW˙JBCB˙VirusClassificationCopmarison˙Final
FILE
COMPARING VIRUS CLASSIFICATIONS USING GENOMIC MATERIALS
7
threshold value. After the protein sequences were clustered into m groups, these groups could be used, for example, to represent each virus as one m-dimensional vector ; meanwhile, the weighting for each dimension of that vector is determined according to a weighting method that is similar to the tf ∗idf weighting approach22 . Let CDS(xi,l ) be the set of protein sequences that belong to xi,l and let |CDS(xi,l )| be number of protein sequences in CDS(xi,l ). Let S = ∪1≤i≤c,1≤l≤ni CDS(xi,l ) = {s1 , s2 , s3 , ..., s|#of P roteins| } be the set of all protein sequences in X and |#of P roteins| be the number of protein sequences in S. First, all of the protein sequences in S are mapped into distinct m groups as GID1 , GID2 , ..., GIDm , in which the instances in one group, such as GIDd , 1 ≤ d ≤ m, have similar functions. The similarity between two protein sequences, for example, sp and sq , are measured using the ”pblast” program20 , and sp and sq are clustered into the same group GIDd , 1 ≤ d ≤ m, if the similarity of the sp related to the sq is under the given threshold T-value (T), e−T , e.g. T = 3. In this study, a weighting method similar to that in tf ∗ idf 14 was adopted. Let CDS(xi,l )GIDd = |CDS(xi,l ) ∩ GIDd | be the number of CDS(xi,l ) that are mapped to group GIDd . Let Group Frequency of the GIDd , GF (GIDd ), be the number of instances that contain the CDS that were mapped to the GIDd , and let N . After all CDS IGF (GIDd ) be the Inverse Group Frequency (IGF), log GF (GID d) in S are mapped to distinct m groups as GIDd , 1 ≤ d ≤ m, each instance xi,l could be represented as one vector < xi,l > using Eq.4. |#of Groups|
< xi,l >= (x1i,l , x2i,l , . . . , xdi,l , . . . , xi,l
)
(4)
where xdi,l =CDS(xi,l )GIDd ∗ IGF (GIDd ), 1 ≤ d ≤ m = |#of Groups|. 2.2.4. Region Names from Protein Notation The regions (domains) within one protein are well known to support a particular of that protein. After the region names are extracted and collected from the notation of the proteins, as shown in Section 2.1, the ”tf*idf” weighting method14 is applied to transform vectors where one region name is used as one term and one virus is treated as one document. Accordingly, the term frequency (tf) of a region name for one virus instance is the number of times that region name appears in the notation for names of proteins that belong to that virus; the document frequency (df) of one region name is estimated as the number of viruses that contain that region name. Let rd be the dth in the set of region names and let tf (xi,l , rd ) be the number of rd that appear in the instance xi,l . Let df (rd ) be the number of instances in which the notations for the protein contains the rd region and let the inverse document frequency idf (rd ) be log( df N (rd ) ). For example, one instance xi,l is transformed into a vector as follows. |#of Region|
< xi,l >=< x1i,l , x2i,l , . . . , xdi,l , . . . , xi,l
where xdi,l =tf (xi,l , rd ) ∗ idf (rd ), 1 ≤ d ≤ m = |#of Region|.
>,
(5)
October 1, 2013 21:5 WSPC/INSTRUCTION 2013GIW˙JBCB˙VirusClassificationCopmarison˙Final
8
FILE
JING-DOO WANG
Table 2. The statistics of virus taxonomy # # # #
of of of of
Orders Families Genera Species
ICTV 7 96 420 2,618
NCBI 6 85 326 2,406
Selected (#ofSpecies) 6 (812) 42 (1,922) 33 (693)
Table 3. The statistics of six virus orders. Ci 1 2 3 4 5 6
Order Caudovirales Herpesvirales Mononegavirales Nidovirales Picornavirales Tymovirales Total
#ofViruses 446 47 64 33 114 108 812
DNA Length(bp) 3628439253 856390145 7394800 8177344 1616712 3903885 4505922139
#ofProteins 39312 4672 493 293 171 520 45461
#ofRegions 22158 4518 544 818 780 926 29744
Average (Per Virus) DNA Length(bp) #ofProteins 8135514.0 88.1 18221066.9 99.4 115543.8 7.7 247798.3 8.9 14181.7 1.5 36147.1 4.8
3. Experimental Results In this paper, the ”easy.py” program from LIBSVM12 was used as the SVM classifier for virus classification; meanwhile 10-fold cross-validation was adopted to avoid the over-fitting problem19 . Notably, SVM is a well-known classifier in machine learning8 and LIBSVM supports multi-class classification. In the following, Section 3.1 gives statistics of viruses in ICTC and NCBI, and of viruses, in 6 orders, 42 families and 33 genera that were selected for the experiments. Section 3.2 compares the accuracies of classification according to these three genomic materials.
3.1. The Statistics of viruses To provide a comprehensive understanding of existing virus taxonomy in ICTV (International Committee on Taxonomy of Viruses)4 and the viral genomes available in NCBI (National Center for Biotechnology Information) 5 , Table 2 gives the statistics concerning virus taxonomy. Based on the official ICTV 2012 taxonomy25 , a total of 2, 618 virus species belonged to 7 orders, 96 families and 420 genera. Based on the whole virus genomes that were extracted from NCBI’s FTP site2 when this study started (2012-6-21), 2, 406 virus species belonged to 6 orders, 85 families, 420 genera. To ensure the robustness of experimental results and to provide three types of comparable genomic materials for virus classification, 6 orders, 42 families and 33 genera were selected for experiments. Each of the classes (orders, families or genera) contained at least 10 species and each of these species had at least one region name that was tagged in notation for the corresponding protein, as described in Fig.4. Table 3, Table 4 and Table 5 provide details of the statistics of the DNA sequences, the number of proteins, the number of region names and their corresponding averages per species by the order, family and genus of the viruses, respectively.
#ofRegions 49.7 96.1 8.5 24.8 6.8 8.6
October 1, 2013 21:5 WSPC/INSTRUCTION 2013GIW˙JBCB˙VirusClassificationCopmarison˙Final
FILE
COMPARING VIRUS CLASSIFICATIONS USING GENOMIC MATERIALS
9
Table 4. The statistics of 42 virus families. Ci 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42
Family Adenoviridae Alphaflexiviridae Anelloviridae Arenaviridae Astroviridae Baculoviridae Betaflexiviridae Bromoviridae Bunyaviridae Caliciviridae Caulimoviridae Circoviridae Closteroviridae Coronaviridae Dicistroviridae Flaviviridae Geminiviridae Herpesviridae Inoviridae Luteoviridae Microviridae Myoviridae Nodaviridae Papillomaviridae Paramyxoviridae Partitiviridae Parvoviridae Picornaviridae Podoviridae Polyomaviridae Potyviridae Poxviridae Reoviridae Retroviridae Rhabdoviridae Secoviridae Siphoviridae Togaviridae Tombusviridae Totiviridae Tymoviridae Virgaviridae Total
3.2.
#ofViruses 25 39 34 25 11 51 45 29 25 19 33 14 23 29 14 52 254 41 26 21 14 104 12 67 33 18 52 55 88 22 82 27 32 56 25 32 248 17 43 26 22 37 1922
DNA Length(bp) 27030426 1452534 360257 1052628 209256 960584193 1990938 1115486 1406346 369837 1193484 87332 3856475 7633320 278978 612284 6043944 674477523 2127893 723719 728653 2412036372 175085 3685566 4401661 169081 1073344 451983 236689280 642091 1655075 929391315 8989522 2283443 2162899 726507 959712930 460911 961398 301678 413049 1718589 6261437285
#ofProteins 794 211 108 100 31 7149 238 132 99 48 152 44 233 259 30 57 1597 3885 287 126 140 15976 39 479 277 39 209 59 4944 125 168 4685 363 252 169 65 17974 40 227 55 64 196 62125
#ofRegions 776 308 102 124 51 6328 476 178 153 119 222 48 204 765 93 720 1907 4246 213 165 126 8347 25 594 349 18 246 465 3163 221 753 6871 231 690 128 147 10412 180 184 48 133 339 50868
Average (Per Virus) DNA Length(bp) #ofProteins 1081217.0 31.8 37244.5 5.4 10595.8 3.2 42105.1 4.0 19023.3 2.8 18834984.2 140.2 44243.1 5.3 38465.0 4.6 56253.8 4.0 19465.1 2.5 36166.2 4.6 6238.0 3.1 167672.8 10.1 263217.9 8.9 19927.0 2.1 11774.7 1.1 23795.1 6.3 16450671.3 94.8 81842.0 11.0 34462.8 6.0 52046.6 10.0 23192657.4 153.6 14590.4 3.3 55008.4 7.1 133383.7 8.4 9393.4 2.2 20641.2 4.0 8217.9 1.1 2689650.9 56.2 29186.0 5.7 20183.8 2.0 34421900.6 173.5 280922.6 11.3 40775.8 4.5 86516.0 6.8 22703.3 2.0 3869810.2 72.5 27112.4 2.4 22358.1 5.3 11603.0 2.1 18775.0 2.9 46448.4 5.3
#ofRegions 31.0 7.9 3.0 5.0 4.6 124.1 10.6 6.1 6.1 6.3 6.7 3.4 8.9 26.4 6.6 13.8 7.5 103.6 8.2 7.9 9.0 80.3 2.1 8.9 10.6 1.0 4.7 8.5 35.9 10.0 9.2 254.5 7.2 12.3 5.1 4.6 42.0 10.6 4.3 1.8 6.0 9.2
Comparison of Accuracies of Classification and Numbers of Dimensions of Vectors
Figure 5 and Fig.6 present accuracies of virus classification by SVM classifiers for two types of genomic materials, DNA and protein sequences, respectively. The values of ”k” and ”T” in the experiments ranged from 1 to 8 and from 3 to 75, respectively. As shown in Fig.5 (Fig.6), the best accuracies were 99.5%(98.0%), 93.7%(91.5%) and 98.1%(94.5%) when k=5(T=30), k=4(T=21) and k=6(T=12) were set with three virus datasets in 6 orders, 43 families and 33 genera, respectively. As shown in Table 6, the classification accuracies obtained using ”region” names were 99.9%, 97.3% and 99.0%, respectively. In this study, as shown in Table 6, ”Region” achieved the best accuracy. The numbers of dimensions of the vectors and 42 families, for example, were 256 for ”DNA” when k=5, 28,136 for ”Protein” when T=21, and 4,538 for ”Region”. Section 4.1 explains why the use of ”Region” yielded the best accuracy.
October 1, 2013 21:5 WSPC/INSTRUCTION 2013GIW˙JBCB˙VirusClassificationCopmarison˙Final
10
FILE
JING-DOO WANG
Table 5. The statistics of 33 virus genera. Ci 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33
Genus #ofViruses Alphabaculovirus 35 Alphapapillomavirus 14 Alphatorquevirus 16 Alphavirus 16 Badnavirus 18 Begomovirus 123 Begomovirus∗ 13 Betabaculovirus 12 Betacoronavirus 10 Carlavirus 27 Carmovirus 13 Circovirus 11 Crinivirus 11 Dependovirus 15 Enterovirus 13 Flavivirus 37 Gammaretrovirus 13 Ilarvirus 14 Inovirus 14 Mastadenovirus 16 Mastrevirus 13 Nepovirus 10 New world arenaviruses 18 Partitivirus 11 Parvovirus 11 Polerovirus 13 Polyomavirus 22 Potexvirus 31 Potyvirus 64 Sobemovirus 12 Tobamovirus 22 Tombusvirus 10 Tymovirus 15 Total 693 * Begomovirus-associated alphasatellites
DNA Length(bp) 700041223 789935 195467 441387 494642 3253448 17843 225849175 3013335 1403493 295267 64075 2035824 191112 95345 442757 278712 559013 1056331 17663233 137527 245938 756288 101890 243618 449983 642091 1054674 1251743 218780 579754 236700 282903 964383506
#ofProteins 5081 100 52 38 65 813 13 1687 98 164 74 33 122 40 13 41 40 66 147 519 51 20 72 23 49 78 125 162 128 52 90 50 45 10151
#ofRegions 4826 137 55 172 111 965 24 1298 297 326 51 38 93 72 130 551 120 73 137 561 63 54 90 11 70 106 221 236 600 43 168 60 94 11853
Average (Per Virus) DNA Length(bp) #ofProteins 20001177.8 145.2 56423.9 7.1 12216.7 3.3 27586.7 2.4 27480.1 3.6 26450.8 6.6 1372.5 1.0 18820764.6 140.6 301333.5 9.8 51981.2 6.1 22712.8 5.7 5825.0 3.0 185074.9 11.1 12740.8 2.7 7334.2 1.0 11966.4 1.1 21439.4 3.1 39929.5 4.7 75452.2 10.5 1103952.1 32.4 10579.0 3.9 24593.8 2.0 42016.0 4.0 9262.7 2.1 22147.1 4.5 34614.1 6.0 29186.0 5.7 34021.7 5.2 19558.5 2.0 18231.7 4.3 26352.5 4.1 23670.0 5.0 18860.2 3.0
Fig. 5. Accuracy Comparison using DNA sequences using various k values.
Table 6. Comparison of Classification Accuracy and Numbers of Dimensions of Vectors 6 Virus Orders 42 Virus Families 33 Virus Genera
DNA
Protein
Region
99.5%, 1024 (k=5) 93.7%, 256 (k=4) 98.1%, 4096 (k=6)
98.0%,26942 (T=30) 91.5%,28136 (T=21) 94.5%,2223 (T=12)
99.9%,2783 97.3%,4538 99.0%,2939
#ofRegions 137.9 9.8 3.4 10.8 6.2 7.8 1.8 108.2 29.7 12.1 3.9 3.5 8.5 4.8 10.0 14.9 9.2 5.2 9.8 35.1 4.8 5.4 5.0 1.0 6.4 8.2 10.0 7.6 9.4 3.6 7.6 6.0 6.3
October 1, 2013 21:5 WSPC/INSTRUCTION 2013GIW˙JBCB˙VirusClassificationCopmarison˙Final
FILE
COMPARING VIRUS CLASSIFICATIONS USING GENOMIC MATERIALS
11
Fig. 6. Accuracy Comparison using protein sequences using various T values.
4. Discussions 4.1. Why ”region” yielded the best accuracy Table 6 shows that ”region” provided the best classification accuracy. The reason is discussed below. First, The frequency distributions of k-mers that were derived from DNA sequences were used for virus classification. Generally, longer k-mers present more specific features. However, two characteristics of viruses - their rapid evolution and diversity cause the frequency distribution of k-mers to be too sparse to be used for classification purpose when the DNA sequences ar short but the value of k is large. Figure 5 shows that the accuracy decreases as the value of k increases over 7. Second, in this study, the protein sequences within the same group after protein clustering were assumed to have similar functions. This fact was used as a distinguishing feature for further vector transformation processing. However, the protein clustering approach was implemented in the ”pblast” program to measure the similarity between two protein sequences and the single-linkage method was used to join two groups into one. The above approach might generate impurities in the protein such groups that one group may exhibit two functions. For example, Fig.7 presents 7 protein sequences, S1 , S2 , . . . , S7 , in two distinctive groups (functions), GID1 and GID2 determined by the single-link method with a threshold value T . The two groups, GID1 and GID2 , are formed due to regions R1 and R2 , respectively, and are disjointed because all of the distances between the nodes of GID1 and those of GID2 are larger than e−T . However, the appearance of S8 , containing both R1 and R2 , results in the merging of GID1 and GID2 into GID3 . The region name is a distinguishing feature for classification in this study. With respect to the distribution of class frequency (CF) of region names across 42 families, an example of which is shown in Fig.8, the majority of the CF values of region names were ”CF=1”(78.45%) and most of the region names appeared in only one
October 1, 2013 21:5 WSPC/INSTRUCTION 2013GIW˙JBCB˙VirusClassificationCopmarison˙Final
12
FILE
JING-DOO WANG
Fig. 7. Two distinct groups, GID1 and GID2 , are merged as the group GID3 due to the sequence S8 that contains R2 and R3 .
Fig. 8. The distribution of class frequency (CF) of the ”Region” derived from 42 families.
class. Notably, about 20% of the regions were with ”CF=2” (18.85%) or ”CF=3” (5.61%), and these regions, as the S8 described above, may have caused the impurity of protein groups. 4.2. Drawbacks of ”region” name annotation As shown in Table 2, ICTV and NCBI, contained 420 and 326 genera, respectively. However, only 33 genera (693 viruses) were selected in the experiments owing to
October 1, 2013 21:5 WSPC/INSTRUCTION 2013GIW˙JBCB˙VirusClassificationCopmarison˙Final
FILE
COMPARING VIRUS CLASSIFICATIONS USING GENOMIC MATERIALS
13
the requirement that each class should contain at least ten instances that included at least one region name. Hence, the majority of the viruses were not used in classification experiments so the experimental result in this study was not robust enough. The ”region” names usually were given or assigned manually while related sequences were aligned using the RPS-Blast against the CDD (Conserved Domain Database) 23 . The way in which a region name is assigned may have the side effect that related sequences might be highly specific to some viral family, for example. Therefore, the region names may contain some metadata about the label of the original family, which may provide a way of cheating in classification experiments. To avoid such a situation, region names must be annotated automatically without knowledge of the class label using the HMMER33 against PFAM6 or other automated domain annotation tools. Doing so would make the proposed approach more practical and provide more convincing experimental classification in the future. 4.3. Verifying fitness of class structure within existing virus taxonomy The mis-classified instances are examined using a confusion matrix 26 to identify the implicit relationship between two classes. This experimental results is thus obtained are not shown herein owing to the limitation on the number of pages. However, analyzing the ambiguities of among classes is favored to evaluate the fitness of an existing class structure31 . After a feasible type of genomic material is selected from existing genomic materials for classification. Existing class structures of biological taxonomy can be verified via molecular biology instead of traditional morphology. Such work may provide clues for biologists or taxonomists to reinspect and adjust existing class structures when they working with taxonomy in the future. 5. Conclusion In this study, there genomic materials are used to compare methods of virus classification; there are DNA sequences, protein sequences and region names. The first two materials are extracted directly from virus genomes, and the last is obtained from the annotation of the protein. The resources that are used in the experiments are collected from taxonomic levels and include 6 orders, 42 families and 33 genera. Experimental results show that using ”region” to classify viruses yielded the best classification accuracy when the SVM classifier from LIBSVM was used. The obtained accuracies were 99.9%, 97.3% and 99.0% for the three datasets that comprised 6 order, 42 families and 33 genera, respectively. This paper provides a novel approach to classifying viruses for molecular biological purposes, instead of the use of morphology. This approach, using genomic materials, can be applied to classify other creatures (organisms). This work opens up a new way to determine whether the existing taxonomic structure is suited from the point of view of molecular biology31 .
October 1, 2013 21:5 WSPC/INSTRUCTION 2013GIW˙JBCB˙VirusClassificationCopmarison˙Final
14
FILE
JING-DOO WANG
Acknowledgment This study is supported by Asia University, Taiwan under project 101-asia-30. The author thanks the reviewers for their valuable comments and suggestions. References 1. Entrez Programming Utilities Help, http://www.ncbi.nlm.nih.gov/books/NBK25501/. 2. FTP Site for Genomes in NCBI, ftp://ftp.ncbi.nih.gov/genomes. 3. HMMER, http://hmmer.janelia.org/. 4. International Committee on Taxonomy of Viruses (ICTV), http://www.ncbi.nlm.nih.gov/ICTVdb/. 5. National Center for Biotechnology Information(NCBI), http://www.ncbi.nlm.nih.gov/. 6. Pfam database, http://pfam.sanger.ac.uk/. 7. Wikipedia: Virus Classification, http://en.wikipedia.org/wiki/Virus classification. 8. Alpaydin E, Introduction to Machine Learning, The MIT Press, 2004. 9. Baltimore D, Animal Virology, no. 4, Elsevier Science, 1976. ISBN 9780323142281. 10. Bazinet A, Cummings M, A comparative evaluation of sequence classification programs, BMC Bioinformatics 13(1):92+, 2012. 11. Burke SM, Torkington N, Aas G, Perl and LWP. Fetching Web Page, Parsing HTML, Writing Spiders and More, O’Reilly, Beijing, 2002. 12. Chang CC, Lin CJ, LIBSVM: a library for support vector machines, 2001, software available at http://www.csie.ntu.edu.tw/ cjlin/libsvm. 13. Croft B, Metzler D, Strohman T, Search Engines Information Retrieval in Practice, 1st ed., Addison Wesley, 2009. 14. Croft B, Metzler D, Strohman T, Search Engines: Information Retrieval in Practice, Addison-Wesley Publishing Company, USA, 2009. ISBN 0136072240, 9780136072249. 15. Deschavanne P, DuBow M, Regeard C, The use of genomic signature distance between bacteriophages and their hosts displays evolutionary relationships and phage growth cycle determination, Virology Journal 7, 2010. 16. Dutilh BE, He Y, Hekkelman ML, Huynen MA, Signature, a web server for taxonomic characterization of sequence samples using signature genes, Nucleic Acids Research (suppl 2):W470–W474. 17. Dutilh BE, Snel B, Ettema TJ, Huynen MA, Molecular biology and evolution 25, 2008. 18. Exarchos TP, Tsipouras MG, Papaloukas C, Fotiadis DI, A two-stage methodology for sequence classification based on sequential pattern mining and optimization, Data Knowl Eng 66(3):467–487, 2008. 19. Han J, Kamber M, Data Mining: Concepts and Techniques, 2nd ed., Morgan Kaufmann, 2007. 20. Jones NC, Pevzner PA, An Introduction to Bioinformatics Algorithms, MIT Press, 2004. ISBN 0-262-10106-8. 21. Jun SR, Sims GE, Wu GA, Kim SH, Whole-proteome phylogeny of prokaryotes by feature frequency profiles: An alignment-free method with optimal feature resolution, Proceedings of the National Academy of Sciences 107(1):133–138, 2010. 22. Manning CD, Raghavan P, Schu”tze H, Introduction to Information Retrieval, Cambridge University Press. 23. Marchler-Bauer A, Zheng C, Chitsaz F, Derbyshire MK, Geer LY, Geer RC, Gonzales NR, Gwadz M, Hurwitz DI, Lanczycki CJ, Lu F, Lu S, Marchler GH, Song JS, Thanki
October 1, 2013 21:5 WSPC/INSTRUCTION 2013GIW˙JBCB˙VirusClassificationCopmarison˙Final
FILE
COMPARING VIRUS CLASSIFICATIONS USING GENOMIC MATERIALS
24. 25.
26. 27. 28. 29.
30. 31. 32. 33.
34.
15
N, Yamashita RA, Zhang D, Bryant SH, CDD: conserved domains and protein threedimensional structure., Nucleic acids research 41(Database issue):D348–D352, 2013. Mitchell TM, Machine Learning, The McGraw-Hill Companies, Inc, 1997. on Taxonomy of Viruses IC, King A, Adams M, Lefkowitz E, Carstens E, Virus Taxonomy: IXth Report of the International Committee on Taxonomy of VirusesImmunology and microbiology, Immunology and microbiology, Academic Press, 2011. ISBN 9780123846846. Roiger R, Geatz MW, Data Mining: A Tutorial Based Primer, Addison Wesley, 2003. Trifonov V, Rabadan R, Frequency analysis techniques for identification of viral genetic data, mBio 1(3), July/August 2010. van Passel M, Kuramae E, Luyf A, Bart A, Boekhout T, The reach of the genome signature in prokaryotes, BMC Evolutionary Biology 6(1):84, 2006. Wang JD, A Comparison study of Virus Classification by Genome Sequences, The 11th IEEE International Conference on Bioinformatics and Bioengineering, pp. 270–273, 2011. Wang JD, Virus Classification via Genomic Sequences From Different Taxonomic Level, The 23rd International Conference on Genome Informatics, p. 76, 2012. Wang JD, Liu HC, An Approach to Evaluate the Fitness of One Class Structure via Dynamic Centroids, Expert Systems with Applications 38(11):13764–13772, 2011. Witten IH, Frank E, Data Mining: Practical Machine Learning Tools and Techniques (Third Edition), Elsevier, 2011. ISBN 0120884070. Wu GA, Jun SR, Sims GE, Kim SH, Whole-proteome phylogeny of large dsdna virus families by an alignment-free method, Proceedings of the National Academy of Sciences 106(31):12826–12831, 2009. Yu C, Hernandez T, Zheng H, Yau SC, Huang HH, He RL, Yang J, Yau SST, Real time classification of viruses in 12 dimensions, PLOS ONE 8(5), 2013.
Jing-Doo Wang received his BS degree in Computer Science and Information Engineering from the University of Tatung (formerly Tatung Institute of Technology) in 1989, and his M.S. and Ph.D. degrees in Computer Science and Information Engineering from the University of Chung Cheng in 1993 and 2002 respectively. He has been with Asia University (formerly Taichung Healthcare and Management University) since spring 2003, where he is currently an assistant professor in the Department of Computer Science and Information Engineering. He also holds a joint appointment with the Department of Biomedical Informatics. His research interests are in the areas of bioinformatics, text mining for trend analysis and the extraction of maximal repeats via cloud computing.