IMPROVING MICRORNA TARGET PREDICTION BY PERFORMANCE-BASED ALGORITHM COMBINATION Ignacio Sanchez Caballero ´ MAGISTER EN BIOINFORMATICA Y BIOLOG´IA COMPUTACIONAL UNIVERSIDAD COMPLUTENSE DE MADRID 2010-2011
Centro Nacional de Biotecnologia - Consejo Superior de Investigaciones Cient´ıficas Dirigido por: Alberto Pascual-Montano Codirigido por: Gonzalo G´omez Septiembre 2011 ´ CALIFICACION
Contents 1 Objective
3
2 Introduction 2.1 Biogenesis . . . . . . . . . . 2.2 Features Affecting Targeting 2.2.1 Complementarity . . 2.2.2 Conservation . . . . 2.2.3 Other factors . . . .
. . . . .
4 4 4 5 6 7
. . . . .
. . . . .
. . . . .
. . . . .
3 Materials and Methods 3.1 Materials: Predictive Databases . . 3.1.1 ElMMo . . . . . . . . . . . 3.1.2 MirTarget . . . . . . . . . . 3.1.3 TargetSpy . . . . . . . . . . 3.1.4 miRWalk . . . . . . . . . . . 3.1.5 Microcosm . . . . . . . . . . 3.1.6 TargetScan . . . . . . . . . 3.1.7 PITA . . . . . . . . . . . . . 3.1.8 DIANA-microT . . . . . . . 3.1.9 microRNA.org . . . . . . . . 3.2 Materials: Experimental Databases 3.2.1 TarBase . . . . . . . . . . . 3.2.2 mirTarBase . . . . . . . . . 3.2.3 miRecords . . . . . . . . . . 3.2.4 miRWalk . . . . . . . . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
8 8 8 9 9 9 10 10 10 11 11 11 12 12 12 12
4 Methods 4.1 Unification of formats . . . . . . . . . . 4.2 Combination of experimental databases 4.3 Performance evaluation . . . . . . . . . 4.4 Combination . . . . . . . . . . . . . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
13 13 14 14 15
5 Results 5.1 Predictive databases: ROC curve . . . . . . . . . . . . . . . . . . . . 5.2 Predictive databases: Precision curve . . . . . . . . . . . . . . . . . . 5.3 Combined database . . . . . . . . . . . . . . . . . . . . . . . . . . . .
17 17 18 18
6 Discussion
20
. . . . . . . . . . . . . . .
7 Appendix 21 7.1 ROC curves . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 7.2 Precision curves . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2
1
Objective
MicroRNAs have been at the center stage of the biomedical community for more than a decade now, after a team of scientists working at Harvard University in Cambridge discovered their vital role in the regulation of gene expression [1]. Since then, the use of bioinformatic tools has become a major accelerator in our understanding of microRNA function. Many algorithms have been created to predict where microRNAs are encoded, as well as what genes they regulate [2–10]. Unfortunately, due to the popularity of the field, it is not always clear which of the available computational methods is best suited for determining which mRNA transcripts are regulated by which microRNAs. We propose a straightforward method to combine the tens of currently available prediction algorithms, and assign them a credibility measure based on their previous performance to simplify the task of experimental validation.
3
2
Introduction
MicroRNAs (miRNAs) are small non-coding RNA molecules of approximately 22 nucleotides in length that act as post-transcriptional regulators of gene expression, and control many cellular processes in animals, plants, and other organisms. Their regulation is accomplished by base-pairing to specific regions in the transcripts of their target genes, thereby degrading them or inhibiting their translation [11].
2.1
Biogenesis
miRNAs are generated by a maturation process that begins in the nucleus of all eukaryotic cells (see Figure 1). The first precursor molecule of the process, called primary miRNA (pri-miRNA), results from the transcription by RNA polymerase II of either a dedicated miRNA gene, or an intron of a protein-coding gene. The RNase III enzyme Drosha cleaves the flanks of the pri-miRNA forming a ∼70-nucleotide hairpin, called precursor miRNA (pre-miRNA). This pre-miRNA is exported from the nucleus to the cytoplasm by the enzyme Exportin-5. Then, it is processed by the RNase III enzyme Dicer, creating a ∼22-nucleotide miRNA/miRNA* duplex. One of the duplex strands (the mature strand) is incorporated into the RNA-induced silencing complex (RISC) and eventually binds to the 30 -untranslated region (30 -UTR) of the target mRNA [12]. Binding to the 50 -UTR has also been shown to occur [13] but it is considered rare, and its repressing effects are less significant than in conventional binding. If the amount of complementarity between the miRNA and the mRNA target site is high, the RISC will degrade the mRNA; if there is only partial complementarity, the RISC will inhibit its translation [14]. Some miRNAs bind hundreds of different mRNAs [15], which means they can regulate multiple processes simultaneously. It is estimated that miRNAs regulate the expression of nearly all mammalian mRNAs [16]. The set of rules that determine what mRNA transcripts a given miRNA will target has not yet been completely elucidated. One reason for this limitation is that metazoan miRNAs bind with only partial complementarity over brief sequences, making them difficult to predict computationally; another is that they function as subtle thermostats of protein production, sculpting gene expression only mildly, making their effects difficult to detect.
2.2
Features Affecting Targeting
The process of understanding miRNA function requires our ability to identify what targets they bind to. In order to do so, we need to understand what factors make a target site functional, that is, capable of being bound by a miRNA. A complete 4
Figure 1: miRNA biogenesis description of the empirical basis for targeting specificity is still in the making, so is our understanding of what are the relative contributions that each factor makes [16]. Much emphasis has been placed on the requirement of sequence complementarity— mainly because it is the easiest feature to analyze computationally—, but discarding miRNAs that do not fulfill it means overlooking all species-specific target sites, of which there are many. Everything we know about targeting requirements comes from analyzing the minority of miRNAs that are the most highly expressed (and therefore, easier to detect), but in the case of equally valid but less expressed miRNAs this will not necessarily be a requirement [16]. 2.2.1
Complementarity
Unlike plants, which have almost full complementarity, most metazoan miRNAs only have perfect complementarity with their target sites on a 6-8 nucleotide-long region near their 50 -end called the “seed.” This region is evolutionarily conserved and is thought to be important in target recognition [7, 15]. Although it was one of the first features to be identified, and is still thought to be one of the most important, it is not the only one [17]. A seed match does not always guarantee that the miRNA will repress translation. Sometimes it also requires that there is supplemental pairing near the 30 -end of the miRNA [16], especially to compensate for non-perfect seed matches, where there is a wobble (non-Watson-Crick) pair. Target sites can be classified into four types according to their size and the 5
complementarity between their miRNAs and mRNAs [16] (See Figure 2). These include a perfect 6-nucleotide match (6mer site), a seed match and an additional match (7mer-m8 site), a seed match with an A at position 1 (7mer-A1), and an 8 nucleotide long seed match, with both an additional match site and an A at position 1 (8mer site). Recently a new class of seed type called “centered site” has been identified [18], which requires 11-12 contiguous Watson-Crick pairs instead of seed matching.
Figure 2: miRNA seed types
2.2.2
Conservation
Conservation is another factor that gives clues to which targets are functional. The reasoning being that if a region is conserved across distantly-related species, it is likely to be functional. Conservation patterns serve to pinpoint these areas as being biologically active. However, restricting predictions of targets to only those that show cross-species conservation will miss any target that is non-conserved (speciesspecific) [19]. There is, in fact, ample evidence contradicting the notion that seed conservation is a requirement for functional interactions [20].
6
2.2.3
Other factors
• AU composition It was found that a high proportion of As and Us surrounding a target is a good indicator that it is functional. This effect is especially important in the immediate vicinity of the site but it plays less of a role farther away [17]. • miRNA Cooperation When several miRNAs are coexpressed, and their binding sites are located close to each other on the same transcript, their efficacy increases. This mechanism allows repression to become more sensitive to small changes in miRNA levels [19]. This means that predicted targets should be weighed differently depending on how many similar targets appear in the same mRNA. • Position in 30 -UTR Functional target sites are usually located in the 30 -UTR, but not too close to the stop codon. Additionally, target sites placed near both ends of long UTRs are more effective than those near the center [2]. • Thermodynamic Stability The free energy of hybridization between the miRNA and its target is often used as a measure of how likely it is that an interaction would take place spontaneously, that is, that the target would become functional. The lower the free energy between two paired RNA the more stable they are. Even though this appears to be an important factor to determine true targets, some authors argue that contemplating this feature brings no more additional information than what is already gained by looking at the evolutionary conservation pattern [21].
7
3
Materials and Methods
3.1
Materials: Predictive Databases
The following databases were chosen because they provide a relatively recent version of their predictions in a downloadable precompiled format. Most databases have predictions for more than one organism, but we will focus on human predictions, since this is often the most complete set. The overlap between them is small and their differences in size, method, format and performance are extensive.To ease the task of comparison, we generated a set of auxiliary databases that allowed the standardization of all databases into a common format that could be compared and analyzed, with a common set of gene and miRNA names and identifiers. hsa Predictive Predictive Predictive Predictive Predictive Predictive Predictive Predictive Predictive
microRNA.org microcosm TargetScan conserved PITA DIANA-microT EIMMo MirTarget (miRDB) Targetspy mirWalk
! ! ! ! ! ! ! ! !
mmu
! ! ! ! ! ! ! !
rno
gga
dre
dme
cel
! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !
Table 1: Organisms with predictive algorithms. Organism codes: hsa-Homo sapiens, mmu-Mus musculus, rno-Rattus norvegicus, gga-Gallus gallus, dre-Danio rerio, dmeDrosophila melanogaster, cel-Caenorhabditis elegans
3.1.1
ElMMo
The ElMMo database [2] is based on a Bayesian probabilistic model without parameters. It models the evolution of orthologous target sites in a set of related species. It infers the phylogenetic distribution of functional target sites independently for each miRNA. It then assigns to each putative site in a 30 -UTR that is complementary to a miRNA seed a posterior probability that the site is a functional target site for the miRNA (meaning that the site has been selected in evolution for its ability to bind the miRNA). For each miRNA and each seed type (out of 9 possible subtypes) it collects all the putative sites in the 30 -UTRs of the reference species of the chosen clade. For each of these putative sites, it determines the conservation pattern. To compute the posterior probabilities that the site is functional, given its conservation pattern, 8
it compares the number of times the the conservation pattern is observed for each putative target site of the same seed type, and it compares this count with what would be expected by chance. It can be downloaded at http://www.mirz.unibas.ch/ElMMo3/BulkDownloads/ v4/hg_targets_FullList_flat.tab.gz 3.1.2
MirTarget
MirTarget [3] is a target prediction program based on support vector machines (SVMs, a universal constructive learning procedure based on statistical learning theory that maximizes the separation between two data groups in a non-linear feature space) and a large microarray training dataset, as weel as RNA secondary structure (thermodynamic stability). The algorithm uses step-wise logistic regression analysis together with resampling methods to identify the most predictive features. Some of the 131 training features used to separative positive samples from negative ones are seed conservation, base composition, secondary structure, and location in the 30 -UTR. The parameters of the training model were optimized by multiple rounds of cross-validation to minimize the risk of over-training. It can be downloaded at http://mirdb.org/miRDB/download/MirTarget2_v3. 0_prediction_result.txt.gz 3.1.3
TargetSpy
Targetspy [4] uses a machine learning prediction scheme (MultiBoost) with decision stumps as base learner and automatic feature selection with a wide spectrum of compositional, structural and base pairing features. The model doesn’t depend on evolutionary conservation (allowing for the detection of species-specific interactions) and predicts target sites regardless of the presence of a seed match. This makes it especially suited to predict species-specific and 30 compensatory target sites. Instead of generating candidates by requiring seed pairing it finds duplexes whose predicted Gibbs free energy is below a certain threshold. It can be downloaded at http://www.targetspy.org/data/hsa_refseq_seed_ sens.gz 3.1.4
miRWalk
miRWalk [5] identifies multiple consecutive Watson-Crick complementary subsequences between miRNA and gene sequences. It searches for seeds by walking the complete sequence of a gene, starting with a heptamer seed. When it finds a heptamer perfect base-pairing, it extends the length of the miRNA seed until a mismatch arises, returning all possible hits with 7 or longer matches. It then separates 9
these binding sites depending on their locations in the analyzed sequences and assigns prediction results in five parts, according to promoter region, 50 -UTR, coding sequence, and 30 -UTR and mitochondrial genes. It can be downloaded at http://www.ma.uni-heidelberg.de/apps/zmf/mirwalk/ microRNApredictedtarget.html 3.1.5
Microcosm
Microcosm [6] (formerly known as miRBase Targets) uses the miRanda algorithm [22]. It doesn’t assume that the miRNA binding sites must be conserved. The miRanda algorithm uses dynamic programming to score alignments based on the complementarity of nucleotides, it also allows G-U wobble pairs. To estimate the thermodynamic properties of a duplex, it uses the Vienna 1.3 RNA secondary structure programming library [23]. It sets threshold scores for complementarity and minimum energy and then adds the additional requirement of being conserved over at least two species. It can be downloaded at ftp://ftp.ebi.ac.uk/pub/databases/microcosm/ v5/arch.v5.txt.homo_sapiens.zip 3.1.6
TargetScan
TargetScan [7] uses the TargetScanS algorithm, it requires perfect seed pairing, a context score is evaluated to predict site performance taking into account the context where the target is inserted on the mRNA. The algorithm gives a score according to type of seed match, 30 -pairing contribution, local AU contribution and position contribution. It is ranked by total context score, based on site type, site number, and site context [17]. It can be downloaded at http://www.targetscan.org//vert_50//vert_50_ data_download/Conserved_Site_Context_scores.txt.zip 3.1.7
PITA
PITA [8] stands for Probability of Interaction by Target Accessibility. It uses standard settings to identify initial seeds for each miRNA in the 30 -UTR, it applies a parameter-free model that computes the difference between the free energy gained from the formation of the duplex, and the cost of unpairing the target to make it accessible to the miRNA, it then combines sites for the same miRNA to get a total interaction score for the interaction. It requires full complementarity. It can be downloaded at http://genie.weizmann.ac.il/pubs/mir07/catalogs/ PITA_sites_hg18_0_0_TOP.tab.gz
10
3.1.8
DIANA-microT
DIANA-microT [9] is an algorithm that requires seed matching, but it does not have to be fully complementary, it may contain a wobble pair. The overall score is the weighted sum of the scores of all the identified interactions on the 30 -UTR. It uses a conservation profile for the estimation of the final score. The algorithm compares the predicted targets against a set of randomly created mock miRNAs and uses that to calculate the FPR of a particular interaction. It can be downloaded at http://diana.cslab.ece.ntua.gr/data/public/microT_ v3.0.txt.gz 3.1.9
microRNA.org
mirSVR (microRNA.org) [10] is a supervised machine-learning method for scoring the efficiency of miRanda-predicted miRNA target sites by using supervised learning on mRNA expression data collected after miRNA transfections. It trains a regression model on sequence and contextual features extracted from miRanda-predicted target sites. It uses support vector regression (SVR) to train on a wide range of features, like secondary structure accessibility to the binding site and phylogenetic conservation. It does not require full complementarity, since it allows a wobble. It can be downloaded at http://cbio.mskcc.org/microRNA_data/human_predictions_ S_C_aug2010.txt.gz
3.2
Materials: Experimental Databases
Determining that a predicted target is functional requires experimental validation, which is expensive and time-consuming. For this reason, the compilation of validated interactions in databases has been instrumental to advance our understanding of the field. We chose the following experimental databases because they are supervised by manual curators and are frequently updated. Some of the strong experimental evidence methods they use include fluorescence quantitative PCR (GFP and Luciferase) and Western blots to detect mRNA and protein expression levels under conditions of miRNA overexpression or knock-down in the cell. Additionally, they also record weaker experimental evidence, through high-throughput miRNA target identification methods, including pSILAC and microarray experiments. These are weaker because they do not show if the expression patterns are changed directly or not.
11
hsa Experimental Experimental Experimental Experimental
Tarbase MirTarBase miRecords mirWalk
! ! ! !
mmu
rno
gga
dre
dme
cel
! ! ! ! ! ! ! ! ! ! ! ! ! !
Table 2: Organisms with experimentally validated data. Organism codes: hsa-Homo sapiens, mmu-Mus musculus, rno-Rattus norvegicus, gga-Gallus gallus, dre-Danio rerio, dme-Drosophila melanogaster, cel-Caenorhabditis elegans 3.2.1
TarBase
TarBase [24] houses a manually curated collection of experimentally tested miRNA targets in human, mouse, fruit fly, worm, and zebrafish. Each record specifies the miRNA that binds, the gene where the target site is, and the experiments that were conducted to validate it. It can be downloaded at http://diana.cslab.ece.ntua.gr/data/public/TarBase_ V5.0.rar 3.2.2
mirTarBase
miRTarBase [25] has the largest amount of validated interactions. It is built using text-mining programs that survey the literature. This list is later manually validated by at least two people. It covers several organisms. It can be downloaded at http://mirtarbase.mbc.nctu.edu.tw/cache/download/ hsa.xls 3.2.3
miRecords
miRecords [26] is a manually curated database of experimentally validated miRNAtarget interactions with systematic documentation of experimental support for each interaction. It can be downloaded at http://mirecords.biolead.org/download_data.php? v=2 3.2.4
miRWalk
miRWalk [5] uses automated text-mining on titles and abstracts from the PubMed database against curated dictionaries. It can be downloaded at http://www.ma.uni-heidelberg.de/apps/zmf/mirwalk/ miRNAtargetpub.html
12
4
Methods
The approach we followed to generate a combined database included standardizing all databases under a single format, combining all experimental databases, evaluating the predictive databases, assigning them weights and, finally, combining them.
4.1
Unification of formats
Heterogeneity of formats is a big problem. Each database uses different columns; some have headers, others don’t; some identify their genes by using transcript ids, others use gene ids; some use the Refseq codes, others Genbank’s, others Ensembl’s; some genes are obsolete, others are mentioned with two different names. Each uses a different range for its confidence scores; some use a negative scale, others a positive one; some go from 1 to 0, others from -100 to 100; for some, a smaller scale indicates better score, for others it is the opposite. Figure 3 shows how much their score distributions vary. To fix this issue we created dictionary files to translate between the different formats, so that all databases, once normalized, would use the same gene and miRNA names. Additionally, we rescaled the confidence scores using Formula 1 to lie between 0 and 1, with 1 representing the highest confidence of an interaction. rescaled score =
old score − min(scores) |max(scores)| − |min(scores)|
(a) Microcosm
(1)
(b) PITA
Figure 3: Score distributions of human miRNA-mRNA interaction predictive algorithms
13
4.2
Combination of experimental databases
Over the last decade, a number of databases have collected the validated results of miRNA-mRNA interaction experiments [5, 24–26]. We can use these experimental databases as a gold standard to measure the performance of our predictive algorithms. We unified all the experimental databases into a single one, containing a single entry for each interaction that appears at least in one of the databases. For example, in the case of humans, this amounts to 8112 validated interactions that appear at least in one experimental database. The first field of the combined database (see Table 3)is the miRNA name, as assigned by mirBase [6]. The second is the Genbank id of the gene. Comparing this list of a few thousand validated interactions with even the smallest predictive database (size ∼200,000) clearly shows that most predicted interactions are not validated. miRNA name
Genbank id
hsa-miR-155 hsa-miR-146a hsa-miR-23a* hsa-miR-192 hsa-miR-19a hsa-miR-1 hsa-miR-10a hsa-miR-31 hsa-miR-21 hsa-miR-7
1595 6098 4775 10597 7057 79814 613 3717 1871 1611
Table 3: Combined experimental table (truncated)
4.3
Performance evaluation
A hypothetical perfect algorithm would generate a set of predictions that would completely overlap with the set of experimentally validated interactions (we call the intersection between predicted and validated interactions True Positives, TP), additionally, the number of experimentally validated interactions that were not predicted by the algorithm should be zero (we call the intersection between validated but not predicted interactions False Negatives, FN). Maximizing the number of True Positives and minimizing the number of False Negatives are not the only criteria that determine a good algorithm. Ideally, it should also maximize the number of not predicted and not validated interactions (True Negatives), and minimize the number of predicted but not validated interactions (False Positives). The problem, however, is that once we know an interaction 14
has not been validated experimentally, we cannot be certain if the interaction never takes place under any condition in nature, or the particular experiment that would validate it was never carried out. Even if it was carried out, and confirmed to be negative under a specific condition, it wouldn’t imply that it would also be negative in all conditions. We evaluated each predictive database region against the combined experimental database. We are assuming that performance is not a significant metric at the scope of the database, but rather, that it varies along different regions. If we want to reward databases whose True Positive predictions are located at high-confidence regions we need to first find out where these are. Our confidence in a database for a given interaction depends on the number of validated predicted interactions with equal or higher score divided by the total number of interactions with equal or higher score. This metric is called precision. TP (2) TP + FP The predictive databases that have the most validated interactions are not necessarily the best performing databases, since it can also mean that they are paying an increase in sensitivity with a decrease in specificity. Therefore our criteria for assigning confidence scores is not the number of validated interactions in total, but the precesion they obtain. We can see in Figure 5 how different these can be. precision =
4.4
Combination
An initial attempt at combining all predictive databases might be to do it by score; that is, sort them by score, and in those cases where several databases predict the same interaction, take the average score. This is less than ideal, since each database defines a “good” score differently; for some, it might be anything above 0.6, for others nothing less than 0.9. Another attempt could be taking the union, or the intersection of all these algorithms. In fact, this is often being used by experimental scientists, since it is one of the most straightforward ways to get an intuition for prediction confidence. The problem with this approach is that it can give higher weight to several low confidence databases that predict the same interaction, rather than to a single high confidence database. This is not necessarily correct. We overcame this limitation by using the precision that we had previously assigned to each interaction and combined it with the original scores of each algorithm that predicted it (see Equation 3). This allowed us to create a global score for each interaction by combining several databases without having low-performing databases drag down good performing databases. 15
The combined predictive database (see Table 4) follows a similar format to the combined experimental database: a miRNA name, a Genbank Id, a gene name, a precision value, and the number of True Positives that have a higher or equal score. combined score =
X
scorei × precisioni
(3)
i
miRNA name
Genbank id
Gene Name
Score
Precision
TP count
hsa-miR-493* hsa-miR-211 hsa-miR-92a-2-star hsa-miR-181b hsa-miR-1301 hsa-miR-323-3p hsa-miR-545 hsa-miR-922 hsa-miR-33a-star hsa-miR-583
1299 643432 54584 5321 56963 9695 4520 57047 2168 65998
COL9A3 TSG1 GNB1L PLA2G4A RGMA EDEM1 MTF1 PLSCR2 FABP1 C11orf95
0.0168806 0.0038764 0.00818215 0.118192 0.000177761 0.078212 0.0261155 0.00758106 3.52438e-05 0.00339401
0.00328941 0.00164134 0.00209322 0.014327 0.00111633 0.00988287 0.00434324 0.00202568 0.00103047 0.00158163
2589 3386 3038 1367 3972 1706 2354 3072 4132 3426
Table 4: Combined predictive table (truncated)
16
5 5.1
Results: Target Prediction Integration Predictive databases: ROC curve
We used the Receiver Operating Characteristic (ROC) curve to give an intuitive indication of database performance (see Figure 4). The graphs are obtained by setting a score threshold of 1 and decreasing it by fixed increments down to 0. This shows the performance along each region of the database, as opposed to the overall performance (which would be a single point). As we mentioned before, these graphs should be viewed with caution, since False Positives and True Negatives appear undifferentiated. A False Positive Rate measures the specificity of a classifier, and the True Positive Rate measures its sensitivity. FPR =
FP FP + TN
(4)
TPR =
TP TP + FN
(5)
(a) ElMMo
(b) Targetspy
Figure 4: ROC curves From these graphs we can infer that the ElMMo database provides more validated interactions in its higher-scoring positions than Targetspy. Regardless of Targetspy’s performance in comparison with ElMMo, we would still like to combine it with ElMMo in a way that doesn’t impact ElMMo’s high scores, but adds confidence to its lower scores, since two scores from two low-confidence regions in two databases would give more confidence than just one.
17
5.2
Predictive databases: Precision curve
The precision curve gives an indication of the number of validated interactions that are predicted in the database. A high-performing database must give a higher score to validated interactions than to other predictions.
(a) DIANA-microT
(b) mirWalk
Figure 5: Precision curves. The red line indicates the average precision.
5.3
Combined database
We combined all predictive databases using equation 3. The resulting ROC and precision curves for this combined database show better performance than any of the predictive databases by itself.
18
(a) ROC Curve
(b) Precision Curve
Figure 6: (a) is the ROC curve of the combined database. (b) is the precision curve of the combined database. It is cropped on both axes. The red line indicates the average precision.
19
6
Discussion
No algorithm makes perfect predictions under every condition [14, 27]. Because of the multi-faceted nature of miRNA targeting, and the lack of consensus among existing predictions, it makes sense to combine them in a way that maximizes the number of true predicted results while minimizing that of false ones. There have been previous attempts to combine the predictions of several algorithms by first taking their union or intersection, as a way to improve accuracy or coverage, balancing out their sensitivity and specificity, and finally, choosing the most likely candidates by consensus [28–30]. Most of these algorithms give the user the freedom to choose which combination of databases to use. The problem with this approach is that a significant proportion of users do not have the necessary information about each database’s performance to make an informed decision. Our approach presents an alternative solution that assigns confidence scores to each database’s predictions. This solves the problem introduced by choosing candidates by consensus; mainly that several low-confidence predictions can erroneously appear as more credible than a single high-confidence prediction. There are some limitations in our approach that could be interesting future research directions. For example, we assume that databases have a high precision when they contain many validated interactions in their top scores, but it does not necessarily mean that databases with low precision are not predicting true interactions. It may just mean that the interactions they predict are harder to prove experimentally because they have subtler effects, or because the necessary experiments to validate them were never carried out. Recently, Shin et al.[18] reported that miRNAs can also bind without being complementary to the seed region, in what they called centered sites. No algorithm has yet been designed to predict these interactions. Therefore, we can be sure that many of the possible interactions that were previously dismissed will have to be revisited. Our choice of validated interactions could be narrowed down further to discard those obtained through high-throughput methods like microarrays. This would make sense since it is believed that microRNAs act through translational inhibition more often than through mRNA degradation. Conversely, using a more inclusive dataset may give a more realistic picture than using a small subset of validated interactions. As our understanding of miRNA targeting improves and experimental methods become cheaper and more precise, our combined database will become more sensitive and specific. Ideally there should be no need for combining individual algorithms, but until we gain a clearer picture of factors involved in targeting this approach can serve as a useful bridge to obtain higher-confidence predictions.
20
7 7.1
Appendix ROC curves
Figure 7: ROC curves: ElMMo
Figure 8: ROC curves: microcosm
21
Figure 9: ROC curves: microrna.org
Figure 10: ROC curves: DIANA-microT
22
Figure 11: ROC curves: MirTarget
Figure 12: ROC curves: mirWalk
23
Figure 13: ROC curves: PITA
Figure 14: ROC curves: Targetscan Conserved
24
Figure 15: ROC curves: Targetspy
7.2
Precision curves
Figure 16: Precision curves: ElMMo
25
Figure 17: Precision curves: microcosm
Figure 18: Precision curves: microrna.org
26
Figure 19: Precision curves: DIANA-microT
Figure 20: Precision curves: MirTarget
27
Figure 21: Precision curves: mirWalk
Figure 22: Precision curves: PITA
28
Figure 23: Precision curves: Targetscan Conserved
Figure 24: Precision curves: Targetspy
29
References [1] R. C. Lee, R. L. Feinbaum, and V. Ambros, “The C. elegans heterochronic gene lin-4 encodes small RNAs with antisense complementarity to lin-14.,” Cell, vol. 75, pp. 843–854, Dec. 1993. [2] D. Gaidatzis, E. van Nimwegen, J. Hausser, and M. Zavolan, “Inference of miRNA targets using evolutionary conservation and pathway analysis.,” BMC Bioinformatics, vol. 8, p. 69, 2007. [3] X. Wang and I. M. El Naqa, “Prediction of both conserved and nonconserved microRNA targets in animals.,” Bioinformatics, vol. 24, pp. 325–332, Feb. 2008. [4] M. Sturm, M. Hackenberg, D. Langenberger, and D. Frishman, “TargetSpy: a supervised machine learning approach for microRNA target prediction,” BMC Bioinformatics, vol. 11, no. 1, p. 292, 2010. [5] H. Dweep, C. Sticht, P. Pandey, and N. Gretz, “miRWalk – Database: Prediction of possible miRNA binding sites by “walking” the genes of three genomes,” Journal of Biomedical Informatics, May 2011. [6] S. Griffiths-Jones and R. Grocock, “miRBase: microRNA sequences, targets and gene nomenclature,” Nucleic acids . . . , 2006. [7] B. P. Lewis, C. B. Burge, and D. P. Bartel, “Conserved seed pairing, often flanked by adenosines, indicates that thousands of human genes are microRNA targets.,” Cell, vol. 120, pp. 15–20, Jan. 2005. [8] M. Kertesz, N. Iovino, U. Unnerstall, U. Gaul, and E. Segal, “The role of site accessibility in microRNA target recognition,” Nature genetics, vol. 39, pp. 1278–1284, Sept. 2007. [9] M. Maragkakis, T. Vergoulis, P. Alexiou, M. Reczko, K. Plomaritou, M. Gousis, K. Kourtis, N. Koziris, T. Dalamagas, and A. G. Hatzigeorgiou, “DIANAmicroT Web server upgrade supports Fly and Worm miRNA target prediction and bibliographic miRNA to disease association.,” Nucleic Acids Research, vol. 39, pp. W145–8, July 2011. [10] D. Betel, A. Koppal, P. Agius, C. Sander, and C. Leslie, “Comprehensive modeling of microRNA targets predicts functional non-conserved and non-canonical sites.,” Genome Biology, vol. 11, no. 8, p. R90, 2010. [11] D. P. Bartel, “MicroRNAs: Genomics, Biogenesis, Mechanism, and Function,” Cell, 2004. 30
[12] B. Cullen, “ScienceDirect - Molecular Cell : Transcription and Processing of Human microRNA Precursors,” Molecular cell, 2004. [13] W. P. Kloosterman, E. Wienholds, R. F. Ketting, and R. H. A. Plasterk, “Substrate requirements for let-7 function in the developing zebrafish embryo.,” Nucleic Acids Research, vol. 32, no. 21, pp. 6284–6291, 2004. [14] M. Selbach, B. Schwanh¨ausser, N. Thierfelder, Z. Fang, R. Khanin, and N. Rajewsky, “Widespread changes in protein synthesis induced by microRNAs.,” Nature News, vol. 455, pp. 58–63, Sept. 2008. [15] J. Brennecke, A. Stark, R. B. Russell, and S. M. Cohen, “Principles of MicroRNA–Target Recognition,” PLoS Biology, vol. 3, p. e85, Feb. 2005. [16] D. P. Bartel, “MicroRNAs: target recognition and regulatory functions.,” Cell, vol. 136, pp. 215–233, Jan. 2009. [17] A. Grimson, K. K.-H. Farh, W. K. Johnston, P. Garrett-Engele, L. P. Lim, and D. P. Bartel, “MicroRNA targeting specificity in mammals: determinants beyond seed pairing.,” Molecular cell, vol. 27, pp. 91–105, July 2007. [18] C. Shin, J.-W. Nam, K. K.-H. Farh, H. R. Chiang, A. Shkumatava, and D. P. Bartel, “Expanding the MicroRNA Targeting Code: Functional Sites with Centered Pairing,” Molecular cell, vol. 38, pp. 789–802, June 2010. [19] K. K. H. Farh, “The Widespread Impact of Mammalian MicroRNAs on mRNA Repression and Evolution,” Science (New York, N.Y.), vol. 310, pp. 1817–1821, Dec. 2005. [20] M. Lindow and J. Gorodkin, “Principles and limitations of computational microRNA gene and target finding.,” DNA and cell biology, vol. 26, pp. 339–351, May 2007. [21] H. Min and S. Yoon, “Got target?: computational methods for microRNA target prediction and their extension,” Experimental and Molecular Medicine, vol. 42, no. 4, p. 233, 2010. [22] A. J. Enright, B. John, U. Gaul, T. Tuschl, C. Sander, and D. S. Marks, “MicroRNA targets in Drosophila.,” Genome Biology, vol. 5, no. 1, p. R1, 2003. [23] I. Hofacker, “Vienna RNA secondary structure server,” Nucleic acids research, 2003.
31
[24] P. Sethupathy, M. Megraw, and A. G. Hatzigeorgiou, “A guide through present computational approaches for the identification of mammalian microRNA targets,” Nature Methods, vol. 3, pp. 881–886, Oct. 2006. [25] S. D. Hsu, F. M. Lin, W. Y. Wu, C. Liang, W. C. Huang, W. L. Chan, W. T. Tsai, G. Z. Chen, C. J. Lee, C. M. Chiu, C. H. Chien, M. C. Wu, C. Y. Huang, A. P. Tsou, and H. D. Huang, “miRTarBase: a database curates experimentally validated microRNA-target interactions,” Nucleic Acids Research, vol. 39, pp. D163–D169, Dec. 2010. [26] F. Xiao, Z. Zuo, G. Cai, S. Kang, X. Gao, and T. Li, “miRecords: an integrated resource for microRNA-target interactions.,” Nucleic Acids Research, vol. 37, pp. D105–10, Jan. 2009. [27] D. Baek, J. Vill´en, C. Shin, F. D. Camargo, S. P. Gygi, and D. P. Bartel, “The impact of microRNAs on protein output.,” Nature News, vol. 455, pp. 64–71, Sept. 2008. [28] M. Megraw, P. Sethupathy, B. Corda, and A. G. Hatzigeorgiou, “miRGen: a database for the study of animal microRNA genomic organization and function.,” Nucleic Acids Research, vol. 35, pp. D149–55, Jan. 2007. Interesting, shows comparison between algorithms. [29] E. A. Shirdel, W. Xie, T. W. Mak, and I. Jurisica, “NAViGaTing the micronome–using multiple microRNA prediction databases to identify signalling pathway-associated microRNAs.,” PloS one, vol. 6, no. 2, p. e17429, 2011. [30] S. Nam, B. Kim, S. Shin, and S. Lee, “miRGator: an integrated system for functional annotation of microRNAs.,” Nucleic Acids Research, vol. 36, pp. D159– 64, Jan. 2008.
32