The notion of motifs (or patterns) in biological molecules, defined as local recurring elements in functionally related entities, either due to evolutionary relationships or through convergence, has been exploited successfully in the past by computational methods aimed at functional characterization. Motifs can be detected (with relative ease) at the primary sequence level, but they almost always have a structural meaning, being clusters of spatially close residues working in concert to achieve a given function. The bioinformatics field of motif finding in proteins and DNA is well developed, providing several tools, approaches and databases (Bailey et al., 2009; Burge et al., 2013; Sonnhammer et al., 1997), while fewer resources are available for structural motif finding in RNAs. Such tools can be particularly useful in helping the functional characterization of noncoding RNAs (ncRNAs), for which information about the involved specific sequences and structures is still scarce. ncRNAs are involved in a wide range of biological functions through diverse molecular mechanisms often involving the interaction with one or more RNA binding protein (RBP) partners, with other RNAs or with the genomic DNA (4,5). Experimental and computational techniques are becoming available to depict, in highthroughput settings and at high resolution, protein-RNA interactions, chromatin-RNA interactions and RNA secondary structures, allowing the identification of binding partners, binding sites and function determinants. Protein-RNA interactions are central to many cellular processes (Kishore et al., 2010; Kiven E. Lukong, Kai-wei Chang et al., 2008; Licatalosi and Darnell, 2010; R., 2002) and they often involve ncRNAs. Those processes include transcription factors/telomere regulation, alternative splicing, chromatin remodelling, nucleotide modification and many others. The complexity of the protein-RNA interaction network is starting to be fully appreciated thanks to several technological advances (Ferrè et al., 2016) such as High Throughput assays like CLIP-Seq, PAR-CLIP and others. Generally, sequence-level binding preferences are found, allowing the definition of sequence motifs and the usage of sequence-only based tools such as MEME (Bailey et al., 2009) or cERMIT (Georgiev et al., 2010). Still, these sequence determinants frequently must be carried by a specific structural context (Buckanovich and Darnell, 1997; Hiller et al., 2007; Meisner et al., 2005), while in other cases it is the RNA secondary structure that dictates the interaction specificity: for example, some proteins tend to recognize complex secondary structure elements such as stem-loops and bulges (Cusack, 1999). The RBP-RNA binding is therefore heterogeneous in nature and different RBP domains are governed by different rules. The influence of the RNA structural context upon protein binding and the impact on motif-finding methods has been recently reviewed (Li et al., 2014). Given the importance of the structural context of functional motifs in RNA molecules, a number of methods for approaching the RNA motiffinding problem that include the secondary structure are available (for two recent reviews see (Achar and Sætrom, 2015; Badr et al., 2013)). FOLDALIGN and its variants (Gorodkin, 2001; Gorodkin et al., 1997), comRNA (Ji et al., 2004), RNAProfile (Pavesi et al., 2004; Zambelli and Pavesi, 2015), RSmatch (Liu et al., 2015), RNAmine (Hamada et al., 2006), MEMERIS (Hiller et al., 2006), CMfinder (Yao et al., 2006a), Seed (Anwar et al., 2006), GeRNAMo (Michal et al., 2007), RNApromo (Rabani et al., 2008), SCARNA_LM (Tabei and Asai, 2009), GraphProt (Maticzka et al., 2014) are all tools that take advantage of secondary structure information for tackling the motif-finding problem, employing different approaches and to different extents. Some other methods were developed specifically for the identification of protein-binding motifs, e.g. RNAcontext (Kazan et al., 2010), the algorithm by Li et al. (Li et al., 2010), mCarts (Zhang et al., 2013), RBPmotif (Kazan and Morris, 2013) and Zagros (Bahrami-Samani et al., 2015). The underlying algorithms can vary: expectation maximization (MEMERIS), covariance models (CMfinder), stochastic context-free grammars (RNApromo), graph matching (comRNA, RNAmine), graph kernels (GraphProt), fold-and-align methods (FOLDALIGN), conditional random fields (SCARNA_LM), hidden Markov models (mCarts), genetic programming (GeRNAMo), and others. The nature of the secondary structure information needed by these methods can also vary: some need pre-computed structures, or perform a minimum free energy prediction onthe-fly, others employ base-pairing probabilities, while others try to build the secondary structure simultaneously with the motif finding procedure. Some methods seek for purely structural motifs, while other can consider sequence information as well. Finally, many algorithms are limited in searching motifs having a specific nature, for example only in single-stranded regions (MEMERIS), or in regions containing a limited and/or fixed number of hairpins (CMfinder, FOLDALIGN, RNAProfile), or starting from and expanding well-conserved stem structures (RNApromo, RNAmine). When the algorithm requires the RNA secondary structure, it is often converted into formats with various degrees of complexity and information content. Graph representations provide very accurate results, but are usually computationally expensive as well as limited to topological assertions that hardly detect structural similarities that find their reasons in biological relations, and models of RNA structure evolution are not implemented when comparing RNA secondary structures. To solve this issue, my research group recently proposed BEAR, a representation of the RNA secondary structure by an alphabet of characters describing secondary structure elements and their size, and computed substitution matrix-like rates of variation of these structural elements in functionally related RNAs (Mattei et al., 2014). Having an informative string-based representation of the secondary structure and a substitution matrix, it becomes possible to apply standard algorithms for sequence alignment to the problem of RNA structural comparison (Mattei et al., 2014, 2015). In my work I developed BEAM (BEAr Motif finder) (Pietrosanto et al., 2016), a method that explores sets of unaligned RNA structures sharing a biological property (e.g. the ability to bind a specific RNA-binding protein) looking for the most represented local secondary structure motifs, and evaluating their significance with respect to a common background. BEAM employs the BEAR secondary structure notation and its associated similarity matrix of secondary structure elements, in order to capture motifs by structural similarities that derive from evolutionary related ncRNAs, in a way that covers topological comparison, yet expands it by considering the evolutionary history behind the abstraction of structure representation. BEAM is able to identify structurally similar sites shared by hundreds or thousands of RNAs, and the extension of the motifs is not subject to limitations (other than those imposed by the user). Hence, it is a tool suitable for low-, medium- and high-throughput settings such as those in CLIP-Seq analysis (Änkö and Neugebauer, 2012), the latter being a feature that structural motif finding methods lacked until now (Cook et al., 2015). BEAM has been tested on a number of artificial and real cases, on its robustness to noisy datasets, and on the impact of imprecise secondary structure predictions on the results. Comparisons against state-of-the-art similar methods brought good responses. The requirement of a known or predicted secondary structure might limit BEAM applicability, but in the future this will not be a major hindrance thanks to recent technology advances that are quickly leading towards an era when high-quality RNA secondary structure information will be available for entire transcriptomes (Bai et al., 2014). BEAM source code is freely available at https://github.com/noise42/beam and a webserver has been developed for online use with some features added.
(2016). BEAM: a novel method to infer conserved structural patterns in RNA.
BEAM: a novel method to infer conserved structural patterns in RNA
PIETROSANTO, MARCO
2016-01-01
Abstract
The notion of motifs (or patterns) in biological molecules, defined as local recurring elements in functionally related entities, either due to evolutionary relationships or through convergence, has been exploited successfully in the past by computational methods aimed at functional characterization. Motifs can be detected (with relative ease) at the primary sequence level, but they almost always have a structural meaning, being clusters of spatially close residues working in concert to achieve a given function. The bioinformatics field of motif finding in proteins and DNA is well developed, providing several tools, approaches and databases (Bailey et al., 2009; Burge et al., 2013; Sonnhammer et al., 1997), while fewer resources are available for structural motif finding in RNAs. Such tools can be particularly useful in helping the functional characterization of noncoding RNAs (ncRNAs), for which information about the involved specific sequences and structures is still scarce. ncRNAs are involved in a wide range of biological functions through diverse molecular mechanisms often involving the interaction with one or more RNA binding protein (RBP) partners, with other RNAs or with the genomic DNA (4,5). Experimental and computational techniques are becoming available to depict, in highthroughput settings and at high resolution, protein-RNA interactions, chromatin-RNA interactions and RNA secondary structures, allowing the identification of binding partners, binding sites and function determinants. Protein-RNA interactions are central to many cellular processes (Kishore et al., 2010; Kiven E. Lukong, Kai-wei Chang et al., 2008; Licatalosi and Darnell, 2010; R., 2002) and they often involve ncRNAs. Those processes include transcription factors/telomere regulation, alternative splicing, chromatin remodelling, nucleotide modification and many others. The complexity of the protein-RNA interaction network is starting to be fully appreciated thanks to several technological advances (Ferrè et al., 2016) such as High Throughput assays like CLIP-Seq, PAR-CLIP and others. Generally, sequence-level binding preferences are found, allowing the definition of sequence motifs and the usage of sequence-only based tools such as MEME (Bailey et al., 2009) or cERMIT (Georgiev et al., 2010). Still, these sequence determinants frequently must be carried by a specific structural context (Buckanovich and Darnell, 1997; Hiller et al., 2007; Meisner et al., 2005), while in other cases it is the RNA secondary structure that dictates the interaction specificity: for example, some proteins tend to recognize complex secondary structure elements such as stem-loops and bulges (Cusack, 1999). The RBP-RNA binding is therefore heterogeneous in nature and different RBP domains are governed by different rules. The influence of the RNA structural context upon protein binding and the impact on motif-finding methods has been recently reviewed (Li et al., 2014). Given the importance of the structural context of functional motifs in RNA molecules, a number of methods for approaching the RNA motiffinding problem that include the secondary structure are available (for two recent reviews see (Achar and Sætrom, 2015; Badr et al., 2013)). FOLDALIGN and its variants (Gorodkin, 2001; Gorodkin et al., 1997), comRNA (Ji et al., 2004), RNAProfile (Pavesi et al., 2004; Zambelli and Pavesi, 2015), RSmatch (Liu et al., 2015), RNAmine (Hamada et al., 2006), MEMERIS (Hiller et al., 2006), CMfinder (Yao et al., 2006a), Seed (Anwar et al., 2006), GeRNAMo (Michal et al., 2007), RNApromo (Rabani et al., 2008), SCARNA_LM (Tabei and Asai, 2009), GraphProt (Maticzka et al., 2014) are all tools that take advantage of secondary structure information for tackling the motif-finding problem, employing different approaches and to different extents. Some other methods were developed specifically for the identification of protein-binding motifs, e.g. RNAcontext (Kazan et al., 2010), the algorithm by Li et al. (Li et al., 2010), mCarts (Zhang et al., 2013), RBPmotif (Kazan and Morris, 2013) and Zagros (Bahrami-Samani et al., 2015). The underlying algorithms can vary: expectation maximization (MEMERIS), covariance models (CMfinder), stochastic context-free grammars (RNApromo), graph matching (comRNA, RNAmine), graph kernels (GraphProt), fold-and-align methods (FOLDALIGN), conditional random fields (SCARNA_LM), hidden Markov models (mCarts), genetic programming (GeRNAMo), and others. The nature of the secondary structure information needed by these methods can also vary: some need pre-computed structures, or perform a minimum free energy prediction onthe-fly, others employ base-pairing probabilities, while others try to build the secondary structure simultaneously with the motif finding procedure. Some methods seek for purely structural motifs, while other can consider sequence information as well. Finally, many algorithms are limited in searching motifs having a specific nature, for example only in single-stranded regions (MEMERIS), or in regions containing a limited and/or fixed number of hairpins (CMfinder, FOLDALIGN, RNAProfile), or starting from and expanding well-conserved stem structures (RNApromo, RNAmine). When the algorithm requires the RNA secondary structure, it is often converted into formats with various degrees of complexity and information content. Graph representations provide very accurate results, but are usually computationally expensive as well as limited to topological assertions that hardly detect structural similarities that find their reasons in biological relations, and models of RNA structure evolution are not implemented when comparing RNA secondary structures. To solve this issue, my research group recently proposed BEAR, a representation of the RNA secondary structure by an alphabet of characters describing secondary structure elements and their size, and computed substitution matrix-like rates of variation of these structural elements in functionally related RNAs (Mattei et al., 2014). Having an informative string-based representation of the secondary structure and a substitution matrix, it becomes possible to apply standard algorithms for sequence alignment to the problem of RNA structural comparison (Mattei et al., 2014, 2015). In my work I developed BEAM (BEAr Motif finder) (Pietrosanto et al., 2016), a method that explores sets of unaligned RNA structures sharing a biological property (e.g. the ability to bind a specific RNA-binding protein) looking for the most represented local secondary structure motifs, and evaluating their significance with respect to a common background. BEAM employs the BEAR secondary structure notation and its associated similarity matrix of secondary structure elements, in order to capture motifs by structural similarities that derive from evolutionary related ncRNAs, in a way that covers topological comparison, yet expands it by considering the evolutionary history behind the abstraction of structure representation. BEAM is able to identify structurally similar sites shared by hundreds or thousands of RNAs, and the extension of the motifs is not subject to limitations (other than those imposed by the user). Hence, it is a tool suitable for low-, medium- and high-throughput settings such as those in CLIP-Seq analysis (Änkö and Neugebauer, 2012), the latter being a feature that structural motif finding methods lacked until now (Cook et al., 2015). BEAM has been tested on a number of artificial and real cases, on its robustness to noisy datasets, and on the impact of imprecise secondary structure predictions on the results. Comparisons against state-of-the-art similar methods brought good responses. The requirement of a known or predicted secondary structure might limit BEAM applicability, but in the future this will not be a major hindrance thanks to recent technology advances that are quickly leading towards an era when high-quality RNA secondary structure information will be available for entire transcriptomes (Bai et al., 2014). BEAM source code is freely available at https://github.com/noise42/beam and a webserver has been developed for online use with some features added.File | Dimensione | Formato | |
---|---|---|---|
tesi_phd_pietrosanto.pdf
solo utenti autorizzati
Licenza:
Non specificato
Dimensione
13.77 MB
Formato
Adobe PDF
|
13.77 MB | Adobe PDF | Visualizza/Apri Richiedi una copia |
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.