Authors: Marschall, Tobias
Title: Algorithms and statistical methods for exact motif discovery
Language (ISO): en
Abstract: The motif discovery problem consists of uncovering exceptional patterns (called motifs) in sets of sequences. It arises in molecular biology when searching for yet unknown functional sites in DNA sequences. In this thesis, we develop a motif discovery algorithm that (1) is exact, that means it returns a motif with optimal score, (2) can use the statistical significance with respect to complex background models as a scoring function, (3) takes into account the effects of self-overlaps of motif instances, and (4) is efficient enough to be useful in large-scale applications. To this end, several algorithms and statistical methods are developed. First, the concepts of deterministic arithmetic automata (DAAs) and probabilistic arithmetic automata (PAAs) are introduced. We prove that they allow calculating the distributions of values resulting from deterministic computations on random texts generated by arbitrary finite-memory text models. This technique is applied three times: first, to compute the distribution of the number of occurrences of a pattern in a random string, second, to compute the distribution of the number of character accesses made by windowbased pattern matching algorithms, and, third, to compute the distribution of clump sizes, where a clump is a maximal set of overlapping motif occurrences. All of these applications are interesting theoretical topics in themselves and, in all three cases, our results go beyond those known previously. In order to compute the distribution of the number of occurrences of a motif in a random text, a deterministic finite automaton (DFA) accepting the motif’s instances is needed to subsequently construct a PAA. We therefore address the problem of efficiently constructing minimal DFAs for motif types common in computational biology. We introduce simple non-deterministic finite automata (NFAs) and prove that these NFAs are transformed into minimal DFAs by the classical subset construction. We show that they can be built from (sets of) generalized strings and from consensus strings with a Hamming neighborhood, allowing the direct construction of minimal DFAs for these pattern types. As a contribution to the field of motif statistics, we derive a formula for the expected clump size of motifs. It is remarkably simple and does not involve laborious operations like matrix inversions. This formula plays an important role in developing bounds for the expected clump size of partially known motifs. Such bounds are needed to obtain bounds for the p-value of a partially known motif. Using these, we are finally able to devise a branch-and-bound algorithm for motif discovery that extracts provably optimal motifs with respect to their p-values in compound Poisson approximation. Markovian text models of arbitrary order can be used as a background model (or null model). The algorithm is further generalized to jointly handle a motif and its reverse complement. An Open Source implementation is publicly available as part of the MoSDi software i package. An experimental evaluation using synthetic and real data sets follows. On the carefully crafted benchmark set of Sandve et al. (2007), the proposed algorithm outperforms Weeder (Bailey and Elkan, 1994) and MEME (Pavesi et al., 2004) in terms of the commonly used average nucleotide-level correlation coefficient. With respect to this measure, it is also superior to other algorithms tested by Fauteux et al. (2008) on the same benchmark suite; namely Seeder (Fauteux et al., 2008), BioProspector (Liu et al., 2001), GibbsSampler (Lawrence et al., 1993), and MotifSampler (Thijs et al., 2001). Besides the comparison to other algorithms, we perform motif discovery on the non-coding regions of Mycobacterium tuberculosis and on CpG-rich regions in the human genome. In both cases, we report on found motifs that are strikingly over-represented. While the function of most of these motifs remains unknown to us, some motifs found in M. tuberculosis can be attributed to a known biological function.
Subject Headings: Compound poisson
Motif discovery
Motif statistics
Mycobacterium tuberculosis
Probabilistic arithmetic automation
URI: http://hdl.handle.net/2003/27760
http://dx.doi.org/10.17877/DE290R-552
Issue Date: 2011-05-23
Appears in Collections:LS 11

Files in This Item:
File Description SizeFormat 
dissertation.pdfDNB2.26 MBAdobe PDFView/Open


This item is protected by original copyright



All resources in the repository are protected by copyright.