Algorithms and statistical methods for exact motif discovery
Loading...
Date
2011-05-23
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
The motif discovery problem consists of uncovering exceptional patterns (called motifs)
in sets of sequences. It arises in molecular biology when searching for yet unknown
functional sites in DNA sequences.
In this thesis, we develop a motif discovery algorithm that (1) is exact, that means it
returns a motif with optimal score, (2) can use the statistical significance with respect
to complex background models as a scoring function, (3) takes into account the effects
of self-overlaps of motif instances, and (4) is efficient enough to be useful in large-scale
applications.
To this end, several algorithms and statistical methods are developed. First, the
concepts of deterministic arithmetic automata (DAAs) and probabilistic arithmetic automata
(PAAs) are introduced. We prove that they allow calculating the distributions of values
resulting from deterministic computations on random texts generated by arbitrary
finite-memory text models. This technique is applied three times: first, to compute
the distribution of the number of occurrences of a pattern in a random string, second,
to compute the distribution of the number of character accesses made by windowbased
pattern matching algorithms, and, third, to compute the distribution of clump
sizes, where a clump is a maximal set of overlapping motif occurrences. All of these
applications are interesting theoretical topics in themselves and, in all three cases, our
results go beyond those known previously.
In order to compute the distribution of the number of occurrences of a motif in a
random text, a deterministic finite automaton (DFA) accepting the motif’s instances
is needed to subsequently construct a PAA. We therefore address the problem of
efficiently constructing minimal DFAs for motif types common in computational biology.
We introduce simple non-deterministic finite automata (NFAs) and prove that these NFAs
are transformed into minimal DFAs by the classical subset construction. We show that
they can be built from (sets of) generalized strings and from consensus strings with a
Hamming neighborhood, allowing the direct construction of minimal DFAs for these
pattern types.
As a contribution to the field of motif statistics, we derive a formula for the expected
clump size of motifs. It is remarkably simple and does not involve laborious operations
like matrix inversions. This formula plays an important role in developing bounds for
the expected clump size of partially known motifs. Such bounds are needed to obtain
bounds for the p-value of a partially known motif. Using these, we are finally able to
devise a branch-and-bound algorithm for motif discovery that extracts provably optimal
motifs with respect to their p-values in compound Poisson approximation. Markovian
text models of arbitrary order can be used as a background model (or null model). The
algorithm is further generalized to jointly handle a motif and its reverse complement.
An Open Source implementation is publicly available as part of the MoSDi software
i
package.
An experimental evaluation using synthetic and real data sets follows. On the
carefully crafted benchmark set of Sandve et al. (2007), the proposed algorithm outperforms
Weeder (Bailey and Elkan, 1994) and MEME (Pavesi et al., 2004) in terms of the
commonly used average nucleotide-level correlation coefficient. With respect to this
measure, it is also superior to other algorithms tested by Fauteux et al. (2008) on the
same benchmark suite; namely Seeder (Fauteux et al., 2008), BioProspector (Liu et al.,
2001), GibbsSampler (Lawrence et al., 1993), and MotifSampler (Thijs et al., 2001).
Besides the comparison to other algorithms, we perform motif discovery on the
non-coding regions of Mycobacterium tuberculosis and on CpG-rich regions in the human
genome. In both cases, we report on found motifs that are strikingly over-represented.
While the function of most of these motifs remains unknown to us, some motifs found
in M. tuberculosis can be attributed to a known biological function.
Description
Table of contents
Keywords
Compound poisson, Motif discovery, Motif statistics, Mycobacterium tuberculosis, Probabilistic arithmetic automation