Bioinformatics from genetic variants to methylation
Loading...
Date
2018
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
An important research topic in bioinformatics is the analysis of DNA, the
molecule that encodes the genetic information of all organisms. The basis
for this is sequencing, a procedure in which the sequence of DNA bases
is determined. In addition to the identification of variations in the base sequence
itself, advances in sequencing methods and a steady reduction in sequencing
costs open up new fields of research: the analysis of functionally
relevant non-base-related changes, so-called epigenetics. An important example
of such a mechanism is DNA methylation, a process in which methyl
groups are added to DNA without altering the sequence itself. Methylation
takes place only at specific sites, and the methylation information of human
DNA consists of approximately 30 million methylation levels between
0 and 1 in total. This thesis deals with problems and solutions for each
phase of DNA methylation analysis.
The most advanced method for detecting DNA methylation based on resolution
is Whole-Genome Bisulfite Sequencing (WGBS), a technique that
modifies DNA at unmethylated sites. We describe the special in-silico treatment
required to process this altered DNA and existing concepts as well
as newly developed bioinformatic methods for efficient determination of
DNA methylation levels and their further processing with our developed
tool camel. A common downstream analysis step is the detection of differentially
methylated regions (DMRs), for which we have implemented a
modification of the widely used method BSmooth in order to deal with
today’s common data sizes.
Setting up and creating new sequencing protocols, e.g., the mentioned
WGBS, is complicated and requires adjustments to several parameters. We
have developed a method based on a linear program (LP) that can predict
the duplicate rate of supersamples. This critical quality measure represents
the proportion of redundant data that in most cases needs to be removed
from any further analysis. By using our method, it becomes possible to
test, adjust and improve parameters for small test libraries only and to
estimate the duplication rate for potential full-size samples.
Once the sequencing protocol has been established, the methylation recognition
of camel can be used as part of automated workflows, such as our
mosquito workflow. This pipeline processes the generated WGBS samples
from the raw data to the degree of methylation, including all essential
intermediate steps. Such workflows are one of the central components of
bioinformatics since the calculation must be parallel, reproducible and scalable.
The distribution of the detected methylation levels, e.g., values of several
samples at a specific location, can often be described as a beta-mixture
model. The standard approach for estimating the parameters for such a
model, the EM algorithm, has problems for data points of 0 or 1, which are
very common as methylation levels. For this reason, we have developed an
alternative algorithm based on moments that overcome this disadvantage.It is robust for data points within the closed interval [0; 1] and can also be
applied to similar data sets in addition to methylation levels.
This work deals not only with epigenetic but also with genetic variants. To
analyze these, we present a second pipeline (ape) for data from targeted
sequencing, where for example only genes are sequenced. The recognized
variants then serve as input for our graphical environment eagle, a tool
for computer scientists and geneticists to recognize possible causal genetic
variants. As the name implies: The configuration of the analysis and presentation
of the results is done via a graphical user interface. Unlike other
tools, eagle is not based on databases, but on encapsulated hdf5 files. The
use of this universal file-system-like data structure offers some advantages
and makes the system easy to use especially for non-computer scientists.
At the end of the thesis, we use all methods presented for the detection,
analysis, and characterization of interindividual DMRs between several
donors. This leads to some computational challenges because DMR
detection is usually performed on two different groups.
Our developed approach processes independent samples and calculates
key metrics such as p-values and the number of undetectable DMRs.
Through whole genome association studies (GWAS) on more than 1000 array
data sets of methylation and variants, we show that (interindividual)
DMRs as a subtype of epigenetics are related to genetic variation.
Description
Table of contents
Keywords
Bioinformatics, Methylation, Variants