LS 11

Permanent URI for this collection

http://hdl.handle.net/2003/73

Browse

Now showing 1 - 20 of 63

Entwicklung von Machine Learning basierten Materialmodellen für Finite-Elemente-Simulationen
(2024) Böhringer, Pauline; Rudolph, Günter; Wiederkehr, Petra
Finite-Elemente-Simulationen sind essenziell für die Strukturanalyse mechanischer Komponenten und finden Anwendung in Bereichen wie Umformprozessen und Crashtests. Die Genauigkeit solcher Simulationen hängt stark von den eingesetzten Materialmodellen ab, deren Erstellung jedoch komplex ist. In dieser Arbeit wird untersucht, ob klassische Materialmodelle durch datenbasierte Modelle mittels maschinellen Lernens (ML) ersetzt werden können. Dazu werden verschiedene ML-Modelle mithilfe zufällig generierter Daten aus klassischen Materialmodellen trainiert und hinsichtlich ihrer Eignung bewertet. Im zweiten Teil wird ein Ansatz vorgestellt, ML-Modelle direkt mittels Daten aus Versuchen, ohne ein klassisches Referenzmodell zu trainieren, wobei physikalisch motivierte Gleichungen für das Training genutzt werden. Der Fokus liegt auf der Anpassung und Evaluierung unterschiedlicher Trainingsmethoden für die ML-Materialmodelle.
Guest editors’ foreword
(2023-07-20) Buchin, Kevin; Colin de Verdière, Éric
Discovering nucleotide-level and structural variants in cancer genome data from second- and third-generation sequencing technologies
(2024) Hartmann, Till; Köster, Johannes; Teubner, Jens
The field of bioinformatics is a diverse one, as it — the name giving a quite obvious hint — bridges the fields of biology and computer science. For example, in (human) cancer research, one commonly 1. obtains a blood and/or tissue sample of a patient in a study, 2. sequences or otherwise analyses the sample on a specialized device to obtain relevant information (such as its genome, transcriptome or methylome), 3. determines variation between the sample and some reference sample(s) or determines other aspects that may be of interest, 4. annotates, filters and analyses these for further examination, 5. and subsequently makes use of them to further one’s research or study goal. In this thesis, we mainly concern ourselves with the last three aspects, though we also explain the sequencing process for two different technologies. Also, we will focus on the genome part of the second item, i.e. the DNA contained within the cells of most organisms. In this context, determining variation usually involves comparing many short DNA sequences to a larger reference DNA sequence (such as ”the human genome”). homopolymer-aware PairHMM: We introduce a homopolymer-aware PairHMM, which addresses one major issue of Oxford Nanopore Technologies (ONT) sequencing: due to the design of the technology, so-called homopolymers — runs of identical nucleotides — often have inaccurately called lengths, which impacts results negatively. This specialized model allows for more accurate alignments and probability estimates, and can be applied to any ONT sequencing data. Detecting extrachromosomal circular DNA: The homopolymer-aware PairHMM finds practical application in the calling of extrachromosomal circular DNA. Extrachromosomal circular DNA is DNA that is both circular and located outside the chromosomal structure of the genome. Such extrachromosomal circular DNA plays an important part in cancer research, as a biomarker for certain cancers or as a way to track tumour progression. We developed a graph-based method to detect eccDNA in ONT sequencing samples, and also provide an easy to set up workflow that guarantees reproducibility and provides an interactive report for exploration of results and documentation of the pipeline. Detecting copy number variation: A different kind of variation is copy number variation, where (larger) regions of a sample’s genome have been repeated or deleted. Over the years, many approaches making use of different kinds of information have been explored, for example read-depth information: Determining variation is often done by comparing, mapping or aligning short DNA sequences to a reference sequence. For each position in the reference sequence, it is then possible to find the number of short sequences that overlap the respective position, which is basically the read-depth. This read-depth information may then be used to locate and estimate the magnitude of copy number variation. However, the mapping process to acquire the read-depth information can both be computationally expensive and require lots of disk-space. Therefore, to improve resource usage, we use a kind of pseudo-mapping based on k-mers instead, and show that using k-mer counts is at least as good as relying on classic read-depth information, while at the same time saving disk space. Filtering of variant calls: Because the field of bioinformatics wouldn’t be the same without its custom file formats, we address one specific format — the Variant Call Format (VCF). VCF is a text-based format describing genomic variation, for example a duplication (copy number variation) or a gene fusion or circle. As the format is widely used and there initially was no binary counterpart, it can be alluring to resort to established text modifying tools such as awk, sed and grep. However, the intricacies of the file format essentially prohibit this, as the results will likely be not what one expects. Even with existing specialized tools such as bcftools, certain syntax and semantic combinations are quite unintuitive and potentially lead to incorrect results. We therefore provide vembrane, a VCF tool which does not introduce its own domain specific language but uses Python instead. It also has built-in support for certain custom annotations (such as provided by SnpEff or VEP) and is the only tool which handles breakend records correctly. For example, extrachromosomal circular DNA variants have to be encoded using such breakend records.
Efficient string algorithmics across alphabet realms
(2024) Ellert, Jonas; Fischer, Johannes; Gawrychowski, Paweł
Stringology is a subfield of computer science dedicated to analyzing and processing sequences of symbols. It plays a crucial role in various applications, including lossless compression, information retrieval, natural language processing, and bioinformatics. Recent algorithms often assume that the strings to be processed are over polynomial integer alphabet, i.e., each symbol is an integer that is at most polynomial in the lengths of the strings. In contrast to that, the earlier days of stringology were shaped by the weaker comparison model, in which strings can only be accessed by mere equality comparisons of symbols, or (if the symbols are totally ordered) order comparisons of symbols. Nowadays, these flavors of the comparison model are respectively referred to as general unordered alphabet and general ordered alphabet. In this dissertation, we dive into the realm of both integer alphabets and general alphabets. We present new algorithms and lower bounds for classic problems, including Lempel-Ziv compression, computing the Lyndon array, and the detection of squares and runs. Our results show that, instead of only assuming the standard model of computation, it is important to also consider both weaker and stronger models. Particularly, we should not discard the older and weaker comparison-based models too quickly, as they are not only powerful theoretical tools, but also lead to fast and elegant practical solutions, even by today's standards.
AAM: a dataset of Artificial Audio Multitracks for diverse music information retrieval tasks
(2023-03-23) Ostermann, Fabian; Vatolkin, Igor; Ebeling, Martin
We present a new dataset of 3000 artificial music tracks with rich annotations based on real instrument samples and generated by algorithmic composition with respect to music theory. Our collection provides ground truth onset information and has several advantages compared to many available datasets. It can be used to compare and optimize algorithms for various music information retrieval tasks like music segmentation, instrument recognition, source separation, onset detection, key and chord recognition, or tempo estimation. As the audio is perfectly aligned to original MIDIs, all annotations (onsets, pitches, instruments, keys, tempos, chords, beats, and segment boundaries) are absolutely precise. Because of that, specific scenarios can be addressed, for instance, detection of segment boundaries with instrument and key change only, or onset detection only in tracks with drums and slow tempo. This allows for the exhaustive evaluation and identification of individual weak points of algorithms. In contrast to datasets with commercial music, all audio tracks are freely available, allowing for extraction of own audio features. All music pieces are stored as single instrument audio tracks and a mix track, so that different augmentations and DSP effects can be applied to extend training sets and create individual mixes, e.g., for deep neural networks. In three case studies, we show how different algorithms and neural network models can be analyzed and compared for music segmentation, instrument recognition, and onset detection. In future, the dataset can be easily extended under consideration of specific demands to the composition process.
Rettungsgassenbildung mit autonomen Fahrzeugen auf Hardware mit beschränkten Ressourcen
(2024) Kuzmic, Jurij; Rudolph, Günter; Mostaghim, Sanaz
Der Schwerpunkt dieser Dissertation liegt auf der Entwicklung der Algorithmen und Software für stromsparende IoT-Geräte bezogen auf die Problematik der Rettungsgassenbildung auf Autobahnen mit autonomen Fahrzeugen. Auf den Autobahnen kommt es immer wieder zu unvorhersehbaren schwerwiegenden Unfällen. Dabei vergessen die Fahrer der Fahrzeuge teilweise eine Rettungsgasse für die Polizei- und Rettungsfahrzeuge zu bilden und behindern diese beim Erreichen des Unfallorts. Dabei ist für die Unfallbeteiligten jede Minute sehr wichtig. Die autonomen Fahrzeuge hingegen könnten die entwickelten Algorithmen zur Bildung der Rettungsgasse bei einem bestimmten Ereignis ausführen. Die Behebung dieser Problematik ist somit durch die autonomen Fahrzeuge möglich. Dementsprechend ist die Motivation dieser Forschungsarbeit dem Problem bei der Bildung der Rettungsgasse auf Autobahnen für die Polizei- und Rettungsfahrzeuge bei einem Unfall mit autonomen Fahrzeugen entgegenzuwirken. Dazu werden zwei Algorithmen erforscht und anschließend getestet. Die Evaluation der Algorithmen ist durch den in dieser Forschungsarbeit entwickelten Simulator möglich. Anschließend werden diese Algorithmen in der echten Umgebung und nicht nur in der Simulation überprüft. Um dem realen Problem entgegenzuwirken, werden echte autonome Modellfahrzeuge entwickelt und auf einer Teststrecke getestet. Außerdem ist die Motivation das Budget so gering wie möglich zu halten und dabei der Umwelt zur Liebe Strom und Geld zu sparen. Deswegen liegt der Fokus auf der Entwicklung der Software für stromsparende IoT-Geräte. Dabei entstehen die folgenden Fragestellungen: Welche der entwickelten Algorithmen und Methoden können überhaupt auf der stromsparenden IoT-Hardware ausgeführt werden? Welche Laufzeiten können auf diesen IoT-Geräten erreicht werden? Können ebenfalls Echtzeitanwendungen auf der verwendeten IoT-Hardware implementiert werden? Die aufgekommenen Fragestellungen können durch die verschiedenen Experimente dieser Forschungsarbeit beantwortet werden. Die Thematik der beschränkten Ressourcen ist nicht nur bei den Modellfahrzeugen interessant. Beschränkte Ressourcen findet man heutzutage ebenfalls in Drohnen, smarten Überwachungskameras, Wildkameras und weiteren IoT-Geräten.
Memory-based deep reinforcement learning in endless imperfect information games
(2023) Pleines, Marco; Rudolph, Günter; Preuss, Mike
Memory capabilities in Deep Reinforcement Learning (DRL) agents have become increasingly crucial, especially in tasks characterized by partial observability or imperfect information. However, the field faces two significant challenges: the absence of a universally accepted benchmark and limited access to open-source baseline implementations. We present "Memory Gym", a novel benchmark suite encompassing both finite and endless versions of the Mortar Mayhem, Mystery Path, and Searing Spotlights environments. The finite tasks emphasize strong dependencies on memory and memory interactions, while the remarkable endless tasks, inspired by the game "I packed my bag", act as an automatic curriculum, progressively challenging an agent's retention and recall capabilities. To complement this benchmark, we provide two comprehensible and open-source baselines anchored on the widely-adopted Proximal Policy Optimization algorithm. The first employs a recurrent mechanism through a Gated Recurrent Unit (GRU) cell, while the second adopts an attention-based approach using Transformer-XL (TrXL) for episodic memory with a sliding window. Given the dearth of readily available transformer-based DRL implementations, our TrXL baseline offers significant value. Our results reveal an intriguing performance dynamic: TrXL is often superior in finite tasks, but in the endless environments, GRU unexpectedly marks a comeback. This discrepancy prompts further investigation into TrXL's potential limitations, including whether its initial query misses temporal cues, the impact of stale hidden states, and the intricacies of positional encoding.
Automatic online algorithm selection for optimization in cyber-physical production systems
(2023) Fischbach, Andreas; Rudolph, Günter; Bartz-Beielstein, Thomas
Shrinking product lifecycles, progressing market penetration of innovative product technologies, and increasing demand for product individualization lead to frequent adjustments of production processes and thus to an increasing demand for frequent optimization of production processes. Offline solutions are not always available, and even the optimization problem class itself may have changed in terms of the value landscape of the objective function: Parameters may have been added, the locations of optimal values and the values themselves may have changed. This thesis develops an automatic solution to the algorithm selection problem for continuous optimization. Furthermore, based on the evaluation of three different real-world use cases and a review of well-known architectures from the field of automation and cognitive science, a system architecture suitable for use in large data scenarios was developed. The developed architecture has been implemented and evaluated on two real-world problems: A Versatile Production System (VPS) and Injection Molding Optimization (IM). The developed solution for the VPS was able to automatically tune the feasible algorithms and select the most promising candidate, which significantly outperformed the competitors. This was evaluated by applying statistical tests based on the generated test instances using the process data and by performing benchmark experiments. This solution was extended to the area of multi-objective optimization for the IM use case by specifying an appropriate algorithm portfolio and selecting a suitable performance metric to automatically compare the algorithms. This allows the automatic optimization of three largely uncorrelated objectives: cycle time, average volume shrinkage, and maximum warpage of the parts to be produced. The extension to multi-objective handling for IM optimization showed a huge benefit in terms of manual implementation effort, as most of the work could be done by configuration. The implementation effort was reduced to selecting optimizers and hypervolume computation.
Graph set data mining
(2023) Schäfer, Till; Mutzel, Petra; Buchin, Kevin
Graphs are among the most versatile abstract data types in computer science. With the variety comes great adoption in various application fields, such as chemistry, biology, social analysis, logistics, and computer science itself. With the growing capacities of digital storage, the collection of large amounts of data has become the norm in many application fields. Data mining, i.e., the automated extraction of non-trivial patterns from data, is a key step to extract knowledge from these datasets and generate value. This thesis is dedicated to concurrent scalable data mining algorithms beyond traditional notions of efficiency for large-scale datasets of small labeled graphs; more precisely, structural clustering and representative subgraph pattern mining. It is motivated by, but not limited to, the need to analyze molecular libraries of ever-increasing size in the drug discovery process. Structural clustering makes use of graph theoretical concepts, such as (common) subgraph isomorphisms and frequent subgraphs, to model cluster commonalities directly in the application domain. It is considered computationally demanding for non-restricted graph classes and with very few exceptions prior algorithms are only suitable for very small datasets. This thesis discusses the first truly scalable structural clustering algorithm StruClus with linear worst-case complexity. At the same time, StruClus embraces the inherent values of structural clustering algorithms, i.e., interpretable, consistent, and high-quality results. A novel two-fold sampling strategy with stochastic error bounds for frequent subgraph mining is presented. It enables fast extraction of cluster commonalities in the form of common subgraph representative sets. StruClus is the first structural clustering algorithm with a directed selection of structural cluster-representative patterns regarding homogeneity and separation aspects in the high-dimensional subgraph pattern space. Furthermore, a novel concept of cluster homogeneity balancing using dynamically-sized representatives is discussed. The second part of this thesis discusses the representative subgraph pattern mining problem in more general terms. A novel objective function maximizes the number of represented graphs for a cardinality-constrained representative set. It is shown that the problem is a special case of the maximum coverage problem and is NP-hard. Based on the greedy approximation of Nemhauser, Wolsey, and Fisher for submodular set function maximization a novel sampling approach is presented. It mines candidate sets that contain an optimal greedy solution with a probabilistic maximum error. This leads to a constant-time algorithm to generate the candidate sets given a fixed-size sample of the dataset. In combination with a cheap single-pass streaming evaluation of the candidate sets, this enables scalability to datasets with billions of molecules on a single machine. Ultimately, the sampling approach leads to the first distributed subgraph pattern mining algorithm that distributes the pattern space and the dataset graphs at the same time.
Analysis and application of hash-based similarity estimation techniques for biological sequence analysis
(2021) Timm, Henning; Rahmann, Sven; Mosig, Axel
In Bioinformatics, a large group of problems requires the computation or estimation of sequence similarity. However, the analysis of biological sequence data has, among many others, three capital challenges: a large amount generated data which contains technology-specific errors (that can be mistaken for biological signals), and that might need to be analyzed without access to a reference genome. Through the use of locality sensitive hashing methods, both the efficient estimation of sequence similarity and tolerance against the errors specific to biological data can be achieved. We developed a variant of the winnowing algorithm for local minimizer computation, which is specifically geared to deal with repetitive regions within biological sequences. Through compressing redundant information, we can both reduce the size of the hash tables required to save minimizer sketches, as well as reduce the amount of redundant low quality alignment candidates. Analyzing the distribution of segment lengths generated by this approach, we can better judge the size of required data structures, as well as identify hash functions feasible for this technique. Our evaluation could verify that simple and fast hash functions, even when using small hash value spaces (hash functions with small codomain), are sufficient to compute compressed minimizers and perform comparable to uniformly randomly chosen hash values. We also outlined an index for a taxonomic protein database using multiple compressed winnowings to identify alignment candidates. To store MinHash values, we present a cache-optimized implementation of a hash table using Hopscotch hashing to resolve collisions. As a biological application of similarity based analysis, we describe the analysis of double digest restriction site associated DNA sequencing (ddRADseq). We implemented a simulation software able to model the biological and technological influences of this technology to allow better development and testing of ddRADseq analysis software. Using datasets generated by our software, as well as data obtained from population genetic experiments, we developed an analysis workflow for ddRADseq data, based on the Stacks software. Since the quality of results generated by Stacks strongly depends on how well the used parameters are adapted to the specific dataset, we developed a Snakemake workflow that automates preprocessing tasks while also allowing the automatic exploration of different parameter sets. As part of this workflow, we developed a PCR deduplication approach able to generate consensus reads incorporating the base quality values (as reported by the sequencing device), without performing an alignment first. As an outlook, we outline a MinHashing approach that can be used for a faster and more robust clustering, while addressing incomplete digestion and null alleles, two effects specific for ddRADseq that current analysis tools cannot reliably detect.
Reconsideration and extension of Cartesian genetic programming
(2021) Kalkreuth, Roman Tobias; Rudolph, Günter; Kaufmann, Paul
This dissertation aims on analyzing fundamental concepts and dogmas of a graph-based genetic programming approach called Cartesian Genetic Programming (CGP) and introduces advanced genetic operators for CGP. The results of the experiments presented in this thesis lead to more knowledge about the algorithmic use of CGP and its underlying working mechanisms. CGP has been mostly used with a parametrization pattern, which has been prematurely generalized as the most efficient pattern for standard CGP and its variants. Several parametrization patterns are evaluated with more detailed and comprehensive experiments by using meta-optimization. This thesis also presents a first runtime analysis of CGP. The time complexity of a simple (1+1)-CGP algorithm is analyzed with a simple mathematical problem and a simple Boolean function problem. In the subfield of genetic operators for CGP, new recombination and mutation techniques that work on a phenotypic level are presented. The effectiveness of these operators is demonstrated on a widespread set of popular benchmark problems. Especially the role of recombination can be seen as a big open question in the field of CGP, since the lack of an effective recombination operator limits CGP to mutation-only use. Phenotypic exploration analysis is used to analyze the effects caused by the presented operators. This type of analysis also leads to new insights into the search behavior of CGP in continuous and discrete fitness spaces. Overall, the outcome of this thesis leads to a reconsideration of how CGP is effectively used and extends its adaption from Darwin's and Lamarck's theories of biological evolution.
Tree comparison: enumeration and application to cheminformatics
(2021) Droschinsky, Andre; Mutzel, Petra; Fischer, Johannes
Graphs are a well-known data structure used in many application domains that rely on relationships between individual entities. Examples are social networks, where the users may be in friendship with each other, road networks, where one-way or bidirectional roads connect crossings, and work package assignments, where workers are assigned to tasks. In chem- and bioinformatics, molecules are often represented as molecular graphs, where vertices represent atoms, and bonds between them are represented by edges connecting the vertices. Since there is an ever-increasing amount of data that can be treated as graphs, fast algorithms are needed to compare such graphs. A well-researched concept to compare two graphs is the maximum common subgraph. On the one hand, this allows ﬁnding substructures that are common to both input graphs. On the other hand, we can derive a similarity score from the maximum common subgraph. A practical application is rational drug design which involves molecular similarity searches. In this thesis, we study the maximum common subgraph problem, which entails ﬁnding a largest graph, which is isomorphic to subgraphs of two input graphs. We focus on restrictions that allow polynomial-time algorithms with a low exponent. An example is the maximum common subtree of two input trees. We succeed in improving the previously best-known time bound. Additionally, we provide a lower time bound under certain assumptions. We study a generalization of the maximum common subtree problem, the block-and-bridge preserving maximum common induced subgraph problem between outerplanar graphs. This problem is motivated by the application to cheminformatics. First, the vast majority of drugs modeled as molecular graphs is outerplanar, and second, the blocks correspond to the ring structures and the bridges to atom chains or linkers. If we allow disconnected common subgraphs, the problem becomes NP-hard even for trees as input. We propose a second generalization of the maximum common subtree problem, which allows skipping vertices in the input trees while maintaining polynomial running time. Since a maximum common subgraph is not unique in general, we investigate the problem to enumerate all maximum solutions. We do this for both the maximum common subtree problem and the block-and-bridge preserving maximum common induced subgraph problem between outerplanar graphs. An arising subproblem which we analyze is the enumeration of maximum weight matchings in bipartite graphs. We support a weight function between the vertices and edges for all proposed common subgraph methods in this thesis. Thus the objective is to compute a common subgraph of maximum weight. The weights may be integral or real-valued, including negative values. A special case of using such a weight function is computing common subgraph isomorphisms between labeled graphs, where labels between mapped vertices and edges must be equal. An experimental study evaluates the practical running times and the usefulness of our block-and-bridge preserving maximum common induced subgraph algorithm against state of the art algorithms.
Parallel text index construction
(2020) Kurpicz, Florian; Fischer, Johannes; Puglisi, Simon J.
In dieser Dissertation betrachten wir die parallele Konstruktion von Text-Indizes. Text-Indizes stellen Zusatzinformationen über Texte bereit, die Anfragen hinsichtlich dieser Texte beschleunigen können. Ein Beispiel hierfür sind Volltext-Indizes, welche für eine effiziente Phrasensuche genutzt werden, also etwa für die Frage, ob eine Phrase in einem Text vorkommt oder nicht. Diese Dissertation befasst sich hauptsächlich, aber nicht ausschließlich mit der parallelen Konstruktion von Text-Indizes im geteilten und verteilten Speicher. Im ersten Teil der Dissertation betrachten wir Wavelet-Trees. Dabei handelt es sich um kompakte Indizes, welche Rank- und Select-Anfragen von binären Alphabeten auf Alphabete beliebiger Größe verallgemeinern. Im zweiten Teil der Dissertation betrachten wir das Suffix-Array, den am besten erforschten Text-Index überhaupt. Das Suffix-Array enthält die Startpositionen aller lexikografisch sortierten Suffixe eines Textes, d.h., wir möchten alle Suffixe eines Textes sortieren. Oft wird das Suffix-Array um das Longest-Common-Prefix-Array (LCP-Array) erweitert. Das LCP-Array enthält die Länge der längsten gemeinsamen Präfixe zweier lexikografisch konsekutiven Suffixe. Abschließend nutzen wir verteilte Suffix- und LCP-Arrays, um den Distributed-Patricia-Trie zu konstruieren. Dieser erlaubt es uns, verschiedene Phrase-Anfragen effizienter zu beantworten, als wenn wir nur das Suffix-Array nutzen.
Computing Lexicographic Parsings
(2019-12-18) Köppl, Dominik
We give a memory-friendly algorithm computing the compression scheme plcpcomp or lex-parse in linear or near-linear time.
Learning with graphs: kernel and neural approaches
(2019) Morris, Christopher; Mutzel, Petra; Kersting, Kristian
Linked data arise in many real-world settings - from chemical molecules, and protein-protein interaction networks, to social networks. Hence, in recent years, machine learning methods for graphs have become an important area of research. The present work deals with (supervised) graph classification, i.e., given a set of (class) labeled graphs we want to train a model such that it can predict the labels of unlabeled graphs. Thereto, we want to map a graph to a (usually) high-dimensional vector such that standard machine learning techniques can be applied. In the first part of this thesis, we present kernel methods for graphs. Specifically, we present a framework for scalable kernels that can handle continuous vertex and edge labels, which often arise with real-world graphs, e.g., chemical and physical properties of atoms. Moreover, we present a graph kernel which can take global or higher-order graph properties into account, usually not captured by other graph kernels. To this end, we propose a local version of the $k$-dimensional Weisfeiler-Leman algorithm, which is a well-known heuristic for the graph isomorphism problem. We show that a variant of our local algorithm has at least the same power as the original algorithm, while at the same time taking the sparsity of the underlying graph into account and preventing overfitting. To make this kernel scalable, we present an approximation algorithm with theoretical guarantees. Subsequently, we introduce a theoretical framework for the analysis of graph kernels showing that most kernels are not capable of distinguishing simple graph-theoretic properties and propose a theoretically well-founded graph kernel, which can distinguish these properties. The second part of this thesis deals with neural approaches to graph classification and their connection to kernel methods. We show that the expressivity of so-called graph neural networks can be upper-bounded by the $1$-dimensional Weisfeiler-Leman graph isomorphism heuristic. We then leverage this insight to propose a generalization which is provable more powerful than graph neural networks regarding distinguishing non-isomorphic graphs.
Seconds first!
(2019) Brinkjost, Tobias; Mutzel, Petra; Rauh, Daniel
The visualization of a protein is hardly imaginable without secondary structure elements (SSEs). But SSEs play a by far more important role in the field of chemical biology, apart from the creation of fancy protein images. They are essential in structure-based analyses due to their impact on secondary structure prediction and structural protein alignment. However, their proper classification is a challenging issue. There are more than 30 tools available that underline the subjective character of the classification of SSEs. But only two tools have dominated this field of research for decades. Why is that? What are the advantages of hydrogen bond-based methods, despite the fact that they are often unable to assign left-handed helices, PPI-helices, or bent structures? We have developed SCOT, a novel multipurpose software that incorporates the benefits of a multitude of approaches for the classification of helices, strands, and turns in proteins. To our knowledge, it is the very first method that not only captures a variety of rare and basic SSEs (rightand left-handed a-, 310-, 2.27-, plus right-handed p-helices, PPII helices, and b-sheets) in protein structures, but also their irregularities in a single step, and provides proper output and visualization options. SCOT combines the benefits of geometry-based and hydrogen bond-based methods by using hydrogen bond and geometric information to gain insights into the structural space of proteins. Its dual character enables robust classifications of SSEs without major influence on the geometric regularity of the assigned SSEs. In consequence, it is perfectly suited to automatically assign SSEs for subsequent helix- and strand-based protein alignments with methods such as LOCK2. This is especially supported by our elaborate kink detection. All of these benefits are clearly demonstrated by our results. Together with the easy to use visualization of assignments by the means of PyMOL scripts, SCOT enables a comprehensive analysis of regular backbone geometries in protein structures. The high number of available secondary structure assignment methods (SSAMs) hampers a straight forward selection of the most suitable one for a certain application. In addition, relying on the most frequently cited tool must not necessarily result in an optimal choice. Thus, we have developed SNOT to fill the gap of a tool that provides a multitude of objective and rational criteria for the comparison and evaluation of different SSAMs. It provides exhaustive information on geometrical parameters, residue statistics, the consistency with respect to protein flexibility and quality, SSE overlaps and sequence coverage, and the consent of two classifications. We used SNOT to compare SCOT to DSSP, STRIDE, ASSP, SEGNO, DISICL, and SHAFT. The results point toward SCOT’s unique features as a solitary multipurpose SSAM with optimal performance for numerous challenges: the support of commonly observed and rare SSEs, the comprehensive assignment of turn types, the elaborate kink detection, the geometric consistency of the SSEs, the robustness with respect to structure quality and protein flexibility, and its superior suitability for SSE-based protein structure alignments. Our analyses of alternative p- and PPII-helix assignments indicate challenges which we try to address with our methodology. There are also hints toward a correlation between the SSEs of a protein and its function. Koch and Waldmann proposed that similar arrangements of SSEs in the neighborhood of a ligand binding site (ligand-sensing cores) can recognize similar scaffolds in disregard of the overall fold. We have developed SLOT to discover these unrevealed similarities which are solely based on SSEs. Its graph-based methodology is able to mimic the geometry of SSEs by using a flexible multi-point representation instead of a straight vector. These points are used to capture the geometry of a protein, i.e., the arrangement of SSEs, in distance matrices. This unique representation enables the comparison of protein structures regardless of their SSEs’ directions or SSE sequence. An optimized algorithm to determine the MCS of two given graphs is used to calculate their structural similarity and to ensure fast runtimes. What sets SLOT further apart from its 40 competitors is that it can be used with any external secondary structure classification. Our exhaustive evaluation highlights the benefits of SCOT for the use with SLOT and also covers our optimizations of the comparison algorithm. It additionally questions the applicability of the concept of ligand-sensing cores. In the end, SCOT, SNOT, and SLOT were the beacons on our journey to answer the question: Are there similar ligand-sensing cores or undiscovered structural similarities solely based on SSEs?
Data-driven optimization of hot rolling processes
(2019) Jung, Christian; Rudolph, Günter; Bartz-Beielstein, Thomas
The rolling process is a complex real-world problem which requires adequate handling and analysis. During the production of strips, plates or coils, a huge amount of data is generated which can be used to optimize the process. The description of the process is done with so-called process models which are regular software programs. It is shown in this thesis that data-driven surrogate based optimization is an excellent method for improving even complex processes. The optimization procedure was tested with data of an aluminum roughing mill to optimize flow curve parameters for an unknown material. The tremendous improvement which was observed during the optimization was also validated with real process data. Residual prediction inaccuracies will still be observable for all kind of analytical models, even after optimization. To reduce these residuals, online algorithms are used. The most promising candidate for all rolling datasets was the online SVR algorithm, which was extended to fit the requirements for real-world processes. These extensions included strategies for handling infinite datasets, the management of the stored data and online parameter optimization. Furthermore, the thesis provides strategies for handling categorical variables. It is shown that all online algorithms outperform their offline variant and that the online SVR achieves the best results on these data. The application of the developed algorithms is not limited to the hot rolling process and can directly be applied to other real-world processes.
More compact orthogonal drawings by allowing additional bends
(2018-06-26) Jünger, Michael; Mutzel, Petra; Spisla, Christiane
Compacting orthogonal drawings is a challenging task. Usually, algorithms try to compute drawings with small area or total edge length while preserving the underlying orthogonal shape. We suggest a moderate relaxation of the orthogonal compaction problem, namely the one-dimensional monotone flexible edge compaction problem with fixed vertex star geometry. We further show that this problem can be solved in polynomial time using a network flow model. An experimental evaluation shows that by allowing additional bends could reduce the total edge length and the drawing area.
Bioinformatics from genetic variants to methylation
(2018) Schröder, Christopher; Rahmann, Sven; Klau, Gunnar W.
An important research topic in bioinformatics is the analysis of DNA, the molecule that encodes the genetic information of all organisms. The basis for this is sequencing, a procedure in which the sequence of DNA bases is determined. In addition to the identification of variations in the base sequence itself, advances in sequencing methods and a steady reduction in sequencing costs open up new fields of research: the analysis of functionally relevant non-base-related changes, so-called epigenetics. An important example of such a mechanism is DNA methylation, a process in which methyl groups are added to DNA without altering the sequence itself. Methylation takes place only at specific sites, and the methylation information of human DNA consists of approximately 30 million methylation levels between 0 and 1 in total. This thesis deals with problems and solutions for each phase of DNA methylation analysis. The most advanced method for detecting DNA methylation based on resolution is Whole-Genome Bisulfite Sequencing (WGBS), a technique that modifies DNA at unmethylated sites. We describe the special in-silico treatment required to process this altered DNA and existing concepts as well as newly developed bioinformatic methods for efficient determination of DNA methylation levels and their further processing with our developed tool camel. A common downstream analysis step is the detection of differentially methylated regions (DMRs), for which we have implemented a modification of the widely used method BSmooth in order to deal with today’s common data sizes. Setting up and creating new sequencing protocols, e.g., the mentioned WGBS, is complicated and requires adjustments to several parameters. We have developed a method based on a linear program (LP) that can predict the duplicate rate of supersamples. This critical quality measure represents the proportion of redundant data that in most cases needs to be removed from any further analysis. By using our method, it becomes possible to test, adjust and improve parameters for small test libraries only and to estimate the duplication rate for potential full-size samples. Once the sequencing protocol has been established, the methylation recognition of camel can be used as part of automated workflows, such as our mosquito workflow. This pipeline processes the generated WGBS samples from the raw data to the degree of methylation, including all essential intermediate steps. Such workflows are one of the central components of bioinformatics since the calculation must be parallel, reproducible and scalable. The distribution of the detected methylation levels, e.g., values of several samples at a specific location, can often be described as a beta-mixture model. The standard approach for estimating the parameters for such a model, the EM algorithm, has problems for data points of 0 or 1, which are very common as methylation levels. For this reason, we have developed an alternative algorithm based on moments that overcome this disadvantage.It is robust for data points within the closed interval [0; 1] and can also be applied to similar data sets in addition to methylation levels. This work deals not only with epigenetic but also with genetic variants. To analyze these, we present a second pipeline (ape) for data from targeted sequencing, where for example only genes are sequenced. The recognized variants then serve as input for our graphical environment eagle, a tool for computer scientists and geneticists to recognize possible causal genetic variants. As the name implies: The configuration of the analysis and presentation of the results is done via a graphical user interface. Unlike other tools, eagle is not based on databases, but on encapsulated hdf5 files. The use of this universal file-system-like data structure offers some advantages and makes the system easy to use especially for non-computer scientists. At the end of the thesis, we use all methods presented for the detection, analysis, and characterization of interindividual DMRs between several donors. This leads to some computational challenges because DMR detection is usually performed on two different groups. Our developed approach processes independent samples and calculates key metrics such as p-values and the number of undetectable DMRs. Through whole genome association studies (GWAS) on more than 1000 array data sets of methylation and variants, we show that (interindividual) DMRs as a subtype of epigenetics are related to genetic variation.
Uncertainty handling in surrogate assisted optimisation of games
(2019) Volz, Vanessa; Rudolph, Günter; Preuss, Mike
In this thesis entitled Uncertainty handling in surrogate assisted optimisation of games, we started out with the goal to investigate the uncertainty in game optimisation problems, as well as to identify or develop suitable optimisation algorithms. In order to approach this problem systematically, we first created a benchmark consisting of suitable game optimisation functions (GBEA). The suitability of these functions was determined using a taxonomy that was created based on the results of a literature survey of automatic game evaluation approaches. In order to improve the interpretability of the results, we also implemented an experimental framework that adds several features aiding the analysis of the results, specifically for surrogate-assisted evolutionary algorithms. After describing potentially suitable algorithms, we proposed a promising algorithm (SAPEO), to be tested on the benchmark alongside state-of-the-art optimisation algorithms. SAPEO is utilising the observation that most evolutionary algorithms only need fitness evaluations for survival selections. However, if the individuals in a population can be distinguished reliably based on predicted values, the number of function evaluations can be reduced. After a theoretical analysis of the performance limits of SAPEO, which produced very promising insights, we conducted several sets of experiments in order to answer the three central hypotheses guiding this thesis. We find that SAPEO performs comparably to state-of-the-art surrogate-assisted algorithms, but all are frequently outperformed by stand-alone evolutionary algorithms. From a more detailed analysis of the behaviour of SAPEO, we identify a few pointers that could help to further improve the performance. Before running experiments on the developed benchmark, we first verify its suitability using a second set of experiments. We find that GBEA is practical and contains interesting and challenging functions. However, we also discover that, in order to produce interpretable result with the benchmark, a set of baseline results is required. Due to this issue, we are not able to produce meaningful results with the GBEA at the time of writing. However, after more experiments are conducted with the benchmark, we will be able to interpret our results in the future. The insights developed will most likely not only be able to provide an assessment of optimisation algorithms, but can also be used to gain a deeper understanding of the characteristics of game optimisation problems.

Browse

Recent Submissions