LS 11

Permanent URI for this collection

Browse

Recent Submissions

Now showing 1 - 20 of 60
  • Item
    Efficient string algorithmics across alphabet realms
    (2024) Ellert, Jonas; Fischer, Johannes; Gawrychowski, Paweł
    Stringology is a subfield of computer science dedicated to analyzing and processing sequences of symbols. It plays a crucial role in various applications, including lossless compression, information retrieval, natural language processing, and bioinformatics. Recent algorithms often assume that the strings to be processed are over polynomial integer alphabet, i.e., each symbol is an integer that is at most polynomial in the lengths of the strings. In contrast to that, the earlier days of stringology were shaped by the weaker comparison model, in which strings can only be accessed by mere equality comparisons of symbols, or (if the symbols are totally ordered) order comparisons of symbols. Nowadays, these flavors of the comparison model are respectively referred to as general unordered alphabet and general ordered alphabet. In this dissertation, we dive into the realm of both integer alphabets and general alphabets. We present new algorithms and lower bounds for classic problems, including Lempel-Ziv compression, computing the Lyndon array, and the detection of squares and runs. Our results show that, instead of only assuming the standard model of computation, it is important to also consider both weaker and stronger models. Particularly, we should not discard the older and weaker comparison-based models too quickly, as they are not only powerful theoretical tools, but also lead to fast and elegant practical solutions, even by today's standards.
  • Item
    AAM: a dataset of Artificial Audio Multitracks for diverse music information retrieval tasks
    (2023-03-23) Ostermann, Fabian; Vatolkin, Igor; Ebeling, Martin
    We present a new dataset of 3000 artificial music tracks with rich annotations based on real instrument samples and generated by algorithmic composition with respect to music theory. Our collection provides ground truth onset information and has several advantages compared to many available datasets. It can be used to compare and optimize algorithms for various music information retrieval tasks like music segmentation, instrument recognition, source separation, onset detection, key and chord recognition, or tempo estimation. As the audio is perfectly aligned to original MIDIs, all annotations (onsets, pitches, instruments, keys, tempos, chords, beats, and segment boundaries) are absolutely precise. Because of that, specific scenarios can be addressed, for instance, detection of segment boundaries with instrument and key change only, or onset detection only in tracks with drums and slow tempo. This allows for the exhaustive evaluation and identification of individual weak points of algorithms. In contrast to datasets with commercial music, all audio tracks are freely available, allowing for extraction of own audio features. All music pieces are stored as single instrument audio tracks and a mix track, so that different augmentations and DSP effects can be applied to extend training sets and create individual mixes, e.g., for deep neural networks. In three case studies, we show how different algorithms and neural network models can be analyzed and compared for music segmentation, instrument recognition, and onset detection. In future, the dataset can be easily extended under consideration of specific demands to the composition process.
  • Item
    Rettungsgassenbildung mit autonomen Fahrzeugen auf Hardware mit beschränkten Ressourcen
    (2024) Kuzmic, Jurij; Rudolph, Günter; Mostaghim, Sanaz
    Der Schwerpunkt dieser Dissertation liegt auf der Entwicklung der Algorithmen und Software für stromsparende IoT-Geräte bezogen auf die Problematik der Rettungsgassenbildung auf Autobahnen mit autonomen Fahrzeugen. Auf den Autobahnen kommt es immer wieder zu unvorhersehbaren schwerwiegenden Unfällen. Dabei vergessen die Fahrer der Fahrzeuge teilweise eine Rettungsgasse für die Polizei- und Rettungsfahrzeuge zu bilden und behindern diese beim Erreichen des Unfallorts. Dabei ist für die Unfallbeteiligten jede Minute sehr wichtig. Die autonomen Fahrzeuge hingegen könnten die entwickelten Algorithmen zur Bildung der Rettungsgasse bei einem bestimmten Ereignis ausführen. Die Behebung dieser Problematik ist somit durch die autonomen Fahrzeuge möglich. Dementsprechend ist die Motivation dieser Forschungsarbeit dem Problem bei der Bildung der Rettungsgasse auf Autobahnen für die Polizei- und Rettungsfahrzeuge bei einem Unfall mit autonomen Fahrzeugen entgegenzuwirken. Dazu werden zwei Algorithmen erforscht und anschließend getestet. Die Evaluation der Algorithmen ist durch den in dieser Forschungsarbeit entwickelten Simulator möglich. Anschließend werden diese Algorithmen in der echten Umgebung und nicht nur in der Simulation überprüft. Um dem realen Problem entgegenzuwirken, werden echte autonome Modellfahrzeuge entwickelt und auf einer Teststrecke getestet. Außerdem ist die Motivation das Budget so gering wie möglich zu halten und dabei der Umwelt zur Liebe Strom und Geld zu sparen. Deswegen liegt der Fokus auf der Entwicklung der Software für stromsparende IoT-Geräte. Dabei entstehen die folgenden Fragestellungen: Welche der entwickelten Algorithmen und Methoden können überhaupt auf der stromsparenden IoT-Hardware ausgeführt werden? Welche Laufzeiten können auf diesen IoT-Geräten erreicht werden? Können ebenfalls Echtzeitanwendungen auf der verwendeten IoT-Hardware implementiert werden? Die aufgekommenen Fragestellungen können durch die verschiedenen Experimente dieser Forschungsarbeit beantwortet werden. Die Thematik der beschränkten Ressourcen ist nicht nur bei den Modellfahrzeugen interessant. Beschränkte Ressourcen findet man heutzutage ebenfalls in Drohnen, smarten Überwachungskameras, Wildkameras und weiteren IoT-Geräten.
  • Item
    Memory-based deep reinforcement learning in endless imperfect information games
    (2023) Pleines, Marco; Rudolph, Günter; Preuss, Mike
    Memory capabilities in Deep Reinforcement Learning (DRL) agents have become increasingly crucial, especially in tasks characterized by partial observability or imperfect information. However, the field faces two significant challenges: the absence of a universally accepted benchmark and limited access to open-source baseline implementations. We present "Memory Gym", a novel benchmark suite encompassing both finite and endless versions of the Mortar Mayhem, Mystery Path, and Searing Spotlights environments. The finite tasks emphasize strong dependencies on memory and memory interactions, while the remarkable endless tasks, inspired by the game "I packed my bag", act as an automatic curriculum, progressively challenging an agent's retention and recall capabilities. To complement this benchmark, we provide two comprehensible and open-source baselines anchored on the widely-adopted Proximal Policy Optimization algorithm. The first employs a recurrent mechanism through a Gated Recurrent Unit (GRU) cell, while the second adopts an attention-based approach using Transformer-XL (TrXL) for episodic memory with a sliding window. Given the dearth of readily available transformer-based DRL implementations, our TrXL baseline offers significant value. Our results reveal an intriguing performance dynamic: TrXL is often superior in finite tasks, but in the endless environments, GRU unexpectedly marks a comeback. This discrepancy prompts further investigation into TrXL's potential limitations, including whether its initial query misses temporal cues, the impact of stale hidden states, and the intricacies of positional encoding.
  • Item
    Automatic online algorithm selection for optimization in cyber-physical production systems
    (2023) Fischbach, Andreas; Rudolph, Günter; Bartz-Beielstein, Thomas
    Shrinking product lifecycles, progressing market penetration of innovative product technologies, and increasing demand for product individualization lead to frequent adjustments of production processes and thus to an increasing demand for frequent optimization of production processes. Offline solutions are not always available, and even the optimization problem class itself may have changed in terms of the value landscape of the objective function: Parameters may have been added, the locations of optimal values and the values themselves may have changed. This thesis develops an automatic solution to the algorithm selection problem for continuous optimization. Furthermore, based on the evaluation of three different real-world use cases and a review of well-known architectures from the field of automation and cognitive science, a system architecture suitable for use in large data scenarios was developed. The developed architecture has been implemented and evaluated on two real-world problems: A Versatile Production System (VPS) and Injection Molding Optimization (IM). The developed solution for the VPS was able to automatically tune the feasible algorithms and select the most promising candidate, which significantly outperformed the competitors. This was evaluated by applying statistical tests based on the generated test instances using the process data and by performing benchmark experiments. This solution was extended to the area of multi-objective optimization for the IM use case by specifying an appropriate algorithm portfolio and selecting a suitable performance metric to automatically compare the algorithms. This allows the automatic optimization of three largely uncorrelated objectives: cycle time, average volume shrinkage, and maximum warpage of the parts to be produced. The extension to multi-objective handling for IM optimization showed a huge benefit in terms of manual implementation effort, as most of the work could be done by configuration. The implementation effort was reduced to selecting optimizers and hypervolume computation.
  • Item
    Graph set data mining
    (2023) Schäfer, Till; Mutzel, Petra; Buchin, Kevin
    Graphs are among the most versatile abstract data types in computer science. With the variety comes great adoption in various application fields, such as chemistry, biology, social analysis, logistics, and computer science itself. With the growing capacities of digital storage, the collection of large amounts of data has become the norm in many application fields. Data mining, i.e., the automated extraction of non-trivial patterns from data, is a key step to extract knowledge from these datasets and generate value. This thesis is dedicated to concurrent scalable data mining algorithms beyond traditional notions of efficiency for large-scale datasets of small labeled graphs; more precisely, structural clustering and representative subgraph pattern mining. It is motivated by, but not limited to, the need to analyze molecular libraries of ever-increasing size in the drug discovery process. Structural clustering makes use of graph theoretical concepts, such as (common) subgraph isomorphisms and frequent subgraphs, to model cluster commonalities directly in the application domain. It is considered computationally demanding for non-restricted graph classes and with very few exceptions prior algorithms are only suitable for very small datasets. This thesis discusses the first truly scalable structural clustering algorithm StruClus with linear worst-case complexity. At the same time, StruClus embraces the inherent values of structural clustering algorithms, i.e., interpretable, consistent, and high-quality results. A novel two-fold sampling strategy with stochastic error bounds for frequent subgraph mining is presented. It enables fast extraction of cluster commonalities in the form of common subgraph representative sets. StruClus is the first structural clustering algorithm with a directed selection of structural cluster-representative patterns regarding homogeneity and separation aspects in the high-dimensional subgraph pattern space. Furthermore, a novel concept of cluster homogeneity balancing using dynamically-sized representatives is discussed. The second part of this thesis discusses the representative subgraph pattern mining problem in more general terms. A novel objective function maximizes the number of represented graphs for a cardinality-constrained representative set. It is shown that the problem is a special case of the maximum coverage problem and is NP-hard. Based on the greedy approximation of Nemhauser, Wolsey, and Fisher for submodular set function maximization a novel sampling approach is presented. It mines candidate sets that contain an optimal greedy solution with a probabilistic maximum error. This leads to a constant-time algorithm to generate the candidate sets given a fixed-size sample of the dataset. In combination with a cheap single-pass streaming evaluation of the candidate sets, this enables scalability to datasets with billions of molecules on a single machine. Ultimately, the sampling approach leads to the first distributed subgraph pattern mining algorithm that distributes the pattern space and the dataset graphs at the same time.
  • Item
    Analysis and application of hash-based similarity estimation techniques for biological sequence analysis
    (2021) Timm, Henning; Rahmann, Sven; Mosig, Axel
    In Bioinformatics, a large group of problems requires the computation or estimation of sequence similarity. However, the analysis of biological sequence data has, among many others, three capital challenges: a large amount generated data which contains technology-specific errors (that can be mistaken for biological signals), and that might need to be analyzed without access to a reference genome. Through the use of locality sensitive hashing methods, both the efficient estimation of sequence similarity and tolerance against the errors specific to biological data can be achieved. We developed a variant of the winnowing algorithm for local minimizer computation, which is specifically geared to deal with repetitive regions within biological sequences. Through compressing redundant information, we can both reduce the size of the hash tables required to save minimizer sketches, as well as reduce the amount of redundant low quality alignment candidates. Analyzing the distribution of segment lengths generated by this approach, we can better judge the size of required data structures, as well as identify hash functions feasible for this technique. Our evaluation could verify that simple and fast hash functions, even when using small hash value spaces (hash functions with small codomain), are sufficient to compute compressed minimizers and perform comparable to uniformly randomly chosen hash values. We also outlined an index for a taxonomic protein database using multiple compressed winnowings to identify alignment candidates. To store MinHash values, we present a cache-optimized implementation of a hash table using Hopscotch hashing to resolve collisions. As a biological application of similarity based analysis, we describe the analysis of double digest restriction site associated DNA sequencing (ddRADseq). We implemented a simulation software able to model the biological and technological influences of this technology to allow better development and testing of ddRADseq analysis software. Using datasets generated by our software, as well as data obtained from population genetic experiments, we developed an analysis workflow for ddRADseq data, based on the Stacks software. Since the quality of results generated by Stacks strongly depends on how well the used parameters are adapted to the specific dataset, we developed a Snakemake workflow that automates preprocessing tasks while also allowing the automatic exploration of different parameter sets. As part of this workflow, we developed a PCR deduplication approach able to generate consensus reads incorporating the base quality values (as reported by the sequencing device), without performing an alignment first. As an outlook, we outline a MinHashing approach that can be used for a faster and more robust clustering, while addressing incomplete digestion and null alleles, two effects specific for ddRADseq that current analysis tools cannot reliably detect.
  • Item
    Reconsideration and extension of Cartesian genetic programming
    (2021) Kalkreuth, Roman Tobias; Rudolph, Günter; Kaufmann, Paul
    This dissertation aims on analyzing fundamental concepts and dogmas of a graph-based genetic programming approach called Cartesian Genetic Programming (CGP) and introduces advanced genetic operators for CGP. The results of the experiments presented in this thesis lead to more knowledge about the algorithmic use of CGP and its underlying working mechanisms. CGP has been mostly used with a parametrization pattern, which has been prematurely generalized as the most efficient pattern for standard CGP and its variants. Several parametrization patterns are evaluated with more detailed and comprehensive experiments by using meta-optimization. This thesis also presents a first runtime analysis of CGP. The time complexity of a simple (1+1)-CGP algorithm is analyzed with a simple mathematical problem and a simple Boolean function problem. In the subfield of genetic operators for CGP, new recombination and mutation techniques that work on a phenotypic level are presented. The effectiveness of these operators is demonstrated on a widespread set of popular benchmark problems. Especially the role of recombination can be seen as a big open question in the field of CGP, since the lack of an effective recombination operator limits CGP to mutation-only use. Phenotypic exploration analysis is used to analyze the effects caused by the presented operators. This type of analysis also leads to new insights into the search behavior of CGP in continuous and discrete fitness spaces. Overall, the outcome of this thesis leads to a reconsideration of how CGP is effectively used and extends its adaption from Darwin's and Lamarck's theories of biological evolution.
  • Item
    Tree comparison: enumeration and application to cheminformatics
    (2021) Droschinsky, Andre; Mutzel, Petra; Fischer, Johannes
    Graphs are a well-known data structure used in many application domains that rely on relationships between individual entities. Examples are social networks, where the users may be in friendship with each other, road networks, where one-way or bidirectional roads connect crossings, and work package assignments, where workers are assigned to tasks. In chem- and bioinformatics, molecules are often represented as molecular graphs, where vertices represent atoms, and bonds between them are represented by edges connecting the vertices. Since there is an ever-increasing amount of data that can be treated as graphs, fast algorithms are needed to compare such graphs. A well-researched concept to compare two graphs is the maximum common subgraph. On the one hand, this allows finding substructures that are common to both input graphs. On the other hand, we can derive a similarity score from the maximum common subgraph. A practical application is rational drug design which involves molecular similarity searches. In this thesis, we study the maximum common subgraph problem, which entails finding a largest graph, which is isomorphic to subgraphs of two input graphs. We focus on restrictions that allow polynomial-time algorithms with a low exponent. An example is the maximum common subtree of two input trees. We succeed in improving the previously best-known time bound. Additionally, we provide a lower time bound under certain assumptions. We study a generalization of the maximum common subtree problem, the block-and-bridge preserving maximum common induced subgraph problem between outerplanar graphs. This problem is motivated by the application to cheminformatics. First, the vast majority of drugs modeled as molecular graphs is outerplanar, and second, the blocks correspond to the ring structures and the bridges to atom chains or linkers. If we allow disconnected common subgraphs, the problem becomes NP-hard even for trees as input. We propose a second generalization of the maximum common subtree problem, which allows skipping vertices in the input trees while maintaining polynomial running time. Since a maximum common subgraph is not unique in general, we investigate the problem to enumerate all maximum solutions. We do this for both the maximum common subtree problem and the block-and-bridge preserving maximum common induced subgraph problem between outerplanar graphs. An arising subproblem which we analyze is the enumeration of maximum weight matchings in bipartite graphs. We support a weight function between the vertices and edges for all proposed common subgraph methods in this thesis. Thus the objective is to compute a common subgraph of maximum weight. The weights may be integral or real-valued, including negative values. A special case of using such a weight function is computing common subgraph isomorphisms between labeled graphs, where labels between mapped vertices and edges must be equal. An experimental study evaluates the practical running times and the usefulness of our block-and-bridge preserving maximum common induced subgraph algorithm against state of the art algorithms.
  • Item
    Parallel text index construction
    (2020) Kurpicz, Florian; Fischer, Johannes; Puglisi, Simon J.
    In dieser Dissertation betrachten wir die parallele Konstruktion von Text-Indizes. Text-Indizes stellen Zusatzinformationen über Texte bereit, die Anfragen hinsichtlich dieser Texte beschleunigen können. Ein Beispiel hierfür sind Volltext-Indizes, welche für eine effiziente Phrasensuche genutzt werden, also etwa für die Frage, ob eine Phrase in einem Text vorkommt oder nicht. Diese Dissertation befasst sich hauptsächlich, aber nicht ausschließlich mit der parallelen Konstruktion von Text-Indizes im geteilten und verteilten Speicher. Im ersten Teil der Dissertation betrachten wir Wavelet-Trees. Dabei handelt es sich um kompakte Indizes, welche Rank- und Select-Anfragen von binären Alphabeten auf Alphabete beliebiger Größe verallgemeinern. Im zweiten Teil der Dissertation betrachten wir das Suffix-Array, den am besten erforschten Text-Index überhaupt. Das Suffix-Array enthält die Startpositionen aller lexikografisch sortierten Suffixe eines Textes, d.h., wir möchten alle Suffixe eines Textes sortieren. Oft wird das Suffix-Array um das Longest-Common-Prefix-Array (LCP-Array) erweitert. Das LCP-Array enthält die Länge der längsten gemeinsamen Präfixe zweier lexikografisch konsekutiven Suffixe. Abschließend nutzen wir verteilte Suffix- und LCP-Arrays, um den Distributed-Patricia-Trie zu konstruieren. Dieser erlaubt es uns, verschiedene Phrase-Anfragen effizienter zu beantworten, als wenn wir nur das Suffix-Array nutzen.
  • Item
    Computing Lexicographic Parsings
    (2019-12-18) Köppl, Dominik
    We give a memory-friendly algorithm computing the compression scheme plcpcomp or lex-parse in linear or near-linear time.
  • Item
    Learning with graphs: kernel and neural approaches
    (2019) Morris, Christopher; Mutzel, Petra; Kersting, Kristian
    Linked data arise in many real-world settings - from chemical molecules, and protein-protein interaction networks, to social networks. Hence, in recent years, machine learning methods for graphs have become an important area of research. The present work deals with (supervised) graph classification, i.e., given a set of (class) labeled graphs we want to train a model such that it can predict the labels of unlabeled graphs. Thereto, we want to map a graph to a (usually) high-dimensional vector such that standard machine learning techniques can be applied. In the first part of this thesis, we present kernel methods for graphs. Specifically, we present a framework for scalable kernels that can handle continuous vertex and edge labels, which often arise with real-world graphs, e.g., chemical and physical properties of atoms. Moreover, we present a graph kernel which can take global or higher-order graph properties into account, usually not captured by other graph kernels. To this end, we propose a local version of the $k$-dimensional Weisfeiler-Leman algorithm, which is a well-known heuristic for the graph isomorphism problem. We show that a variant of our local algorithm has at least the same power as the original algorithm, while at the same time taking the sparsity of the underlying graph into account and preventing overfitting. To make this kernel scalable, we present an approximation algorithm with theoretical guarantees. Subsequently, we introduce a theoretical framework for the analysis of graph kernels showing that most kernels are not capable of distinguishing simple graph-theoretic properties and propose a theoretically well-founded graph kernel, which can distinguish these properties. The second part of this thesis deals with neural approaches to graph classification and their connection to kernel methods. We show that the expressivity of so-called graph neural networks can be upper-bounded by the $1$-dimensional Weisfeiler-Leman graph isomorphism heuristic. We then leverage this insight to propose a generalization which is provable more powerful than graph neural networks regarding distinguishing non-isomorphic graphs.
  • Item
    Seconds first!
    (2019) Brinkjost, Tobias; Mutzel, Petra; Rauh, Daniel
    The visualization of a protein is hardly imaginable without secondary structure elements (SSEs). But SSEs play a by far more important role in the field of chemical biology, apart from the creation of fancy protein images. They are essential in structure-based analyses due to their impact on secondary structure prediction and structural protein alignment. However, their proper classification is a challenging issue. There are more than 30 tools available that underline the subjective character of the classification of SSEs. But only two tools have dominated this field of research for decades. Why is that? What are the advantages of hydrogen bond-based methods, despite the fact that they are often unable to assign left-handed helices, PPI-helices, or bent structures? We have developed SCOT, a novel multipurpose software that incorporates the benefits of a multitude of approaches for the classification of helices, strands, and turns in proteins. To our knowledge, it is the very first method that not only captures a variety of rare and basic SSEs (rightand left-handed a-, 310-, 2.27-, plus right-handed p-helices, PPII helices, and b-sheets) in protein structures, but also their irregularities in a single step, and provides proper output and visualization options. SCOT combines the benefits of geometry-based and hydrogen bond-based methods by using hydrogen bond and geometric information to gain insights into the structural space of proteins. Its dual character enables robust classifications of SSEs without major influence on the geometric regularity of the assigned SSEs. In consequence, it is perfectly suited to automatically assign SSEs for subsequent helix- and strand-based protein alignments with methods such as LOCK2. This is especially supported by our elaborate kink detection. All of these benefits are clearly demonstrated by our results. Together with the easy to use visualization of assignments by the means of PyMOL scripts, SCOT enables a comprehensive analysis of regular backbone geometries in protein structures. The high number of available secondary structure assignment methods (SSAMs) hampers a straight forward selection of the most suitable one for a certain application. In addition, relying on the most frequently cited tool must not necessarily result in an optimal choice. Thus, we have developed SNOT to fill the gap of a tool that provides a multitude of objective and rational criteria for the comparison and evaluation of different SSAMs. It provides exhaustive information on geometrical parameters, residue statistics, the consistency with respect to protein flexibility and quality, SSE overlaps and sequence coverage, and the consent of two classifications. We used SNOT to compare SCOT to DSSP, STRIDE, ASSP, SEGNO, DISICL, and SHAFT. The results point toward SCOT’s unique features as a solitary multipurpose SSAM with optimal performance for numerous challenges: the support of commonly observed and rare SSEs, the comprehensive assignment of turn types, the elaborate kink detection, the geometric consistency of the SSEs, the robustness with respect to structure quality and protein flexibility, and its superior suitability for SSE-based protein structure alignments. Our analyses of alternative p- and PPII-helix assignments indicate challenges which we try to address with our methodology. There are also hints toward a correlation between the SSEs of a protein and its function. Koch and Waldmann proposed that similar arrangements of SSEs in the neighborhood of a ligand binding site (ligand-sensing cores) can recognize similar scaffolds in disregard of the overall fold. We have developed SLOT to discover these unrevealed similarities which are solely based on SSEs. Its graph-based methodology is able to mimic the geometry of SSEs by using a flexible multi-point representation instead of a straight vector. These points are used to capture the geometry of a protein, i.e., the arrangement of SSEs, in distance matrices. This unique representation enables the comparison of protein structures regardless of their SSEs’ directions or SSE sequence. An optimized algorithm to determine the MCS of two given graphs is used to calculate their structural similarity and to ensure fast runtimes. What sets SLOT further apart from its 40 competitors is that it can be used with any external secondary structure classification. Our exhaustive evaluation highlights the benefits of SCOT for the use with SLOT and also covers our optimizations of the comparison algorithm. It additionally questions the applicability of the concept of ligand-sensing cores. In the end, SCOT, SNOT, and SLOT were the beacons on our journey to answer the question: Are there similar ligand-sensing cores or undiscovered structural similarities solely based on SSEs?
  • Item
    Data-driven optimization of hot rolling processes
    (2019) Jung, Christian; Rudolph, Günter; Bartz-Beielstein, Thomas
    The rolling process is a complex real-world problem which requires adequate handling and analysis. During the production of strips, plates or coils, a huge amount of data is generated which can be used to optimize the process. The description of the process is done with so-called process models which are regular software programs. It is shown in this thesis that data-driven surrogate based optimization is an excellent method for improving even complex processes. The optimization procedure was tested with data of an aluminum roughing mill to optimize flow curve parameters for an unknown material. The tremendous improvement which was observed during the optimization was also validated with real process data. Residual prediction inaccuracies will still be observable for all kind of analytical models, even after optimization. To reduce these residuals, online algorithms are used. The most promising candidate for all rolling datasets was the online SVR algorithm, which was extended to fit the requirements for real-world processes. These extensions included strategies for handling infinite datasets, the management of the stored data and online parameter optimization. Furthermore, the thesis provides strategies for handling categorical variables. It is shown that all online algorithms outperform their offline variant and that the online SVR achieves the best results on these data. The application of the developed algorithms is not limited to the hot rolling process and can directly be applied to other real-world processes.
  • Item
    More compact orthogonal drawings by allowing additional bends
    (2018-06-26) Jünger, Michael; Mutzel, Petra; Spisla, Christiane
    Compacting orthogonal drawings is a challenging task. Usually, algorithms try to compute drawings with small area or total edge length while preserving the underlying orthogonal shape. We suggest a moderate relaxation of the orthogonal compaction problem, namely the one-dimensional monotone flexible edge compaction problem with fixed vertex star geometry. We further show that this problem can be solved in polynomial time using a network flow model. An experimental evaluation shows that by allowing additional bends could reduce the total edge length and the drawing area.
  • Item
    Bioinformatics from genetic variants to methylation
    (2018) Schröder, Christopher; Rahmann, Sven; Klau, Gunnar W.
    An important research topic in bioinformatics is the analysis of DNA, the molecule that encodes the genetic information of all organisms. The basis for this is sequencing, a procedure in which the sequence of DNA bases is determined. In addition to the identification of variations in the base sequence itself, advances in sequencing methods and a steady reduction in sequencing costs open up new fields of research: the analysis of functionally relevant non-base-related changes, so-called epigenetics. An important example of such a mechanism is DNA methylation, a process in which methyl groups are added to DNA without altering the sequence itself. Methylation takes place only at specific sites, and the methylation information of human DNA consists of approximately 30 million methylation levels between 0 and 1 in total. This thesis deals with problems and solutions for each phase of DNA methylation analysis. The most advanced method for detecting DNA methylation based on resolution is Whole-Genome Bisulfite Sequencing (WGBS), a technique that modifies DNA at unmethylated sites. We describe the special in-silico treatment required to process this altered DNA and existing concepts as well as newly developed bioinformatic methods for efficient determination of DNA methylation levels and their further processing with our developed tool camel. A common downstream analysis step is the detection of differentially methylated regions (DMRs), for which we have implemented a modification of the widely used method BSmooth in order to deal with today’s common data sizes. Setting up and creating new sequencing protocols, e.g., the mentioned WGBS, is complicated and requires adjustments to several parameters. We have developed a method based on a linear program (LP) that can predict the duplicate rate of supersamples. This critical quality measure represents the proportion of redundant data that in most cases needs to be removed from any further analysis. By using our method, it becomes possible to test, adjust and improve parameters for small test libraries only and to estimate the duplication rate for potential full-size samples. Once the sequencing protocol has been established, the methylation recognition of camel can be used as part of automated workflows, such as our mosquito workflow. This pipeline processes the generated WGBS samples from the raw data to the degree of methylation, including all essential intermediate steps. Such workflows are one of the central components of bioinformatics since the calculation must be parallel, reproducible and scalable. The distribution of the detected methylation levels, e.g., values of several samples at a specific location, can often be described as a beta-mixture model. The standard approach for estimating the parameters for such a model, the EM algorithm, has problems for data points of 0 or 1, which are very common as methylation levels. For this reason, we have developed an alternative algorithm based on moments that overcome this disadvantage.It is robust for data points within the closed interval [0; 1] and can also be applied to similar data sets in addition to methylation levels. This work deals not only with epigenetic but also with genetic variants. To analyze these, we present a second pipeline (ape) for data from targeted sequencing, where for example only genes are sequenced. The recognized variants then serve as input for our graphical environment eagle, a tool for computer scientists and geneticists to recognize possible causal genetic variants. As the name implies: The configuration of the analysis and presentation of the results is done via a graphical user interface. Unlike other tools, eagle is not based on databases, but on encapsulated hdf5 files. The use of this universal file-system-like data structure offers some advantages and makes the system easy to use especially for non-computer scientists. At the end of the thesis, we use all methods presented for the detection, analysis, and characterization of interindividual DMRs between several donors. This leads to some computational challenges because DMR detection is usually performed on two different groups. Our developed approach processes independent samples and calculates key metrics such as p-values and the number of undetectable DMRs. Through whole genome association studies (GWAS) on more than 1000 array data sets of methylation and variants, we show that (interindividual) DMRs as a subtype of epigenetics are related to genetic variation.
  • Item
    Uncertainty handling in surrogate assisted optimisation of games
    (2019) Volz, Vanessa; Rudolph, Günter; Preuss, Mike
    In this thesis entitled Uncertainty handling in surrogate assisted optimisation of games, we started out with the goal to investigate the uncertainty in game optimisation problems, as well as to identify or develop suitable optimisation algorithms. In order to approach this problem systematically, we first created a benchmark consisting of suitable game optimisation functions (GBEA). The suitability of these functions was determined using a taxonomy that was created based on the results of a literature survey of automatic game evaluation approaches. In order to improve the interpretability of the results, we also implemented an experimental framework that adds several features aiding the analysis of the results, specifically for surrogate-assisted evolutionary algorithms. After describing potentially suitable algorithms, we proposed a promising algorithm (SAPEO), to be tested on the benchmark alongside state-of-the-art optimisation algorithms. SAPEO is utilising the observation that most evolutionary algorithms only need fitness evaluations for survival selections. However, if the individuals in a population can be distinguished reliably based on predicted values, the number of function evaluations can be reduced. After a theoretical analysis of the performance limits of SAPEO, which produced very promising insights, we conducted several sets of experiments in order to answer the three central hypotheses guiding this thesis. We find that SAPEO performs comparably to state-of-the-art surrogate-assisted algorithms, but all are frequently outperformed by stand-alone evolutionary algorithms. From a more detailed analysis of the behaviour of SAPEO, we identify a few pointers that could help to further improve the performance. Before running experiments on the developed benchmark, we first verify its suitability using a second set of experiments. We find that GBEA is practical and contains interesting and challenging functions. However, we also discover that, in order to produce interpretable result with the benchmark, a set of baseline results is required. Due to this issue, we are not able to produce meaningful results with the GBEA at the time of writing. However, after more experiments are conducted with the benchmark, we will be able to interpret our results in the future. The insights developed will most likely not only be able to provide an assessment of optimisation algorithms, but can also be used to gain a deeper understanding of the characteristics of game optimisation problems.
  • Item
    Surrogate models for discrete optimization problems
    (2018) Zaefferer, Martin; Rudolph, Günter; Baartz-Beielstein, Thomas
    Surrogate models are crucial tools for many real-world optimization problems. An optimization algorithm can evaluate a data-driven surrogate model instead of an expensive objective function. While surrogate models are well-established in the continuous optimization domain, they are less frequently applied to more complex search spaces with discrete or combinatorial solution representations. The main goal of this thesis is to fill this gap. We develop and improve methods for data-driven surrogate modeling in discrete search spaces. After an initial review of existing approaches, this work focuses on a similarity-based, or kernel-based, model: Kriging. The intuitive idea is to change the underlying kernel, thus adapting Kriging to arbitrary data types. However, the model is sensitive to the employed kernel. Hence, methods for combining or selecting kernels are required. To that end, this thesis discusses various methods based on concepts like correlation, likelihood, and cross-validation. Our experiments show that choosing or constructing the right kernel determines the success of the optimization algorithm. Another challenge is that kernels are often desired to be positive semi-definite (e.g., correlation functions) or conditionally negative semi-definite (distances). Analytical proofs of a kernel's definiteness are possible, yet not always feasible in practice. At the same time, these properties can have a significant effect on model accuracy. Hence, we propose a randomized, empirical search procedure to determine whether a kernel is definite. Once a kernel is determined to be indefinite, appropriate counter-measures have to be taken to avoid performance losses. Taking inspiration from the field of support vector learning, we use eigenspectrum transformations and related methods to correct the kernel matrices. We add to the state of the art by considering various repair mechanisms that are linked to additional requirements imposed by Kriging models. We test most of our methods with sets of simple benchmark problems. To extend this, we also investigate two problems that are more complex. Firstly, we consider solutions represented by tree structures, which are of interest in genetic programming. We discuss different kernels on tree structures and test them on symbolic regression tasks. Another more complex test case are hierarchical search spaces. Here, some variables are only active when other variables satisfy a condition. We design several new kernels for hierarchical search spaces and compare them to an existing kernel. Even with those more complex test cases, it remains unclear how performance estimates translate to real-world problems. To address this, we propose a data-driven test function generator based on Kriging simulation. In contrast to estimation, simulation may avoid smoothing the training data and is inherently suited to generate randomized test instances that reflect real-world behavior.
  • Item
    K-best enumeration - theory and application
    (2017) Kurz, Denis; Mutzel, Petra; Chimani, Markus
    Where an optimal solution does not contain sufficient information about a given problem instance, enumerating good solutions is a common coping strategy. In combinatorial optimization, k-best enumeration, or ranking, has been studied and applied extensively. The k shortest simple path problem in directed, weighted graphs (kSSP), introduced in 1963 by Clarke, Krikorian and Rausen, is particularly well known. Efficient existing algorithms are based on Yen's algorithm for this problem; they all feature a worst-case running time of O(kn(m + n log n)) on a graph with n vertices and m edges. Vassilevska Williams and Williams show that a polynomially faster kSSP algorithm would also result in an algorithm for the all-pairs shortest path problem with running time O(n\^{3-ε}), which seems unlikely at the moment. We present a kSSP algorithm that is not based on Yen's algorithm, matching the state-of-the-art running time of O(kn(m + n log n)). In particular, we do not solve a single instance of the replacement path problem, a basic building block of Yen's algorithm. Instead, we make use of ideas used in Eppstein's algorithm for a similar problem where paths are allowed to be non-simple. Using our algorithm, one may find Θ(m) simple paths with a single shortest path tree computation and additional O(m + n) time per path in well-behaved cases. We also propose practical improvements for our algorithm, comprising dynamic edge pruning, lazy shortest path tree computations, and fast path simplicity checks. In a detailed computational study, we demonstrate that well-behaved cases are quite common in random graphs, grids and road networks. We showcase usefulness of the dynamic edge pruning approach on those three graph classes and on network topology graphs. Despite 40 years of heavy research on Yen-based kSSP algorithms, including involved algorithm engineering, our algorithm proves to be faster than existing algorithms by at least an order of magnitude. However, there is not much room for improvement for the general worst case. We demonstrate that kSSP can be solved considerably faster if the input graph is restricted to have bounded treewidth. Specifically, we propose an algorithm template for enumerating the k best solutions in a bounded treewidth graph for any problem that can be expressed in counting monadic second-order logic. Our template is a generalization of Courcelle's theorem, mainly utilizing dynamic programming and a persistent heap data structure. It enumerates any constant amount of solutions in time O(n). For general k, it requires O(log n) extra time per solution, resulting in a total running time of O(n + k log n). Finding the initial solution is parallelizable, so the linear term can be dropped and we obtain a running time of O(k log n) in the PRAM model. The class of problems expressible in counting monadic second order logic contains kSSP, matching problems, or the spanning tree problem, but also a number of NP-hard problems like the traveling salesman problem, where we achieve a doubly-exponential speedup.
  • Item
    Output-sensitive complexity of multiobjective combinatorial optimization with an application to the multiobjective shortest path problem
    (2018) Bökler, Friedrich Konstantin; Mutzel, Petra; Ehrgott, Matthias
    In this thesis, we are concerned with multiobjective combinatorial optimization (MOCO) problems. We attempt to fill the gap between theory and practice, which comes from the lack of a deep complexity theory for MOCO problems. In a first part, we consider a complexity notion for multiobjective optimization problems which derives from output-sensitive complexity. The primary goal of such a complexity notion is to classify problems into easy and hard problems. We show that the multiobjective minimum cut problem as well as the multiobjective linear programming problem are easy problems in this complexity theory. We also show that finding extreme points of nondominated sets is an easy problem under certain assumptions. On the side of hard problems, there are obvious ones like the multiobjective traveling salesperson problem. Moreover, we show that the well-known multiobjective shortest path problem is a hard problem. This is also the case for the multiobjective matching and integer min-cost flow problem. We also identify a class of problems for which the biobjective unconstraint combinatorial optimization problem is a barrier for efficient solvability. In a second part, we are again concerned with the gap between theory and practice. This time, we approach the multiobjective shortest path (MOSP) problem from the practical side. For the application in the planning of power transmission lines, we need to have implementations which can cope with large graphs and a larger number of objectives. The results from the first part suggest that exact methods might be incapable of achieving this goal which we also prove empirically. This is why we decide to study the literature on approximation algorithms for the MOSP problem. We conclude that they do not scale well with the number of objectives in general and that there are no practical implementations available. Hence, we develop a novel approximation algorithm in the second part which leans to the exact approaches which are well tested in practice. In an extensive computational study, we show that our implementation of this algorithm performs well even on a larger number of objectives. We compare our implementation to implementations of the other existing approximation algorithms and conclude that our implementation is superior on instances with more than three objectives.