LS 08 Künstliche Intelligenz

Permanent URI for this collection

http://hdl.handle.net/2003/71

Browse

Now showing 1 - 20 of 90

Measuring and resource-efficient optimization of clustering quality
(2025) Lenssen, Lars Filipp; Schubert, Erich; Zimek, Arthur
Cluster analysis is a fundamental task in exploratory data mining, widely used to uncover hidden structures within datasets across various fields. It has broad applications, from identifying subgroups in gene expression data for disease research to segmenting customer bases in industry. Over time, a diverse range of clustering methods has been developed to handle the complex structure of different data domains. Despite these, key challenges remain, particularly in evaluating the quality of clustering results and optimizing the performance of clustering algorithms. The research presents the Average Medoid Silhouette (AMS), an improved version of the Average Silhouette Width (ASW), and introduces the FastMSC and FasterMSC algorithms, which optimize the AMS directly. The DynMSC algorithm is also proposed to simplify determining the optimal number of clusters. For categorical data, the Average Relative Entropy Score (ARES) and Minimum Relative Entropy Contrast (MREC) are introduced, forming the basis of the CatRED algorithm, an agglomerative hierarchical method applied in information systems research.
Feature selection on quantum computers
(2023-02-20) Mücke, Sascha; Heese, Raoul; Müller, Sabine; Wolter, Moritz; Piatkowski, Nico
In machine learning, fewer features reduce model complexity. Carefully assessing the influence of each input feature on the model quality is therefore a crucial preprocessing step. We propose a novel feature selection algorithm based on a quadratic unconstrained binary optimization (QUBO) problem, which allows to select a specified number of features based on their importance and redundancy. In contrast to iterative or greedy methods, our direct approach yields higher-quality solutions. QUBO problems are particularly interesting because they can be solved on quantum hardware. To evaluate our proposed algorithm, we conduct a series of numerical experiments using a classical computer, a quantum gate computer, and a quantum annealer. Our evaluation compares our method to a range of standard methods on various benchmark data sets. We observe competitive performance.
Joint leaf-refinement and ensemble pruning through L1 regularization
(2023-03-15) Buschjäger, Sebastian; Morik, Katharina
Ensembles are among the state-of-the-art in many machine learning applications. With the ongoing integration of ML models into everyday life, e.g., in the form of the Internet of Things, the deployment and continuous application of models become more and more an important issue. Therefore, small models that offer good predictive performance and use small amounts of memory are required. Ensemble pruning is a standard technique for removing unnecessary classifiers from a large ensemble that reduces the overall resource consumption and sometimes improves the performance of the original ensemble. Similarly, leaf-refinement is a technique that improves the performance of a tree ensemble by jointly re-learning the probability estimates in the leaf nodes of the trees, thereby allowing for smaller ensembles while preserving their predictive performance. In this paper, we develop a new method that combines both approaches into a single algorithm. To do so, we introduce L1 regularization into the leaf-refinement objective, which allows us to jointly prune and refine trees at the same time. In an extensive experimental evaluation, we show that our approach not only offers statistically significantly better performance than the state-of-the-art but also offers a better accuracy-memory trade-off. We conclude our experimental evaluation with a case study showing the effectiveness of our method in a real-world setting.
Smaller world models for reinforcement learning
(2023-08-10) Robine, Jan; Uelwer, Tobias; Harmeling, Stefan
Model-based reinforcement learning algorithms try to learn an agent by training a model that simulates the environment. However, the size of such models tends to be quite large which could be a burden as well. In this paper, we address the question, how we could design a model with fewer parameters than previous model-based approaches while achieving the same performance in the 100 K-interactions regime. For this purpose, we create a world model that combines a vector quantized-variational autoencoder to encode observations and a convolutional long short-term memory to model the dynamics. This is connected to a model-free proximal policy optimization agent to train purely on simulated experience from this world model. Detailed experiments on the Atari environments show that it is possible to reach comparable performance to the SimPLe method with a significantly smaller world model. A series of ablation studies justify our design choices and give additional insights.
Explainable online ensemble of deep neural network pruning for time series forecasting
(2022-08-02) Saadallah, Amal; Jakobs, Matthias; Morik, Katharina
Both the complex and evolving nature of time series data make forecasting among one of the most challenging tasks in machine learning. Typical methods for forecasting are designed to model time-evolving dependencies between data observations. However, it is generally accepted that none of them are universally valid for every application. Therefore, methods for learning heterogeneous ensembles by combining a diverse set of forecasters together appears as a promising solution to tackle this task. While several approaches in the context of time series forecasting have focused on how to combine individual models in an ensemble, ranging from simple and enhanced averaging tactics to applying meta-learning methods, few works have tackled the task of ensemble pruning, i.e. individual model selection to take part in the ensemble. In addition, in classical ML literature, ensemble pruning techniques are mostly restricted to operate in a static manner. To deal with changes in the relative performance of models as well as changes in the data distribution, we employ gradient-based saliency maps for online ensemble pruning of deep neural networks. This method consists of generating individual models’ performance saliency maps that are subsequently used to prune the ensemble by taking into account both aspects of accuracy and diversity. In addition, the saliency maps can be exploited to provide suitable explanations for the reason behind selecting specific models to construct an ensemble that plays the role of a forecaster at a certain time interval or instant. An extensive empirical study on many real-world datasets demonstrates that our method achieves excellent or on par results in comparison to the state-of-the-art approaches as well as several baselines. Our code is available on Github (https://github.com/MatthiasJakobs/os-pgsm/tree/ecml_journal_2022).
Approaching phase retrieval with deep learning
(2023) Uelwer, Tobias; Harmeling, Stefan; Dickscheid, Timo
Phase retrieval is the process of reconstructing images from only magnitude measurements. The problem is particularly challenging as most of the information about the image is contained in the missing phase. An important phase retrieval problem is Fourier phase retrieval, where the magnitudes of the Fourier transform are given. This problem is relevant in many areas of science, e.g., in X-ray crystallography, astronomy, microscopy, array imaging, and optics. In addition to Fourier phase retrieval, we also take a closer look at two additional phase retrieval problems: Fourier phase retrieval with a reference image and compressive Gaussian phase retrieval. Most methods for phase retrieval, e.g., the error-reduction algorithm or Fienup's hybrid-input output algorithms are optimization-based algorithms which solely minimize an error-function to reconstruct the image. These methods usually make strong assumptions about the measured magnitudes which do not always hold in practice. Thus, they only work reliably for easy instances of the phase retrieval problems but fail drastically for difficult instances. With the recent advances in the development of graphics processing units (GPUs), deep neural networks (DNNs) have become fashionable again and have led to breakthroughs in many research areas. In this thesis, we show how DNNs can be applied to solve the more difficult instances of phase retrieval problems when training data is available. On the one hand, we show how supervised learning can be used to greatly improve the reconstruction quality when training images and their corresponding measurements are available. We analyze the ability of these methods to generalize to out-of-distribution data. On the other hand, we take a closer look at an existing unsupervised method that relies on generative models. Unsupervised methods are agnostic toward the measurement process which is particularly useful for Gaussian phase retrieval. We apply this method to the Fourier phase retrieval problem and demonstrate how the reconstruction performance can be further improved with different initialization schemes. Furthermore, we demonstrate how optimizing intermediate representations of the underlying generative model can help overcoming the limited range of the model and, thus, can help to reach better solutions. Finally, we show how backpropagation can be used to learn reference images using a modification of the well-established error-reduction algorithm and discuss whether learning a reference image is always efficient. As it is common in machine learning research, we evaluate all methods on benchmark image datasets as it allows for easy reproducibility of the experiments and comparability to related methods. To better understand how the methods work, we perform extensive ablation experiments, and also analyze the influence of measurement noise and missing measurements.
Explainable adaptation of time series forecasting
(2022) Saadallah, Amal; Morik, Katharina; Hammer, Barbara
A time series is a collection of data points captured over time, commonly found in many fields such as healthcare, manufacturing, and transportation. Accurately predicting the future behavior of a time series is crucial for decision-making, and several Machine Learning (ML) models have been applied to solve this task. However, changes in the time series, known as concept drift, can affect model generalization to future data, requiring thus online adaptive forecasting methods. This thesis aims to extend the State-of-the-Art (SoA) in the ML literature for time series forecasting by developing novel online adaptive methods. The first part focuses on online time series forecasting, including a framework for selecting time series variables and developing ensemble models that are adaptive to changes in time series data and model performance. Empirical results show the usefulness and competitiveness of the developed methods and their contribution to the explainability of both model selection and ensemble pruning processes. Regarding the second part, the thesis contributes to the literature on online ML model-based quality prediction for three Industry 4.0 applications: NC-milling, bolt installation in the automotive industry, and Surface Mount Technology (SMT) in electronics manufacturing. The thesis shows how process simulation can be used to generate additional knowledge and how such knowledge can be integrated efficiently into the ML process. The thesis also presents two applications of explainable model-based quality prediction and their impact on smart industry practices.
Machine learning for acquiring knowledge in astro-particle physics
(2022) Bunse, Mirko; Morik, Katharina; Sebastiani, Fabrizio
This thesis explores the fundamental aspects of machine learning, which are involved with acquiring knowledge in the research field of astro-particle physics. This research field substantially relies on machine learning methods, which reconstruct the properties of astro-particles from the raw data that specialized telescopes record. These methods are typically trained from resource-intensive simulations, which reflect the existing knowledge about the particles—knowledge that physicists strive to expand. We study three fundamental machine learning tasks, which emerge from this goal. First, we address ordinal quantification, the task of estimating the prevalences of ordered classes in sets of unlabeled data. This task emerges from the need for testing the agreement of astro-physical theories with the class prevalences that a telescope observes. To this end, we unify existing methods on quantification, propose an alternative optimization process, and develop regularization techniques to address ordinality in quantification problems, both in and outside of astro-particle physics. These advancements provide more accurate reconstructions of the energy spectra of cosmic gamma ray sources and, hence, support physicists in drawing conclusions from their telescope data. Second, we address learning under class-conditional label noise. More particularly, we focus on a novel setting, in which one of the class-wise noise rates is known and one is not. This setting emerges from a data acquisition protocol, through which astro-particle telescopes simultaneously observe a region of interest and several background regions. We enable learning under this type of label noise with algorithms for consistent, noise-aware decision thresholding. These algorithms yield binary classifiers, which outperform the existing state-of-the-art in gamma hadron classification with the FACT telescope. Moreover, unlike the state-of-the-art, our classifiers are entirely trained from the real telescope data and thus do not require any resource-intensive simulation. Third, we address active class selection, the task of actively finding those proportions of classes which optimize the classification performance. In astro-particle physics, this task emerges from the simulation, which produces training data in any desired class proportions. We clarify the implications of this setting from two theoretical perspectives, one of which provides us with bounds of the resulting classification performance. We employ these bounds in a certificate of model robustness, which declares a set of class proportions for which the model is accurate with a high probability. We also employ these bounds in an active strategy for class-conditional data acquisition. Our strategy uniquely considers existing uncertainties about those class proportions that have to be handled during the deployment of the classifier, while being theoretically well-justified.
Some representation learning tasks and the inspection of their models
(2022) Pfahler, Lukas; Morik, Katharina; Hotho, Andreas
Today, the field of machine learning knows a wide range of tasks with a wide range of supervision sources, ranging from the traditional classification tasks with neatly labeled data, over data with noisy labels to data with no labels, where we have to rely on other forms of supervision, like self-supervision. In the first part of this thesis, we design machine learning tasks for applications where we do not immediately have access to neatly-labeled training data. First, we design unsupervised representation learning tasks for training embedding models for mathematical expression that allow retrieval of related formulae. We train convolutional neural networks, transformer models and graph neural networks to embed formulas from scientific articles into a real-valued vector space using contextual similarity tasks as well as self-supervised tasks. We base our studies on a novel dataset that consists of over 28 million formulae that we have extracted from scientific articles published on arXiv.org. We represent the formulas in different input formats — images, sequences or trees — depending on the embedding model. We compile an evaluation dataset with annotated search queries from several different disciplines and showcase the usefulness of our approach for deploying a search engine for mathematical expressions. Second, we investigate machine learning tasks in astrophysics. Prediction models are currently trained on simulated data, with hand-crafted features and using multiple singletask models. In contrast, we build a single multi-task convolutional neural network that works directly on telescope images and uses convolution layers to learn suitable feature representations automatically. We design loss functions for each task and propose a novel way to combine the different loss functions to account for their different scales and behaviors. Next, we explore another form of supervision that does not rely on simulated training data, but learns from actual telescope recordings. Through the framework of noisy label learning, we propose an approach for learning gamma hadron classifiers that outperforms existing classifiers trained on simulated, fully-labeled data. Our method is general: it can be used for training models in scenarios that fit our noise assumption of class-conditional label noise with exactly one known noise probability. In the second part of this work, we develop methods to inspect models and gain trust into their decisions. We focus on large, non-linear models that can no longer be understood in their entirety through plain inspection of their trainable parameters. We investigate three approaches for establishing trust in models. First, we propose a method to highlight influential input nodes for similarity computations performed by graph neural networks. We test this approach with our embedding models for retrieval of related formulas and show that it can help understand the similarity scores computed by the models. Second, we investigate explanation methods that provide explanations based on the training process that produced the model. This way we provide explanations that are not merely an approximation of the computation of the prediction function, but actually an investigation into why the model learned to produce an output grounded in the actual data. We propose two different methods for tracking the training process and show how they can be easily implemented within existing deep learning frameworks. Third, we contribute a method to verify the adversarial robustness of random forest classifiers. Our method is based on knowledge distillation of a random forest model into a decision tree model. We bound the approximation error of using the decision tree as a proxy model to the given random forest model and use these bounds to provide guarantees on the adversarial robustness of the random forest. Consequently, our robustness guarantees are approximative, but we can provably control the quality of our results using a hyperparameter.
Ensemble learning with discrete classifiers on small devices
(2022) Buschjäger, Sebastian; Morik, Katharina; Fürnkranz, Johannes
Machine learning has become an integral part of everyday life ranging from applications in AI-powered search queries to (partial) autonomous driving. Many of the advances in machine learning and its application have been possible due to increases in computation power, i.e., by reducing manufacturing sizes while maintaining or even increasing energy consumption. However, 2-3 nm manufacturing is within reach, making further miniaturization increasingly difficult while thermal design power limits are simultaneously reached, rendering entire parts of the chip useless for certain computational loads. In this thesis, we investigate discrete classifier ensembles as a resource-efficient alternative that can be deployed to small devices that only require small amounts of energy. Discrete classifiers are classifiers that can be applied -- and oftentimes also trained -- without the need for costly floating-point operations. Hence, they are ideally suited for deployment to small devices with limited resources. The disadvantage of discrete classifiers is that their predictive performance often lacks behind their floating-point siblings. Here, the combination of multiple discrete classifiers into an ensemble can help to improve the predictive performance while still having a manageable resource consumption. This thesis studies discrete classifier ensembles from a theoretical point of view, an algorithmic point of view, and a practical point of view. In the theoretical investigation, the bias-variance decomposition and the double-descent phenomenon are examined. The bias-variance decomposition of the mean-squared error is re-visited and generalized to an arbitrary twice-differentiable loss function, which serves as a guiding tool throughout the thesis. Similarly, the double-descent phenomenon is -- for the first time -- studied comprehensively in the context of tree ensembles and specifically random forests. Contrary to established literature, the experiments in this thesis indicate that there is no double-descent in random forests. While the training of ensembles is well-studied in literature, the deployment to small devices is often neglected. Additionally, the training of ensembles on small devices has not been considered much so far. Hence, the algorithmic part of this thesis focuses on the deployment of discrete classifiers and the training of ensembles on small devices. First, a novel combination of ensemble pruning (i.e., removing classifiers from the ensemble) and ensemble refinement (i.e., re-training of classifiers in the ensemble) is presented, which uses a novel proximal gradient descent algorithm to minimize a combined loss function. The resulting algorithm removes unnecessary classifiers from an already trained ensemble while improving the performance of the remaining classifiers at the same time. Second, this algorithm is extended to the more challenging setting of online learning in which the algorithm receives training examples one by one. The resulting shrub ensembles algorithm allows the training of ensembles in an online fashion while maintaining a strictly bounded memory consumption. It outperforms existing state-of-the-art algorithms under resource constraints and offers competitive performance in the general case. Last, this thesis studies the deployment of decision tree ensembles to small devices by optimizing their memory layout. The key insight here is that decision trees have a probabilistic inference time because different observations can take different paths from the root to a leaf. By estimating the probability of visiting a particular node in the tree, one can place it favorably in the memory to maximize the caching behavior and, thus, increase its performance without changing the model. Last, several real-world applications of tree ensembles and Binarized Neural Networks are presented.
Randomized outlier detection with trees
(2020-12-15) Buschjäger, Sebastian; Honysz, Philipp-Jan; Morik, Katharina
Isolation forest (IF) is a popular outlier detection algorithm that isolates outlier observations from regular observations by building multiple random isolation trees. The average number of comparisons required to isolate a given observation can then be used as a measure of its outlierness. Multiple extensions of this approach have been proposed in the literature including the extended isolation forest (EIF) as well as the SCiForest. However, we find a lack of theoretical explanation on why IF, EIF, and SCiForest offer such good practical performance. In this paper, we present a theoretical framework that views these approaches from a distributional viewpoint. Using this viewpoint, we show that isolation-based approaches first accurately approximate the data distribution and then secondly approximate the coefficients of mixture components using the average path length. Using this framework, we derive the generalized isolation forest (GIF) that also trains random isolation trees, but combining them moves beyond using the average path length. That is, GIF splits the data into multiple sub-spaces by sampling random splits as do the original IF variants do and directly estimates the mixture coefficients of a mixture distribution to score the outlierness on entire regions of data. In an extensive evaluation, we compare GIF with 18 state-of-the-art outlier detection methods on 14 different datasets. We show that GIF outperforms three competing tree-based methods and has a competitive performance to other nearest-neighbor approaches while having a lower runtime. Last, we highlight a use-case study that uses GIF to detect transaction fraud in financial data.
Geospatial IoT—the need for event-driven architectures in contemporary spatial data infrastructures
(2018-09-25) Rieke, Matthes; Bigagli, Lorenzo; Herle, Stefan; Jirka, Simon; Kotsev, Alexander; Liebig, Thomas; Malewski, Christian; Paschke, Thomas; Stasch, Christoph
The nature of contemporary spatial data infrastructures lies in the provision of geospatial information in an on-demand fashion. Although recent applications identified the need to react to real-time information in a time-critical way, research efforts in the field of geospatial Internet of Things in particular have identified substantial gaps in this context, ranging from a lack of standardisation for event-based architectures to the meaningful handling of real-time information as “events”. This manuscript presents work in the field of event-driven architectures as part of spatial data infrastructures with a particular focus on sensor networks and the devices capturing in-situ measurements. The current landscape of spatial data infrastructures is outlined and used as the basis for identifying existing gaps that retain certain geospatial applications from using real-time information. We present a selection of approaches—developed in different research projects—to overcome these gaps. Being designed for specific application domains, these approaches share commonalities as well as orthogonal solutions and can build the foundation of an overall event-driven spatial data infrastructure.
How is a data-driven approach better than random choice in label space division for multi-label classification?
(2016-07-30) Szymanski, Piotr; Kajdanowicz, Tomasz; Kersting, Kristian
We propose using five data-driven community detection approaches from social networks to partition the label space in the task of multi-label classification as an alternative to random partitioning into equal subsets as performed by RAkELd. We evaluate modularity-maximizing using fast greedy and leading eigenvector approximations, infomap, walktrap and label propagation algorithms. For this purpose, we propose to construct a label co-occurrence graph (both weighted and unweighted versions) based on training data and perform community detection to partition the label set. Then, each partition constitutes a label space for separate multi-label classification sub-problems. As a result, we obtain an ensemble of multi-label classifiers that jointly covers the whole label space. Based on the binary relevance and label powerset classification methods, we compare community detection methods to label space divisions against random baselines on 12 benchmark datasets over five evaluation measures. We discover that data-driven approaches are more efficient and more likely to outperform RAkELd than binary relevance or label powerset is, in every evaluated measure. For all measures, apart from Hamming loss, data-driven approaches are significantly better than RAkELd ( α=0.05 ), and at least one data-driven approach is more likely to outperform RAkELd than a priori methods in the case of RAkELd’s best performance. This is the largest RAkELd evaluation published to date with 250 samplings per value for 10 values of RAkELd parameter k on 12 datasets published to date.
A mathematical theory of making hard decisions: model selection and robustness of matrix factorization with binary constraints
(2018) Heß, Sibylle Charlotte; Morik, Katharina; Siebes, Arno P. J. M.
One of the first and most fundamental tasks in machine learning is to group observations within a dataset. Given a notion of similarity, finding those instances which are outstandingly similar to each other has manifold applications. Recommender systems and topic analysis in text data are examples which are most intuitive to grasp. The interpretation of the groups, called clusters, is facilitated if the assignment of samples is definite. Especially in high-dimensional data, denoting a degree to which an observation belongs to a specified cluster requires a subsequent processing of the model to filter the most important information. We argue that a good summary of the data provides hard decisions on the following question: how many groups are there, and which observations belong to which clusters? In this work, we contribute to the theoretical and practical background of clustering tasks, addressing one or both aspects of this question. Our overview of state-of-the-art clustering approaches details the challenges of our ambition to provide hard decisions. Based on this overview, we develop new methodologies for two branches of clustering: the one concerns the derivation of nonconvex clusters, known as spectral clustering; the other addresses the identification of biclusters, a set of samples together with similarity defining features, via Boolean matrix factorization. One of the main challenges in both considered settings is the robustness to noise. Assuming that the issue of robustness is controllable by means of theoretical insights, we have a closer look at those aspects of established clustering methods which lack a theoretical foundation. In the scope of Boolean matrix factorization, we propose a versatile framework for the optimization of matrix factorizations subject to binary constraints. Especially Boolean factorizations have been computed by intuitive methods so far, implementing greedy heuristics which lack quality guarantees of obtained solutions. In contrast, we propose to build upon recent advances in nonconvex optimization theory. This enables us to provide convergence guarantees to local optima of a relaxed objective, requiring only approximately binary factor matrices. By means of this new optimization scheme PAL-Tiling, we propose two approaches to automatically determine the number of clusters. The one is based on information theory, employing the minimum description length principle, and the other is a novel statistical approach, controlling the false discovery rate. The flexibility of our framework PAL-Tiling enables the optimization of novel factorization schemes. In a different context, where every data point belongs to a pre-defined class, a characterization of the classes may be obtained by Boolean factorizations. However, there are cases where this traditional factorization scheme is not sufficient. Therefore, we propose the integration of another factor matrix, reflecting class-specific differences within a cluster. Our theoretical considerations are complemented by empirical evaluations, showing how our methods combine theoretical soundness with practical advantages.
Exponential families on resource-constrained systems
(2018) Piatkowski, Nico Philipp; Morik, Katharina; Ermon, Stefano
This work is about the estimation of exponential family models on resource-constrained systems. Our main goal is learning probabilistic models on devices with highly restricted storage, arithmetic, and computational capabilities—so called, ultra-low-power devices. Enhancing the learning capabilities of such devices opens up opportunities for intelligent ubiquitous systems in all areas of life, from medicine, over robotics, to home automation—to mention just a few. We investigate the inherent resource consumption of exponential families, review existing techniques, and devise new methods to reduce the resource consumption. The resource consumption, however, must not be reduced at all cost. Exponential families possess several desirable properties that must be preserved: Any probabilistic model encodes a conditional independence structure—our methods keep this structure intact. Exponential family models are theoretically well-founded. Instead of merely finding new algorithms based on intuition, our models are formalized within the framework of exponential families and derived from first principles. We do not introduce new assumptions which are incompatible with the formal derivation of the base model, and our methods do not rely on properties of particular high-level applications. To reduce the memory consumption, we combine and adapt reparametrization and regularization in an innovative way that facilitates the sparse parametrization of high-dimensional non-stationary time-series. The procedure allows us to load models in memory constrained systems, which would otherwise not fit. We provide new theoretical insights and prove that the uniform distance between the data generating process and our reparametrized solution is bounded. To reduce the arithmetic complexity of the learning problem, we derive the integer exponential family, based on the very definition of sufficient statistics and maximum entropy estimation. New integer-valued inference and learning algorithms are proposed, based on variational inference, proximal optimization, and regularization. The benefit of this technique is larger, the weaker the underlying system is, e.g., the probabilistic inference on a state-of-the-art ultra-lowpower microcontroller can be accelerated by a factor of 250. While our integer inference is fast, the underlying message passing relies on the variational principle, which is inexact and has unbounded error on general graphs. Since exact inference and other existing methods with bounded error exhibit exponential computational complexity, we employ near minimax optimal polynomial approximations to yield new stochastic algorithms for approximating the partition function and the marginal probabilities. Changing the polynomial degree allows us to control the complexity and the error of our new stochastic method. We provide an error bound that is parametrized by the number of samples, the polynomial degree, and the norm of the model’s parameter vector. Moreover, important intermediate quantities can be precomputed and shared with the weak computational device to reduce the resource requirement of our method even further. All new techniques are empirically evaluated on synthetic and real-world data, and the results confirm the properties which are predicted by our theoretical derivation. Our novel techniques allow a broader range of models to be learned on resource-constrained systems and imply several new research possibilities.
Distributed analysis of vertically partitioned sensor measurements under communication constraints
(2017) Stolpe, Marco; Morik, Katharina; Rehof, Jakob; Brefeld, Ulf
Nowadays, large amounts of data are automatically generated by devices and sensors. They measure, for instance, parameters of production processes, environmental conditions of transported goods, energy consumption of smart homes, traffic volume, air pollution and water consumption, or pulse and blood pressure of individuals. The collection and transmission of data is enabled by electronics, software, sensors and network connectivity embedded into physical objects. The objects and infrastructure connecting such objects are called the Internet of Things (IoT). In 2010, already 12.5 billion devices were connected to the IoT, a number about twice as large as the world's population at that time. The IoT provides us with data about our physical environment, at a level of detail never known before in human history. Understanding such data creates opportunities to improve our way of living, learning, working, and entertaining. For instance, the information obtained from data analysis modules embedded into existing processes could help their optimization, leading to more sustainable systems which save resources in sectors such as manufacturing, logistics, energy and utilities, the public sector, or healthcare. IoT's inherent distributed nature, the resource constraints and dynamism of its networked participants, as well as the amounts and diverse types of data collected are challenging even the most advanced automated data analysis methods known today. Currently, there is a strong research focus on the centralization of all data in the cloud, processing it according to the paradigm of parallel high-performance computing. However, the resources of devices and sensors at the data generating side might not suffice to transmit all data. For instance, pervasive distributed systems such as wireless sensors networks are highly communication-constrained, as are streaming high throughput applications, or those where data masses are simply too huge to be sent over existing communication lines, like satellite connections. Hence, the IoT requires a new generation of distributed algorithms which are resource-aware and intelligently reduce the amount of data transmitted and processed throughout the analysis chain. This thesis deals with the distributed analysis of vertically partitioned sensor measurements under communication constraints, which is a particularly challenging scenario. Here, not observations are distributed over nodes, but their feature values. The learning of accurate prediction models may require the combination of information from different nodes, necessarily leading to communication. The main question is how to design communication-efficient algorithms for the scenario, while at the same time preserving sufficient accuracy. The first part of the thesis introduces fundamental concepts. An overview of the IoT and its many applications is given, with a special focus on data analysis, the vertically partitioned data scenario, and accompanying research questions. Then, basic notions of machine learning and data mining are introduced. A selection of existing distributed data mining approaches is presented and discussed in more detail. Distributed learning in the vertically partitioned data scenario is then motivated by a smart manufacturing case study. In a hot rolling mill, different machines assess parameters describing the processing of single steel blocks, whose quality should be predicted as early as possible, by analysis of distributed measurements. Each machine creates not single value series, but many of them. Their heterogeneity leads to challenging questions concerning the steps of preprocessing and finding a good representation for learning, for which solutions are proposed. Another problem is that quality information is not given for individual blocks, but charges of blocks. How can we nevertheless predict the quality of individual blocks? Time constraints lead to questions typical for the vertically partitioned data scenario. Which data should be analyzed locally, to match the constraints, and which should be sent to a central server? Learning from aggregated label information is a relatively novel problem in machine learning research. A new algorithm for the task is developed and evaluated, the Learning from Label Proportions by Clustering (LLPC) algorithm. The algorithm's performance is compared to three other state-of-the-art approaches, in terms of accuracy and running time. It can be shown that LLPC achieves results with lower running time, while accuracy is comparable to that of its competitors, or significantly higher. The proposed algorithm comes with many other benefits, like ease of implementation and a small memory footprint. For highly decentralized systems, the Training of Local Models from (Label) Counts (TLMC) algorithm is proposed. The method builds on LLPC, reducing communication by transferring only label counts for batches of observations between nodes. Feasibility of the approach is demonstrated by evaluating the algorithm's performance in the context of traffic flow prediction. It is shown that TLMC is much more communication-efficient than centralization of all data, but that accuracy can nevertheless compete with that of a centrally trained global model. Finally, a communication-efficient distributed algorithm for anomaly detection is proposed, the Vertically Distributed Core Vector Machine (VDCVM). It can be shown that the proposed algorithm communicates up to an order of magnitude less data during learning, in comparison to another state-of-the-art approach, or training a global model by the centralization of all data. Nevertheless, in many relevant cases, the VDCVM achieves similar or even higher accuracy on several controlled and benchmark datasets. A main result of the thesis is that communication-efficient learning is possible in cases where features from different nodes are conditionally independent, given the target value to be predicted. Most efficient are local models, which exchange label information between nodes. In comparison to consensus algorithms, which transmit labels repeatedly, TLMC sends labels only once between nodes. Communication could be even reduced further by learning from counts of labels. In the context of traffic flow prediction, the accuracy achieved is still sufficient in comparison to centralizing all data and training a global model. In the case of anomaly detection, similar results could be achieved by utilizing a sampling approach which draws only as many observations as needed to reach a (1+ε)-approximation of the minimum enclosing ball (MEB). The developed approaches have many applications in communication-constrained settings, in the sectors mentioned above. It has been shown that data can be reduced and learned from before it even enters the cloud. Decentralized processing might thus enable the analysis of big data masses, the more devices are getting connected to the IoT.
Automatic methods to extract latent meanings in large text corpora
(2016) Pölitz, Christian; Morik, Katharina; Müller, Heinrich
This thesis concentrates on Data Mining in Corpus Linguistic. We show the use of modern Data Mining by developing efficient and effective methods for research and teaching in Corpus Linguistics in the fields of lexicography and semantics. Modern language resources as they are provided by Common Language Resources and Technology Infrastructure (http://clarin.eu) offer a large number of heterogeneous information resources of written language. Besides large text corpora, additional information about the sources or publication date of the documents from the corpora are available. Further, information about words from dictionaries or WordNets offer prior information of the word distributions. Starting with pre-studies in lexicography and semantics with large text corpora, we investigate the use of latent variable methods to extract hidden concepts in large text collections. We show that these hidden concepts correspond to meanings of words and subjects in text collections. This motivates an investigation of latent variable methods for large corpora to support linguistic research. In an extensive survey, latent variable models are described. Mathematical and geometrical foundations are explained to motivate the latent variable methods. We distinguish two starting points for latent variable models depending on how we represent documents internally. The first representation is based on geometric objects in a vector space and latent variable are represented by vectors. Latent factor models are described to extract latent variables by finding factorizations of matrices summarizing the document objects. The second representation is based on random sequences and the latent variables are random variables on which the sequences conditionally depend. Latent topic models are described to extract latent variables by finding conditionally depending variables. We explain state-of-the-art methods for factor and topic models. To show the quality and hence the use of latent variable methods for corpus linguistic, different evaluation methods are discussed. Qualitative evaluation methods are described to effectively present the results of the latent variable methods to users. State-of-the-art quantitative evaluation methods are summarized to illustrate how to measure the quality of latent variable methods automatically. Additional, we propose new methods to efficiently estimate the quality of latent variable methods for corpora with time information about the documents. Besides standard evaluation methods based on likelihoods and coherences of the extracted hidden concepts, we develop methods to estimate the coherence of the concepts in terms of temporal aspects and likelihoods that including time. Based on the survey on latent variable methods, we interpret the latent variable methods as optimization problem that finds latent variables to optimally describe the document corpus. To efficiently integrate additional information about a corpus from a modern language resources, we propose to extend the optimization for the latent variables with a regularization that includes this additional information. In terms of the different latent variable models, regularizations are proposed to either align latent factors or jointly model latent topics with information about the documents in the corpus. From pre-studies and collaborations with researches from corpus linguistics, we compiled use cases to investigate the regularized latent variable methods for linguistic research and teaching. Two major application are investigated. In diachronic linguistics, we show efficient regularized latent topic models to jointly model latent variables with time stamps from documents. In variety linguistics, we integrate information about the sources of the documents to model similarities and dissimilarities between corpora. Finally, a software package as Plugin for the Data Mining toolkit RapidMiner as it is developed to implement the methods from the thesis is described. The interfaces to the language resources and text corpora, the text processing methods, the latent variable methods and the evaluation methods are specified. We give detailed information about how the software is used on the use cases. The integration of the developed methods in the modern language resources like WebLicht or the Dictionary of the German Languages is explained to show the acceptance of our method in corpus linguistic research and teaching.
Graphical models beyond standard settings: lifted decimation, labeling, and counting
(2015) Hadiji, Fabian; Kersting, Kristian; Natarajan, Sriraam
With increasing complexity and growing problem sizes in AI and Machine Learning, inference and learning are still major issues in Probabilistic Graphical Models (PGMs). On the other hand, many problems are specified in such a way that symmetries arise from the underlying model structure. Exploiting these symmetries during inference, which is referred to as "lifted inference", has lead to significant efficiency gains. This thesis provides several enhanced versions of known algorithms that show to be liftable too and thereby applies lifting in "non-standard" settings. By doing so, the understanding of the applicability of lifted inference and lifting in general is extended. Among various other experiments, it is shown how lifted inference in combination with an innovative Web-based data harvesting pipeline is used to label author-paper-pairs with geographic information in online bibliographies. This results is a large-scale transnational bibliography containing affiliation information over time for roughly one million authors. Analyzing this dataset reveals the importance of understanding count data. Although counting is done literally everywhere, mainstream PGMs have widely been neglecting count data. In the case where the ranges of the random variables are defined over the natural numbers, crude approximations to the true distribution are often made by discretization or a Gaussian assumption. To handle count data, Poisson Dependency Networks (PDNs) are introduced which presents a new class of non-standard PGMs naturally handling count data.
Mining big data streams for multiple concepts
(2015) Bockermann, Christian; Morik, Katharina; Bifet, Albert
About the exploration of data mining techniques using structured features for information extraction
(2012-06-26) Jungermann, Felix; Morik, Katharina; Jannach, Dietmar
The World Wide Web is a huge source of information. The amount of information being available in the World Wide Web becomes bigger and bigger every day. It is impossible to handle this amount of information by hand. Special techniques have to be used to deliver smaller excerpts of information which become manageable. Unfortunately, these techniques like search engines, for instance, just deliver a certain view of the informations original appearance. The delivered information is present in various types of les like websites, text documents, video clips, audio files and the like. The extraction of relevant and interesting pieces of information out of these files is very complex and time-consuming. Special techniques which allow for an automatic extraction of interesting informational units are analyzed in this work. Such techniques are based on Machine Learning methods. In contrast to traditional Machine Learning tasks the processing of text documents in this context needs certain techniques. The structure of natural language contained in text document poses constraints which should be respected by the Machine Learning method. These constraints and the specially tuned methods respecting them are another important aspect in this work. After defining all needed formalisms of Machine Learning which are used in this work, I present multiple approaches of Machine Learning applicable to the fields of Information Extraction. I describe the historical development from first approaches of Information Extraction over Named Entity Recognition to the point of Relation Extraction. The possibilities of using linguistic resources for the creation of feature sets for Information Extraction purposes are presented. I show how Relation Extraction is formally defined, and I additionally show what kind of methods are used for Relation Extraction in Machine Learning. I focus on Relation Extraction techniques which benefit on the one hand from minimum optimization and on the other hand from efficient data structure. Most of the experiments and implementations described in this work were done using the open source framework for Data Mining RapidMiner. To apply this framework on Information Extraction tasks I developed an extension called Information Extraction Plugin which is exhaustively described. Finally, I present applications which explicitly benefit from the collaboration of Data Mining and Information Extraction.

Browse

Recent Submissions