Towards formal explainability: Faithful distillation of deep neural networks into interpretable surrogate models
Loading...
Date
2025
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Alternative Title(s)
Abstract
The goal of this thesis is to make the semantic function of trained deep neural networks accessible in a compact and efficient data structure based on formal principles.
Deep neural networks are today the predominant machine learning model based on their remarkable performance over the last two decades. From the first breakthroughs in computer vision with AlexNet to today's sophisticated large language models for natural language processing, deep neural networks have become the state-of-the-art in machine learning.
One key factor for this success is their ability to autonomously learn from data without human guidance, enabling end-to-end optimization. On the other hand, the lack of human involvement makes the internal structure of neural networks hard to understand, as it is missing structure and design. Besides their large number of learnable parameters, three key factors are identified that make these learned intermediate representations so difficult to understand: they are distributed, non-linear, and sub-symbolic.
This thesis proposes a new post-hoc approach for explaining deep neural networks based on faithful surrogate models. Through a systematic and property-preserving decomposition of piecewise linear neural networks into their linear regions, the internal structure of neural networks is compiled into a new surrogate model. In this way, the typical representation of DNNs based on their dataflow, which is optimized for execution speed on graphics cards, is converted into a representation focusing on controlflow, which is more suitable for formal analysis. Consequently, the new representation is free of the distributed and non-linear internal representations. The surrogate model provides explanatory information from which different types of explanations can be derived, such as outcome explanations, class characterizations, and model explanations.
At the core of this approach stands an optimized data structure, a binary decision tree, that combines ideas from Algebraic Decision Trees, Binary Space Partitioning Trees, and classic program optimization. By placing function composition at the center, these trees enable a modular approach to faithful distillation that is easily extensible and simplifies reasoning. Through optimizations, such as infeasible path elimination, redundancies in the tree are identified and pruned.
As the distilled tree mirrors the network's behavior, it can be used to analyze its semantic properties, such as fairness or robustness. As a result of their formal grounding, these trees integrate well with mathematical notions. Furthermore, based on two-dimensional slices, it is possible to visualize the actual decision boundaries of a neural network, setting an ideal ground for exploring its behavior using intuition.
Description
Table of contents
Keywords
Deep neural networks, Model distillation, Descision Trees, Interpretable surrogate model, Activation pattern decompostition, Symbolic execution, Rule extraction, Input space patititon, Continuous Piece-wise linear
Subjects based on RSWK
Tiefes neuronales Netz, Entscheidungsbaum, Symbolische Ausführung, Wissensextraktion, Stückweise lineare Funktion