Modeling Approaches for Dose-Response Data in Toxicology Dissertation zur Erlangung des Doktorgrades Dr. rer. nat. der Fakultät Statistik der Technischen Universität Dortmund Vorgelegt von Julia Christin Duda Dortmund, Januar 2024 Amtierender Dekan: Prof. Dr. Philipp Doebler Gutachter: Prof. Dr. Jörg Rahnenführer (Technische Universität Dortmund) JProf. Dr. Kirsten Schorning (Technische Universität Dortmund) Prof. Dr. Jan G. Hengstler (Leibnizinstitut für Arbeitsforschung an der Technischen Universität Dortmund) Tag der Prüfung: 14. März 2024 Abstract Dose-response modeling occurs in many application areas and has a rich research history. An extensively studied application field is clinical studies, where dose-response model- ing is used in Phase II studies to identify the dose closest to a pre-defined effect. Many non-clinical, toxicological studies also aim at identifying a dose-response relationship. However, for non-clinical or toxicological studies there are fewer regulations or guide- lines. This leads to a gap between nowadays research advances in statistical modeling and the use of these methods in practice in toxicology. In addition, toxicological dose- response studies differ from clinical studies in various technical aspects. For example, cells might be studied instead of human patients, and administered doses are constrained due to laboratory, and technical reasons rather than ethical considerations. Therefore, the transfer of clinical methodological knowledge into toxicological applications is only possible to a limited extent and tailored methodologies are required that match the specific data structure of toxicological studies. This cumulative thesis is based upon four works that all present approaches for modeling toxicological dose-response data. The first manuscript reveals the potential of applying the Multiple Comparison Testing and Modeling (MCP-Mod) approach by Bretz et al. (2005) developed for Phase II clinical studies on toxicological, gene-expression dose- response data. In the second manuscript, a parametric, mechanistically motivated model for toxicological dose-time-response data is developed. The third manuscript is application-focused and explains the use of interaction effects when analyzing dose- response gene expression in a two-factor setting. At last, a non-parametric Bayesian dose-response modeling approach was developed that performs functional shrinkage for non-linear function spaces. While the first three manuscripts are published, the fourth work is attached in its current version. iii Acknowledgments I would like to thank my supervisor, Jörg, for his unwavering support and optimism throughout each phase of my dissertation. Thank you for always showing me another perspective when I needed it, while also granting me a lot of freedom to explore and find my ways. I am grateful for the opportunity to work on my Ph.D. within the Research Training Group 2624 Statistical Methods for High-Dimensional Data in Toxicology. It was a great experience to work on so many interdisciplinary projects. Special thanks go out to Prof. Hengstler, whose passion for his research is beyond scales and I am grateful for the opportunity to collaborate with him - thank you! I would also like to thank all members of the RTG and other colleagues, for being a great community to gain experiences, feel supported, and share fun times throughout the last years. A special thanks goes out to Matt, the co-author of my last work, who gave me the opportunity to visit him at the National Institute of Environmental Health Sciences in the United States to work on our research project. Thank you for pushing me to work through challenges that I would have thought invincible before. I would like to thank Ezekiel King, my roommate during my USA stay. Thank you for teaching me to become a braver version of myself and to always cherish the little things in life. I hope I made your life a little brighter in the short time we shared and that you can rest peacefully now. At last, I would like to thank my family and friends who greatly supported me throughout all the ups and downs of this academic journey. Without your love and support, I would not have made it through the hard times. “Just keep swimming.” – Dory, Finding Nemo. v List of Publications This cumulative thesis is based on the following four manuscripts: Article 1: Duda, J. C., Kappenberg, F., & Rahnenführer, J. (2022). Model selection character- istics when using MCP-Mod for dose–response gene expression data. Biometrical Journal, 64(5), 883–897. https://doi.org/10.1002/bimj.202000250 Contribution of the author: The author of this thesis implemented the data analyses and the simulation studies, added additional ideas and wrote the first draft of the manuscript. Discussions and revising was done together with Franziska Kappenberg under the supervision of Jörg Rahnenführer. The reuse of this article in the thesis is granted under the terms of the Creative Commons Attribution 4.0 International License. Article 2: Duda, J. C., Hengstler, J. G., & Rahnenführer, J. (2022). td2pLL: an intuitive time-dose-response model for cytotoxicity data with varying exposure durations. Computational Toxicology, 43, Article 100234. https://doi.org/10.1016/j.comtox.2022.100234 Contribution of the author: The author of this thesis had the central idea for the suggested model, implemented the software package, simulation study, and data application, and wrote the first draft of the manuscript. Helpful comments by the co-authors were incorporated. The reuse of this article in the thesis is granted under the terms of the Creative Commons Attribution 4.0 International License. vii Article 3: Duda, J. C., Drenda, C., Kästel, H., Rahnenführer, J., & Kappenberg, F. (2023). Benefit of using interaction effects for the analysis of high-dimensional time- response or dose-response data for two-group comparisons. Scientific Reports, 13(1), 20804. https://doi.org/10.1038/s41598-023-47057-0 Contribution of the author: The author of this thesis implemented the analyses together with Carolin Drenda and Hue Kästel, and wrote the manuscript. Comments from fruitful discussions with Jörg Rahnenführer and Franziska Kappenberg were incorporated. The reuse of this article in the thesis is granted under the terms of the Creative Commons Attribution 4.0 International License. Article 4: Duda, J. C. & Wheeler, M. Bayesian non-linear subspace shrinkage using horse- shoe priors. Unpublished. Contribution of the author: The original idea for this manuscript was proposed by Matthew Wheeler. The computational implementation, design, conduction, and analysis of the simulation study and the search for and analysis of the application example as well as writing the manuscript was done by the author of this thesis under the supervision of Matthew Wheeler. The manuscript is attached in its current version. Further publications: 1. Tug, T., Duda, J. C., Menssen, M., Bruce, S. W., Bringezu, F., Dammann, M., Frötschl, R., Harm, V., Ickstadt, K., Igl, B. W., Jarzombek, M., Kellner, R., Lott, J., Pfuhler, S., Plappert-Helbig, U., Rahnenführer, J., Schulz, M., Vaas, L., Vasquez, M., Ziegler, V., Ziemann, C. (2024). In vivo alkaline comet assay: Statistical con- siderations on historical negative and positive control data. Regulatory Toxicology and Pharmacology, 105583. https://doi.org/10.1016/j.yrtph.2024.105583 2. Ghallab, A., González, D., Strängberg, E., Hofmann, U., Myllys, M., Hassan, R., Hobloss, Z., Brackhagen, L., Begher-Tibbe, B., Duda, J. C., Drenda, C., Kappenberg, F., Reinders, J., Friebel, A., Vucur, M., Turajski, M., Seddek, A., Abbas, T., Abdelmageed, N., Morad, S. A. F., Morad, W. , Hamdy, A., Albrecht, W., Kittana, N., Assali, M., Vartak, N., van Thriel, C., Sous, A., Nell, P., Villar-Fernandez, M., Cadenas, C., Genc, E., Marchan, R., Luedde, T., Åkerblad, P., Mattsson, J., Marschall, H., Hoehme, S., Stirnimann, G., Schwab, M., Boor, P., Amann, K., Schmitz, J., Bräsen, J. H., Rahnenführer, J., Edlund, K., Karpen, S.J., Simbrunner, B., Reiberger, T., Mandorfer, M., Trauner, M., Dawson, P. A., Lindström, E., Hengstler, J. G. (2023). Inhibition of the Renal Apical Sodium Dependent Bile Acid Transporter Prevents Cholemic Nephropa- thy in Mice with Obstructive Cholestasis. Journal of Hepatology. Published. https://doi.org/10.1016/j.jhep.2023.10.035 3. Kappenberg, F.,Duda, J. C., Schürmeyer, L., Gül, O., Brecklinghaus, T., Hengstler, J. G., Schorning, K., & Rahnenführer, J. (2023). Guidance for statistical design and analysis of toxicological dose–response experiments, based on a comprehensive literature review. Archives of Toxicology. In press. https://doi.org/10.1007/s00204- 023-03561-w 4. Su, H., Haque, M., Becker, S., Edlund, K., Duda, J. C., Wang, Q., Reißing, J., Marschall, H., Candels, L. S., Mohamed, M., Sjöland, W., Liao, L., Drexler, S. A., Strowig, T., Rahnenführer, J., Hengstler, J. G., Hatting, M., Trautwein, C. (2023). Long-term hypercaloric diet exacerbates metabolic liver disease in PNPLA3 I148M animals. Liver International. In press. https://doi.org/10.1111/liv.15587 5. Brecklinghaus, T., Albrecht, W., Duda, J. C., Kappenberg, F., Gründler, L., Edlund, K., Marchan, R., Ghallab, A., Cadenas, C., Rieck, A., Vartak, N., Tolosa, L., Castell, J. V., Gardner, I., Halilbasic, E., Trauner, M., Ullrich, A., Zeigerer, A., Turgunbayer, Ö. D., Damm, G., Seehofer, D., Rahnenführer, J., Hengstler, J. G. (2022). In vitro/in silico prediction of drug induced steatosis in relation to oral doses and blood concentrations by the Nile Red assay. Toxicology Letters, 368, 33–46. https://doi.org/10.1016/j.toxlet.2022.08.006 6. Brecklinghaus, T., Albrecht, W., Kappenberg, F., Duda, J. C., Vartak, N., Edlund, K., Marchan, R., Ghallab, A., Cadenas, C., Günther, G., Leist, M., Zhang, M., Gardner, I., Reinders, J., Russel, F. GM., Foster, A. J., Williams, D. P., Damle- Vartak, A., Grandits, M., Ecker, G., Kittana, N., Rahnenführer, J., Hengstler, J. G. (2022). The hepatocyte export carrier inhibition assay improves the separation of hepatotoxic from non-hepatotoxic compounds. Chemico-Biological Interactions, 351, Article 109728. https://doi.org/10.1016/j.cbi.2021.109728 7. Brecklinghaus, T., Albrecht, W., Kappenberg, F., Duda, J. C., Zhang, M., Gardner, I., Marchan, R., Ghallab, A., Turgunbayer, Ö. D., Rahnenführer, J., Hengstler, J. G. (2022). Influence of bile acids on the cytotoxicity of chemicals in cultivated human hepatocytes. Toxicology in Vitro, 81, Article 105344. https://doi.org/10.1016/j.tiv.2022.105344 8. Ghallab, A., Myllys, M., Friebel, A., Duda, J. C., Edlund, K., Halibasic, E., Vucur, M., Hobloss, Z., Brackhagen, L., Begher-Tibbe, B., Hassan, R., Burke, M., Genç, E., Frohwein, L. J., Hofmann, U., Holland, C. H., González Leiva, D. F., Keller, M., Seddek, A., Abbas, T., Mohammed, E. S. I., Teufel, A., Itzel, T., Metzler, S., Marchan, R., Cadenas, C., Watzl, C., Nitsche, M. A., Kappenberg, K., Luedde, T., Longerich, T., Rahnenführer, J., Hoehme, S., Trauner, M., Hengstler, J. G. (2021). Spatio-temporal multiscale analysis of Western diet-fed mice reveals a translationally relevant sequence of events during NAFLD progression. Cells, 10(10), Article 2516. https://doi.org/10.3390/cells10102516 Contents Abstract iii Acknowledgments v List of Publications vii I Introduction 1 1 Motivation 3 2 Statistical Methods 9 2.1 Dose-response models . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.2 MCP-Mod . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.3 Gene expression data . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.4 Functional shrinkage for linear subspaces . . . . . . . . . . . . . . . . 13 3 Summary of the Articles 15 3.1 Article 1: MCP-Mod on Gene-Expression Dose-Response Data . . . . . 15 3.2 Article 2: Time-Dose-Response Modeling . . . . . . . . . . . . . . . . 16 3.3 Article 3: Interaction Effects . . . . . . . . . . . . . . . . . . . . . . . 18 3.4 Article 4: Non-linear Shrinkage . . . . . . . . . . . . . . . . . . . . . 19 4 Discussion and Outlook 23 Bibliography 27 II Publications 33 xi Part I Introduction 1 1 Motivation Dose-response analysis occurs in a variety of fields but is most intensely studied within the pharmaceutical development context. When developing a new therapeutic compound, the Phase I and II clinical trials specifically aim at finding a ’good’ dose (Ting et al., 2017). Therefore, it is crucial to understand and statistically model the dose-response relationship of the compound under investigation. The analysis of dose-response relation- ships is also a central goal in related fields, such as pre-clinical studies or toxicological research outside the pharmaceutical sector (Hothorn, 2016). The goal of modeling a dose-response relationship in whichever application field appears identical from a mathe- matical perspective, at least at first glance. But more specialized, tailored methodological approaches to each field are required due to different experimental conditions. For exam- ple, the challenge of small sample sizes in toxicological experiments calls for a different methodological development. However, the methodological research states and practices are more advanced in clinical research than in non-clinical and toxicological research for historical, political, and economic reasons. The pharmaceutical industry faces complex regulations, initiated by the Contergan scandal in the 1960s, while simultaneously being an economically very profitable sector (Bauschke, 2010). This combination quickly pushed methodological research for clinical studies towards new, complex methods that save study duration time while remaining in line with regulatory standards. For toxicological, non-clinical studies, there are many guidelines to standardize labora- tory procedures of assays and study types, for example by the Organization of Economic Co-Operation and Development (OECD) (Gordon, 2001). Laboratory testing results related to chemical safety that adhere to OECD Testing Guidelines are accepted by OECD countries. This harmonization effort reduces repeated testing for country-specific regulations and creates trust and predictability in society and the international market. Other authorities such as the European Food Safety Agency (EFSA) of the European Union or the U.S. Environmental Protection Agency (EPA) also influence policy making by providing scientific advice. Such guidelines are updated regularly to keep pace with scientific advances. Especially within the risk assessment context, there were advances in the recommendation for the statistical evaluation of dose-response data. For example, in 2009 the Scientific Committee of the EFSA published a guidance document that 3 1 Motivation promoted actual modeling of dose-response relationships rather than multiple testing approaches between dose levels to determine the critical effect of a substance (EFSA, 2009). Despite advances and efforts through guidelines and recommendations, their impact is limited as they are not binding for all toxicological studies or experiments. International guidelines apply for specific studies conducted by international companies and indepen- dent risk assessors. The vast majority of research in toxicology is not required to be guideline-coherent. This freedom empowers exploratory research but also reinforces the gap between statistical methodological research and its application in practice. The discrepancy between widely accepted methodological advances in risk assessment and their practical use is quantified in recent a review by Kappenberg et al. (2023). From 3269 analyzed dose-response curves where some form of modeling was applied, for the vast majority (>2,250) simple linear interpolation was used. Only for about 500 curves, an actual parametric model was fitted. In clinical research, dose-response modeling methods typically aim at dose-finding goals, i.e. finding a clinically relevant dose (Bretz et al., 2008). Mathematically, the goal is similar to finding toxicity levels, when analyzing dose-response data in risk assessment. However, methodological research for clinical dose-finding studies can be considered more prevalent. It started with the traditional rule-based, "up-and-down" 3+3 design developed in 1946 (Dixon and Mood, 1946) which was introduced for clinical trials in 1989 (Storer, 1989). Model-based designs followed such as the continual reassessment method (CRM) (O’Quigley and Shen, 1996), escalation with overdose control (EWOC) (Babb et al., 1998; Rogatko et al., 2005) and extensions such as the time-to-event CRM (TITE-CRM) (Cheung and Chappell, 2000). A hybrid approach that combines classical multiple comparison procedures (MCP) testing approaches (Tamhane et al., 1996) with modeling approaches (Pinheiro et al., 2006) is MCP-Mod (Bretz et al., 2005). Further methods and reviews are compared and referenced by Ananthakrishnan et al. (2017). Despite this active, fast-paced development, and agreement on the superiority of the advanced methods, also in clinical trials there is a high reluctance to adopt new designs in practice (Kurzrock et al., 2021). A review by Rogatko et al. (2007) revealed that from 1,235 clinical Phase I studies published between 1991 and 2006, only 20 (1.6%) used a statistical modeling method that was published during that time period. The vast majority of the remaining studies used some form of the classical rule-based design. The above considerations reveal how despite efforts both from academia and regulatory bodies and a theoretically very advanced research state on dose-response modeling, the application of said methods falls short in practice. Before linking how the work of this 4 thesis contributes to filling this gap, one additional development that is intertwined with this topic is elucidated. Another fast-developing field that effects dose-response modeling in toxicology is ge- nomics. Starting with the discovery of the molecular structure of DNA by Watson and Crick (Watson and Crick, 1953), sequencing technologies that transform genetical information into machine-readable data were continuously improved. A breakthrough was the first complete sequence of the human genome in 2001 (Venter et al., 2001). This was achieved with the first sequencing technology, Sanger sequencing (Sanger and Coulson, 1975). The sequencing technology was rapidly advances, bringing forward microarray technology (Schena et al., 1995), next generation sequencing, which allowed massive parallel sequencing (Margulies et al., 2005), and more recently single-cell se- quencing (Aldridge and Teichmann, 2020), with its first experiment in 2008. Sanger and next-generation sequencing can nowadays be considered as first and second generation sequencing. In 2009, single-cell sequencing or third generation sequencing was first introduced commercially by Pacific Biosciences (PacBio) as Single Molecule Real Time (SMRT) Sequencing (Eid et al., 2009). Third generation sequencing differs from first and second generation sequencing, as it allows much longer sequencing lengths up to tens of kilobases of DNA base pairs of a single DNA molecule. It does not require DNA amplification (Satam et al., 2023). With next generation sequencing, the size of generated data grew rapidly. This called for new methods and software and highly influenced the research field of bioinformatics. Data generated by sequencing machines has to be pre-processed prior to the actual data analysis (Gondane and Itkonen, 2023). After pre-processing the raw reads, they have to be aligned or assembled using alignment algorithms such as STAR (Dobin et al., 2013), TopHat (Trapnell et al., 2009) or Salmon (Patro et al., 2015). Subsequent to additional normalization steps, the statistical differential gene-expression analysis follows. This analysis aims at finding genes that are differentially expressed (have different activity levels) between two groups, hence identifying genes that might cause the condition under investigation. The most used softwares to identify differentially expressed genes are DESeq2 (Love et al., 2014) and edgeR (Robinson et al., 2010). Over the last two decades, the rapidly decreasing cost of genome sequencing allowed for larger sample sizes. From $100,000,000 per human genome sequencing in 2001, the costs dropped below $1,000 per genome in 2021 (Wetterstrand, 2021). Consequently, not only two groups might be compared, but repeated measurements at different dose- or time-points became financially feasible. Such experimental setups allowed for dose- response modeling considerations in the field of RNA sequencing data. 5 1 Motivation In summary, dose-response modeling is a central task for toxicological risk assessment and faces various, interdisciplinary challenges. It is intensively studied in related fields such as clinical trials, where the mathematical approach is similar, though contextual differences lead to different focuses. Additionally, ceaseless technical developments in genomics increase RNA sequencing data set sizes such that dose-response modeling is starting to become relevant in bioinformatical analyses, too. Despite regulations and guidelines, the transfer from state-of-the-art methodological dose-response modeling research into practice is slow both for toxicological and clinical research. Overarching use of general dose-response modeling approaches developed in the context of either clinical or toxicological research is rare but promising, due to the mathematical similar goal. Specific, contextual differences between toxicological and clinical studies certainly prevent a general interchangeability of methods. For example, censored survival data in clinical trials are typically not present in controlled toxicological studies with mice or cells, or exposure times in clinical models typically refer to the time after an administra- tion, rather than the time period of a continuous exposure duration. Such differences do not allow indiscriminate methodological interchangeability, but uphold the need for tailored method developments. The works of this thesis contribute to the aforementioned challenges from various perspectives. The first paper reveals the potential of using MCP-Mod - a dose-response modeling approach developed in the clinical research context - on toxicological, in vitro gene expression dose-response data (Duda et al., 2022b). MCP-Mod is a two-step procedure designed to first detect a Proof-of-Concept (PoC) of a candidate drug and then, conditioned on an established PoC, determine the dose of interest by considering model uncertainty. These ideas map well to the goals of gene expression analysis, where the detection of active genes is of interest, while the dose-response model of each gene is uncertain a priori. The second paper introduces a time-dose-response modeling approach that is tailored to exposure-time dose-response modeling for cytotoxicity experiments (Duda et al., 2022a). Motivated by Haber’s law (Haber, 1924), the model links a physiological idea to a popular dose-response model, yielding a time-dose-response model. Equipped with an R software package and a recommended two-step procedure, the approach is presented with a practical focus, thereby minimizing the burden for potential users. Promoting the optimal use of existing tools and methods in RNA Sequencing analyses in practice is the aim of the third paper of this thesis. A common experimental scenario in RNA Sequencing is the comparison of a factor with two (or more) levels between two groups (Duda et al., 2023). Statistically, research questions for these scenarios naturally translate into the analysis of an interaction effect. However, in practice, methodological 6 workarounds are often used. The paper explains and demonstrates in detail using an RNA-Seq mouse data set, the benefits of directly modeling interaction effects compared to the commonly used approach. The last manuscript presents a new modeling approach that combines non-parametric modeling with functional shrinkage into a parametric space. The approach shrinks a non-parametric model into a pre-defined parametric function space if the data does not suggest deviations from that space. If the data presents deviations from the assumed function space, the approach is not limited to the assumed space and remains flexible. We demonstrate the particular use of this general method for dose-response modeling, where often the parametric Hill model is considered plausible, even though deviations (e.g. due to downturn effects at large doses) cannot be ruled out in advance and must be modeled flexibly. Further, the approach’s adaptive behavior is appealing for modeling high throughput dose-response data, as it reduces the burden of correctly selecting a fixed set of candidate models that is appropriate for thousands of genes. The remainder of this thesis is structured as follows. In Chapter 2, a methodological background is provided. It contains general concepts and methods that are used in the works of this thesis but were developed before and are not part of the research contribution of this thesis. Chapter 3 provides a summary of each of the four papers. Close attention is paid to pointing out the innovative aspect of each work and how it contributes to the respective research landscape. The thesis closes with a discussion in Chapter 4. All full-length papers are attached thereafter. 7 1 Motivation 8 2 Statistical Methods In this chapter, a general methodological background of existing methods that were used or extended in the works of this thesis is presented. To allow intuitive notations, symbols are mostly but not strictly used consistently across topics. 2.1 Dose-response models For the remainder of this thesis, we use the terms dose and concentration interchange- ably, despite its differences from a biological perspective. Mathematically, there is no distinction, and for brevity, we will mostly refer to a dose. Consider n ∈ N doses x = (x ⊤1, . . . , xn) with corresponding response values y = (y1, . . . , yn) ⊤. Doses might be grouped into k <∑n distinct dose levels d1, . . . , dk with n1, . . . , nk repeated response measurements and k i=1 ni = n. Assume a parametric relationship between the true mean response and the dose level, and additive, mean-zero, homoscedastic noise with variance σ2 > 0: y = f(x) + ε, (2.1) with ε = (ε1, . . . , εn)⊤ ∼ N (0, I 2nσ ). The parametric function f is unknown, and has to be modeled. The following parametric dose-response models were considered in the papers. The Hill model (a.k.a. sigmoidal Emax model or four-parameter log-logistic model) was developed by A. V. Hill in 1910 (Hill, 1910) to describe biochemical equilibrium between oxygen availability and hemoglobin saturation. It is one of the most frequently used dose-response models (Goutelle et al., 2008) and has a characteristic sigmoidal shape. The Hill model is known under different names and parameterizations (Ritz, 2010). We use the parametrization as in the R package DoseFinding (Bornkamp et al., 2010): 9 2 Statistical Methods xh fθ(x) = E0 + Emax (2.2) EDh h50 + x with parameters θ = (E0, E ⊤max, ED50, h) . The intercept or background response at zero dose (x = 0) isE0. Emax is the asymptotic maximum andED50 is the dose yielding half of the asymptotic maximum effect. The parameter h defines the steepness of the curve at ED50. For the special case h = 1, the Hill model becomes the Emax model, which has an asymptote only for large doses, but not at the zero dose level. The Beta model is defined as f δ1 δ2θ(x) = E0 + EmaxB(δ1, δ2)(x/scal) (1− x/scal) (2.3) with B(δ1, δ2) = (δ + δ )(δ1+δ2)1 2 /(δδ1δδ21 2 ) capturing the shape of the density function of a Beta distribution on [0, scal]. The Beta model can account for downturn effects after an increase and is a comparatively flexible parametric dose-response model. While the scaling parameter scal is a pre-defined hyperparameter that is by default set to 1.2 times the maximal dose, the parameters to be estimated are θ = (E0, E ⊤max, δ1, δ2) . The linear parameters E0 and Emax can be interpreted as for the Hill model, while δ1 > 0 and δ2 > 0 are non-linear parameters that define the location and steepness of the mode. Another simpler non-linear model is the Power model fθ(x) = E0 + E h maxx (2.4) with power coefficient h defining the steepness and potency in the dose-response context. It is used in the U.S. EPA’s Benchmark Dose Software (BMDS) (Davis et al., 2011). Further, simple linear regression models or quadratic models are candidate dose-response models and are assumed familiar. In the frequentistic setup, model fitting can be performed by ordinary least squares, which coincides with a maximum likelihood approach for the normal error assumption, or weighted least squares. In the Bayesian framework, the posterior distribution is calculated analytically or approximated computationally. The specific computational implementation for both frameworks varies. For examples and details, see Pinheiro et al. (2014) for the MCP-Mod Software, Ritz et al. (2015) for the drc R package (frequentistic), or Wheeler et al. (2023) and Davis et al. (2011) for the bayesian ToxicR R package and BMDS software. 10 2.2 MCP-Mod 2.2 MCP-Mod MCP-Mod was developed for Phase I/II clinical dose finding studies by Bretz et al. (2005) to analyze dose-response data and thereby combine two classical approaches: Multiple comparison procedures (MCP) and modeling (Mod). It is a dual approach to address two goals simultaneously: Performing a proof-of-concept (PoC) step, and if the PoC is established, model the dose-response relationship to derive the dose of interest. With the PoC step or MCP step, it is analyzed if the true dose-response f is non-flat, i.e. a dose-response signal exists. More precisely, one tests if f is within a set M of considered dose-response functions, or not. Practitioners a-priori define reasonable, standardized model shapes f 0 01 , . . . , fM by decomposing fm into a location, scale and shape component: f(x,θ) = θ0 + θ f 0 1 (x,θ 0). A guesstimate approach translates interpretable statements on the expected response into quantifiable shape parameters θ0 for each candidate model. Each model shape f 0m then defines a mean response vector µm. The PoC test is constructed based on the hypotheses Hm : c⊤0 mµm = 0 and H m 1 : c ⊤ mµm ̸= 0 where the contrast vector cm is constructed to maximize the power of the test under the assumption that µm is the true mean response. Each contrast defines a test statistic ∑k T = √ i=1 cmiȳim ∑ , (2.5) S k 2i=1 cmi/ni ∑ ∑ where ȳi is the mean response at distinct dose level di and S2 = k ni i=1 j=1(yij − ȳi) 2/(n−k)with yij being the jth measurement at dose di. Given the normal distribution assumption in (2.1), (T1, . . . , Tm) follows a multivariate normal distribution. A PoC is established if T = max(T1, . . . , Tm) is greater than the corresponding quantile qα, and α is the pre-selected significance level. This multivariate approach controls the family-wise error rate while remaining powerful because correlations between similar model shapes are accounted for automatically. If a PoC is established, a non-flat dose-response relationship is assumed. All models that passed the PoC step are subject to modeling the dose-response curve in the modeling (Mod) step. In the Mod-step either model selection using the Akaike Information Criterion (AIC) or model averaging with weights based on the AIC is performed. MCP- Mod is used in the first paper of this thesis, with a focus on model selection via the AIC criterion. 11 2 Statistical Methods 2.3 Gene expression data Gene expression data cover a wide variety of data types and technologies to generate them, and algorithms for downstream analyses. A general historical and technical overview was provided in Chapter 1. With a focus on the papers of this thesis, this section provides more details on microarray data and RNA Sequencing data. For elementary biological concepts on cells, proteins and DNA, see Brazma et al. (2001). Microarrays quantify gene activities, by quantifying a single sample’s messenger RNA (mRNA), a gene product assumed proportional to a gene’s activity, at thousands of genes simultaneously. Small mRNA molecules - transcripts - are labeled with a fluorescent dye and become targets. The transcripts are deposited over the microarray chip. On the chip’s surface, there are small oligonucleotides (RNA strands) called probes which the sample’s targets can bind to (hybridization), if the genetic sequence is similar enough. In theory, only compatible probe targets remain on the ship after a washing phase. Fluorescence-based intensity measurements are assumed proportional to the corresponding gene activities (Sánchez and de Villa, 2008). There are further technical details such as adjusting for background fluorescence that orig- inates from non-specific hybridization, i.e. mRNA samples that bind to non-complementary probes. After quality control assessment, background signal adjustment and normal- ization to account for technical artifacts can be performed using the robust multi-array averaging (RMA) measure (Irizarry et al., 2003). The final, pre-processed intensity values of different samples are typically on a log2 scale and considered approximately normally distributed. An important conceptual difference between microarrays and next generation, RNA- seqeuencing, is that microarrays have a pre-defined set of known target sequences, while RNA-sequencing also allows de novo assembly of unknown genomes (Gondane and Itkonen, 2023). The bioinformatical assembling of the reads is not subject to the research of this thesis and shall not be discussed here. Starting with the resulting raw read counts, normalization and statistical modeling follow, as summarized in Chapter 1. In the papers of this thesis, the R package DESeq2 (Love et al., 2014) is used to identify differentially expressed genes (DEGs). The relevant modeling and normalization of DESeq2 is explained in the following (cf. Love et al. (2014)). The read counts of gene i for sample j are modeled using a negative binomial distribution, Kji ∼ NB(µij, αi), where µij is the expected mean read count and αi a dispersion parameter. To account for different gene lengths and other technical considerations, model µij = qijsj , where sj is the 12 2.4 Functional shrinkage for linear subspaces size factor or technical read depth for sample j, and qij is the unknown, true DNA fragment concentration of gene i for sample j. The size factors sj are estimated using the median-of-ratios method (Anders and Huber, 2010). The fragment concentration that represents the gene activit∑y is then modeled using a generalized linear model with a logarithmic link: log2 q = R ij r=1 xjrβir, where xjr dummy-codes the experimental conditions r = 1, . . . , R in a design matrix X = {x}jr and βij are the coefficients. The selection of the model design and corresponding interpretation of the model fit for typical gene expression experiments is the topic of the third paper of this thesis. 2.4 Functional shrinkage for linear subspaces For the regression problem outlined in (2.1), Shin et al. (2020) developed a non- parametric, Bayesian, flexible modeling approach that shrinks the posterior mean re- sponse towards a pre-specified linear subspace Ω0 = {Φ0x|x ∈ Rn}. The approach is adaptive because there is no shrinkage if the data suggests that the true mean response is outside of Ω0. The mean response is modeled non-parametrically using B-splines (Carl, 2001): ∑k f(x) = βmϕm(x), m=1 where ϕm are B-spline bases evaluated at m = 1, . . . , k knots and design matrix Φ = {ϕm(xi)}i,m. Shrinkage of the posterior mean response E(Φβ|y) into Ω0 is enforced by the conditional prior specifications ( ) p(β|σ2, τ 2) ∝ (τ 2)−(k−d0)/2 1exp − β⊤Φ⊤(I − P )Φβ . (2.6) 2σ2τ 2 Φ0 and (τ 2)b−1/2 p(τ) ∝ 1(0,∞)(τ), (2.7) (1 + τ 2)(a+b) where d0 = rank(Φ0), and PΦ0 = Φ0(Φ ⊤ 0 Φ ) −1 0 Φ ⊤ 0 is the orthogonal projection matrix into the column space of Φ0. For example, shrinkage into a simple linear subspace can be implemented using Φ0 = {1x}. The prior specification on τ , the shrinkage or scaling parameter, is a horseshoe, standard half Cauchy prior for a = b = 1/2 (Carvalho et al., 2010). It allows strong shrinkage of weak signals close to zero while leaving strong signals unconstrained. In the model, this shrinks weak deviations of the response from 13 2 Statistical Methods Ω0 such that the response is effectively in Ω0, while strong deviations of the response from Ω0 remain unconstrained. For the hyperparameters a and b, Shin et al. (2020) provides boundaries that guarantee optimally fast adaptive behavior (shrinkage or no shrinkage) w.r.t. the sample size. In the forth paper of this thesis, the approach by Shin et al. (2020) is extended to non- linear function spaces. The relevance of this extension for toxicological applications is highlighted as it is demonstrated on the non-linear Hill model. 14 3 Summary of the Articles 3.1 Article 1: Model selection characteristics when using MCP-Mod for dose-response gene expression data The first article contributes to the research area of dose-response modeling in a trans- lational way. We identified MCP-Mod - a modeling method developed for clinical dose-response analyses - as an adequate tool for dose-response modeling in a field where MCP-Mod has not yet been used: gene expression data. Application of MCP-Mod in gene-expression dose-response modeling is motivated by uncertainty considerations. Gene-expression data are high-dimensional and it is therefore not feasible to carefully choose appropriate modeling constraints for each gene. Also, differences in cell ex- periments compared to human dosing data lead to further uncertainties in cell-based dose-response data. If doses are applied to cells, there are fewer ethical constraints on higher doses. This might lead to deviations from expected shapes. For example, down- turn effects due to systematic toxicity at higher doses violate monotonicity assumptions. Uncertainty regarding the final dose-response shape for gene expression data is hence imminent, both due to the large number of genes that can not be viewed individually as well as less constraints on dose selection. MCP-Mod addresses the issue of model uncertainty by allowing a set of models to be considered, from which a single best model is used or models are averaged for the final fit. We apply and analyze the model selection workflow of MCP-Mod on a gene-expression data set by Krug et al. (2013), where several doses of valproic acid (VPA) are applied to embryonic stem cells. The six models linear, quadratic, sigmoidal Emax, Emax, and beta model, are considered. The aim of the first article was to analyze if any of the considered models are more relevant than others, or might even be superfluous for the analysis. To address this question, MCP-Mod was applied to the VPA data set, a simulation study was conducted and a score that grades the model set’s performance was proposed. The analyses lead to several conclusions that are relevant in practice and motivated 15 3 Summary of the Articles the use of MCP-mod in other applied dose-response projects that face uncertainty (e.g. Brecklinghaus et al. (2022)). The monotonicity assumption should not be stated for analyzing dose-response gene-expression data. Models that were able to capture a down-turn effect (beta and quadratic) cannot be excluded from the model set without performance loss. The popular Emax model is a special case of the more general, and also very popular sigmoidal Emax model. However, the Emax model can be excluded without notable performance loss, especially if the sigmoidal Model is included in the model set. The sigmoidal model is often selected if the response signal is clear. This result aligns with the sigmoidal Emax model’s popularity in practice and its mechanistically motivated origin. Often, gene expression data is noisy due to the many possible sources of variability. For such genes, the linear model is useful for the detection of a general trend, though the true dose-response shape likely is not linear and cannot be detected accurately due to the noise. To sum up, the first article laid an important groundwork for the cumulative work of this thesis, which thereafter proceeded to focus on dose-response modeling, gene-expression data, and uncertainty considerations. 3.2 Article 2: td2pLL: An intuitive time-dose-response model for cytotoxicity data with varying exposure durations The second article expands the research scope of dose-response modeling to time-dose- response (TDR) modeling. TDR is often considered in the clinical context when the time after intake of a dose is considered. However, this work focuses on TDR modeling in the toxicological context where cells are continuously exposed to a dose of a compound for a certain amount of time. Cells can be exposed continuously throughout a time period in contrast to the momentary intake of a dose for humans, as cells are in a solution that contains the administered concentration. The contribution to the TDR research of this article is the proposition of a new time- dose-response model tailored to cytotoxicity experiments and the development of an R software package to facilitate its use. The proposed model is called time-dose 2- parameter log-logistic (td2pLL) model. It is a two-dimensional, parametric approach and relates the ED50 parameter to the exposure time: ED50(t) = ∆t −γ + C0, (3.1) which leads to the TDR model 16 3.2 Article 2: Time-Dose-Response Modeling xh f(t, x) = 100− 100 . (3.2) xh + ED h50(t) This relationship is mechanistically motivated by a modification of Haber’s law, which states that the toxic effect of a compound is the product of the concentration and the exposure time (Haber, 1924). The added parameters ∆, γ and C0 are interpretable as the magnitude of the exposure-time dependent effect, its potency, and a potential minimal concentration toxicity threshold. The latter means that a minimal dose is required to yield a toxicity effect for a hypothetical, infinitely long exposure duration. The lower and upper asymptotes of 100 and 0 are considered fixed, as the model is tailored to cytotoxicity experiments where initially all cells live and then continuously lose viability with increasing dose. The resulting model is more complex for practitioners than fitting a single dose-response curve for each considered exposure time. However, the model has several advantages if the experimental ranges of the dose and the exposure period allow to observe a time-dependent change in the ED50. These include an increased precision in the ED50 estimate and the possibility to extrapolate ED50 for time periods not considered in the experiment. To decide whether the benefits justify the model’s complexity, a two-step pipeline was suggested. In the first step, an ANOVA-based approach is used to verify if the ED50 depends on the exposure duration. If there is a dependency, the td2pLL model is used in the second step. If not, this joint model is considered unnecessarily complicated, and separate dose-response curves are fitted at each exposure duration. The two-step approach using the suggested model is compared to other approaches (always using the td2pLL model, always fitting separate time-doses curves per time duration, ignoring the time data and fitting a single time-dose curve) in a simulation study and applied on a real cytotoxicity data set by Gu et al. (2018). The simulation results suggest that the two-step approach successfully decides between using the td2pLL model or the simpler approach and leads to more precise ED50 estimates. The real data demonstration also indicates the benefit of the proposed model and pipeline. However, the low number of different exposure points per compound used in the experiments limits the reliability of these conclusions. Larger data examples are required to provide a stronger empirical evaluation of the mechanistically derived ED50 dependency. In summary, the suggested pipeline together with the provided software underlines the application-oriented focus of this work. The article targets a broader audience, possibly 17 3 Summary of the Articles with less profound statistical knowledge, to help bridge the gap between methodological knowledge and its use in practice in toxicological risk assessment. 3.3 Article 3: Benefit of using interaction effects for the analysis of high-dimensional time-response or dose-response data for two-group comparisons The third article is an application-oriented work and was motivated by the interdisci- plinary projects together with the Leibniz Research Centre for Working Environments and Human Factors (IfADo), e.g. by Su et al. (2023). It addresses how a statistically well-known, potentially beneficial method has little use in practice for a very common research situation in gene expression analyses. Precisely, we explain and illustrate the use of interaction effects in gene expression analyses of experimental situations with two factors with two levels each, or where one factor has more levels. A typical experimental setup might compare the effect of a treatment (using Treatment A or B) on two groups of genetically modified mice (Genotype 0 and 1). The biological research question can then be formulated as: Does the treatment effect differ w.r.t. the genotype? Statistically, this naturally translates into analyzing an interaction effect γ in the model yj = µ+ αgj + βcj + γgjcj + εj, where µ is the mean in the reference group (genotype = 0, treatment = A), cj = 1 if the treatment assigned to sample j is 1, otherwise cj = 0. Equivalently the genotype of sample j is indicated by gj and εj is some error term. To test whether the treatment effect differs between the genotypes, one considers the hypotheses H0 : γ = 0 and H1 : γ ̸= 0. The above approach is rather simple. However, while for many experimental situations in practice, it would be appropriate to analyze them using the above approach, the reality is different. Often a statistical detour is used that aims at analyzing interactions, but evades the concept of (statistical) interaction effects in a model. Precisely, practitioners might split the data set into samples of genotype A and samples of genotype B. Then, separately for each genotype data set, the treatment effect is estimated. If the treatment effect is significant (and possibly relevant) for one genotype data set, but not for the other, a difference in the treatment effects w.r.t. the genotype is concluded. We denote 18 3.4 Article 4: Non-linear Shrinkage this commonly found approach as ’Method I’ and the interaction effect-based approach as ’Method II’. In the article, we explain and demonstrate the differences and potential benefits of Method II - modeling an interaction effect - using a gene expression data set where mice are fed two different diets over many weeks. The diet was either a healthy standard diet (SD) or an unhealthy, fatty western diet (WD) and we focused on the feeding periods of either 3 weeks or 6 weeks. The effect of the WD over the SD is of interest. Biologically, genes where the diet effect is different after 6 weeks of feeding compared to the diet effect at the reference time of three weeks of feeding are of interest. Statistically, this calls for modeling an interaction effect between diet type and feeding period. As we dealt with RNA sequencing data, the expression levels are modeled using a Negative Binomial distribution and a generalized linear model, as described in Section 2.3, and implemented using the DESeq2 R package (Love et al., 2014). Method I and method II were compared on the data set and results suggest that modeling an interaction effect can lead to smaller but more specific gene sets. The article provides practitioner-orientated explanations and real-data demonstrations on using interaction effects tailored to gene expression analyses. Gene expression analyses are a standard experimental procedure used by many laboratories with different levels of statistical training available in-house. Therefore, the article contributes to bridging the gap between available statistical research and software for gene expression analyses and its optimal use in practice. 3.4 Article 4: Bayesian non-linear subspace shrinkage using horseshoe priors This article extends the work of Shin et al. (2020), who introduced shrinkage of non- parametrically modeled response curves into parametric linear subspaces, towards shrink- age into non-linear function spaces (cf. Section 2.4). In contrast to the previous articles, this work is formulated within a Bayesian framework. While the work’s extension towards non-linear function spaces is generally applicable for any non-linear parametric model, its particular relevance for toxicological research is demonstrated by shrinkage into the non-linear Hill model (cf. Section 2.1). The underlying motivation for this work is that mechanistically motivated models as the Hill model are often non-linear and always limited to being an approximation of a more complex, underlying physiological mechanism. Therefore, deviations from a model that is overall plausible, cannot be ruled out beforehand and a flexible modeling approach is needed that shrinks into the 19 3 Summary of the Articles reasonable model, but is not fully constrained to the model space. The methodological extension is based on conditional linearization of the non-linear function space using a first-order Taylor expansion. Let hθ be the Hill model function (cf. 2.1) and ΩΘ0 = {hθ(x)|θ ∈ Θ, x ∈ R+0 } the corresponding function space. ΩΘ0 cannot be represented by a matrix column space as for linear subspace shrinkage. Therefore, consider a linear approximation of ΩΘ0 at θ0 for fixed x: hθ(x) ≈ hθ0(x) + Ḟθ0(x)(θ − θ0) (3.3) ∣ with partial derivatives Ḟ = Ḟ (x) = ∂hθ(x)θ0 θ0 ∣ ∣ ∈ Rn×s. Instead of P ∂θ Φ0 for the θ=θ0 linear case, we propose to use Pθ = PḞ as projection into the column space of Ḟθ forθ a given θ. Geometrically, this can be justified as Ḟθ(x) locally approximates hθ(x) by spanning tangent planes at each xi ∈ x (Seber and Wild, 2003)[p. 130]. The resulting, conditional shrinkage prior for β is ( ) p(β|σ2, τ 2,θ) ∝ 2 1(τ )−k/2 exp − β⊤Φ⊤(I − Pθ)Φβ . (3.4) 2σ2τ 2 Notably, the prior now depends on θ, which has to be updated and provided with prior distributions as opposed to the linear shrinkage of Shin et al. (2020). The prior choices for θ depend on the choice of hθ and available knowledge on the expected response. Further, the conditional precision matrix of β has full rank due to the non-linearity of hθ(x), leading to (τ 2)−k/2 instead of (τ 2)−(k−d0)/2 in the linear shrinkage case. The proposed method can further be extended to shrinkage into combined function spaces. Therefore, the corresponding Ḟθ of different hθ are combined horizontally to form an overall derivative matrix, under the constraint that only one intercept column remains. This straightforward extension to shrinkage into combined function spaces allows for additional robustness and can properly capture uncertainty if more than one parametric model is reasonable. We demonstrate the method for shrinkage into the single Hill model space and the combined space of the Hill model and the Power model. In an extensive simulation study, these method implementations are compared to various alternative approaches, namely B-splines, P-splines, a parametric fit with an additive horse-shoe shrinkage spline, and common parametric fits (correctly or incorrectly specified). Additionally, the method is applied to real data where age-dependent testosterone levels in men are modeled. 20 3.4 Article 4: Non-linear Shrinkage As a result, the non-parametric non-linear functional shrinkage approach performs comparative to the oracle-condition parametric fit in case of correct function space specification. If the wrong function space is specified, it correctly decides against shrinkage into the function space and effectively becomes an unconstrained B-spline fit as by construction. Shrinkage into combined function spaces leads to better fits due to the increased robustness. In the context of dose-response modeling, this feature is helpful for the common scenario of down-turn effects at higher doses. Here, the response curve matches a Hill model up to a certain dose where a downturn starts, for example, due to systematic toxicity. A limitation of the proposed method is that in the case of function space misspecification, the missing shrinkage leads to too unconstrained fits that are prone to overfitting. Approaches to overcome this and other limitations are discussed in Section 4. 21 3 Summary of the Articles 22 4 Discussion and Outlook For this thesis, four manuscripts with different forms of contributions to the dose- response modeling research landscape in toxicology were presented. Dose-response modeling is a large field with major applications in clinical trials, risk assessment, or toxicology. Its methodological developments are additionally influenced by the fast technological developments in high-throughput testing and genomics. The first manuscript applied the dose-response modeling method MCP-Mod, originally developed within the clinical context, for high-dimensional, toxicological dose-response gene-expression data (Duda et al., 2022b). This application is innovative as the transfer between dose-response modeling approaches between clinical applications and toxico- logical applications is rare, despite large methodological overlaps of the two fields. Here, the identification of the methodological transfer potential has an additional novelty aspect as the transfer is not only from the clinical towards the general toxicological context but to more specific, high-dimensional, toxicological gene expression data. The main motivation to use MCP-Mod for gene-expression dose-response data is the method’s ability to handle model uncertainty, which is crucial when modeling thousands of genes simultaneously. As a result of this work, the necessity of considering model uncertainty in gene-expression data was demonstrated. To quantify this, tailored measures and evaluation procedures were additionally developed and a large simulation study was performed. The first paper’s scope was on model selection using MCP-Mod and further investiga- tions on dose-response model averaging leave opportunities for future investigations. Benchmark dose (BMD) estimation based on model averaging is a promising approach. However, the inclusion of down-turn effect models might undermine the desirable plateau at low doses and require further investigation. Further, multivariate approaches to combine BMDs, potentially w.r.t. functional genetic groups, but within limits of computational feasibility, are research opportunities based on this work. Related to this task are the works of Kappenberg and Rahnenführer (2023) and the BMDExpress 2 software of the National Toxicology Program (Phillips et al., 2019). Kappenberg and Rahnenführer (2023) uses an empirical Bayes framework to share Hill model parameter 23 4 Discussion and Outlook information in functional genetical groups to improve the estimation of gene-wise and gene-group-wise effective doses. The work is limited to the Hill model as a basis for the parameter-sharing approach. BMDExpress 2 considers various models and genetic pathway information to pool BMDs of genes within functional groups. The pooling is limited to simple summary statistics, and challenges to incorporate fast, multivariate approaches remain. The second paper of this thesis proposed a mechanistically motivated two-dimensional time-dose-response (TDR) model tailored to cytotoxicity experiments. Additionally, a two-step guideline for parsimonious use of the model was provided and the methodology is made available in a developed R-package (Duda et al., 2022a). The work is an example that many toxicological applications benefit from tailored dose-response or TDR method development, demonstrating the limitations of methodological interchangeability between clinical and toxicological dose-response research. A limitation of the proposed model is the small scale of the data used for the application, which limits statements on empirical validation of the model. Larger, appropriate cytotoxicity data sets would also promote the proposition of more parametric TDR models, opening doors for model averaging considerations in the TDR modeling context. The third paper of this thesis is application-focused and aims at promoting the optimal use of available methods and software when analyzing RNA-sequencing data in dose- response two-group comparisons. In particular, correct modeling of an interaction effect, its implementation in the R-package DESeq2 (Love et al., 2014), thorough explanations, and a demonstration of its methodological benefits on an RNA-sequencing mice data set are provided. To build upon this work, extensive simulation studies and literature search-based work that empirically assesses the potential of modeling interaction effects on conducted studies offer a research outlook. The last work of this thesis proposes a general Bayesian non-linear functional shrinkage (NLFS) approach that can specifically be applied in the dose-response context. Based upon a non-parametric model, the response adaptively shrinks towards a pre-defined, non-linear function space (e.g. the Hill model), or a combination of function spaces. NLFS is adaptive as shrinkage is not enforced if the data suggests that the function space is misspecified, yielding a highly flexible approach that accounts for prior knowledge, but also allows deviations. The method is an extension of Shin et al. (2020), who developed this method for linear subspace shrinkage. The extension to non-linear function spaces is based on a first-order Taylor expansion of the model space of interest. In an extensive simulation study, NLFS was compared to other modeling approaches and applied to a testosterone data set. To speed up computation, a tailored block Gibb’s sampler is implemented, which uses Metropolis-Hastings and Slice Sampling. 24 The simulation results suggest that the proposed non-parametric method successfully approximates a parametric fit if the function space is correctly specified. If the function space is misspecified, the shrinkage quickly (for low sample sizes) vanishes, allowing an unconstrained fit. We also demonstrated that shrinkage towards a combination of approximated function spaces is straightforward and provides additional robustness as further model uncertainty considerations can be met. An application example of testosterone levels modeled for men of any age highlights the method’s strengths, as the response is similar to the Hill model, while plausible, data-driven deviations from the Hill model are present. Several improvements and extensions of the proposed NLFS approach offer further research directions. By construction, the model is a flexible non-parametric, uncon- strained B-spline if the assumed function space is severely missspecified. This can lead to undesirable overfitting. To overcome this limitation, the implementation of a smooth- ness penalty as suggested by Wiemann and Kneib (2021) appears promising. Further, the NLFS approach cannot handle a few (n < 10), distinct dose levels with repeated measurements, which is common in practice. So far NLFS is implemented based on uniformly drawn doses, which guarantees balanced shrinkage across the dose range and circumvents rank-deficiency challenges. Adapting the approach and implementing an additional grid that regulates the shrinkage independent of the doses is worth further investigations. Within this extension, additional weights that allow for less shrinkage at the end, e.g. to better account for down-turn effects, would meet further practical needs. However, a grid approach requires additional hyperparameter considerations for the grid size and density. To avoid these limitations, embedding the approach into a Gaussian Process framework is another research perspective. 25 4 Discussion and Outlook 26 Bibliography Aldridge, S. and Teichmann, S. A. (2020). Single cell transcriptomics comes of age. Nature Communications, 11(1):4307. Ananthakrishnan, R., Green, S., Chang, M., Doros, G., Massaro, J., and LaValley, M. (2017). Systematic comparison of the statistical operating characteristics of various phase i oncology designs. Contemporary clinical trials communications, 5:34–48. Anders, S. and Huber, W. (2010). Differential expression analysis for sequence count data. Nature Precedings, pages 1–1. Babb, J., Rogatko, A., and Zacks, S. (1998). Cancer phase i clinical trials: efficient dose escalation with overdose control. Statistics in medicine, 17(10):1103–1120. Bauschke, R. (2010). The Effectiveness of European Regulatory Governance: The Case of Pharmaceutical Regulation. PhD thesis. Bornkamp, B., Pinheiro, J., and Bretz, F. (2010). Dosefinding: planning and analyzing dose finding experiments. R package version 0.4-1. Brazma, A., Parkinson, H., Schlitt, T., and Shojatalab, M. (2001). A quick introduction to elements of biology-cells, molecules, genes, functional genomics, microarrays. European Bioinformatics Institute, Draft. Brecklinghaus, T., Albrecht, W., Kappenberg, F., Duda, J., Vartak, N., Edlund, K., Marchan, R., Ghallab, A., Cadenas, C., Günther, G., et al. (2022). The hepatocyte export carrier inhibition assay improves the separation of hepatotoxic from non- hepatotoxic compounds. Chemico-biological interactions, 351:109728. Bretz, F., Hsu, J., Pinheiro, J., and Liu, Y. (2008). Dose finding–a challenge in statistics. Biometrical Journal: Journal of Mathematical Methods in Biosciences, 50(4):480– 504. Bretz, F., Pinheiro, J. C., and Branson, M. (2005). Combining multiple comparisons and modeling techniques in dose-response studies. Biometrics, 61(3):738–748. 27 Bibliography Carl, D. (2001). A practical guide to splines. Springer. Carvalho, C. M., Polson, N. G., and Scott, J. G. (2010). The horseshoe estimator for sparse signals. Biometrika, 97(2):465–480. Cheung, Y. K. and Chappell, R. (2000). Sequential designs for phase i clinical trials with late-onset toxicities. Biometrics, 56(4):1177–1182. Davis, J. A., Gift, J. S., and Zhao, Q. J. (2011). Introduction to benchmark dose methods and us epa’s benchmark dose software (bmds) version 2.1. 1. Toxicology and applied pharmacology, 254(2):181–191. Dixon, W. J. and Mood, A. M. (1946). The statistical sign test. Journal of the American Statistical Association, 41(236):557–566. Dobin, A., Davis, C. A., Schlesinger, F., Drenkow, J., Zaleski, C., Jha, S., Batut, P., Chaisson, M., and Gingeras, T. R. (2013). Star: ultrafast universal rna-seq aligner. Bioinformatics, 29(1):15–21. Duda, J., Hengstler, J. G., and Rahnenführer, J. (2022a). td2pll: An intuitive time-dose- response model for cytotoxicity data with varying exposure durations. Computational Toxicology, 23:100234. Duda, J. C., Drenda, C., Kästel, H., Rahnenführer, J., and Kappenberg, F. (2023). Benefit of using interaction effects for the analysis of high-dimensional time-response or dose-response data for two-group comparisons. Scientific Reports, 13(1):20804. Duda, J. C., Kappenberg, F., and Rahnenführer, J. (2022b). Model selection charac- teristics when using mcp-mod for dose–response gene expression data. Biometrical Journal, 64(5):883–897. EFSA (2009). Guidance of the scientific committee on use of the benchmark dose approach in risk assessment. EFSA Journal, 7(6):1150. Eid, J., Fehr, A., Gray, J., Luong, K., Lyle, J., Otto, G., Peluso, P., Rank, D., Baybayan, P., Bettman, B., et al. (2009). Real-time dna sequencing from single polymerase molecules. Science, 323(5910):133–138. Gondane, A. and Itkonen, H. M. (2023). Revealing the history and mystery of rna-seq. Current Issues in Molecular Biology, 45(3):1860–1874. Gordon, K. (2001). The oecd guidelines and other corporate responsibility instruments: a comparison. 28 Bibliography Goutelle, S., Maurin, M., Rougier, F., Barbaut, X., Bourguignon, L., Ducher, M., and Maire, P. (2008). The hill equation: a review of its capabilities in pharmacological modelling. Fundamental & clinical pharmacology, 22(6):633–648. Gu, X., Albrecht, W., Edlund, K., Kappenberg, F., Rahnenführer, J., Leist, M., Moritz, W., Godoy, P., Cadenas, C., Marchan, R., et al. (2018). Relevance of the incubation period in cytotoxicity testing with primary human hepatocytes. Archives of toxicology, 92:3505–3515. Haber, F. (1924). Zur geschichte des gaskrieges. Fuenf vortraege aus den jahren 1920–1923, pages 76–92. Hill, A. V. (1910). The possible effects of the aggregation of the molecules of hemoglobin on its dissociation curves. The Journal of Physiology, 40:iv–vii. Hothorn, L. A. (2016). Statistics in toxicology using R. CRC Press. Irizarry, R. A., Hobbs, B., Collin, F., Beazer-Barclay, Y. D., Antonellis, K. J., Scherf, U., and Speed, T. P. (2003). Exploration, normalization, and summaries of high density oligonucleotide array probe level data. Biostatistics, 4(2):249–264. Kappenberg, F., Duda, J. C., Schürmeyer, L., Gül, O., Brecklinghaus, T., Hengstler, J. G., Schorning, K., and Rahnenführer, J. (2023). Guidance for statistical design and analysis of toxicological dose–response experiments, based on a comprehensive literature review. Archives of Toxicology, 97(10):2741–2761. Kappenberg, F. and Rahnenführer, J. (2023). Information sharing in high-dimensional gene expression data for improved parameter estimation in concentration-response modelling. Plos one, 18(10):e0293180. Krug, A. K., Kolde, R., Gaspar, J. A., Rempel, E., Balmer, N. V., Meganathan, K., Vojnits, K., Baquié, M., Waldmann, T., Ensenat-Waser, R., et al. (2013). Human embryonic stem cell-derived test systems for developmental neurotoxicity: a transcriptomics approach. Archives of toxicology, 87:123–143. Kurzrock, R., Lin, C.-C., Wu, T.-C., Hobbs, B. P., Pestana, R. C., and Hong, D. S. (2021). Moving beyond 3+ 3: the future of clinical trial design. American Society of Clinical Oncology Educational Book, 41:e133–e144. Love, M. I., Huber, W., and Anders, S. (2014). Moderated estimation of fold change and dispersion for rna-seq data with deseq2. Genome Biology, 15:550. 29 Bibliography Margulies, M., Egholm, M., Altman, W. E., Attiya, S., Bader, J. S., Bemben, L. A., Berka, J., Braverman, M. S., Chen, Y.-J., Chen, Z., et al. (2005). Genome sequencing in microfabricated high-density picolitre reactors. Nature, 437(7057):376–380. O’Quigley, J. and Shen, L. Z. (1996). Continual reassessment method: a likelihood approach. Biometrics, pages 673–684. Patro, R., Duggal, G., and Kingsford, C. (2015). Salmon: accurate, versatile and ultrafast quantification from rna-seq data using lightweight-alignment. BioRxiv, 10:021592. Phillips, J. R., Svoboda, D. L., Tandon, A., Patel, S., Sedykh, A., Mav, D., Kuo, B., Yauk, C. L., Yang, L., Thomas, R. S., et al. (2019). Bmdexpress 2: enhanced transcriptomic dose-response analysis workflow. Bioinformatics, 35(10):1780–1782. Pinheiro, J., Bornkamp, B., Glimm, E., and Bretz, F. (2014). Model-based dose finding under model uncertainty using general parametric models. Statistics in medicine, 33(10):1646–1661. Pinheiro, J. C., Bretz, F., and Branson, M. (2006). Analysis of dose–response stud- ies—modeling approaches. In Dose finding in drug development, pages 146–171. Springer. Ritz, C. (2010). Toward a unified approach to dose–response modeling in ecotoxicology. Environmental Toxicology and Chemistry, 29(1):220–229. Ritz, C., Baty, F., Streibig, J. C., and Gerhard, D. (2015). Dose-response analysis using r. PloS one, 10(12):e0146021. Robinson, M. D., McCarthy, D. J., and Smyth, G. K. (2010). edger: a bioconductor pack- age for differential expression analysis of digital gene expression data. bioinformatics, 26(1):139–140. Rogatko, A., Babb, J. S., Tighiouart, M., Khuri, F. R., and Hudes, G. (2005). New paradigm in dose-finding trials: patient-specific dosing and beyond phase i. Clinical cancer research, 11(15):5342–5346. Rogatko, A., Schoeneck, D., Jonas, W., Tighiouart, M., Khuri, F. R., and Porter, A. (2007). Translation of innovative designs into phase i trials. Journal of Clinical Oncology, 25(31):4982–4986. Sánchez, A. and de Villa, M. (2008). A tutorial review of microarray data analysis. Bioinformatics, 1(1):1–55. 30 Bibliography Sanger, F. and Coulson, A. R. (1975). A rapid method for determining sequences in dna by primed synthesis with dna polymerase. Journal of molecular biology, 94(3):441–448. Satam, H., Joshi, K., Mangrolia, U., Waghoo, S., Zaidi, G., Rawool, S., Thakare, R. P., Banday, S., Mishra, A. K., Das, G., et al. (2023). Next-generation sequencing technology: Current trends and advancements. Biology, 12(7):997. Schena, M., Shalon, D., Davis, R. W., and Brown, P. O. (1995). Quantitative moni- toring of gene expression patterns with a complementary dna microarray. Science, 270(5235):467–470. Seber, G. A. and Wild, C. J. (2003). Nonlinear regression. hoboken. New Jersey: John Wiley & Sons, 62(63):1238. Shin, M., Bhattacharya, A., and Johnson, V. E. (2020). Functional horseshoe priors for subspace shrinkage. Journal of the American Statistical Association, 115(532):1784– 1797. Storer, B. E. (1989). Design and analysis of phase I clinical trials. Biometrics, pages 925–937. Su, H., Haque, M., Becker, S., Edlund, K., Duda, J., Wang, Q., Reißing, J., Marschall, H.-U., Candels, L. S., Mohamed, M., et al. (2023). Long-term hypercaloric diet exacerbates metabolic liver disease in pnpla3 i148m animals. Liver International. Tamhane, A. C., Hochberg, Y., and Dunnett, C. W. (1996). Multiple test procedures for dose finding. Biometrics, pages 21–37. Ting, N., Chen, D.-G., Ho, S., Cappelleri, J. C., et al. (2017). Phase II clinical develop- ment of new drugs. Springer. Trapnell, C., Pachter, L., and Salzberg, S. L. (2009). Tophat: discovering splice junctions with rna-seq. Bioinformatics, 25(9):1105–1111. Venter, J. C., Adams, M. D., Myers, E. W., Li, P. W., Mural, R. J., Sutton, G. G., Smith, H. O., Yandell, M., Evans, C. A., Holt, R. A., et al. (2001). The sequence of the human genome. Science, 291(5507):1304–1351. Watson, J. D. and Crick, F. H. (1953). Molecular structure of nucleic acids: a structure for deoxyribose nucleic acid. Nature, 171(4356):737–738. 31 Bibliography Wetterstrand, K. A. (Last Updated: November 1 2021). The cost of sequencing a human genome. https://www.genome.gov/about-genomics/fact-sheets/ Sequencing-Human-Genome-cost. Accessed: January 5 2024. Wheeler, M. W., Lim, S., House, J. S., Shockley, K. R., Bailer, A. J., Fostel, J., Yang, L., Talley, D., Raghuraman, A., Gift, J. S., et al. (2023). Toxicr: A computational plat- form in r for computational toxicology and dose–response analyses. Computational Toxicology, 25:100259. Wiemann, P. and Kneib, T. (2021). Adaptive shrinkage of smooth functional effects towards a predefined functional subspace. arXiv preprint arXiv:2101.05630. 32 Part II Publications 33 Article 1 Received: 19 August 2020 Revised: 11 November 2021 Accepted: 12 January 2022 DOI: 10.1002/bimj.202000250 RESEARCH ARTICLE Model selection characteristics when using MCP-Mod for dose–response gene expression data Julia C. Duda Franziska Kappenberg Jörg Rahnenführer Department of Statistics, TU Dortmund University, Dortmund, Germany Abstract We extend the scope of application for MCP-Mod (Multiple Comparison Proce- Correspondence Julia C. Duda, Department of Statistics, dure andModeling) to in vitro gene expression data and assess its characteristics TU Dortmund University, Vogelpothsweg regarding model selection for concentration gene expression curves. Precisely, 87, 44227 Dortmund, Germany. we apply MCP-Mod on single genes of a high-dimensional gene expression data Email: duda@statistik.tu-dortmund.de set, where human embryonic stem cells were exposed to eight concentration lev- Funding information els of the compound valproic acid (VPA). As candidate models we consider the Bundesministerium für Bildung und sigmoid 𝐸max (four-parameter log-logistic), linear, quadratic, 𝐸max , exponential,Forschung, Grant/Award Number: LivSysTransfer (031L0119); Deutsche and beta model. Through simulations we investigate the impact of omitting one Forschungsgemeinschaft, Grant/Award or more models from the candidate model set to uncover possibly superfluous Number: RTG 2624 (427806116) models and to evaluate the precision and recall rates of selected models. Each model is selected according to Akaike information criterion (AIC) for a consid- erable number of genes. For less noisy cases the popular sigmoid 𝐸max model This article has earned an open data badge “Reproducible Research” for is frequently selected. For more noisy data, often simpler models like the linear making publicly available the code model are selected, but mostly without relevant performance advantage com- necessary to reproduce the reported pared to the second best model. Also, the commonly used standard 𝐸max model results. The results reported in this article could fully be reproduced. has an unexpected low performance. KEYWORDS dose–response curves, gene expression, MCP-mod, model selection, toxicology 1 INTRODUCTION In drug development, two major steps are of interest when a new compound is examined. First, changes in the dose or concentration of the compound are intended to cause changes in the response. Once this relation is established, the precise modeling of the dose–response curve is the next goal. It aims at finding the target dose for the confirmatory Phase III trials. If multiple comparison procedures (MCPs) are used for signal detection, this can lead to less flexibility as target dose estimation is restricted to the tested dose levels. One major methodological advancement in this field is the Multiple Comparison Procedure and Modeling (MCP-Mod) approach by Bretz et al. (2005). It combines MCP and a modeling (Mod) step by proposing amultistage procedure.MCP-Mod received a positive qualification opinion and a “fit for purpose” designation by theEMAandFDA in 2014 and 2016, respectively, as statisticalmethodology to analyze Phase II dose-finding studies under model uncertainty (European Medicines Agency, 2015; Food and Drug Administration, 2016). This work extends the usual scope of application of MCP-Mod from clinical Phase II to gene expression data. As a practical example, human embryonic stem cells are analyzed (O’Quigley et al., 2017, Chap. 12.3). Valproic acid (VPA) is This is an open access article under the terms of the Creative Commons Attribution License, which permits use, distribution and reproduction in any medium, provided the original work is properly cited. © 2022 The Authors. Biometrical Journal published by Wiley-VCH GmbH. Biometrical Journal. 2022;64:883–897. www.biometrical-journal.com 883 884 DUDA et al. used for treating epilepsy but it is known to be embryo-toxic when taken in the first trimenon of pregnancy (Genton et al., 2006). The MCP-Mod framework can help to gain insights on concentration–response relationships between the concentration of VPA and gene activity. Specifically, we are interested in two aspects of MCP-Mod when applied on concentration–response data: the detection of genes where VPA has an effect on the dose–response curve (power) and model selection. We investigate these proper- ties in analyses on real and on simulated data. Further, the model performance or goodness-of-fit of selected models is evaluated to identify which models are suitable for gene expression dose–response data. Model selection and model performance differ substantially in the underlying theory. In model selection a statistical model from a set of candidate models has to be selected, given a data set. The aim is to select the model that represents the true, unknownmodel fun𝑅ction best (Chap. 1 of Claeskens & Hjort, 2008; Schorning et al., 2016)). In addition to select-ing the best model among the candidates, we also aim at identifying necessary or dispensable models. Therefore, we usethe goodness-of-fit measure 2adj. We combine the three aspects power, model selection, and goodness-of-fit in a newly proposed score that summarizes the suitability of a model set. This approach is applied on the VPA data set and on simu- lated data. In the context of clinical Phase II trials, model uncertainty for dose–response modeling is considered to increase pre- cision in target dose estimation—Ting (2006), Wheeler and Bailer (2009), Bornkamp et al. (2011) among many others. In Phase II trials, decisions on the model set can be based on expert knowledge and concentrate on a single compound and dose–response relationship. For gene expression data, model selection must be considered for thousands of genes simultaneously and it is not straightforward to find or use prior knowledge on the dose–response profile of each gene. House et al. (2017) and Filer et al. (2016) propose experimental pipelines that include concentration–response modeling and model selection for toxicological gene expression data. They consider a flat model, the sigmoid 𝐸max model with all four parameters or with the lower asymptote fixed to zero, and a gain–loss model that is similar to the beta model con- sidered here. However, detailed investigations on the necessity of model selection and on appropriateness of candidate model sets for gene expression concentration–response data are lacking, which motivates our work. This paper is structured as follows. The VPA data set is introduced in Section 2. The statistical methodology including MCP-Mod and both established performancemeasures and a newly proposed one are presented in Section 3. Our analysis procedures and results that are based on theVPAdata set are presented in Section 4. Different controlled simulation setups and corresponding results follow in Section 5. Final conclusions are summarized in Section 6. Source code to reproduce the results is available as Supporting Information on the journalsweb page (http://onlinelibrary.wiley.com/doi/xxx/suppinfo). 2 GENE EXPRESSION DATA SET The data set was first presented in the study of Krug et al. (2013), whereVPA is applied, among others, to human embryonic stem cells (hESC). VPA is widely used to treat different forms of epilepsy. However, it is linked to an increased incidence in congenital abnormalities (Cotariu & Zaidman, 1991). Krug et al. (2013) state that identifying changes in the transcriptome induced by toxic substances illustrates interesting mechanistic insights. Gene expression levels of the hESCs aremeasured repeatedly for different concentrations, using theGeneChipRHuman Genome U133 Plus 2.0. The=data are preprocessed with the Robust Multi-Chip Average algorithm by Irizarry et al. (2003),such that the expression data are on the logarithmic scale with base 2.Th𝑑e d=ata set cont𝑑ain=s G 5𝜇4,675 probe sets, which will be referred to as genes in the following, for simplicity. Forevery 7gene, expressio8n values c𝑛orresponding to the concentrations 𝑑1 = 0, 𝑑2 = 25, 𝑑3 = 150, 𝑑4 = 350, 𝑑5 = 450, 𝑑6 =550, 800, and 1000 M2 =V⋯PA=ar𝑛e8a=va3ilable. For the control𝑁lev=el2𝑑71, 𝑛1 = 6 replicates were measured. For allother concentrations there are replicates. There are measurements per gene. The replicates are biological replicates since different cells were used for each experiment. Due to functional relationships between genes, we cannot assume independence between the measurements from different genes. Further, six or three replicates per concentration is small for statistical modeling approaches. These problems are addressed in Section 4. 3 MCP-MODMETHODOLOGY AND PERFORMANCEMEASURES In this sec𝑆tion, the methodology is presented. First, the MCP-Mod approach is outlined. Then, the performance measuresprecision and recall for evaluating the model selection in MCP-Mod are explained. Additionally, the newly proposedmeasure  is presented. 15214036, 2022, 5, Downloaded from https://onlinelibrary.wiley.com/doi/10.1002/bimj.202000250 by Technische Universitaet Dortmund, Wiley Online Library on [14/12/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License DUDA et al. 885 TABL𝐵E 1 Dose–response models 𝑓(𝑑, 𝜽), their standardized versions 𝑓0(𝑑, 𝜽0), and the guesstimates for 𝜽0 for the analysis. For the betamodel is defined as 𝐵(𝛿1, 𝛿2) = (𝛿1 + 𝛿2)𝛿1+𝛿2∕(𝛿1𝛿1𝛿2𝛿2 ) and 𝐷 = 1200 M𝐸 odelmax 𝐸 𝐸 𝒇(𝒅+, 𝜽𝐸)𝐸0 + 𝐸max𝑑𝑑∕ℎ∕(𝐸(𝐸𝐷 + 𝑑) 𝑑 𝒇∕𝟎((𝒅,𝐷50ℎ + 𝑑ℎ) 𝑑ℎ∕𝐸(𝐸𝐷 𝜽𝟎) 𝐷50ℎ++𝑑) 𝜽𝟎 𝑑ℎ) 𝐸𝐸𝐷𝐷max 0 max 50 50 550 ∈ {100}Sigmoid Exponential 𝐸𝐸0 ++ 𝐸1{exp(𝑑∕𝛿) − 1} exp(𝑑∕𝛿) − 1 𝛿 ∈ 0{1=444.45505, }ℎ = 5.117 Linear 0 Quadratic 𝐸𝐸0 + 𝛿 + 𝐸𝛽 𝑑 1𝑑 + 𝛽 𝑑2 𝑑𝑑 + (𝛽 ∕|𝛽 |)𝑑2 ∅𝛿 = 𝛽 ∕|𝛽 | = −0.001 Beta 0 max𝐵(𝛿21, 𝛿2)(𝑑∕𝐷)𝛿1 ⋅ (1 − 𝑑∕𝐷)𝛿2 (𝑑∕𝐷)𝛿21 (1 −1 𝑑∕𝐷)𝛿2 𝛿1 = 22 1, 𝛿2 = 1 3.1 MCP-Mod The MCP-Mod approach was originally developed by Bretz et al. (2005) to model dose–response relationships in Phase II clinical trials under model uncertainty. For details see also Xun and Bretz (2017) and Bornkamp et al. (2009). The MCP-Mod methodology comprises two analysis steps. First, in the MCP-step, a statistically significant signal in a gene is determined by an optimal-contrast test for a prespecified set of candidate models. If such a signal is found for at least one model, a significant result of the multiple comparison procedure (signifMCP) is present for the gene. This means that an effect of VPA on the gene activity is established. The second step, Mod, refers to the modeling. From the set of mod𝑑els, for which a signifMCP has been established, one model fit is chosen and used as final fit for the data.𝑖A=lte1r,n…at,i𝑘vel1y, m𝑗o=de1l ,averaging can be performed.Denote as placebo concentration and 𝑑2 < … < 𝑑𝑘 as active concentrations with 𝑛𝑖 replicates. For concentrationand … , 𝑛𝑖 , 𝑁 = 𝑛1 +⋯+ 𝑛𝑘, the (preprocessed) expression values are modeled as 𝑦𝑖𝑗 = 𝜇(𝑑𝑖) + 𝜀𝑖𝑗, 𝜀𝑖𝑗 𝑖.∼𝑖.𝑑.  (0, 𝜎2), (1) with homogeneous variance 𝜎2 𝑀> 0. The mean response E(𝑦𝑖𝑗) = 𝜇𝑖 = 𝑓(𝑑𝑖, 𝜽) at concentration 𝑑𝑖 is assumed to follow aconcentration–response model with parameter vector 𝜽 and 𝜀𝑖𝑗 as independent errors.For the MCP step, a set of candidate models needs to be prespecified. Models commonly used for dose–response relationships are summarized in Table 1. All models summarized in the first column o𝑓f(T𝑑a,b𝜽l)e=1 c𝜃an be r𝑓 (𝑑, 𝜽 ) 0 + 𝜃1𝑓 efor 0(𝑑m, 𝜽ulated as0), (2) (see second column of Table 1), where 0 0 is the standardized version of a model. Introduction of the standardized model shape concept is crucial for choosing optimal contrasts in the MCP step, as their choice is scale invariant. It remains to determine initial guesses for the parameter 𝜽0. In practice, for a Phase II study, careful considerations and prior knowledge on expected percentages of maximal effects at certain doses are translated into guesstimates for 𝜽0. Here, the large number of genes makes individual, gene-dependent deci𝑓si0o(𝑑n,s𝜽o0n 𝜽0 difficult. We therefore use the sameguesstimates for all genes. Figure 1 displays the (rescaled) model shapes ) used for the analysis. The guesstimates are listed in Table 1. To the best of our knowledge there is little preliminary work on dose–response model selection in the context of gene expression data (Filer et al., 2016; House et al., 2017). In toxicology, often monotone dose–response relationships are assumed. Especially the 𝐸max model, a special case of the sigmoid 𝐸max model with ℎ = 1, was found to be appropri- ate for the majority of dose–response relationships in a large meta-analysis of clinical dose–response studies (Thomas et al., 2014). The inclusion of these two monotonic models in the candidate model set is therefore obligatory. The linear model is added as a reference or baseline model. For genes where the true underlying model might be a sigmoid 𝐸max model, but at the maximal considered dose, the turning point has not yet been reached, the exponential model might be more suitable. The quadratic and the beta model are included as nonmonotone shapes. They are similar to the gain–loss model used by Filer et al. (2016). There might be a nonmonotone relationship between concentration and gene activity, for example, for metabolic genes. Such genes might be activated at lower VPA concentrations but successively deactivated at increasing, highly toxic concentrations. 15214036, 2022, 5, Downloaded from https://onlinelibrary.wiley.com/doi/10.1002/bimj.202000250 by Technische Universitaet Dortmund, Wiley Online Library on [14/12/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License 886 DUDA et al. 0 200 400 600 800 1000 betaMod emax exponential 1.0 0.8 0.6 0.4 0.2 0.0 linear quadratic sigEmax 1.0 0.8 0.6 0.4 0.2 0.0 0 200 400 600 800 1000 0 200 400 600 800 1000 Dose F IGURE 1 Candidate model shapes used for the six concentration–response models The specific guesstimates for 𝜽0 for each (m𝐸o𝐷del)are chosen such that a wide range of dose–response shapes is covered.Further, we consider that during the experime5n0tal design stage, the concentrations were chosen with the expectationth𝐸a𝐷t th≈e d4o5s0e with𝐸𝐷the h≈a8lf0m0 aximal effect is close to 450 and a plateau is reached at concentration 1000, whichtran5s0lates into the sec9𝐸o5𝐷nd guess that 95% of themaximal effect is reached at concentration 800.With these two assumptions( and 50𝑚 =),1t,h…e guesstimate 𝜽 0 for the sigmoid 𝐸max model can be calculated analytically. For the 𝐸max model, a gu𝑡ess of an of 300 is,u𝑀sed. And for the exponential model, an 𝐸𝐷50 of 700 is assumed.Eac𝒄h 𝝁candidate s𝒄hap=e,(𝑐 , … , 𝑐 , defines a respective mean response vector 𝝁𝑚 = (𝜇𝑚1, … , 𝜇𝑚𝑘). For the MCP-step,a contr𝑚⊤ast𝑚 -test as f𝑚irst de𝑚sc1ribed𝑚by Abelson et al. (1963) is calculated. The test is constructed based on the linear con-trast where 𝑘)⊤ is chosen to maximize the power of the test for the assumed mean response 𝝁𝑚 (Bornkamp et al., 2009). This yields the hypotheses H𝑚0 ∶ 𝒄⊤𝑚𝝁𝑚 = 0 and H1𝑚 ∶ 𝒄𝑚⊤𝝁𝑚 ≠ 0. The test statistics for the contrasts are given by 𝑇𝑚 = ∑𝑘𝑆√ 𝑖=∑𝑘1 𝑐𝑚𝑖=1 𝑐2𝑖 ?̄?𝑖 ?̄? 𝑚𝑖 ∕𝑛𝑖 , 𝑚 = 1,… ,𝑀, (3) whe(r𝑇e1, 𝑖…is, 𝑇th𝑚e)o⊤bservedmean at dose 𝑑𝑖 and 𝑆2 𝑡= ∑𝑘𝑖=1∑𝑛𝑗=𝑖 1(𝑦𝑖𝑗 − ?̄?𝑖)2∕(𝑁 − 𝑘) is themean squared error. Under H0 and(1), follows a central, multivariate -distribution. A dose–response sig𝛼nal is established if 𝑇𝑇ma>x =𝑞 max(𝑇1, … , 𝑇𝑚) > 𝑞1=−𝛼{,𝑀w{h, …e𝑚0r,e,𝑀𝑞𝛼𝑚1}i}s th𝐿e equicoordinate 𝛼-quantile ofthe null distribution. This approach leads to𝑚multi1p−l𝛼e testing ad∅justm∗ent for1H H𝐿 with a strict control of the familywise error rate at level . The models with form∗th≠e set of significant models with estab-lished signifMCP. The modeling step is only executed if , that is, a dose–response signal is established for at least one model. During the Mod-step, either one fitted model of those that passed the MCP-step can be chosen for a final fit or all fitted models that passed theMCP-step can be averaged. If a singlemodel is selected, criteria as the Akaike information criterion (AIC) or the Bayesian information criterion (BIC) as well as the largest test statistic (maxT) can be used to pick a model. For model averaging, standardized weights based on the AIC or BIC can be calculated for the models in, and the final model is the resulting weighted average of each of the fitted models. Calculations are done with the DoseFinding R package, version 0.9-17, and the statistical software R, version 4.0.2 (R Core Team, 2020). For theℎnumerical estimation of the nonlinear parameters, we use the default boundary setting of theDoseFinding package. As the maximum concentration is 1000, this leads to boundaries for the 𝐸𝐷50 parameter of [1,1500] and [1/2, 10] for the parameter of the sigmoid 𝐸max model. Model means 15214036, 2022, 5, Downloaded from https://onlinelibrary.wiley.com/doi/10.1002/bimj.202000250 by Technische Universitaet Dortmund, Wiley Online Library on [14/12/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License DUDA et al. 887 3.2 Measures In this section, we briefly present for our context the definitions of the evaluation measures precision, recall, and 𝑅2adj. Further, a new measure  is proposed s𝑀pe∈cifically=f{o𝑀r th, e…c,o𝑀nte}xt of using MCP-Mod with a fixed candidate set  onmany dose–response data sets.For a set of genes and a specific model ∗ 1 𝐿 , the precision is defined as the conditional probability that a model is correct, given that it has been selected. Accordingly, the recall is defined as the conditional probability that a model is selected, given that it is correct (Buckland & Gey, 1994). Formally, we denote precision = ?̂?(Model is correct |Model is selected) = 𝑡𝑝 𝑡+𝑝𝑓𝑝 , recall = ?̂?(Model is selected |Model is correct) = 𝑡𝑝 𝑡𝑝 𝑓𝑝 𝑓𝑛 𝑡𝑝 + 𝑓𝑛 , where , , and are the number of true positives, false positives, and false negatives. Precision and recall values are in the interval [0,1], and a larger value corresponds to a better performance. They can only be evaluated in simulations where the correct model is known. For a model fit 𝑓(⋅, ?̂?) and data 𝑦𝑖𝑗 of a specific gene, 𝑖 = 1, … , 𝑘, 𝑗 = 1,… , 𝑛𝑖 , we use 𝑅2adj defined as 𝑅2 = 1 − (1 − 𝑁𝑅2−)(𝑁𝑝 − 1)adj , 𝑅2 = 1 − 𝑆𝑆𝑆𝑆𝐸𝑇 , (4) where 𝑆𝑆𝐸 = ∑𝑘𝑖=1∑𝑛𝑗=𝑖 1(𝑦𝑖𝑗 − 𝑓(𝑑𝑖, ?̂?))2 and 𝑆𝑆𝑇 =𝑝 ∑𝑘𝑖=1∑𝑗𝑛=𝑖 1(𝑦𝑖𝑗 − ?̄?)2 is the sum of squared errors and total sum ofsquares, respectively. The number of parameters is and the total number of measurements is 𝑁 = ∑𝑘𝑖=1 𝑛𝑖 .We further propose a new measure, the suitability of model set score . It can be used in a descriptive manner when MCP-Mod is applied to dose–response or concentration–response data of many genes. The score balances two desired properties. First, the number of detected signa𝑅l2s (signifMCPs) is desired to be large. Additionally, the detected signals arealso desired to be clear, that is, to have a large adj value. The score balances the number of detected signals and the model performance, that is, power and goodness-of-fit. It is defined as  = 𝐺1 ∑𝐺 𝑔=1 𝟙{Gene 𝑔 has significant MCP after adjustment} ⋅ 𝑅2adj, (5) where 𝐺 denotes the total number of genes and  the considered set of candidate models. For a given set , the score summarizes the proportion of genes with significant MCP after adjustment, weighted by the goodness-of-fit of the respec- tive genes. Adjustment means that the false discovery rate (FDR) is controlled using the Benjamini–Hochberg (BH) pro- cedure (Benjamini & Hochberg, 1995). In the context of MCP-Mo𝑝d, this means that for each gene, the sma𝑝llest 𝑝-value fr𝑝om the MCP tests of all candidatemodels is chosen. Consequently, each gene is represent𝑅ed by a single -value. These -values are adjusted with the BHprocedure. If a BH-adjusted -value is below 0.05 then th2 is results in a multiplicity-adjusted significant concentration–response signal. As performance measure, the value of adj of the winner model w.r.t. AIC for the corresponding gene is used. In general, when comparing two values 1 and 2 , the larger v𝑅alue indicates a favorable choice of the candidateset, since both the number of detected signals and the model performance2are taken into account. For improved clarity, inaddition both the proportion of genes with detected signal and themean adj of the fit of the winnermodels corresponding to those genes will be reported. 4 DATA-BASED ANALYSIS In this section, setups and results of the data-based analyses are presented. In the following, they are referred to as Analysis I and Analysis II. In Analysis I, MCP-Mod is applied on the real VPA data set and an overview on model selection results 15214036, 2022, 5, Downloaded from https://onlinelibrary.wiley.com/doi/10.1002/bimj.202000250 by Technische Universitaet Dortmund, Wiley Online Library on [14/12/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License 888 DUDA et al. TABLE 2 Overview of the analyses scenarios and their respective data generation details (Section 4) as well as for the simulation study of Section 5 Part Data (generation) Analysis I main Original VPA data Analysis II LOMO O𝑛 ri=ginSimulation 𝜎1= 𝑞⋯ al VPA data (0.5=)𝑛⋅ 8 ∈ {3, 5, 10}range (𝜎 = 𝑞null(0.5) for null-model) 6942 F IGURE 2 Distribution of winner counts per model 6000 4000 3739 3067 3090 2376 2000 979 0 betaMod emax exponential linear quadratic sigEmax Winner model by AIC given signifMCP is provided. An additional goal is to check if any model can be omitted from the candidate model set because it can be easily substituted by another model. Analysis I is extended by Analysis II through leave-one-model-out (LOMO) analyses. These include that the entire analysis of Analysis I is repeated several times, and each time one of the candidate modelsof is omitted. For an overview of the different analyses and the simulation, see Table 2. 4.1 Setup for Analysis I In Analysis I, MCP-Mod is applied independently on each gene of the VPA data set. As we cannot assume only increasing or decreasing effects, each gene is tested with two-sided contrast tests with significance level 𝛼 = 0.05. We apply multiplicity adjustment between genes by controlling the FDR using the BH procedure as described in Sec- tion 3.2. For each gene, if a dose–response signal is detected and hence at least one model passes the MCP-step, the AIC is used as model selection criterion. For small sample sizes, Schorning et al. (2016) show that the AIC outperforms the BIC, especially if the true underlying model is a complex one among the considered models. There are 𝑁 = 27 observations per gene in the VPA data set. Thus we use the AIC to avoid too low selection counts of possibly correct, more complex models. In our analysis we will see that even with the AIC the simple linear model is often selected. 4.2 Results for Analysis I Of the 54,675 genes, when controlling the FDR, 20,193 (36.9%) genes pass the MCP-step, that is, their concentration– response profile significantly differs from a flat profile. VPA has a significant effect on the activation (deactivation) of these genes. For each gene one winner model is selected by AIC as a final fit (Figure 2). The linear model is selected most often (34.4%), because the AIC penalizes more complex models. However, Figure 3 clearly shows that the linear model performs compa𝐸rmaatxively poorly with respect to the 𝑅2adj.The popu𝑅l2ar model (cf. Thomas et al., 2014, among many others) wins the fewest times and when it does win, itsfit has low adj values (Figures 2, 3). Figure 4 shows the distribution of model winner counts w.r.t. AIC stratified by 𝑅2adj. Less noisy genes are represented by the rightmost plot. Assuming (1), we refer to more (less) noisy genes as genes whose Number of genes 15214036, 2022, 5, Downloaded from https://onlinelibrary.wiley.com/doi/10.1002/bimj.202000250 by Technische Universitaet Dortmund, Wiley Online Library on [14/12/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License DUDA et al. 889 F IGURE 3 Performance of winner models 1.00 0.75 0.50 0.25 betaMod emax exponential linear quadratic sigEmax Winner model by AIC given signifMCP [0, 0.25] (0.25, 0.5] (0.5, 0.75] (0.75, 1] n = 526 n = 7821 3452 n = 6854 2346 n = 5086 1702 232 3000 2000 1500 200 1500 2000 1249 1000 981 912 1115 1141 108 694 100 1235 1000 78 1177 741 609 1000 500 44 632 771 485 500 22 262 18817 0 0 0 0 od ax l M nt ia ea r tic ax d x l l l m n ra m M o ma nt ia r cea ati a x od ax nti a ar tic ax od ax tia ar tic ax eta e n e li ad e a e a o igE n u et a e e lin rn m M r m M ad E ta em ne lin ad E ta em ne lin ad r Em b xp q s b xp o qu si g be o ig e o g e e ex p qu s b xpe q u si Winner model by AIC given signifMCP F IGURE 4 Distribution of winner counts per model stratified by 𝑅2adj underlyi𝑅n2g model has larger (smaller) s𝐸tamnadxard deviations 𝜎 in relation to the response range, which leads to smaller(larger) adj values. While the sigmoid and the beta model win most often for the least noisy stratum, the 𝐸max model is chosen rarely, regardless of the strata. The nonmonotone beta and quadratic model are chosen considerably often. For more noisy genes the linear model is preferred. For these genes, none of the models explains a lot of variance, which favors the linearmodel in terms𝐸of AIC. Hence, if the linearmodel is selected by AIC, one should hesitate to assumea true linear concentration–resp𝐸o𝐷nse reTo ensure that the low number 5o0f m laatxionship. Some example fits are visualiwinners is not only due to too strict𝐸zed in Figure 5.pmaarxameter constraints in the numericaloptimization, 𝐸w𝐷e visualize the p𝐸arameter estimates for genes where the model won (Appendix, Figure A1).There is no 𝑆ev𝑆i𝑇den50ce that the poor perfomr𝑅amx ance of the 𝐸max model is due to optimization constraints, but instead due tothe often low estimates. For an model, a low 𝐸𝐷50 translates to an early plateau, which can lead to an 𝑆𝑆𝐸close to the and therefore to a small 2adj. It is also of interest if any of the models in the candidate se𝑅t is redundant such that it can be substituted by anothermodel. Removing such a model from the candidate set would2increase power as the number of hypotheses would bedecreased in the MCP-step of each gene. If for many genes the adj for the winner model and the second best model differ substantially, th𝐸e winning model should be considered for future analyses. If the quadratic model is the winning modelwith a good fit, mmaanxy genes cannot be modeled well by the secon𝐸d best model (Figure 6).The sigmoid and the beta model pe𝑅rformances also differmbayx a considerable amount to the second best model’sperformance across the whole range of explained variance. The and the exponential model can mostly be replacedby othermodels without substantial loss in 2adj. This especially applies to genes with larger explained𝐸vmaraixance. The linearmodel can always be replaced with minimal loss in explained variance, as it is a special case of the model and the quadratic model. Number of genes R2adj 15214036, 2022, 5, Downloaded from https://onlinelibrary.wiley.com/doi/10.1002/bimj.202000250 by Technische Universitaet Dortmund, Wiley Online Library on [14/12/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License 890 DUDA et al. Gene: 235456_at, Gene: 201152_s_at, Gene: 1554418_s_at, Model: betaMod Model: emax Model: exponential 7.5 5.0 8.0 7.0 4.5 4.0 6.5 7.5 3.5 6.0 3.0 0 250 500 750 1000 0 250 500 750 1000 0 250 500 750 1000 Concentration Concentration Concentration Gene: 204400_at, Gene: 203883_s_at, Gene: 217837_s_at, Model: linear Model: quadratic Model: sigEmax 9.8 10.0 8.5 9.4 9.6 9.0 8.0 9.2 8.6 7.5 8.8 0 250 500 750 1000 0 250 500 750 1000 0 250 500 750 1000 Concentration Concentration Concentration Gene: 1555004_a_at, Gene: 236937_at, Gene: 1558000_at, Model: emax Model: exponential Model: linear 4.1 4.2 3.9 4.0 4.0 3.7 3.9 3.8 3.5 3.8 3.6 3.3 3.7 3.4 3.6 0 250 500 750 1000 0 250 500 750 1000 0 250 500 750 1000 Concentration Concentration Concentration F IGURE 5 Example selection of nonnoisy (rows 1 and 2) and noisy (row 3) genes of the VPA data set with significant concentration–response model, with added fit of the model selected as winner w.r.t. AIC. Gray dots represent single response values, and the red dots indicate the mean responses per concentration. Each model has an 𝑅2adj of at least 0.75 (rows 1 and 2) or below 0.25 (row 3) 4.3 Setup for Analysis II Analysis II offers further insights into possibly expendable models in the candidate set. The analysis setup is similar to the one of Analysis I. Analysis I is redone several times, but each time one model from the candidate model set is omitted. We refer to these as LOMO analyses. 4.4 Results for Analysis II The number of FDR adjusted significant concentration–response relationships is similar to the main analysis where no model is left out (Table 3). This finding is consistent with the results of Pinheiro et al. (2006). If the quadratic model is omitted from the candidate model set, the total number of signifMCPs increases at the cost of reduced mean 𝑅2adj for the remaining genes.This is due to the different, rarely appropriate shape of the quadraticmodel compared to all othermodels.Measured by the  score, it is proposed to drop the 𝐸max model from the candidate model set (indicated in bold). The log2Expression log2Expression log2Expression log2Expression log2Expression log2Expression log2Expression log2Expression log2Expression 15214036, 2022, 5, Downloaded from https://onlinelibrary.wiley.com/doi/10.1002/bimj.202000250 by Technische Universitaet Dortmund, Wiley Online Library on [14/12/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License DUDA et al. 891 𝑅F2IGURE 6 Scatter plots stratified by winner model (by AIC) showing the 𝑅2adj value of the winner on the 𝑥-axis and the difference to theadj value of the second best model (by AIC) on the 𝑦-axis TABLE 3 Total number of FDR adjusted significant genes in each LOMO analysis, number of gained and lost genes compared to the case when no model is left out (Analysis I),  score, rate of FDR adjusted significant genes, and mean 𝑅2adj among significant genes. Largest score indicated in bold Model 𝐸 Total Gained Lost  signifMCP rate mean 𝑹𝟐adjSigmoid max 20,122 61 132 0.2114 0.3680 0.5743 Quadratic 20,459 697 431 0.2119 0.3742 0.5664 Beta 20,221 150 122 0.2120 0.3698 0.5732 Exponential 20,164 529 558 0.2120 0.3688 0.5748 Linear 20,178 91 106 0.2127 0.3691 0.5763 𝐸None 20,193 0 0 0.2134 0.3693 0.5778max 20,349 330 174 0.2138 0.3722 0.5745 sigmoid 𝐸max model, which contains the 𝐸max model as a special case, decreases the score the most when removed from the candidate set. We are further interested by which model an originally selected model after its omission is typically substituted in the modeling step (Figure 7). The beta model is selected more often, if the sigmoid 𝐸max model is removed and vice versa. If the often selected linear model is omitted, the exponential model is most often replacing it.Two additional evaluations regarding the validity of the  score were conducted. First, a copy of the VPA data set was generated and all 3067 genes where the exponential model won by AIC were removed and the LOMO analyses wererepeated. As expected, in this artificial scenario the  score suggests to drop the exponential model (Table 4, second column from the left). Second, Analysis I was repeated but with a single model as candidate model (Table 4). When a single candidate modelis used, the  score is always smaller than when only one or no model is omitted from the candidate model set and the original VPA data ar𝐸e used. The lowest score of 0.0825 is obtained if the quadratic model is the only candidate model. Thisis because the quadramtiacxmodel shape passes the MCP-step for only 12.27% of the genes. When using only one candidatemodel, the sigmoid model has the largest score of 0.2036. The absolute differences in scoresmight appear small butmust not bemisinterpreted as irrelevant. For the artificial scenario where genes with the exponential model as winner model are removed, omitting the exponential model from thecandidate model set is considered reasonable by construction. Therefore, the corresponding difference in the  score of 15214036, 2022, 5, Downloaded from https://onlinelibrary.wiley.com/doi/10.1002/bimj.202000250 by Technische Universitaet Dortmund, Wiley Online Library on [14/12/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License 892 DUDA et al. 8000 Left out model 6000 betaMod emax 4000 exponential linear quadratic 2000 sigEmax none 0 betaMod emax exponential linear quadratic sigEmax AIC winner model (given signifMCP) F IGURE 7 Absolute count of selections by AIC in the modeling step of each model that had a signifMCP in the MCP-step for each LOMO scenario TABLE 4  score, rate of significant genes, and mean 𝑅2adj among significant genes for two added analyses: The LOMO analyses on the modified VPA data set where genes with exponential winner model are removed and Analysis I repeated but with only one model in the candidate model set. Largest  score indicated in bold LOMO analyses on VPA data set without Only model in candidate model set on exponential winner genes original VPA data set signif. mean signif. mean Mod𝐸el  genes (%) 𝑹𝟐adj  genes (%) 𝑹𝟐adjsig. max 0.1789 0.3062 0.5844 0.2036 0.3771 0.5399 Beta 0.1795 0.3077 0.5834 0.2013 0.3647 0.5520 Quadratic 0.1795 0.3117 0.5758 0.0825 0.1227 0.6722 Linear 0.1803 0.3071 0.5869 0.1855 0.3768 0.4922 𝐸None 0.1810 0.3078 0.5882 - - -max 0.1813 0.3097 0.5854 0.1690 0.3235 0.5225 Exponential 0.1834 0.3173 0.5778 0.1669 0.3011 0.5544 0.0024 can be interpreted as relevant. Only using the sigmoid 𝐸max model compared to using the full candidate model set differs by 0.0098, which can hence be viewed as a relevant difference such that it would not suffice to use a single model.The interpretation of the  score is not straightforward, which is discussed in Section 6. 5 SIMULATION-BASED ANALYSIS The simulation gives insights on the effect of the number of replications per concentration level while standard deviation of the noise is fixed (Table 2). Opposed to the data-based analysis, the correct model is known such that precision, recall, and goodness-of-fit can be evaluated. 5.1 Setup Concentration–response data sets are generated for 10,000 genes for each of the six considered models and for the null case, as well as for three different numbers of replicates 𝑛𝑖 and a fixed standard deviation to range ratio (see Table 2). Details on how the range and𝑛standard deviation are chosen are explained in the following. The null case means that aconstant model is used to gener𝑖ate the data. In order to have a realistic data generation process, it is based on the real VPAdata set. For each con𝑓sid=er𝑓e(d𝑑, 𝜽,)a∈data set structu=ra7lly similar to the VPA data set but with 70,000 genes, 10,000 for eachof the six nonflat models, and 10,000for the flat𝑞n(0u.l5l)model, is generated as follows.Consider a model where | | and for the null case, 𝑓 = 𝑓(𝑑, 𝜽) = 𝑓(𝑑, 𝑐) = 𝑐 > 0. The assumedratio of standard deviation to range denoted by is explained below. Define ∗(𝑓) as the set of all genes 𝑔 for which Gene count 15214036, 2022, 5, Downloaded from https://onlinelibrary.wiley.com/doi/10.1002/bimj.202000250 by Technische Universitaet Dortmund, Wiley Online Library on [14/12/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License DUDA et al. 893 TABLE 5 Summary of the simulation results, stratified by correct model and chosen model, respectively. Signif. genes (%) is the rate of FDR adjusted, detected dose–response signals among the 10,000 generated dose–response data for each model, with 𝑛𝑖 replicates at each dose level. For the recall rate, th𝒏e model is the correct m𝑬 odel. For the precision rate, the model is the selected modelMeasure 𝒊 Beta 𝐦𝐚𝐱 Exponential Flat Linear Quadratic sig. 𝑬𝐦𝐚𝐱 3 0.989 0.982 0.979 0.037 0.988 0.988 0.997 Signif. genes (%) 5 1.000 1.000 1.000 0.042 1.000 1.000 1.000 10 1.000 1.000 1.000 0.045 1.000 1.000 1.000 3 0.447 0.496 0.561 0.963 0.705 0.543 0.471 Recall 5 0.567 0.552 0.657 0.958 0.750 0.638 0.567 10 0.712 0.616 0.754 0.955 0.770 0.723 0.717 3 0.582 0.658 0.682 0.925 0.416 0.547 0.510 Precision 5 0.627 0.746 0.745 0.999 0.529 0.565 0.586 10 0.676 0.790 0.822 1.000 0.713 0.614 0.689 model 𝑓 was the selected winner model by AIC in the VPA data set in Analysis I. Further, (𝑓) is a sample of 10,000 genes drawn with replacement𝑓f(r𝑔o)(m𝑑,?̂?∗)(𝑓). For a g𝑓ene 𝑔 ∈ (𝑓), the true und𝑔erlying concentration–response relationshipis assumed to be the mod𝑐el fit of model on the VPA d𝑔ata of gene . For the null model, the mean response ?̄? (𝑔) is used as true value for . Given this true concentration–response relationship of gene , noise is added to generate a data set according to the m𝑓(o𝑔)d(e𝑑l𝑖 ,e?̂?q)u+ati𝑒o𝑖𝑗n (𝑗1)=. F1o,r…co, 𝑛n𝑖centration levels 𝑑 ∈ {𝑑1, … , 𝑑8} used in the original experiment (see Section 2), generate 𝑦𝜎 𝑖 (𝑗𝑔) = max (𝑓 (𝑑, , ?̂?))(−𝑓(𝑔),m𝑠)i.∶nT=he(a𝑓dde((𝑑d𝑓,n(?̂?𝑔o)))i)s⋅e𝑞v(a𝑠l)ues 𝑒𝑖𝑗𝑞a(r𝑠e) independently drawn from 𝜀 ∼  (0, (𝜎(𝑓 (𝑔), 𝑠))2). If 𝑓(𝑔) is not the null case(𝑔,)then ran(g𝑔)e , where the range for a gene with model 𝑓 is calculated as range(𝑓(𝑔)) ∶=𝑑∈{𝑑1,…,𝑑8}𝑔 𝑑∈{𝑑1,…,𝑑8} . The term is the empirical 𝑠-quantile of the ratio 𝑆(𝑔) (𝑔)all genes with a detected signal and their respect𝑠iv=e 0fi.t5s in Analysis𝑞I(.0H.5e)re=, (0𝑆.3(𝑔2)2)22= ∑𝑘𝑖=1∑𝑛𝑗=𝑖 1(𝑦𝑖(𝑔) ∕−ra?̄?n(𝑔g)e)(2𝑓∕(𝑁)−ac𝑗 𝑖 𝑘ro)ssis the estim𝜎(ated variance for each gene. Hence, for , we obtain , which is used for all nonflat modelsand all𝑞ge𝑓n((e𝑔0s).,5t𝑠o))=c=a0l𝑞c (𝑔) .1u9la0(t9𝑠e)𝜎(𝑓, 0.5) (Appendix, Figure A2). If 𝑓 (𝑔) is the null case, then range(𝑓(𝑔)) = 0. In this case, we use null , which is the empirical 𝑠-quantile of 𝑆 calculated for nonsignificant genes 𝑔 of Analysis I. We obtain null (Appendix, Figure A3). Using an adapted standard deviation per gene and model is preferred over using a fixed standard deviation, because it allows for comparability between different models and ranges with respect to goodness-of-fit (Kappenberg et al., 2021). Finally, the generated data set is analyzed as the original VPA data set in Analysis I. 5.2 Results Table 5 summarizes the results of the simulation w.r.t. signal detection (power), recall, and precision. For 𝑛𝑖 = 3, which mimics the conditions in the real VPA data set, a signal is almost always detected if it is present. However, the recall and precision rates for n𝐸onlinear and nonflat models for this scenario are below 0.69. If 𝑛𝑖 = 10, the rates of these modelselection errors are still lamrgaxe, even though the sample size𝑁 = 8 ⋅ 𝑛𝑖 = 80 is rather large in the context of toxicology. Forexample, if the sigmoid model is correct, for 31.1% of the generate𝑛d dose–response data another model is𝐸inmcaoxrrectlyselected. Due to the penalty term of the AIC used in model selection, complex correct models as the sigmoid or thebeta model have a comparatively large increase in recall rates when 𝑖 is increased. For comparatively noisy scenarios, these models are rarely selected. The opposite holds for the least complex model, the linear model. It has a comparatively very low precision rate (41.6%) and very high recall rate (70.5%) for 𝑛𝑖 = 3, but not for 𝑛𝑖 = 10. Precision values naturally have more practical value, as they give insight on how confident one can be with the model selection. The precision rate increases from 92.5%𝑛t𝑖o 99.9% for the flat model if 𝑛𝑖 increases from 3 to 5. For nonflat models, the precision rate does notexceed 82.2% at any . In practice, often one is not mainly interested in selecting the true underlying model but to have a sufficiently good fit. Figure 8 summarizes the relative loss in model fit by considering the log-ratio in root-mean-square error (RMSE) between the winner and the true model, that is, between the actually selected model and the fitted model if the correct 15214036, 2022, 5, Downloaded from https://onlinelibrary.wiley.com/doi/10.1002/bimj.202000250 by Technische Universitaet Dortmund, Wiley Online Library on [14/12/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License 894 DUDA et al. ni3 ni5 ni10 4 Winner model betaMod 2 emax exponential linear quadratic 0 sigEmax −2 7686 7534 8235 16958 9921 9251 9038 7399 8822 14196 11279 9675 10523 7794 9167 10804 11763 10400 F IGURE 8 Distributi𝑅on𝑀s𝑆o𝐸f log2(𝑅𝑀𝑆𝐸winner∕𝑅𝑀𝑆𝐸true) in the simulation. 𝑅𝑀𝑆𝐸true is the root mean squared error (RMSE) if thecorrect model is fitted and winner the RMSE of the selected winner model. For 108 genes, the log-ratio is greater than 5 and not displayed model is selected. For both, the RMSE is calculated with respect to the true dose–response curve that is known in the simulation. If the correct model is actually selected, this ratio is 0, because the true and selected model are the same. If the ratio is greater than one, then the selected model differs from the correct model and has a worse RMSE. The ratio of the RMSE values is independent of 𝑛𝑖 and of the range of the respective gene. It only captures the effect of the model selection. In general, the relative loss inRMSEdecrease𝑛s𝑖 w=it3h increasing𝑛𝑖 but is still present for𝑛𝑖 = 10, although𝑁 = 8 ⋅ 𝑛𝑖 = 80is already a large sample size in toxicology. For themedian of the log-ratio is 0.000𝑛0𝑖for all winnermodels, but 0.0543for the linear model. This demonstrates the low precision of the linear model for𝐸small . For small 𝑛𝑖 = 3, the penalty ofthe AIC is comparatively2large,=yi1e.l4d8i0ng too2many=se1l.e5c0t9ions of the simpler linearmmaxodel in cases where a more complexmodel m𝐸ight be required. If a more complex model such as the beta or sigmoid model is selected, the ratio’s upperquartile armealxargest with 0.464 and 0.594𝐸 , respectively. Hence, for 25% of the generated genes where thesigmoid mo𝑛del is smelaexcted, the selection is not correct and the RMSE is at least 50.9% larger than the RMSE of thecorrect model. For𝑖 the and for the exponential model, the upper quartiles of the ratio are closest to zero for each 𝑛𝑖 .𝐸With increasing , the penalty term of the AIC becomes comparatively weak. For 𝑛𝑖 = 10, this heavily affects the linearmmoadxel. It is selected less often and has log-ratios closely concentrated around zero. For the beta, quadratic, and sigmoidmodel, the upper quartile of the log-ratios remain comparatively2f0a.r23f5ro=m10.1.7I7f the beta model is selected, for 25% ofthe generated genes the selection is incorrect and the RMSE is at least times the RMSE if the correct model was selected. Thus, not selecting the correct model results in a noteworthy relative loss in goodness-of-fit, even when larger sample sizes are used in toxicology. 6 CONCLUSION In this work, MCP-Mod was used as model selection approach for gene expression concentration–response data. For the data set at hand, human embryonic stem cells were exposed to varying concentrations of VPA. For 54,675 probe sets or g𝐸enes the expression is measured. The data set indicated that modeling gene expression concentration–response dataremqauxires the consideration of several models, that is, a candidate model set. Only considering the popular 𝐸max or sigmoidmodel might not be sufficient. Especially nonmonotone models like the quadratic model should also be taken into account. When usingMCP-Mod, frequent selections of a linear model should not be misinterpreted as evidence for a true, linear concentration–response relationship. A large noise-to-signal ratio, or, more precisely, a large standard deviation to true response range ratio, favors the selection of the linearmodel. Also, there is typically no notable loss in goodness-of-fit, when instead of the linear modelthe second be𝐸st model is use𝐸d.Using a newly proposed score, , it was obsermvaexd that the max model can be omitted from the candidate set despite itspopularity, as long as themore general sigmoid model is included in the candidate set. Further, the score discourages to omit the linearmodel, even though it can be easily substitutedwith respect to goodness-of-fit. Including the linearmodel in the candidate set aims to detect unclear concentration–response signals rather than modeling detected signals well. If the linear model is omitted, one might fail to identify potentially interesting genes. Simulation studies based on the data set indicate that the confidence in the correctness of the selected model, measured by the precision, is not very high. log2RMSEwinner RMSEtrue 15214036, 2022, 5, Downloaded from https://onlinelibrary.wiley.com/doi/10.1002/bimj.202000250 by Technische Universitaet Dortmund, Wiley Online Library on [14/12/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License DUDA et al. 895 Even when increasing the sample size per concentration from 3 to 10, which is very large for this type of toxicological experiments, the precision of nonflat models does not exceed 0.83. Thus, increasing the number of experiments does not increase the precision in model selection proportionally. The relative loss in goodness-of-fit due to model selection mistakes decreases withincreasing sample size, but remains notable even for 10 replicates per concentration.The newly proposed  score served as a help to summarize analysis and simulation results. For a given candidate set, it considers the power, that is, number of detected genes, and the goodness-of-fit of genes with a detected signal simultaneously. Despite its simple form, its interpretability is not straightforward, which allows for improvements. If one does not want to consider both power and goodness-of-fit at the same time but, for example, focuses on optimizing power, the score is not an adequate tool. The data basis of this work is a single data set, which, despite its size and quality, is an obvious limitation. Similar anal- yses on other gene expression concentration–response data would be valuable to confirm our results. Another promising approach, which is not considered in this work, is model averaging. It would be interesting to analyze the influence of the different approaches and parameters on target dose estimation. ACKNOWLEDGMENTS This work has received funding from LivSysTransfer (BMBF: Project Number 031L0119) and was partially supported by the Research Training Group “Biostatistical Methods for High-Dimensional Data in Toxicology” (RTG 2624, Project P2) funded by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation: Project Number 427806116). The authors would like to thank the anonymous reviewers for their numerous helpful comments. CONFL ICT OF INTEREST The authors have declared no conflict of interest. DATA AVAILAB IL ITY STATEMENT The original data that support the findings of this study are openly available in ArrayExpress as stated by Krug et al. (2013) (https://doi.org/10.1007/s00204-012-0967-3). OPEN RESEARCH BADGES This article has earned an Open Data badge for making publicly available the digitally-shareable data necessary to reproduce the reported results. The data is available in the Supporting Information section. This article has earned an open data badge “Reproducible Research” for making publicly available the code necessary to reproduce the reported results. The results reported in this article could fully be reproduced. ORCID JuliaC.Duda https://orcid.org/0000-0002-3894-0124 FranziskaKappenberg https://orcid.org/0000-0001-8066-5333 JörgRahnenführer https://orcid.org/0000-0002-8947-440X REFERENCES Abelson, R. P., & Tukey, J. W. (1963). Efficient utilization of non-numerical information in quantitative analysis general theory and the case of simple order. The Annals of Mathematical Statistics, 34(4), 1347–1369. Benjamini, Y., & Hochberg, Y. (1995). Controlling the false discovery rate: A practical and powerful approach to multiple testing. Journal of the Royal Statistical Society: Series B (Methodological), 57(1), 289–300. Bornkamp, B., Bretz, F., Dette, H., & Pinheiro, J. (2011).. Response-adaptive dose-finding under model uncertainty. The Annals of Applied Statistics, 5(2B), https://doi.org/10.1214/10-aoas445 Bornkamp, B., Pinheiro, J., &Bretz, F. (2009).MCPMod:AnRPackage for theDesign andAnalysis ofDose-Finding Studies. Journal of Statistical Software, 29(7), https://doi.org/10.18637/jss.v029.i07 Bretz, F., Pinheiro, J. C., & Branson,M. (2005). Combiningmultiple comparisons andmodeling techniques in dose-response studies. Biometrics, 61(3), 738–748. Buckland, M., & Gey, F. (1994). The relationship between recall and precision. Journal of the American Society for Information Science, 45(1), 12–19. Claeskens, G., & Hjort, N. L. (2008).Model selection and model averaging. Cambridge Books. Cotariu, D., Zaidman, J. L. (1991). Developmental toxicity of valproic acid. Life Sciences, 48(14), 1341–1350. 15214036, 2022, 5, Downloaded from https://onlinelibrary.wiley.com/doi/10.1002/bimj.202000250 by Technische Universitaet Dortmund, Wiley Online Library on [14/12/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License 896 DUDA et al. European Medicines Agency (2015). Qualification opinion of mcp-mod as an efficient statistical ethodology for model-based design and analysis of phase II dose finding studies under model uncertainty. https://www.ema.europa.eu/en/documents/regulatory-procedural-guideline/draft- qualification-opinion-mcp-mod-efficient-statistical-methodology-model-based-design-analysis_en.pdf. Filer, D. L., Kothiya, P., Setzer, R. W., Judson, R. S., & Martin, M. T. (2016). tcpl: the ToxCast pipeline for high-throughput screening data. Bioinformatics, btw680, https://doi.org/10.1093/bioinformatics/btw680 Food and Drug Administration. (2016). Determination letter. https://www.fda.gov/media/99313/download. Genton, P., Semah, F., & Trinka, E. (2006). Valproic acid in epilepsy. Drug Safety, 29(1), 1–21. House, J. S., Grimm, F. A., Jima, D. D., Zhou, Y.-H., Rusyn, I., & Wright, F. A. (2017). A pipeline for high-throughput concentration response modeling of gene expression for toxicogenomics. Frontiers in Genetics, 8, 168. Kappenberg, F., Grinberg, M., Jiang, X., Kopp-Schneider, A., Hengstler, J. G., & Rahnenführer, J. (2021). Comparison of observation-based and model-based identification of alert concentrations from concentration–expression data. Bioinformatics, 37(14), 1990–1996. Krug, A. K., Kolde, R., Gaspar, J. A., Rempel, E., Balmer, N. V., Meganathan, K., Vojnits, K., Baquié, M., Waldmann, T., Ensenat-Waser, R., Evans, R. M., Julien, S., Kortenkamp, A., Hescheler, J., Hothorn, L., Bremer, S., van Thriel, C., Krause, K.-H., Hengstler, J. G., Rahnenführer, J., Leist, M., & Sachinidis, A. (2013). Human embryonic stem cell-derived test systems for developmental neurotoxicity: A transcriptomics approach. Archives of Toxicology, 87(1), 123–143. O’Quigley, J., Iasonos, A., & Bornkamp, B. (2017).Handbook of methods for designing, monitoring, and analyzing dose-finding trials. Handbooks of Modern Statistical Methods, CRC Press. Pinheiro, J., Bornkamp, B., & Bretz, F. (2006). Design and analysis of dose-finding studies combining multiple comparisons and modeling procedures. Journal of Biopharmaceutical Statistics, 16(5), 639–656. R Core Team. (2020). R: A language and environment for statistical computing. R Foundation for Statistical Computing. Schorning, K., Bornkamp, B., Bretz, F., &Dette, H. (2016). Model selection versusmodel averaging in dose finding studies. Statistics inMedicine, 35(22), 4021–4040. Thomas, N., Sweeney, K., & Somayaji, V. (2014). Meta-analysis of clinical dose–response in a large drug development portfolio. Statistics in Biopharmaceutical Research, 6(4), 302–317. Ting, N. (2006). Dose finding in drug development. Springer Science & Business Media. Wheeler, M. W., & Bailer, A. J. (2009). Comparing model averaging with other model selection strategies for benchmark dose estimation. Environmental and Ecological Statistics, 16(1), 37–51. Xun, X. & Bretz, F. (2017). TheMCP-Modmethodology: Practical considerations and the dose finding r package. In J. O’Quigley, A. Iasonos, & B. Bornkamp (Eds.), Handbook of methods for designing, monitoring and analyzing dose-finding trials (pp. 205-227). Chapman and Hall/CRC. SUPPORT ING INFORMATION Additional supporting information may be found in the online version of the article at the publisher’s website. How to cite this article: Duda, J.C., Kappenberg, F., & Rahnenführer, J. (2022). Model selection characteristics when using MCP-Mod for dose–response gene expression data. Biometrical Journal, 64, 883–897. https://doi.org/10.1002/bimj.202000250 APPENDIX 300 150 100 75 100 200 50 50 100 25 0 0 0 0 500 1000 1500 −5 0 5 10 5 10 ED50estimate Emaxestimate E0estimate F IGURE A1 Barplots of parameter estimates of the 𝐸max model for genes where it was selected as winner model by the AIC Gene count Gene count Gene count 15214036, 2022, 5, Downloaded from https://onlinelibrary.wiley.com/doi/10.1002/bimj.202000250 by Technische Universitaet Dortmund, Wiley Online Library on [14/12/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License DUDA et al. 897 1500 1000 500 0 0.0 0.2 0.4 0.6 0.8 S / range 𝑞F(I0G.1)U=R0E.14A272 𝑞(0B.5a)rp=lo0t.o3f22𝑆2to rang𝑞e(0r.a9t)io=fo0r.5g2e4n5es with an FDR adjusted signifMCP in Analysis I. The vertical dotted lines are at, , and 10000 7500 5000 2500 0 0.0 0.2 0.4 0.6 0.8 S F IGUR and 𝑞null(0E.9)A=30.28B9a8rplot of 𝑆 for nonsignificant genes in Analysis I. The vertical dotted lines are at 𝑞null(0.1) = 0.1236, 𝑞null(0.5) = 0.1909, Gene count Gene count 15214036, 2022, 5, Downloaded from https://onlinelibrary.wiley.com/doi/10.1002/bimj.202000250 by Technische Universitaet Dortmund, Wiley Online Library on [14/12/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License Article 2 Computational Toxicology 23 (2022) 100234 Contents lists available at ScienceDirect Computational Toxicology journal homepage: www.sciencedirect.com/journal/computational-toxicology td2pLL: An intuitive time-dose-response model for cytotoxicity data with varying exposure durations Julia Duda a,*, Jan G. Hengstler b, Jörg Rahnenführer a a Department of Statistics, TU Dortmund University, Vogelpothsweg 87, Dortmund 44227, Germany b Department of Toxicology, TU Dortmund University, Leibniz Research Centre for Working Environment and Human Factors, Ardeystr. 67, Dortmund 44139, Germany A R T I C L E I N F O A B S T R A C T Keywords: Statistical modeling approaches for dose-response or concentration-response analyses are often required in Exposure duration-concentration-response toxicological applications, especially for cytotoxicity assays. By fitting a concentration-response curve, one can Time-dose-response derive target concentrations, such as the EC50. In practice, concentration-response data for different exposure Dose-response durations might be available and the target concentration for each or some exposure duration(s) are of interest. Exposure-duration Cytotoxicity In this work, we propose a statistical modeling approach that improves the precision of the target concentration Statistical modelling estimation at a given exposure duration by extrapolating the concentration-response data from other exposure durations. The method further enables target concentration estimation at exposure durations that were not conducted in the experiment. For practitioners, the proposed model yields additional complexity compared to the simple approach of a single concentration-response curve for all exposure durations. It would only be used if it improves the estimation of the target concentration compared to the simple approach. We propose a two-step pipeline to decide between using the complex and the simple approach to result in a precise target concentra- tion estimation. The methods were evaluated using a simulation study and a real data set. The models are accessible for practitioners through the R package td2pLL. cell cycle [3]. This usually leads to recommended exposure durations of 1. Introduction 24, 48, or 72 h, depending on the test. In their guidance document, the ICCVAM also referred to Riddell et al. [4], who first addressed the In cytotoxicity assays, concentration-response curves help to un- exposure duration question for cytotoxicity tests. They found large dif- derstand the functional relationship between exposure of cells in the ferences in the toxicity of compounds between 48 h and 72 h exposure culture medium and viability. To conduct an assay, in addition to the duration. Possible mechanistic explanations are that some substances concentration, the exposure duration of the cells should also be set. Gu damage cell membranes while others affect DNA replication or cell di- et al. [1] analyzed the relevance of the exposure or incubation duration vision. The toxic effects of the former can be observed after a short time on cytotoxicity in primary human hepatocytes (PHH), in which exposure and the latter only after longer exposure durations, as the cell is only durations of 1 or 2 days are normally administered [2]. They showed a affected during certain phases of the cell cycle [5]. Consequently, if the clear influence of exposure duration on EC50 values of a set of 30 com- toxicological mechanism of a compound is unknown, one should pounds incubated for 1, 2 and 7 days by separately fitting concentration- consider various exposure durations. If toxicity can be measured for response curves for each compound and incubation duration. The se- short exposure durations, it is of further interest to investigate the lection of one or more exposure duration(s) is therefore relevant for exposure duration dependency. cytotoxicity assays as it affects the target concentration estimation. An exposure duration-concentration-response (ECR) model could However, there is no clear guideline for selecting exposure durations for facilitate answering questions such as: What is the hypothetical limit cytotoxicity tests. The guidance document of the Interagency Coordi- target concentration for infinitely long exposure durations? When does nating Committee on the Validation of Alternative Methods (ICCVAM), increasing the exposure duration cease to decrease the target concen- for example, states that longer exposure durations tend to enhance the tration? The mechanistic motivation of the proposed ECR model helps to sensitivity of a test and propose to use exposure durations of at least one answer questions relating to dependency between the exposure duration * Corresponding author. E-mail addresses: duda@statistik.tu-dortmund.de (J. Duda), hengstler@ifado.de (J.G. Hengstler), rahnenfuehrer@statistik.tu-dortmund.de (J. Rahnenführer). https://doi.org/10.1016/j.comtox.2022.100234 Available online 1 June 2022 2468-1113/© 2022 The Authors. Published by Elsevier B.V. This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/4.0/). J. Duda et al. C o m p u t a t i o n a l T o x i c o l o gy 23 (2022) 100234 Nomenclature DMSO Dimethyl sulfoxide ETOH Ethanol Abbreviations FAM Famotidine td2pLL time-dose two-parameter log–logistic GLC Glucose EC50 effective concentration of half maximal effect HYZ Hydroxyzine PHH primary human hepatocytes INAH Isoniazid ICCVAM Interagency Coordinating Committee on the Validation of KC Ketoconazole Alternative Methods LAB Labetalol ECR exposure duration-concentration-response LEV Levofloxacin 4pLL four-parameter log–logistic MEL Melatonin 2pLL two-parameter log–logistic MePA Methylparaben RSS residual sum of squares NAC Acetylcysteine AMAFC average mean absolute log2 fold-change NFT Nitrofurantoin APAP Acetaminophen NIM Nimesulide ASP Aspirin PHB Phenylbutazone BOS Bosentan PMZ Promethazine BRP Buspirone PPL Propanolol BUSF Busulfan RIF Rifampicin CBZ Carbamazepine TSN Triclosan CHL Chlorpheniramine maleate VITC Vitamin C CLON Clonidine VPA Valporic acid DFN Diclofenac and the target concentration. The hypothetical target concentration the specific ECR model, proposed here in Section 2.1. In addition to the EC50 for an infinitely long exposure duration is just one parameter of the mechanistic motivation, the application of the ECR model on a real ECR model. The parameter and the answer to the question are therefore cytotoxicity data set is presented in Section 2.4, which supports the available as soon as the model is fitted. ECR modeling could solve the model assumptions. The benefits of our model are described, for which issue of exposure-duration selection for concentration-dependent cyto- the assumptions and possible modifications are justified as far as the toxicity testing. limited size of the available data allowed. Fitting an ECR model has two main advantages over fitting separate To understand Section 2.4, a brief explanation of the concept of concentration-response curves for each exposure duration: overfitting may be required for readers unfamiliar with this term. Generally, a model with many parameters can fit to a data set more 1. The target concentration estimation is not restricted to the exposure closely than a model with fewer parameters. A more flexible model, durations conducted in the experiment. As the ECR model fits a however, is subject to a high risk of overfitting the data, i.e. it even fits to surface over the combinations of concentration and exposure dura- small deviations in the data that are due to noise. When overfitting is tion, it predicts a viability response for each possible combination strong, the resulting model would perform poorly on new data if the including those that are not conducted in the experiment. ECR model experiment is repeated. The overfitting effect is always present, but less fitting thus yields a target concentration estimation that is less sen- strong or negligible for large data sets and less parameters. For small sitive to the selection of exposure durations than fitting separate data sets and many parameters its effect is stronger. Hence, the number concentration-response curves at each exposure duration. of parameters must be accounted for, i.e. the flexibility of a model, when 2. A benefit of an ECR model is the increased statistical precision in comparing how closely a model fits the data as a measure of model target concentration estimation. If one fits a one-dimensional con- performance. Otherwise, a model with more parameters will automati- centration-response curve for each exposure duration separately, cally fit the data more closely than a model with fewer parameters, but only the data that belong to the respective exposure duration are the seemingly better performance in the sense of a closer fit can be a used for each fit. When fitting an ECR model, only a single fit is result of overfitting. calculated using the data from all exposure durations. This increases In addition to the proposed ECR model, we propose a two-step the sample size, the precision in estimating the model parameters pipeline to automatically distinguish between experimental data for and, in turn, the target concentration estimation. which fitting an ECR model is beneficial and where it might be unnec- essary complicated. It is possible that concentration-response data are These two benefits of the ECR model over separate concentration- available for different exposure durations, but all exposure durations are response curve fitting at each exposure duration underline the attrac- rather long so that the target concentration (such as the EC50) does not tiveness of the ECR modeling approach. More recent works highlighted differ between these exposure concentrations. In this case, it would be the need (Morin et al. 2018) and efforts (Focke et al. 2017; Serra et al. unnecessarily complicated and would not yield better target concen- 2020) to overcome suboptimal separate concentration-response curve tration estimation when fitting the proposed ECR model. For these cases, fitting when concentration-response relationships at different exposure it is preferred both in terms of a better target-concentration estimations durations are to be analyzed. These are discussed in Section 2.2. Two- and simplicity of the model to treat the data (after normalization with dimensional ECR modeling has key advantages in comparison to sepa- exposure duration-wise control) as if there is only one exposure dura- rate concentration-response modeling when different exposure dura- tion. A simple concentration-response curve should be fitted using the tions are conducted in the experiment and current research seeks for data of all exposure durations. To account automatically for such a good ECR models. scenario, we additionally propose a statistical two-step procedure. In The validity of the model assumptions is crucial for the major step 1, it is decided objectively whether the EC50 differs between improvement of ECR modeling over separate concentration-response different exposure durations. If so, the ECR model is fitted in step 2. If fitting. We addressed this considering the mechanistic motivation of not, a single concentration-response curve is fitted in step 2. The pro- 2 J. Duda et al. C o m p u t a t i o n a l T o x i c o l o gy 23 (2022) 100234 posed two-step pipeline helps practitioners to fit an ECR model only on experimental data where it is beneficial and necessary. 100 Exp. dur. To quantify and compare the potential benefit of fitting the proposed ECR model and the two-step pipeline, we conducted a simulation study t = 1 based on real cytotoxicity data of Gu et al. [1], which is presented in 75 t = 2 Section 2.5. We compared the new model and pipeline to fitting separate t = 3 concentration-response curves, as well as fitting a single concentration- t = 100 response curve to all exposure durations. The latter approach effectively ignores that there are different exposure durations. The simulation study 50 accounts for many scenarios, such as different exposure duration effects ∆ = 0.1 on the target concentration EC50 (no, small and large effect) and different levels of background noise (small, medium, large) in the data. 25 A common bottleneck for implementing sound theory into practical usage is that sophisticated statistical methods are often difficult to grasp for practitioners with less profound mathematical background or there is a lack of software availability. These drawbacks can prevent theoreti- 0 C0 = 0.01 cally well-established methods from being used in practice. Hence, simplicity and good interpretation of the proposed ECR model is an 0.001 0.010 0.100 1.000 advantage compared to other works in this field, as discussed in Section Concentration 2.1. To overcome problems with practical implementation and calcu- lations of our proposed ECR model and methods, they were made Fig. 1. Concentration-response curves of a td2pLL model at different exposure available through an open source R software [6] package td2pLL via durations t with parameter Δ = 0.1. Black symbols on the dotted line show how GitHub (https://github.com/jcduda/td2pLL/). We therefore eased the the EC50(t) moves dependent on t. usage of our methods for practitioners both by making our model easy to interpret and by providing a software implementation. with respect to the raw mean response at the control, the viability is 100 In this study, we addressed the question as to whether there is a [%] at the control, and for large enough concentrations, the viability benefit in using the proposed ECR model and the two-step pipeline when tends towards 0[%]. Mathematically, this is equivalent to setting E0 = estimating target concentrations such as the EC50 for cytotoxicity data, 100 and Emax = −100 in Eq. (1), yielding: where different exposure durations are conducted in the experiment. ( ) xh The methods have intrinsic advantages compared to fitting separate f x = 100 − 100 h . (2) xh + EC concentration-response curves for each exposure duration and they 50 improve target concentration estimation, which was demonstrated in a Hence, for cytotoxicity one can reduce the 4pLL model to the 2pLL simulation study based on real data. model, as only two parameters, EC50 and h, remain to be estimated. In We note that in the mathematical literature, time-dose-response model cytotoxicity assays it can occur that the viability does not reach 0%. For is the general term and hence motivates the chosen name td2pLL (time- an initial development of an ECR model, we restrict ourselves to the dose two-parameter log–logistic) model. However, we used the biolog- reasonable assumption of 0% viability at large concentration. Examples ically correct term exposure duration-concentration-response (ECR) model of 2pLL curves are presented in Fig. 1. throughout this manuscript. The 2pLL concentration-response model depends only on the con- centration. To incorporate exposure duration, the EC50 parameter is 2. Materials and methods modelled to be dependent on the exposure duration, t, by setting EC50 = EC50(t). This is intuitive because, e.g., for a longer exposure duration, 2.1. Exposure duration-concentration-response model one expects a smaller concentration to yield 50% viability. How exactly the EC50(t) changes according to the exposure duration, t, needs to be In this section, we explain how the proposed ECR model is an formulated mathematically. We therefore applied a modified version of extension of a commonly used concentration-response curve. A detailed Haber’s law [10] following Miller et al. [11]. Originally, Haber postu- explanation if given of how the parameters of the model can be inter- lated that a lethal effect of a compound is determined by multiplying the preted in terms of toxicological research interests. concentration of the compound and the exposure duration, i.e. effect = For cytotoxicity assays, the most popular model used for describing concentration ⋅ exposure duration. For exposure duration t = 1, we concentration-response curves is the sigmoidal four-parameter log–lo- denote the lethal effect for which half of the cells die, by Δ̃. By definition, gistic (4pLL) model ([7–9]): the corresponding concentration is the EC50: ( ) xh f x = E0 + Emax h , (1) Δ̃ = EC50⋅t. xh + EC50 Miller et al. [11] further introduced a parameter C0 as a lower limit where x is the concentration and f(x) the viability in %. The Hill or slope for EC50(t). The parameter C0 can be interpreted as the EC50 at an infi- parameter, h, represents the steepness of the curve, with a greater value nitely large exposure duration. This leads to the equation indicating a steeper curve. EC50 is the concentration where the half- Δ̃ = (EC − C )⋅t. maximal effect is reached, i.e. half of Emax. For clarity, note that the 50 0 other well-known parametrization ([7]) of the 4pLL model is Lastly, concentration and exposure duration might not change the ⎛ ⎞ lethal effect equally when increased or decreased by one unit. To capture d − c d − c this, different exponents, α and β, were added in the model: f ⎜⎝x⎟⎠ = c + = c +1 ( )+ exp{b(log(x) − log(e))} 1 x b + e Δ̃ = (EC50 − C α0) ⋅tβ. where d is the upper asymptote, c is the lower asymptote, e is the EC Solving the equation for EC50 50 yields: and b is h. For viability assays, one can assume that after normalization 3 Response J. Duda et al. C o m p u t a t i o n a l T o x i c o l o gy 23 (2022) 100234 Fig. 2. Explanation on how to interpret the modeled influence of exposure duration on EC50 (red line) when fitting the proposed exposure duration-concentration-response (ECR) model with four ((a)-(d)) example parametrizations of the proposed ECR model, td2pLL. They have the same steepness parameter h = 2 and the same Δ parameter. Δ is the difference in EC50 values at exposure duration t = 1 and hypothetical exposure duration t→∞, where EC50(t→∞) = C0. Comparing the top row ((a), (b)) with the bottom row ((c), (d)), γ increases, which ex- plains the stronger change of the EC50 at short exposure durations. In the right column ((b), (d)), the EC50 does not appear to change much between different exposure durations compared to the left column ((a), (c)). This is because the concentration is visualized on a logarithmic scale and the fold-change be- tween C0 +Δ = 0.2 (= EC50 at exposure duration t = 1) and C0 = 0.1 is only 0.2/ 0.1 = 2, compared to 11 in (a) and (c). 1/α Δ̃ t−β/α C EC concentration. The parameters that define how the EC50 changes for ⏟̅⏞⏞̅⏟ + 0 = 50. different exposure durations are C0, γ, and Δ and can be interpreted as :=Δ follows: Mathematically, the above parametrization does not lead to a unique solution when fitting the model to an appropriate data set. A re- • C0 is the limit EC50 for a hypothetical, infinitely long exposure parametrization circumvents this technical problem, yielding the final duration. For increasing exposure duration t, EC50(t) approaches C0 dependency of the EC50 on the exposure duration: (if γ > 0, cf. Figs. 2 and 1). ( ) • γ > 0 indicates when the EC50(t) changes most between different ⇔ γ=β/αΔ⋅t−γ + C0 = EC50 t . exposure durations. A larger γ indicates a stronger initial change of Note that as soon as the parameters, Δ, γ and C are known/esti- the EC50 followed by quickly reaching a plateau in the EC50. This 0 mated, the EC50(t) can be derived/estimated directly at any exposure means that for shorter exposure durations the EC50 differs markedly duration t of interest using the above formula. There is no restriction to and for larger exposure durations the EC50 does not change much as exposure durations conducted in the experiment. By plugging EC (t) it is already close to its limit C0. A smaller γ indicates a more 50 into the 2pLL model, the new td2pLL ECR model is obtained: balanced change of the EC50 between exposure durations. This ( ) means that the EC50 differs considerably even for two comparatively xh long exposure durations. In other words, the EC50 reaches a plateau f t, x = 100 − 100 . (3) xh + (Δ⋅t−γ C h+ ) less quickly, i.e. at even longer exposure durations compared to 0 larger γ > 0. Examples of the model are visualized in Fig. 2. The interpretation of the • γ < 0 is the unintuitive case that the EC50 increases when the expo- slope parameter, h, remains as in the standard concentration-response sure duration increases. This increase could be valid if the substance 2pLL model: A larger h leads to a steeper curve alongside the 4 J. Duda et al. C o m p u t a t i o n a l T o x i c o l o gy 23 (2022) 100234 is beneficial instead of toxic. Mathematically, the model contains no constraints on γ, but practically this case is typically irrelevant. • Δ is the difference in EC50 values at exposure duration t = 1 and Significant exposure duration t = ∞, i.e. the limit EC50 value for infinitely large exposure durations, which is C . Δ hence indicates the range of the difference0 EC50 values for exposure durations t⩾1 (cf. Fig. 1). The EC50 at Step 1 between 50s exposure duration t = 1 is always Δ + C0. An example of the inter- pretation of Δ and C is described below. Mathematically, EC t across exposure0 50( = 1) −EC50(t→∞) = (Δ + C0) −C0 = Δ. durations? In the following we explain the interpretation of the model param- eters C0 and Δ using examples (Figs. 2 and 1). When considering fold changes in the EC50 estimates between the exposure duration t = 1 and the hypothetical exposure duration, t→∞, both estimates of Δ and C0 should be conducted simultaneously. For example, given the estimates C0 = 0.01 and Δ = 0.1, the fold change is 11. This can be derived Fit td2pLL Fit single 2pLL directly from the model parameters by Step 2 (Account for effect of (Ignore effect of exposure exposure duration) duration) EC50(t = 1)/EC50(t→∞) = (C0 + Δ)/C0 = (0.01 + 0.1)/0.01 = 11. However, if the estimated EC at an infinitely long exposure duration is Fig. 3. The proposed two-step pipeline decides in an objective, statistical 50 manner if the two-dimensional model should be used to model both concen- C0 = 0.1 while still Δ = 0.1, then there is only a fold change of tration- and exposure duration-dependency, or if it suffices to use a one- (0.1 +0.1)/0.1 = 2 in the EC50 value between exposure durations t = 1 dimensional model that only includes the concentration. and t→∞. In summary, the proposed ECR model, td2pLL, has a clear derivation as it naturally extends the commonly used sigmoidal the experimental data, we propose a two-step pipeline. Step 1 involves concentration-response using Haber’s law [10]. deciding statistically if the exposure duration has an influence on the EC50. If it does, an ECR model is used for fitting in step 2. If not, a single, 2.2. Comparison to other approaches one-dimensional 2pLL concentration-response curve is fitted on all data in step 2 and the information on exposure duration is ignored (Fig. 3). Efforts on modeling ECR data have been increasing lately. Focke For step 1, an ANOVA-based test can be used. Statistically, the hy- et al. [12] used a similar approach to ours, but the exposure duration- pothesis that the EC50 is the same across all considered exposure dura- response relationship was modeled with a sigmoidal function and the tions ti is tested against the hypothesis that it is different for any pair of concentration component was added according to Haber’s law. This values t1 and t2. Here, a typical signal plus noise model resulted in a different ECR model. There was no software implementa- ( ) tion provided by Focke et al. [12]. In the work of Schüttler et al. [13], an yijk = f ti, xj + εijk (4) ECR model for toxicogenomic data was proposed where the concentration-response relationship was a 4pLL model and the exposure with i = 1, …, k exposure duration levels, j = 1, …, l concentration duration was also incorporated using the EC50 parameter. However, the levels, nij replicates at exposure duration-concentration setting (ti, xj), 2 dependence of the EC50 parameter was modeled using a non-monotone and corresponding noise εijk ∼ N (0, σ ) is assumed. logarithmic Gaussian function. The authors justified this type of de- The two nested models Q0 and Q1 are then compared. Model Q0 is a pendency with its good empirical fit to toxicogenomic data. For cyto- regular 2pLL concentration-response model that is fitted to the data, toxicity data, however, a non-monotone exposure duration dependency where exposure duration is completely neglected. Model Q1 allows for is less realistic. A software implementation was provided by Schüttler each exposure duration an individual EC50 parameter. The remaining et al. [13] through an R-package. In the R-package TinderMIX intro- hill parameter h remains shared across exposure duration levels. Given duced by Serra et al. [14], ECR toxicogenomic data were modeled using the assumption of normally distributed errors, the (nested) ANOVA linear regression with polynomials with a maximal order of three. One statistic is obvious limitation is that these models might predict unrealistic (e.g. ( )/( )RSSQ − RSSQ η − ν negative) response values for exposure duration-concentration settings F 0= /1 , (5) RSS ν that are not close to the exposure duration-concentration settings of the Q1 experiment. Furthermore, the model parameters are not easily inter- where η are the residual degrees of freedom for Q , ν are the residual pretable if quadratic terms or terms of a higher order are included. With 0degrees of freedom for Q , and RSS is the residual sum of squares. The td2pLL, we fill a scientific gap by providing a mechanistically motivated, 1test rejects the hypothesis that Q0 is the true model at significance level easily interpretable model for cytotoxicity data that is embedded in an α, if the observed value for F is greater than F(1 −α, η −ν, ν), the 1 −α R-package to ease its application for practitioners. quantile of an F-distribution with η −ν and ν degrees of freedom. A rejection of the null hypothesis that Q0 is the true model means that the 2.3. Two-step pipeline for exposure duration-concentration-response data hypothesis that there is no exposure duration-dependency for the EC50 is rejected. Even when exposure duration-resolved and concentration-resolved Hence, rejecting the null hypothesis in step 1 leads to fitting the ECR toxicity data are available, they might not exhibit a clear exposure model in step 2. If the test does not reject the null hypothesis in step 1, duration-effect. One possible reason is that the experimentally chosen information on exposure duration is ignored and a single 2pLL exposure durations all lie within a range where the exposure duration- concentration-response model is fitted in step 2. The significance level α effect is already saturated and the EC50 no longer changes. For such must be chosen in advance. cases, it is unnecessarily complicated and not appropriate to use the The two-step pipeline provides an objective framework to make a proposed ECR model. Instead, ignoring the information on exposure decision regarding the data as to whether it is beneficial to account for duration and fitting a single concentration-response curve would be possible differences in target concentrations across different exposure considered reasonable. To automatically decide which case applies to 5 J. Duda et al. C o m p u t a t i o n a l T o x i c o l o gy 23 (2022) 100234 Table 1 2.5. Simulation study - setup Overview of the simulation study setup that compares the precision of the EC50 estimation between the proposed method (Two-Step) and other methods, based 2.5.1. Overview on various simulated scenarios that use real cytotoxicity data from Gu et al. [1]. To compare the performance of our method to other statistical Aim Comparing the performance of the proposed exposure analysis approaches, we used a simulation study based on the cytotox- duration-concentration-response modeling approach (Two- icity data of Gu et al. [1]. The computations are performed with the step pipeline) for toxicity assays with respect to target td2pLL R package (version 1.0.0.). In such a simulation study, one as- concentration estimation precision with other approaches. sumes different scenarios of ECR relationships as true. Given such a true scenario, data can be generated and the methods are applied on the Data Generating Based on data from Gu et al. [1] with models M ∈ {M0, M1 , generated data to estimate a target concentration for the assumed, true Mechanism M2} =: M where - M is the scenario where the EC does not change across ECR scenario. For reliable simulation results, for each scenario, the data 0 50 exposure durations, generation and method application process was repeated many (nsim = - M1 is the scenario where the EC50 changes moderately 1000) times. Performance of the methods was assessed by averaging across exposure durations, - M is the scenario where the EC changes a lot between precision of target concentration estimation, ÊC50, across the different 2 50 exposure durations. simulation scenarios. -Experimental design An overview of the main components of the simulation study using -k ∈ {3, 4} =: K different exposure durations the ADEMP summarization principle [16] (Aim, Data Generating -nobs ∈ {72, 216} =: N obs observations -(Normal) noise N N N N Mechanism, Estimands, Methods, Performance) is provided in Table 1. ∈ { 1, 2 , 3} = : N oise In our simulation study, the general estimand was the target concen- tration EC . To enhance the generalization of the simulation study re- Estimands EC50 (Concentration causing a response that is 50 percent of 50 the maximum achievable) sults on real data, many factors of data generation were crossed with each other to generate a large, robust pool of simulation scenarios. To understand the presentation of the results in Section 3.2, impor- Methods -Two-Step: Proposed two-step-procedure with ANOVA pre- test in step 1 and td2pLL ECR model fit or 2pLL tant factors of the data generation should be noted. To mimic the pos- concentration-response model fit with ignored exposure sibility of different magnitudes of influence of the exposure duration on duration, respectively, in step 2. the target concentration in real data, in the simulation study we -Always td2pLL ECR model fit considered three levels of influence of the exposure duration on the -Always separate 2pLL concentration-response model fit per exposure duration target concentration EC50, namely no (M0), little (M1) and strong (M2) -Always single 2pLL concentration-response model fit influence. We further accounted for varying levels of background noise neglecting exposure duration and thus pooling with respect in real data by including little (N1), medium (N2) and strong (N3) to exposure durations background noise scenarios in the simulated data. In the following, further detailed explanations on the data generating mechanism, Performance -AMAFC: Mean of the absolute log2 fold change of estimated methods, and performance measures are presented. and true EC50 across exposure durations and averaged over simulation replications. A small AMAFC is desirable. 2.5.2. Details To generate ECR data for the simulation study, the assumed true durations by fitting the proposed ECR model, or if it is sufficient to fit a model M, the experimental design, and the variability N of the added concentration-response curve. background noise must be specified. Three models representing different types of exposure duration dependency were used. These 2.4. Data models were derived from models fitted on real data. A strong exposure duration dependency means that the EC50 differs a lot between different The data on which our simulation study is based come from Gu et al. exposure durations. For a strong (M2) and a weak (M1) exposure dura- [1] where cytotoxicity testing was performed on primary human hepa- tion dependency, the td2pLL model was fitted to data from two com- tocytes. These data are publicly available and the publication contains pounds tested by Gu et al. [1], namely chlorpheniramine (CHL) and details of the laboratory protocol. The data set contains concentration- ethanol, respectively. A model without exposure duration dependency response data of 30 compounds. For each compound, 3 biological rep- (M0) was derived from the fitted model M1 by setting γ = 0, which licates (donors) are available. For each donor, there are measurements removed the influence of the exposure duration on the viability response of 3 exposure durations (1, 2 and 7 days) and (including solvent) 6, 7, or such that the EC50 was the same across all exposure durations (Fig. 4). To 8 concentrations (mostly 6). The concentrations are equidistant on a log- enhance comparability, the concentration ranges of the compounds √̅̅̅̅̅̅ scale with base 10. There are 4 or 8 measurements per compound, were rescaled to be between 0 and 1. donor, exposure duration and concentration. Note that the data was To cover more aspects of a real experiment in the simulation sce- normalized: For each compound, the viability values per donor were narios, we varied the number of exposure durations in the generated divided by the respective donor-solvent response value and multiplied data. Given a true ECR model M from the set {M0, M1, M2}, the true by 100. This normalization implicitly assumes that compound, donor response values are calculated at either k = 3 or k = 4 different exposure and exposure duration only affect the response at concentration 0. Other durations as depicted in Fig. 5. For 3 exposure durations, the selected correlations cannot be accounted for with this typical normalization exposure duration-concentration points (ti, xj) agree with the ones procedure. More over, the ‘retting’ approach by Kappenberg et al. [15] selected in the real experiment of Gu et al. [1] for compound CHL. Note was applied as an additional pre-processing step, to better justify the that in Gu et al. an effect of the exposure duration was anticipated, such assumption on the asymptotes. Lastly, the concentration-range was that for the larger exposure durations of 7 (days), lower concentrations rescaled to [0,1] to guarantee comparison of the results across were selected. In the simulation, either nobs = 72 or nobs = 216 obser- compounds. vations are generated with equal sample size for all exposure duration- concentration combinations (ti,xj). To understand the influence of noise, we added different levels of background noise to the assumed true mean responses. It was assumed that all data points at a specific exposure duration-concentration point 6 J. Duda et al. C o m p u t a t i o n a l T o x i c o l o gy 23 (2022) 100234 Fig. 4. To increase the representative- ness of the simulation study for real data, three different scenarios of effect (none, weak, large) of exposure duration on EC50 were included in the simulation. The weak (b, model M1) and large (c, model M2) effect are each a td2pLL model fit on real ECR data from Gu et al. [1] of the compounds ethanol and chlorpheniramine. The no-effect model M0 (a) was derived from the low-effect model M1 (b), with complete elimina- tion of the exposure duration effect by setting γ = 0. The models M0, M1, and M2 served as assumed true exposure duration-concentration-response re- lationships for the simulation study. Exact parameters of the models are: M0 (h = 1.26, Δ = 0.07, γ = 0, C0 = 0.05), M1 (h = 1.26, Δ = 0.07, γ = 2.19, C0 = 0.05), M2 (h = 1.54,Δ = 0.06,γ = 2.05, C0 = 0.02). Note that the controls could not be shown due to the logarithmic scale of the concentration axis. driven approach yields more realistic simulated data. For each com- 3 exposure durations 4 exposure durations pound, donor and exposure duration-concentration point (ti, xj), the 7 12 12 12 12 12 12 9 9 9 9 9 9 standard deviation σi,j of the pre-processed (cf. Section 2.4) cytotoxicity measurements was calculated. Note that this procedure calculated a single standard deviation value from 3 to 4 measurements. A model was fitted to the standard deviations depending on the exposure duration, ti, 4 9 9 9 9 9 9 and pseudo-log transformed concentrations, xi (see Figs. 5 and 6). The outcome (standard deviation) was modeled as a linear combination of 2 2 12 12 12 12 12 12 9 9 9 9 9 9 intercept, concentration, concentration , exposure duration and the 1 12 12 12 12 12 12 9 9 9 9 9 9 interaction between concentration 2 and exposure duration. This means that the concentration can have a quadratic effect and exposure duration a linear effect on the standard deviation. Only significant variables (p- Concentration value < 0.05) were retained in the model. The resulting model was used to calculate the standard deviation for the normally distributed noise ε ∼ Fig. 5. To increase the representativeness of the simulation study for real data, N (0, (σ 2i,j) ). From ε, realizations were drawn that were added to the the number of exposure durations was varied in the simulation. Experimental response values f (t ,x ). designs are shown with k 3 (left) and k 4 (right) different exposure dura- M i j= = tions for the data generating mechanism in the simulation study. Only the cases We considered 3 scenarios with increasing noise levels. In N1 (low with nobs = 216 are depicted. For nobs = 72, the replicates at each exposure noise), the generated standard deviations were divided by 2, in N2 duration-concentration point were reduced to 4 and 3, respectively. Note that a (medium noise) they were left unchanged, and in N3 (high noise), they pseudo-log transformation was used to display the control at concentration- were multiplied by 2. This approach of adding background in the level 0. simulation study closely mimicked real data background noise and therefore further increased the representativeness of the simulation (ti, xj) have the same, true mean response fM(ti, xj) of model M ∈ {M1,M1, study for real ECR data. M2}. Noise was added as realizations of independent, normally distrib- To compare the approaches considered in this work for fitting ECR uted, mean zero noise (cf. Eq. (4)). The noise standard deviation, σ, was data and predicting target concentrations, each approach was applied on chosen based on a linear model of the empirical standard deviations each ECR data generating scenario. All 3⋅2⋅2⋅3 = 36 combinations of observed in the real data of all compounds in Gu et al. [1]. This data- data generation in the space M × K × N obs × N oise were crossed with 7 Exposure duration 0 1e−04 0.001 0.01 0.1 1 0 1e−04 0.001 0.01 0.1 1 J. Duda et al. C o m p u t a t i o n a l T o x i c o l o gy 23 (2022) 100234 ti = 1 ti = 2 ti = 7 60 40 20 0 Concentration xj Fig. 6. The model used for determining exposure duration-concentration dependent average noise levels in the simulation study captured properties of real data background noise. Points are empirical standard deviations per compound, donor, concentration and exposure duration in the pre-processed data of Gu et al. [1]. The blue line shows a model fit to these standard errors depending on the exposure duration ti and the pseudo-log transformed concentration xj. The model fit was used to select exposure duration-concentration point dependent standard deviations of the noise that was added to the generated data in the simulation. Note that the model captures intuitive noise characteristics, such as a decrease in noise at the maximal concentration. each method for data fitting (Two-step, always td2pLL, always separate 2pLL, always single 2pLL, cf. Table 1), which yielded 36⋅4 = 144 NAC simulation scenarios with nsim = 1000 repetitions for each scenario in total. Note that fitting separate 2pLL curves for each exposure duration represented the most basic concentration-response modeling approach for accounting for an exposure duration effect on the target MEL VITC concentration. 25 For each simulation scenario, the performance of a model fit and associated target concentration estimation was measured using the averaged mean absolute log2 fold change (AMAFC). For a fitted model DFN and a fixed exposure duration, ti, ÊC50(ti) was the estimated EC50 value and EC50(ti) was the true value. The AMAFC was defined as the mean of DMSO KC 20 LAB TSNthe values for |log2(ÊD50(ti)/ED50(ti))|, where the mean was taken over the exposure durations, ti, and these means were averaged over all n BOS sim CBZ CHL NIM simulation repetitions. A small AMAFC is desirable. VPA PPLNFT In summary, the resulting simulation was based on real exposure GLC duration-concentration response data of Gu et al. [1] and compared the ASP HYZ INAH proposed method regarding estimation precision in target-concentration PHB APAP ETOH estimation with several other approaches across many realistic 15 PMZ scenarios. CLON LEV BPR 3. Results 15 20 25 New td2pLL model: Deviation model fit to data 3.1. Real data application Fig. 7. The newly proposed td2pLL model for ECR data has valid assumptions To validate the assumptions of our newly proposed model, model fits based on its application to a small cytotoxicity data set, which allows for a more on real ECR data were analyzed. Validation of the assumptions of our precise target concentration estimation. The comparison of the td2pLL model new model is crucial as they are the basis for the benefits of the model, in fits with the classical separate 2pLL model fits per exposure duration for all terms of a potentially improved target concentration estimation for a compounds of Gu et al. [1] based on the deviation between the model fit and given exposure duration, by also exploiting data from other exposure the data, which is measured by the standard error (SE) of the residuals. A small durations. We therefore fitted our newly proposed model, the td2pLL SE is desirable, which may however be artificially decreased due to overfitting. model, to pre-processed ECR cytotoxicity data of Gu et al. [1]. We The points scatter around the diagonal line, which means that the td2pLL model compared the fit of the td2pLL model (approach 1) with separate 2pLL fits are comparable to those of the separate 2pLL fits. This indicates that the assumptions of the td2pLL model are valid. For compounds BUSF, MePA, RIF fits for each exposure duration (approach 2). This helped to investigate and FAM, computational problems in the separate 2pLL fitting occurred and the whether the td2pLL model has valid assumptions for application to data points are not shown (cf. Fig. A1). For those compounds, there was no cytotoxicity data, thus improving target concentration estimations. concentration-dependent decline in viability for some exposure durations. To properly compare the two approaches with respect to the validity of the td2pLL model assumptions, the statistical concept of overfitting is than approach 1, where the td2pLL was fitted only once to ECR data of crucial, as explained in the introduction. Approach 2 involved more all exposure durations and therefore used fewer parameters in total to fit parameters because for each of the three exposure durations, a new 2pLL all data of one compound. model with 2 parameters each was fitted. Also, for each fit only the data We reduced the overfitting effect in the comparison of the ap- of the respective exposure duration were used. For approach one, only 4 proaches by calculating the standard error (SE) for both approaches for parameters were fitted. Hence, approach 2 was more prone to overfitting each compound. Note that this measure takes into account the numbers 8 σ(ti, xj) 0 0.003 0.01 0.032 0.1 00..331168 1 0 0.001 0.003 Separate 2pLL models: Deviation model fit to data 0.01 0.032 0.1 00..331186 1 0 0.001 0.003 0.01 0.032 0.1 00..331168 1 J. Duda et al. C o m p u t a t i o n a l T o x i c o l o gy 23 (2022) 100234 PPL Fig. 8. The newly proposed td2pLL model leads to clear improvements in model fitting for some compounds (LAB, DFN) of [1] and less optimal fits for other compounds (PPL, 90 90 GLC). The resulting concentration-response Time curves at exposure durations 1, 2 and 7 60 60 (days - legend in upper left plot) are shown 1 for separate 2pLL models (left column) and for the td2pLL model (right column). For 30 2 30 the compounds PPL and GLC, separate fits 7 might be more plausible, while for LAB and DFN the td2pLL model seems to stabilize the 0 0 fit. Only concentration-wise response means are shown. 04 01 32 1 16 6 04 1 32 60 e− .0 .00 .0 03 0. 1 .310 . 1 0 e− .0 0 .00 0.0 1 310 0.1 .31 6 1 3 0 0 0 0 3 0 0 0. 0 GLC 100 100 80 75 60 50 40 25 0 3.16 10 31.6 100 316 0 3.16 10 31.6 100 316 LAB 125 125 100 100 75 75 50 50 25 25 0 0 0 0.0032 0.01 0.0316 0.1 0.316 0 0.0032 0.01 0.0316 0.1 0.316 DFN 90 90 60 60 30 30 0 0 0 0.0158 0.05 0.158 0.5 1.58 0 0.0158 0.05 0.158 0.5 1.58 Concentration Concentration 9 Response Response Response Response J. Duda et al. C o m p u t a t i o n a l T o x i c o l o gy 23 (2022) 100234 Simulated background noise level Little (N1) Medium (N2) Large (N3) 1.00 0.75 0.50 0.25 0.00 1.00 0.75 0.50 0.25 0.00 1.00 0.75 0.50 0.25 0.00 LL L LL L L 2p LL pL tep 2p L L pL tep 2p L d p 2 L L LL p s t p . 2 le −s s t d . 2 p p le 2 −s s t d . 2 le 2 −s te ay ep ing w o ay ep ng woi y ep g o Alw S S T SAlw S T Alw a S Sin T w Method nobs 216 72 Fig. 9. The proposed methods, Two-step and always td2pLL, outperformed other methods regarding EC50 estimation across different simulation scenarios for exposure duration-concentration response cytotoxicity data. The three vertical panels indicate increased simulated background noise levels. The three horizontal panels correspond to an increasing effect of the exposure duration on the EC50. For M0 (upper horizontal panel), the EC50 is the same across all exposure durations. For M1 (middle horizontal panel), the EC50 decreases moderately for increasing exposure durations, and for M2 (lower horizontal panel) it decreases a lot for increasing exposure durations. Two-step is the proposed two-step pipeline with pre-test for exposure duration-dependency. Small values of a deviation between the estimated and the true EC50 are desirable. of parameters by dividing by the degrees of freedom, i.e. the number of how many exposure durations are investigated. Therefore the td2pLL observations minus the number of parameters. The residual SE measures model would not overfit the data. In summary, the assumptions of the how much the model fit deviates from the data and penalizes models new td2pLL model and thus its potential benefit in target concentration with more parameters. Without this penalty, approach 2 would have estimation precision are generally supported by the analysis on the small always fitted more closely to the data than approach 1 because of ECR data set of [1]. overfitting, even if the assumptions of the td2pLL model are true. If the For qualitative demonstration purposes, we have focused on four assumptions of the td2pLL model are true or close to the truth, we expect example compounds (PPL, DLC, LAB, DFN) where the td2pLL model the SE of the two approaches to be similar. The similarity of the SEs of appears to be either more or less adequate than separate 2pLL fits the two approaches across the compounds can be seen in Fig. 7, in which (Fig. 8). For PPL and GLC, separate 2pLL fits appear more appropriate. the SEs of the approaches are close to the diagonal line. If the assump- Especially for PPL, the assumption of a fixed steepness parameter, h, tions of the td2pLL model were markedly incorrect, the SE of approach 1 across all exposure durations seems questionable. For LAB and DFN, would be systematically larger and the points would systematically tend however, the td2pLL model fit seems to be more robust (less overfitting): to lie below the diagonal line. For LAB, the data at the exposure duration of 7 days were very noisy, The data are not optimal to validate the td2pLL model, as there are which made a separate fit for these data difficult. By fitting a td2pLL only 3 different exposure durations and often too low maximal con- model, the concentration-response curve fit at t = 7 was stabilized centrations were used (Fig. A1). through the (less noisy) concentration-response curves at the other For experiments with more exposure durations investigated, exposure durations. For DFN, the concentration-response curves for approach 2 would increasingly overfit the data, as with every added exposure durations of 2 and 7 days were similar. A td2pLL model fit exposure duration, a new 2pLL model is fitted. By contrast, for the practically pooled and therefore stabilized the curves of these two td2pLL model, the number of parameters remains the same, regardless of exposure durations by fitting a large γ, i.e. a strong initial change of the 10 Estimated EC50 to true EC50 (Absolute log2 deviation)Simulated effect of exposure duration on EC50 None (M0) Small (M1) Large (M2) J. Duda et al. C o m p u t a t i o n a l T o x i c o l o gy 23 (2022) 100234 EC50(t) and almost no change in the EC50(t) for larger exposure improved assessment of concentration and exposure-duration depen- durations. dent cytotoxicity. For cytotoxicity experiments where concentration- In summary, the assumptions of the newly proposed td2pLL model response data are available for different exposure durations, the pro- for ECR data are supported by the data set of Gu et al. [1] as far as the posed ECR model can improve the precision of target-concentration small size of the data set allows any validations. A possible modification estimation, such as the EC50, compared to fitting separate of the td2pLL model is a varying steepness of the concentration-response concentration-response curves for each exposure duration. This is curves across exposure durations. However, this would require more possible because the model implicitly uses the concentration-response parameters, which can easily lead to overfitting or computational issues. data from all exposure durations to calculate the target concentration In general, the application on the real cytotoxicity data of Gu et al. [1] at an exposure duration of interest. In addition, joint modeling allows supports that the newly proposed td2pLL model might lead to more the extrapolation of cytotoxicity responses to exposure duration- precise target concentration estimation for ECR data at an exposure concentration combinations that were not conducted in the experi- duration of interest, by exploiting available concentration-response data ment. This is a main advantage over separately fitting concentration- of other exposure durations. response curves at each exposure duration, where cytotoxicity estima- tion is restricted to the exposure durations used in the experiments. 3.2. Results of the simulation study The potential benefit of the proposed ECR model over separate concentration-response curve fitting depends on the validity of the In this section we present the results of the simulation study model assumptions. We presented arguments both on a theoretical and described in Section 2.5. In particular, the precision of the target con- practical level, that support these model assumptions. From a theoretical centration (EC50) estimation for the proposed two-step pipeline that uses perspective, the model assumptions are mechanistically motivated as it the ECR model, td2pLL, was compared with other approaches for esti- extends the well-established, one-dimensional Hill model to a two- mating the EC50 from ECR data. The main finding was that the two-step dimensional ECR model using Haber’s law. For a practical investiga- pipeline, as well as always fitting the new td2pLL model, outperformed tion, we checked the suitability of the model using a real cytotoxicity other approaches that ignore or overfit the potentially different EC50 data set of 30 compounds, with measurements for multiple exposure values at different exposure durations. Fig. 9 summarizes the simulation durations and concentrations for each compound. This practical appli- results, comparing in each subgraph the four methods Two-step (new cation generally supported the model assumptions, as far as the limited two-step pipeline described in Section 2.3), Sep. 2pLL (separate 2pLL size of the data set allowed strong conclusions. The td2pLL model models for each exposure duration), Single 2pLL (Single 2pLL model seemed to be beneficial for the data for some compounds, while for independent of exposure duration) and td2pLL (new ECR model). The others, a more flexible model might be more suitable. subgraphs correspond to increasing noise (from left to right) and A detailed look suggested a possible extension of the td2pLL model, increasing exposure duration effect on the EC50 dependency (no, weak, namely a dependency on the (currently constant) slope parameter on the strong, from top to bottom). exposure duration. However, such model extensions might cause prob- In general, the Two-step pipeline performed very well in all sce- lems as a more complex model is more difficult to fit, especially for narios, with comparatively small deviations between estimated and true typically small cytotoxicity data sets. Due to the small sample sizes and EC50 values. The method decides in a first step, if the EC50 changes number of different exposure durations of the analyzed data, further across exposure durations using a nested ANOVA approach. If so, a ECR cytotoxicity data are required to assess the general benefit of the td2pLL model is fitted that accounts for such an effect. If not, the effect of model for target-concentration estimation or if model modifications are exposure duration is ignored and a single 2pLL curve is fitted for all required. exposure durations. In general, exposure duration-related questions regarding the toxi- On the one hand, ignoring the exposure duration (Single 2pLL) was cological mechanism of a compound can be analyzed with the td2pLL clearly worse, when an exposure duration-effect was present (in models model, such as: What is the hypothetical EC50 of an infinitely long M1 and M2, middle and bottom row subgraphs). Fitting separate 2pLL exposure duration? When does expanding the exposure duration have curves (Sep. 2pLL) lead to larger performance losses especially in the no further influence on the toxic effect? Due to its mechanistic moti- case without an exposure duration effect (M0, top subgraphs). Ignoring vation, the model parameters can be easily interpreted to answer these the exposure duration and fitting a single concentration-response curve questions. In fact, one model parameter is the hypothetical limit EC50 for would better exploit the data structure. The performance loss was due to an infinitely long exposure duration. the much smaller effective sample size in the separate fits (Sep. 2pLL). We also proposed a two-step pipeline that first performs a statistical Furthermore, increasing the sample size (nobs) naturally improved test to decide whether a td2pLL model fit might be beneficial, or if it the performance (lower values for estimated to true EC50 deviance are suffices to ignore the exposure duration and fit a regular 2pLL observed for boxplots with red color compared to blue color). However, concentration-response curve. The two-step procedure and always the beneficial effect diminished if there was only little noise (N1) or if the fitting a td2pLL model achieved higher precision in target concentration method was based on severely incorrect assumptions (using Single 2pLL estimation than separately fitting regular 2pLL curves per exposure for scenario M2 with strong exposure duration effect). duration or just fitting a single 2pLL curve for all exposure durations, as Lastly, always fitting a td2pLL model (without ANOVA pre-step) demonstrated in a simulation study based on the data of Gu et al. [1]. We performed almost as well as the two-step pipeline that includes the recommend using the new two-step pipeline as it was successful in pre-test, with only little performance loss when there was no exposure deciding if a td2pLL model or just a simple 2pLL model should be fitted. duration effect (M0). Since using the two-step pipeline instead of always For practitioners, easy applicability of the proposed model and the two- fitting the tp2pLL model can lead to a simpler model with fewer pa- step pipeline was achieved by providing the new R-package td2pLL, rameters (Single 2pLL), this is preferred over the (more complicated) which is available at https://github.com/jcduda/td2pLL/. td2pLL model. This means that in some cases, the ANOVA-based two- In summary, the td2pLL model and the two-step pipeline are prom- step pipeline avoids the fit of an unnecessarily complex model. In ising approaches to increase precision in target concentration estimation summary, the simulation study clearly promoted the use of the proposed for cytotoxicity experiments in which different exposure durations are two-step pipeline for target concentration estimation in ECR data. tested. The proposed methods outperformed the typical approach of separately fitting concentration-response curves for each exposure 4. Discussion duration, as they can incorporate the data of all exposure durations for a single fit, which additionally allows cytotoxicity response estimations at In this work we proposed a new ECR model, named td2pLL, for an concentrations and exposure durations not conducted in the experiment. 11 J. Duda et al. C o m p u t a t i o n a l T o x i c o l o gy 23 (2022) 100234 APAP ASP BOS BPR BUSF 200 120 1 150 140 150 2 150 100 120 7 100 80 100100 100 60 80 50 50 40 6050 20 40 0 0 0 200 0 0.1 1 10 0 0.1 1 10 0 0.01 1 0 0.001 0.1 1 0 0.01 0.1 1 Concentration [mg] Concentration [mg] Concentration [mg] Concentration [mg] Concentration [mg] CBZ CHL CLON DFN DMSO 150 120 200 100 150 150 100 10080 100 60 100 50 40 50 50 50 20 0 0 0 0 0 0 0.01 0.1 1 0 0.001 0.1 1 0 0.1 10 0 0.01 0.1 1 0 10 100 1000 Concentration [mg] Concentration [mg] Concentration [mg] Concentration [mg] Concentration [mg] ETOH FAM GLC HYZ INAH 150 150 120 140 120 100 100 100 100 100 80 80 60 50 60 50 40 50 40 20 20 0 0 0 0 0 1 10 100 1000 0 0.1 1 0 1 10 100 0 0.01 1 0 1 10 100 Concentration [mg] Concentration [mg] Concentration [mg] Concentration [mg] Concentration [mg] KC LAB LEV MEL MePA 150 150 150 150 100 100100 100 100 50 50 50 50 50 0 0 0 0 0 0.001 0.01 0.1 0 0.001 0.01 0.1 0 0.1 1 10 0 0.1 1 0 0.001 0.01 0.1 Concentration [mg] Concentration [mg] Concentration [mg] Concentration [mg] Concentration [mg] NAC NFT NIM PHB PMZ 200 150 250 150 150 200 150 100 100 100 150 100 100 50 5050 50 50 0 0 0 0 0 0 0.1 1 10 0 0.01 1 0 0.001 0.1 1 0 0.01 0.1 1 0 0.01 1 Concentration [mg] Concentration [mg] Concentration [mg] Concentration [mg] Concentration [mg] PPL RIF TSN VITC VPA 140 200 150 120 120 100 150 100 150 100 80 100 80 60 10060 40 5050 40 50 20 20 0 0 0 0 0 0 0.01 1 0 0.01 0.1 0 0.001 0.01 0.1 0 0.01 1 10 0 0.1 1 10 Concentration [mg] Concentration [mg] Concentration [mg] Concentration [mg] Concentration [mg] Fig. A1. Overview of separate 2pLL model fits by exposure duration (1, 2 or 7 days - legend in upper left plot) for all compounds of Gu et al. [1]. For compounds BUSF, FAM, MePA and RIF, computational problems occured as due to a high level of noise or a lack of a detectable concentration-dependent trend in viability, a flat concentration-response curve was fitted for some exposure durations. Only the concentration-wise response means are shown. Declarations Toxicology” (RTG 2624, Project P2) funded by the Deutsche For- schungsgemeinschaft (DFG, German Research Foundation - Project Funding Number 427806116). This work has been supported (in part) by the Research Training Group ”Biostatistical Methods for High-Dimensional Data in 12 Viability [%] Viability [%] Viability [%] Viability [%] Viability [%] Viability [%] Viability [%] Viability [%] Viability [%] Viability [%] Viability [%] Viability [%] Viability [%] Viability [%] Viability [%] Viability [%] Viability [%] Viability [%] Viability [%] Viability [%] Viability [%] Viability [%] Viability [%] Viability [%] Viability [%] Viability [%] Viability [%] Viability [%] Viability [%] Viability [%] J. Duda et al. C o m p u t a t i o n a l T o x i c o l o gy 23 (2022) 100234 Data availability References The data presented in this study are openly available in Archives of [1] X. Gu, W. Albrecht, K. Edlund, F. Kappenberg, J. Rahnenführer, M. Leist, Toxicology at 10.1007/s00204-018–2302-0. W. Moritz, P. Godoy, C. Cadenas, R. Marchan, Relevance of the incubation period in cytotoxicity testing with primary human hepatocytes, Arch. Toxicol. 92 (12) (2018) 3505–3515. Supplementary information [2] M.D. Arbo, S. Melega, R. Stöber, M. Schug, E. Rempel, J. Rahnenführer, P. Godoy, R. Reif, C. Cadenas, M. de Lourdes Bastos, Hepatotoxicity of piperazine designer drugs: up-regulation of key enzymes of cholesterol and lipid biosynthesis, Arch. To reproduce all figures and the simulation results, code is available Toxicol. 90 (12) (2016) 3045–3060. at https://github.com/jcduda/td2pLL-code. [3] ICCVAM: Guidance document on using in vitro data to estimate in vivo starting doses for acute toxicity. NIH Publication NO 01-4500 (2001). Available here [accessed 29 September 2021]. CRediT authorship contribution statement [4] R. Riddell, D. Panacer, S. Wilde, R. Clothier, M. Balls, The importance of exposure period and cell type in in vitro cytotoxicity tests, Altern. Lab. Anim. 14 (2) (1986) 86–92. Julia Duda: Conceptualization, Methodology, Software, Validation, [5] NICEATM: In vitro cytotoxicity test methods for estimating acute oral systemic Formal analysis, Investigation, Data curation, Writing - original draft, toxicity: Background review document. NIH Publication No. 07-4518 (2006). Available here [accessed 29 September 2021]. Visualization. Jan G. Hengstler: Validation, Resources, Writing - re- [6] R Core Team: R: A Language and Environment for Statistical Computing. R view & editing, Supervision, Funding acquisition. Jörg Rahnenführer: Foundation for Statistical Computing, Vienna, Austria (2021). R Foundation for Conceptualization, Validation, Writing - review & editing, Supervision, Statistical Computing.https://www.R-project.org/. Project administration, Funding acquisition. [7] C. Ritz, Toward a unified approach to dose-response modeling in ecotoxicology, Environ. Toxicol. Chem. 29 (1) (2010) 220–229. [8] C. Ritz, F. Baty, J.C. Streibig, D. Gerhard, Dose-response analysis using r, PloS One 10 (12) (2015) 0146021. Declaration of Competing Interest [9] F. Kappenberg, M. Grinberg, X. Jiang, A. Kopp-Schneider, J.G. Hengstler, J. Rahnenführer, Comparison of observation-based and model-based identification of alert concentrations from concentration–expression data, Bioinformatics (2021). The authors declare that they have no known competing financial [10] F. Haber, Zur Geschichte des Gaskrieges, in: Fünf Vorträge aus den Jahren interests or personal relationships that could have appeared to influence 1920–1923, Springer, Berlin, Heidelberg, 1924, pp. 76–92. [11] F.J. Miller, P.M. Schlosser, D.B. Janszen, Haber’s rule: a special case in a family of the work reported in this paper. curves relating concentration and duration of exposure to a fixed level of response for a given endpoint, Toxicology 149 (1) (2000) 21–34. [12] W.W. Focke, I. Van der Westhuizen, N. Musee, M.T. Loots, Kinetic interpretation of Acknowledgments log-logistic dose-time response curves, Sci. Rep. 7 (1) (2017) 1–11. [13] A. Schüttler, R. Altenburger, M. Ammar, M. Bader-Blukott, G. Jakobs, J. Knapp, The authors gratefully acknowledge the computing time provided on J. Krüger, K. Reiche, G.-M. Wu, W. Busch, Map and model—moving from observation to prediction in toxicogenomics, GigaScience 8 (6) (2019) 057. the Linux HPC cluster at Technical University Dortmund (LiDO3), [14] A. Serra, M. Fratello, G. Del Giudice, L.A. Saarimäki, M. Paci, A. Federico, D. Greco, partially funded in the course of the Large-Scale Equipment Initiative by Tindermix: Time-dose integrated modelling of toxicogenomics data, GigaScience 9 the German Research Foundation (DFG) as project 271512359. (5) (2020) 055. [15] F. Kappenberg, T. Brecklinghaus, W. Albrecht, J. Blum, C. van der Wurp, M. Leist, J.G. Hengstler, J. Rahnenführer, Handling deviating control values in Appendix A. Figures concentration-response curves, Arch. Toxicol. 94 (11) (2020) 3787–3798. [16] T.P. Morris, I.R. White, M.J. Crowther, Using simulation studies to evaluate statistical methods, Stat. Med. 38 (11) (2019) 2074–2102. 13 Article 3 www.nature.com/scientificreports OPEN Benefit of using interaction effects for the analysis of high‑dimensional time‑response or dose‑response data for two‑group comparisons Julia C. Duda *, Carolin Drenda , Hue Kästel , Jörg Rahnenführer & Franziska Kappenberg High throughput RNA sequencing experiments are widely conducted and analyzed to identify differentially expressed genes (DEGs). The statistical models calculated for this task are often not clear to practitioners, and analyses may not be optimally tailored to the research hypothesis. Often, interaction effects (IEs) are the mathematical equivalent of the biological research question but are not considered for different reasons. We fill this gap by explaining and presenting the potential benefit of IEs in the search for DEGs using RNA‑Seq data of mice that receive different diets for different time periods. Using an IE model leads to a smaller, but likely more biologically informative set of DEGs compared to a common approach that avoids the calculation of IEs. With the rapid developments in next-generation sequencing (NGS) technology in the last decades, analyses of gene expression data have become regular in many l aboratories1. A common goal is to identify differentially expressed genes (DEGs) that are responsible for the observable differences between, e.g., groups of individuals with different treatments or genotypes. Many software applications became available to optimally extract infor- mation from the large amounts of experimental data2. The mathematics behind these algorithms and models is often complicated, which can lead to suboptimal data analysis from practitioners and bioinformaticians. The interaction effect (IE) between two or more factors of interest is a methodological aspect that is often not con- sidered in analyses where it could be beneficial. IEs are well-known in statistical modeling but are often not used in practice. Properly including and interpreting an IE in gene expression data analyses can be challenging, and the possibility of using an IE is often overlooked. An obvious reason for not using IEs in DEGs analyses might be the complexity of the statistical models and their correct computational implementation. In the literature, there are many application examples similar to the one we will use throughout the manu- script, where an IE was likely beneficial to find interesting DEGs, but not considered. For example,3 dealt with time-restricted feeding of mice to test whether it could prevent obesity. They used D ESeq24 and the design included several factors such as genotype, feeding group, and time. In this setting, combining different variables to explore the interaction between e.g. time and genotype could have led to other, potentially more interesting DEGs. In another e xample5 used four separate study groups to analyze the differences in heart failure in mice. They either received a standardized chow or a high-fat diet for 12 weeks, and either additionally received angio- tensin II after 8 weeks or not. Here as well, analyzing the excluded interaction between diet and hormones could lead to additional interesting insights. Examples with an IE included in the DEG analysis were provided b y6,7. Sloley et al.6 studied the exposure to high-frequency head impacts in mice. They use the DESeq2 package and their design contains an IE of the two factors treatment and injury. Similar methods are used i n7, in which mice were treated with acarbose at three independent study sites. Their model contains the variables treatment, sex, and the interaction between them. In this work, we explain the use, interpretation, and potential benefit of using IEs in gene expression analysis to identify DEGs. The article equips practitioners with a less profound statistical background with the knowl- edge to decide if the use of an IE helps answer their research question. We therefore aim at keeping the level of mathematical complexity low, to reach a wider range of potential users. Mathematical details can be found in8,9. We illustrate, explain, and compare DEG analyses with and without IE using a gene expression data set f rom10, where mice were fed either an unhealthy or a healthy diet for 3 to 48 weeks. Department of Statistics, TU Dortmund University, Vogelpothsweg 87, 44227 Dortmund, Germany. *email: duda@ statistik.tu-dortmund.de Scientific Reports | (2023) 13:20804 | https://doi.org/10.1038/s41598-023-47057-0 1 Vol.:(0123456789) www.nature.com/scientificreports/ The article is structured as follows. We first explain the IE from different perspectives. Then we conceptually compare the use of an IE with the common approach that avoids modeling of interaction w.r.t. the resulting DEGs. The two methods are applied to the data set at hand and the differences in the results are discussed and explained in detail. Material and methods Data The data set was first presented b y10, where mice were fed with two different diets over the course of 48 weeks. One diet was the high-fat or ‘Western’ diet (WD) and the control was a standard diet (SD). The nine analysis time points within the 48 weeks were week 3, 6, 12, 18, 24, 30, 36, 42 and 48. In total 79 samples (mice) were used. The gene expression data from 35,727 genes were measured using RNA-seq. After removing the weeks with no data from mice in one of the two groups, 64 samples from the weeks 3, 6, 30, 36, 42, and 48 were left. To focus on the explanatory aim, analyses were mostly restricted to the data of weeks 3 and 6. The sample sizes in the remaining weeks are 7, 5, 5, 7, 3, 5 for SD and 5, 5, 5, 5, 4, 8 for WD. Further pre-processing is explained in “Implementation”. Interaction effects explained When two or more factors are of interest in an experiment, one should consider including IEs in the statistical model. Only using additive or main effects may not result in adequate modeling of the data. In Fig. 1, different effect scenarios are visualized using interaction plots for the case of two factors of interest, e.g. some group (0 = blue, 1 = red) and a compound with low and high concentration. In Fig. 1a, there is no interaction between the group and the concentration: The increase of the response from the low to the high concentration is the same for group 0 and group 1. At the same time, for a fixed concentration, the difference in the responses between group 0 and group 1 is the same. One can describe the absence of an IE graphically, biologically, and mathematically. • Graphically, an additive effect or the lack of an IE results in parallel lines between the two groups. • Biologically, the effect of the concentration does not interact with the effect of the group, because it is always the same increase in response from low to high concentration, regardless of the group. • Mathematically, considering two factors with two levels each, a classical linear model, or equivalently an ANOVA model, with only additive effects for the two factors and normal noise is appropriate to model the data. This formalizes to yj = µ+ α · gj + β · cj + εj (1) where j indicates the sample, gj indicates if the jth sample is in group 0 ( gj = 0 ) or in group 1 ( gj = 1 ), and cj indicates if the j-th sample is exposed to the low concentration ( cj = 0 ) or the high concentration (c j = 1). The mean difference in the responses for group 1 compared to group 0 is α and for increasing the concen- tration from low to high, the mean difference is β. For example, if the j-th sample is in group 0 ( gj = 0 ) and exposed to the low concentration ( cj = 0 ), the expected response is µ+ 0 · α + 0 · β = µ . The intercept µ represents the mean response in the reference group (0) with the reference concentration (low). Figure 1. Schematic depiction of data scenarios without and with IE. (a) Group 0 (blue) and 1 (red) both have a positive effect for treatment high compared to low and a positive group effect, but no IE. (b) As in (a), but with an additional positive IE. (c) Negative IE between group and treatment. (d) No treatment effect for group 0. The treatment effect for group 1 is entirely represented by the IE. (e) Both groups display a positive treatment effect and there is no group effect in the treatment category low, only in high, i.e. an IE is present. (f) Negative IE between group and treatment, but no line crossing as in (c). Scientific Reports | (2023) 13:20804 | https://doi.org/10.1038/s41598-023-47057-0 2 Vol:.(1234567890) www.nature.com/scientificreports/ The contrary case, the presence of a clear IE with a changed direction for the concentration effect, is depicted in Figure 1c. The crossing lines mean that the effect of a concentration increase is not additive (it is not the same within both groups). Instead, the concentration effect depends on the group, i.e. there is an interaction with the group effect. For group 0, an increase in the concentration leads to an increase in the response, whereas for group 1, an increase in the concentration leads to a decrease in the response. The additive model (1) can not capture this interaction as the model fit would force parallel lines into the effect plot. Mathematically, a model that accounts for the interaction between group and treatment is, therefore, an extension of the model in Eq. (1) by adding the IE γ: yj = µ+ α · gj + β · cj + γ · gj · cj + εj . (2) If the j-th sample is exposed to the higher concentration ( cj = 1 ) and is in group 1 ( gj = 1 ), then the mean response is µ+ α + β + γ . The interaction term γ · gj · cj allows the lines in the interaction plot to be non- parallel. It is important to note that an IE does not necessarily visualize as a crossing of lines in an interaction plot, but simply non-parallel lines, such as in the examples shown in Fig. 1b, d, e, and 1f. We elucidate the use of IEs when analyzing real data in the context of biological research questions in “When do interaction effects capture the research question?”. Interaction effects calculated with DESeq2 In this section, we explain the mathematical background of gene expression modeling with the popular R-package DESeq2 4. Details on statistical concepts presented here may not be relevant to readers who are more application- oriented and can be ignored without risking comprehension of the remaining sections. However, to understand an IE in more depth, we encourage to understand the parameters in the model formula (4). Consider the count matrix K, where Kij are the count reads of gene i for sample j, i ∈ {1, ..., n} , j ∈ {1, ...,m} . To model the count data, DESeq2 uses a generalized linear model with a negative binomial distribution Kij ∼ NB(µij , τi) with mean µij and gene-specific dispersion τi. The mean of the observed counts µij = sjqij is modeled with the parameter qij , which is proportional to the expected true concentration of fragments for sample j and rescaled with a sample-specific size∑ factor sj . The parameter qij is modeled with a generalized linear model using the logarithmic link: log2(qij) = r βirxjr . In a factorial design, xjr ∈ {0, 1} indicates if the rth explanatory variable applies to sample j, such that for the ith gene, βir is the log2 FC for factor level r compared to the reference factor level. For our application example (“Data” ), the model has one factor for the diet (two values) and one factor for the week (six values). A model with the parameters for the week and diet without interaction is fitted for each gene i, 1 ≤ i ≤ 35, 727 . In the following, we suppress the gene index i and consider the sample (mouse) index j. The model used in DEseq2 is then ∑6 log2(qj) = µ+ α · dj + βr · wjr , (3) r=2 where µ (intercept) denotes the response at the reference (SD and week 3), and α is the WD (main) effect. The variable dj is binary with value 0 for the SD and value 1 for the WD. The parameters βr , r ∈ {2, ..., 6} , correspond to the week effects. The variable wjr is the indicator variable for the week, i.e. wj2 = 1 only for week 6. Now, adding an IE, the model is ∑6 ∑6 log2(qj) = µ+ α · dj + βr · wjr + γr · dj · wjr . (4) r=2 r=2 The parameter γ2 denotes the IE between the factor diet and the factor week, comparing week 6 to week 3. The parameter γ3 refers to the interaction between the diet and week, comparing week 30 to week 3, and so on. Due to the log2 transformation for the sample concentration qj , the parameters must all be interpreted accordingly. For example, an IE of γ2 = 3 means that the difference between the diet effect in week 3 and the diet effect in week 6 is 23 = 8 , or has a FC of 8. When do interaction effects capture the research question? In RNA-Seq experiments, often the case of two factors, e.g. treatment and genotype, are analyzed, and it is of interest whether the effect of the treatment differs between the genotypes (in certain genes). The research ques- tion might be formulated as: Does the genotype affect the treatment effect? IEs capture such a research question and they should therefore be considered for the analysis. In our application example, the two factors are diet and week, where diet is either a WD or a SD and week indicates the feeding duration. In this dataset measurements for different time points are available, and we focus on the two shortest durations, 3 weeks and 6 weeks, to explain the IE concept. The 3-week time point can be considered the reference level of the factor week. The research goal is to identify genes where activation/deac- tivation from weeks 3 to 6 induced by the WD is different compared to the SD. Mathematically, this research question translates into identifying genes with an IE between diet and week. Consequently, the use of a model that includes an IE should be considered. Scientific Reports | (2023) 13:20804 | https://doi.org/10.1038/s41598-023-47057-0 3 Vol.:(0123456789) www.nature.com/scientificreports/ How do interaction effects capture the research question? To explain how IEs capture the research question, we visualize the benefit of adding IEs to a linear model, using our example dataset. In Fig. 2, for the mice groups, for each combination of diet type and week, expression values and fitted means are plotted, exemplary for one selected gene. Once no IEs are included in the model (Fig. 2, left), and once IEs are included (Fig. 2, right). Without IEs, the estimated effect differences between the diets, represented by arrows, are mathematically forced to be the same across all weeks (vertical lines have the same length). Consequently, in week 3, the effect is markedly overestimated, as the arrow between SD and WD is larger than the pure difference in group means. In contrast, if an IE is used (Fig. 2, right), then the group means estimated by the model capture well that the diet effect varies across weeks. The mathematical formulas of the estimated effects represented by the arrows are explained in “Interaction effects calculated with DESeq2”. Comparison of methods for estimating interaction effects In this section, we compare the results obtained by fitting an interaction model between two factors (called Method II in the following) with a far more popular alternative, which we call Method I. The alternative approach avoids the direct modeling of an IE between two factors as follows: The data are split with respect to the second factor (e.g. week) into two groups G0 and G1 . Then for group G0 and G1 separately, a model comparing the groups with respect to the first factor (e.g. diet) is fitted. Finally, it is analyzed, if for one group, typically the reference group G0 , no significant effect is observed, and for the other group G1 , there is a significant effect present. The differences between the two approaches are illustrated and discussed on the mouse dataset, where for Method I the groups G0 and G1 are defined by week 3 (as reference) and week 6 (or larger week numbers, respec- tively). The models per week contain only one factor (diet) with two levels, SD and WD. Since separate models are fitted per week, the model-wise diet effect is allowed to vary across weeks. When interpreting the results of the differential expression analysis, a consideration of both statistical sig- nificance and biological relevance is necessary: A p-value smaller than the significance level, which constitutes a statistically significant result, does not necessarily mean that the mean effect level, given here by the log2-Fold Change ( log2FC), is of relevant size. On the other hand, a mean effect level larger than a pre-specified threshold, motivated by the biological context, does not always correspond to small p-values 11. Thus, to interpret a gene to be a differentially expressed gene (DEG), we always require two conditions to be fulfilled: The (FDR-adjusted) p-value is smaller than a significance level, and the log2 FC is larger than a pre-specified threshold. For the mouse dataset and the separate models (Method I), only those genes that show a diet effect (both significant and relevant) in week 6, but not in the reference week 3, are considered DEGs. The motivation is that interesting genes show no effect at the reference time point, where the diet had too little time to cause a differential effect, but later (at 6 weeks) the diet causes such a difference. For the interaction model (Method II), not two models but only a single model is fitted. To detect DEGs, one simply checks if the estimated IE is both significant and relevant. • Method I (Separate): Separately for each week: Fit a one-factor model (two-group comparison, see equation (1)). A gene is DEG if the diet effect is both significant and relevant in week 6, but not both in week 3. • Method II (Interaction): Fit a two-factor model between week and diet (including week, diet, and interaction), see equation (2). A gene is DEG if the IE is both significant and relevant. 11 11 10 10 9 9 8 8 7 7 6 6 5 5 0 5 10 15 20 25 30 35 40 45 50 0 5 10 15 20 25 30 35 40 45 50 weeks weeks count observed mean diet SD WD Figure 2. Visualization of the fitted model without IE (left) and with IE (right) for the mice dataset, for the gene identifier ENSMUSG00000069170 (Adgrv1). The arrows represent the estimated log2FCs according to Eq. (3) for the left fit, and Eq. (4) for the right fit. For both fits, µ (green arrow) is the expected mean gene expression level for the reference values three weeks and SD, and α (vertical dark grey arrows) is the estimated FC between SD and WD at each week. Further, both models include the week effects βr (blue arrows). The right model additionally includes interaction effects (yellow, orange, and red arrows) that correspond to γr in formula (4). Scientific Reports | (2023) 13:20804 | https://doi.org/10.1038/s41598-023-47057-0 4 Vol:.(1234567890) log2(normalized count) log2(normalized count) www.nature.com/scientificreports/ To visualize the differences between the decision outcomes (gene is DEG or not) of Method I and II, Fig. 3 dis- plays 7 cases using simulated data scenarios. The data are generated with constant residual variance, so that the decision is not influenced by differing variance values, but only by the estimated effect (arrow lengths). • Case 1: Within both weeks, the estimated diet effect is not relevant (dotted green effect arrow). There is hence is no DEG by Method I. Since the effects are of similar size, the IE estimated by Method II (pink arrow) is not significant, and neither Method II classifies the gene as DEG. • Case 2: In week 3, the effect is not relevant, in week 6 it is both significant and relevant. This leads to a sig- nificant IE for Method II. Therefore, both Method I and Method II classify the gene as DEG. • Case 3: The diet effect is significant in both weeks. Since it is significant in week 3, Method I does not classify the gene as DEG. However, the diet effect in the second week is much larger, such that the IE is significant, and Method II classifies the gene as DEG. • Case 4: Similar to case 3, but the effect direction of the diet effect changes: In the first week, there is a positive effect, and in the second week a negative effect. Again, only Method II classifies the gene as DEG. • Case 5: In week 3, the diet effect is just below the significance level, whereas in week 6 it is just above the significance level. Therefore, Method I labels the gene as DEG. For Method II, the IE is not significant as the diet effect does not differ much between the weeks. Method II does not label the gene as DEG. • Case 6: Similar to case 4, but the effect in week 3 is not significant. Now both methods classify the gene as DEG. • Case 7: The direction of the diet effect changes. It is positive in week 3 and negative in week 6. Within each week, the effect size is not significant, therefore Method I classifies the gene as not DEG. The overall change in the effect represented by the IE is significant. Therefore, Method II labels this gene as DEG. Implementation For all calculations, R 12, version 4.2.2, and the packages DESeq24, version 1.38.1, and topGO13, version 2.50.0, were used for determining DEGs and performing gene ontology enrichment analyses (GO EA), respectively. The entire code is shared on GitHub (https:// github. com/ jcduda/g ene_e xpre ssion_i nter action). We specify the models of Method I and II in DESeq2 using • Method I: DESeqDataSet(gse, design = ∼ diet) • Method II: DESeqDataSet(gse, design = ∼ diet + weeks + diet:weeks) In the example, the code for Method I is applied twice for separate weeks, i.e. for two different data sets ‘gse’, while the code for Method II is applied only once. Note that a model based on ∼ diet + weeks results in the same parameter values for each week, making it unsuitable for comparison with Method I and Method II, see Fig. 2. One notable preprocessing step was the filtering. Removing only genes with less than ten counts over all samples resulted in a peak of the estimated diet effect at 0.206 (Supplementary Fig. 1). However, removing genes with more than 50% of samples with 0 counts leads to reasonably estimated effects without artifactual spikes in the histogram (Supplementary Fig. 2). Further, we shrunk the estimated effects using approximate posterior estimation with the lfcShrink f unction14. Effects that are non-zero only due to noise are shrunk to zero, while large, reliable effects are not affected. Results We compare Method I (separate) and Method II (interaction) for the mouse dataset, w.r.t. classification of genes as DEG or not DEG, as described in “Comparison of methods for estimating interaction effects”. In the following list, we define the terms significant, relevant, and DEG in the context of the example study. For Method I we call a gene • significant, if false discovery rate (FDR) adjusted p-value < 0.05 (for a specific week X) • relevant, if absolute log2 FC > log2(1.5) (for a specific week X) • DEG for week X, if it is significant and relevant for week X • DEG, if it is not DEG for week 3, but DEG for week 6 For Method II we call a gene • significant, if FDR adjusted p-value < 0.05 (for the IE) • relevant, if absolute log2 FC > log2(1.5) (for the IE) • DEG, if it is significant and relevant (for the IE) For Method I, up-regulated DEGs for week X have a positive diet effect in week X. For Method II, up-regulated DEGs have a positive IE. Down-regulated DEGs are defined accordingly. Comparison of genes selected by Method I and Method II We expect a relevant number of DEGs, since a biological effect of the diet (WD vs. SD) is reported b y10. Table 1 shows the number of DEGs in week 3 and DEGs in week 6, according to Method I (simple comparison per week). There are more DEGs after 6 weeks of feeding compared to 3 weeks, both for up- or down-regulation. Scientific Reports | (2023) 13:20804 | https://doi.org/10.1038/s41598-023-47057-0 5 Vol.:(0123456789) www.nature.com/scientificreports/ For up-regulated genes, 104 genes are DEGs only for week 3, 81 genes that are DEGs in both weeks, and 1,622 genes that are DEGs only in week 6. Hence, for Method I, regarding up-regulation, one would focus on the 1622 DEGs that are only identified for week 6 and not for week 3. Table 2 presents a main finding of our study, a comparison of DEGs identified with Method I and Method II. One can see that Method I (separate) identifies more DEGs than Method II (interaction). However, the DEGs identified by Method II are not all contained in the DEGs identified by Method I. There are almost 200 genes only identified by Method II, both for up-regulation and for down-regulation. Characterization of genes that are DEG only for Method I or only for Method II To understand the benefits of the two methods, we characterize the genes that are only identified by one of the two approaches, respectively. After a mathematical characterization, we also investigate biological differences. An insightful example is gene Sirt7 in Fig. 5, which is a typical case for being DEG by Method II, but not by Method I. From week 3 to week 6, there is an interaction between the factor week and diet (crossing of grey lines). The IE (large yellow arrow) is significant and relevant, making this gene DEG for Method II. However, for Method I the log2 FC of the diet effect in week 6 is not large enough to pass the threshold of log2(1.5) . Hence, Sirt7 is not identified as DEG by Method I, even though an important underlying diet effect dependent on the time seems reasonable. Such genes are overlooked by the popular Method I. To better understand the differences between the two approaches, Fig. 6 shows regions of genes classified as DEG by both, none, or only one of the two methods, dependent on the main effect (diet) and the IE, as obtained by the interaction model (2) used by Method II. Each dot represents a single gene. If there is no interaction (cf. Fig. 1a), the estimated IE is (close to) 0, such that the x- and y-value are identical and the gene is on the diagonal. For better illustration, the estimated effects are not shrunk and the decision rule depends on the log2 FC threshold only. In practice, log2 FC estimates should be subject to shrinkage and the classification into a DEG depends on both, log2 FC and adjusted p-value (Sup- plementary Fig. 3 in the Appendix). The genes can be divided into four groups according to the DEG classification of Method I and Method II. The numbers 1–7 assigned to regions match the simulated cases in Fig. 3 and a real gene expression pattern of a representative gene shown in Fig. 4. In the following, the gene expression patterns corresponding to the colored regions in Fig. 6 are explained. • Orange: not DEG for both methods. Genes closer to the diagonal than log2(1.5), such that the IE is below this threshold and the gene is not DEG for Method II. Further, genes with absolute main effect above log2(1.5) are DEG for week 3 and thus not DEG for Method I. • Green: DEG only for Method I. Genes with absolute main effect and IE less than log2(1.5), but overall effect in week 6 greater than log2(1.5). These genes are not DEG in week 3 by being slightly below the threshold but are DEG in week 6 by being slightly above the threshold. Hence, they are DEG for Method I, but the IE is small and the gene is not DEG for Method II. • Purple: DEG for Method I and II. Genes with an estimated main effect (for week 3) below the log2 FC bounda- ries, but the sum of main and IE (diet effect for week 6) is outside these boundaries. Hence, these genes are DEG for Method I. For Method II, they are DEG since the IE is large enough (points far from the diagonal line). • Blue: DEG only for Method II. Genes that are not DEG for Method I since they are either DEG in week 3 (main effect outside ± log2(1.5)) or have a main effect inside ± log2(1.5) (as gene 7) but are not DEG in week 6, since the corresponding effect (main plus IE) is also within ± log2(1.5)). We further looked at differences concerning the biological conclusions of the found DEGs. First, a qualitative, small literature research on the top 10 (lowest adj. p-value) upregulated DEGs found only by Method I or only by Method II, respectively, suggests that both methods find genes that are reasonably associated with liver disease induced by a fatty diet (Table 4; Supplementary Table 1). On a broader scale, a GO EA was performed on the DEGs found by Method I, Method II, and the combination of both DEG sets (Table 3; Supplementary Table 2). Despite the smaller number of DEGs identified by Method II, the biological interpretation based on the processes identified by GO EA is very similar and plausibly covers immune activation related to fatty liver disease. This suggests that the DEGs found by Method II are more specific in the sense that they include fewer non-relevant genes while yielding similar GO EA results. Discussion Using an IE model with 2 factors (Method II) instead of two separate models with one factor each (Method I) clearly changes the set of DEGs found in a gene expression analysis. The set of DEGs found with Method II is usually smaller. A theoretical reason for this is that statistical inference that aims at detecting IEs is less powerful in the sense that the sample size must be four times larger to have the same power for detecting an IE than to detect a main effect15,16, p. 100f. Further, a gene that just passed the thresholds for being DEG for the reference group, but just not for the other group, is DEG for Method I but usually not for Method II, and it is not a good candidate for a biologically meaningful statement. The resulting DEGs for Method II are smaller in number, but lead to equally reasonable biological findings based on enrichment analyses. A limitation of Method II is that a single model with two main factors and an IE can be more difficult to interpret correctly than two models with one factor each and no IE. Quantifying if the smaller set of DEGs found by Method II contains less irrelevant genes is difficult for several Scientific Reports | (2023) 13:20804 | https://doi.org/10.1038/s41598-023-47057-0 6 Vol:.(1234567890) www.nature.com/scientificreports/ Figure 3. Visualization of seven example scenarios with different main effects and IEs, leading to different decisions for Method I (left column) and Method II (right column). Dots represent data points (blue: SD, red: WD; left: 3 weeks, right: 6 weeks), arrows represent effects (black: reference mean, green: main effect of diet, purple: IE). Dotted arrows indicate non-relevance (absolute effect size below threshold), solid arrows represent relevant effects. Dotted arrows are only shown for the main effects of IEs. The label ’DEG’ below a scenario indicates if the respective method classifies a gene as DEG (green) or not DEG (red). Scientific Reports | (2023) 13:20804 | https://doi.org/10.1038/s41598-023-47057-0 7 Vol.:(0123456789) www.nature.com/scientificreports/ Figure 4. Example genes that are, according to DEG decision cases 1–7, not always classified in the same way by Method I (left) and II (right). Note that the original data are the same per gene (row), but due to the differences between Method I and II, background normalizations yield slightly different data for each gene. For normalization, DESeq estimates the library sizes as the median of the ratios of observed counts9. See caption of Figure 3 for an explanation of the arrows. Scientific Reports | (2023) 13:20804 | https://doi.org/10.1038/s41598-023-47057-0 8 Vol:.(1234567890) www.nature.com/scientificreports/ Week 3 only Overlap Week 6 only Up 104 81 1,622 Down 81 93 726 Table 1. Overview of DEGs for Method I, comparison of SD and WD. Method I only Overlap Method II only Up-regulated 914 695 167 Down-regulated 540 177 186 Table 2. Comparison of DEGs identified with Method I and Method II. Note that 914 + 695 = 1609 does not equal 1622 in Table 1, because here we do not include genes that are downregulated in week 3, as otherwise they would not be DEG by Method I. 11.0 10.5 10.0 9.5 9.0 0 5 10 15 20 25 30 35 40 45 50 weeks count observed mean diet SD WD Figure 5. Expression pattern for the gene Sirt7, which is for the comparison week 3 vs. week 6 DEG for Method II (interaction), but not by Method I (separate), since the effect size is too low for week 6. See caption of Fig. 2 for detailed explanation of the arrows. reasons. First, a literature search to determine if a gene is not reported within the context of liver disease is fruit- less. Due to false positive results and extensive research in this area, almost any gene can be found as associated. Second, the data set at hand does not have a clean reference, because mice were already fed with HFD for three weeks in the reference group, instead of being fed for zero weeks. However, within the limits of this study, the conceptual reasoning and analyses of GO enrichment analyses suggest that gene sets identified by Method II are smaller but likely contain fewer irrelevant genes. Conclusion An IE might often be an adequate translation of a biological research question into a statistical concept. However, this relationship might remain unnoticed due to a lack of expertise or reluctance to deviate from routines. In this work, we offer an extensive explanation of IEs and why they might be scientifically relevant in the context of detecting differentially expressed genes (DEGs) in gene expression analysis. We compare the IE-based approach (Method II) with a popular alternative approach (Method I) that avoids the calculation of IEs. While Method I detects more DEGs, many of them might not be scientifically relevant, whereas the smaller set of DEGs found with Method II can be interpreted as more specific by having fewer irrelevant genes. We encourage researchers to clarify for each project if an IE is the accurate mathematical representation of the formulated research question and to use this concept when appropriate. Further, if the research goal is to identify a smaller gene set containing less irrelevant genes (less false positives), we encourage Scientific Reports | (2023) 13:20804 | https://doi.org/10.1038/s41598-023-47057-0 9 Vol.:(0123456789) log2(normalized count) www.nature.com/scientificreports/ Figure 6. Characterization of regions of genes that are identified as DEG only by Method I or by Method II, or by both or none of the methods. The x-axis shows the estimated main effect (diet), i.e. the estimated log2 FC from a SD to WD in the reference week 3, and on the y-axis the sum of this main effect and the IE, i.e. the overall effect between the two diets in week 6 in the interaction model, is plotted. Method I Method II Method I or II 1 Immune system process ( −292.44× 10 ) Immune system process (3 −28.33× 10 ) Immune system process ( 2.27× −2910 ) 2 Immune response ( −292.44× 10 ) Immune response (3 × −28.33 10 ) Immune response ( × −292.27 10 ) 3 Defense response ( −292.44× 10 ) Cell activation (3 −28.33× 10 ) Defense response ( 2.27× −2910 ) 4 Pos. reg. of immune system process ( × −292.44 10 ) Response to external stimulus (5 × −2810 ) Regulation of immune system process ( × −292.27 10 ) 5 Regulation of immune system process ( × −292.44 10 ) Defense response ( × −286 10 ) Pos. reg. of immune system process ( × −292.27 10 ) 6 Response to other organism ( × −292.44 10 ) Response to stimulus ( −271 .65× 10 ) Response to external stimulus ( × −292.27 10 ) 7 Response to external biotic stimulus ( −292.44× 10 ) Leukocyte activation ( × −27) Response to biotic stimulus ( × −292.57 10 2.27 10 ) 8 Response to biotic stimulus ( × −292.44 10 ) Regulation of immune system process ( −251 .2× 10 ) Response to other organism ( −292.27× 10 ) 9 Response to external stimulus ( × −292.44 10 ) Response to external biotic stimulus ( × −25 −292.27 10 ) Response to external biotic stimulus ( 2.27× 10 ) 10 Defense response to other organism ( −292.44× 10 ) Response to other organism ( −252.27× 10 ) Defense response to other organism ( −292.27× 10 ) 11 Innate immune response ( × −292.44 10 ) Response to biotic stimulus ( −25 Biol. proc. involved in interspecies interaction btw organ-2.27× 10 ) isms ( −292.27× 10 ) 12 Cell activation ( −29 −252.44× 10 ) Pos. reg. of immune system process ( 2.92× 10 ) Cell activation ( −292.27× 10 ) 13 Biol. proc. involved in interspecies interaction btw Pos. regulation of multicellular organismal process Pos. regulation of multicellular organismal process organisms ( −29 −252.44× 10 ) ( 4.31× 10 ) ( × −292.27 10 ) 14 Inflammatory response ( −29 Biol. proc. involved in interspecies interaction btw −292.44× 10 ) organisms ( −25) Inflammatory response ( 2.27× 10 )8.57× 10 15 Pos. reg. of response to external biotic stimulus ( −29) Pos. reg. of response to stimulus ( −22 2.93× 10 ) Innate immune response ( × −292.27 10 ) 2.44× 10 Table 3. Top 15 most significant GO groups found based on upregulated DEGs by Method I, Method II and combining the genes found by Method I and Method II. FDR-adjusted p-values are in parentheses. to use Method II. However, if the research goal is rather exploratory and more false positives are acceptable, we suggest to use Method I. Data availability The analyzed data sets are publicly available at the SRA database with reference number PRJNA 953810. Code availability The code is available on GitHub (https:// github.c om/ jcduda/g ene_e xpre ssion_ intera ction). Appendix See Table 4. Scientific Reports | (2023) 13:20804 | https://doi.org/10.1038/s41598-023-47057-0 10 Vol:.(1234567890) www.nature.com/scientificreports/ DEG only in Gene Log2FC FDR-adj. p Literature Acnat2 (ENSMUSG00000060317) 2.02 Considered a candidate for specific metabolic processes within the Type I acyl-CoA thioesterase/< 0.01 acyltransferase gene family17 Tlr12 (ENSMUSG00000062545) 1.45 < 0.01 Signaling in chronic liver diseases via complex immune responses mediating h epatocyte18 Fgf21 (ENSMUSG00000030827) 2.39 < 0.01 Associated with development and progression of NAFLD19 Tgfbi (ENSMUSG00000035493) 0.79 Overexpression in mice resulted in an increased incidence of spontaneous tumors and N,N-diethyl-< 0.01 nitrosamine (DEN)-induced liver tumor n odules20 Method I Ehhadh (ENSMUSG00000022853) 0.90 < 0.01 Associated with development of fatty liver disease in dairy cows21 Hpgds (ENSMUSG00000029919) 2.41 < 0.01 Overexpression associated with adipogenesis and increased insulin sensitivity22 Slc17a4 (ENSMUSG00000021336) 1.14 An intestinal organic anion exporter expressed predominantly in the pancreas, liver, colon, and < 0.01 small intestine23 Lgmn (ENSMUSG00000021190) 0.76 < 0.01 Elevated expression of LGMN is reported in the tumor cells of l iver24 Slc7a8 (ENSMUSG00000022180) 1.26 < 0.01 Slc7a8 deletion is protective against diet-induced obesity25 Pgm3 (ENSMUSG00000056131) 1.17 26< 0.01 Up-regulated in livers of high fat diet fed mice Mmp12 (ENSMUSG00000049723) 4.16 < 0.01 Mmp12 is a matrix metalloproteinase and associated with liver disease and i nflammation27 Cyp2c38 (ENSMUSG00000032808) 2.47 < 0.01 Cyp2c family up-regulated in NAFLD mouse model28 Itgam (ENSMUSG00000030786) 2.92 < 0.01 Increase in Itgam (aka Cd11b) in liver X receptors knockout mice29 Adgrg2 (ENSMUSG00000031298) 2.76 < 0.01 Found upregulated in cholestasis liver tissue compared to mildly damaged liver t issue30 Nap1l1 (ENSMUSG00000058799) 0.71 < 0.01 Tumor promoter in hepatocellular carcinoma31 Method II Gstm3 (ENSMUSG00000004038) 2.40 < 0.01 Associated with acute-on-chronic hepatitis B liver failure32 Myo1c (ENSMUSG00000017774) 0.61 < 0.01 Upregulation associated with human chronic liver disease33 Rac2 (ENSMUSG00000025003) 1.89 < 0.01 Associated with NAFLD34 Cyp2c39 (ENSMUSG00000025003) 2.65 < 0.01 Cyp2c family up-regulated in NAFLD mouse model28 Gm3776 (ENSMUSG00000111709) 3.11 Gm3776, or glutathione S-transferase (GST) alpha 13, belongs to GST genes that are associated < 0.01 with liver disease35 Table 4. Top 10 most significant up-regulated genes found only by Method I or Method II, respectively. Received: 14 March 2023; Accepted: 8 November 2023 References 1. Murray, D., Doran, P., MacMathuna, P. & Moss, A. C. In silico gene expression analysis—An overview. Mol. Cancer 6, 1–10 (2007). 2. Costa-Silva, J., Domingues, D. & Lopes, F. M. RNA-seq differential expression analysis: An extended review and a software tool. PloS one 12, e0190152 (2017). 3. Chaix, A., Lin, T., Le, H. D., Chang, M. W. & Panda, S. Time-restricted feeding prevents obesity and metabolic syndrome in mice lacking a circadian clock. Cell Metab. 29, 303–319 (2019). 4. Love, M. I., Huber, W. & Anders, S. Moderated estimation of fold change and dispersion for RNA-seq data with deseq2. Genome Biol. 15, 550. https:// doi. org/ 10. 1186/ s13059-0 14- 0550-8 (2014). 5. Withaar, C. et al. The effects of liraglutide and dapagliflozin on cardiac function and structure in a multi-hit mouse model of heart failure with preserved ejection fraction. Cardiovasc. Res. 117, 2108–2124 (2021). 6. Sloley, S. S. et al. High-frequency head impact causes chronic synaptic adaptation and long-term cognitive impairment in mice. Nat. Commun. 12, 1–20 (2021). 7. Smith, B. J. et al. Changes in the gut microbiome and fermentation products concurrent with enhanced longevity in acarbose- treated mice. BMC Microbiol. 19, 1–16 (2019). 8. Turner, J. R. & Thayer, J. Introduction to Analysis of Variance: Design, Analyis & Interpretation: Design, Analyis & Interpretation. (Sage, 2001). 9. Anders, S. & Huber, W. Differential expression analysis for sequence count data. Nat. Prec. 1–1 (2010). 10. Ghallab, A. et al. Spatio-temporal multiscale analysis of western diet-fed mice reveals a translationally relevant sequence of events during NAFLD progression. Cells 10, 2516 (2021). 11. Hothorn, L. A. Statistics in Toxicology Using R. (CRC Press, 2015). 12. R Core Team. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing (2022). 13. Alexa, A. & Rahnenfuhrer, J.topGO: Enrichment Analysis for Gene Ontology. R Package Version 2.50.0. (2022). 14. Zhu, A., Ibrahim, J. G. & Love, M. I. Heavy-tailed prior distributions for sequence count data: Removing the noise and preserving large differences. Bioinformatics 35, 2084–2092 (2019). 15. Leon, A. C. & Heo, M. Sample sizes required to detect interactions between two binary fixed-effects in a mixed-effects linear regression model. Comput. Stat. Data Anal. 53, 603–608 (2009). 16. Fleiss, J. L. Design and Analysis of Clinical Experiments. (Wiley, 2011). 17. Reilly, S.-J. et al. A peroxisomal acyltransferase in mouse identifies a novel pathway for taurine conjugation of fatty acids. FASEB J. 21, 99–107 (2007). 18. Kiziltas, S. Toll-like receptors in pathophysiology of liver diseases. World J. Hepatol. 8, 1354 (2016). 19. Tucker, B., Li, H., Long, X., Rye, K.-A. & Ong, K. L. Fibroblast growth factor 21 in non-alcoholic fatty liver disease. Metabolism 101, 153994 (2019). 20. Han, B. et al. The role of tgfbi ( βig-h3) in gastrointestinal tract tumorigenesis. Mol. Cancer 14, 1–12 (2015). 21. Le-Tian, Z. et al. Protein acetylation in mitochondria plays critical functions in the pathogenesis of fatty liver disease. BMC Genom- ics 21, 1–17 (2020). 22. Fujitani, Y. et al. Pronounced adipogenesis and increased insulin sensitivity caused by overproduction of prostaglandin d2in vivo. FEBS J. 277, 1410–1419 (2010). Scientific Reports | (2023) 13:20804 | https://doi.org/10.1038/s41598-023-47057-0 11 Vol.:(0123456789) www.nature.com/scientificreports/ 23. Togawa, N., Miyaji, T., Izawa, S., Omote, H. & Moriyama, Y. A Na+-phosphate cotransporter homologue (slc17a4 protein) is an intestinal organic anion exporter. Am. J. Physiol.-Cell Physiol. 302, C1652–C1660 (2012). 24. Reddy, B. D., Beeraka, N. M., Chitturi, C. & Madhunapantula, S. V. An overview of targeting legumain for inhibiting cancers. Curr. Pharmaceut. Des. 27, 3337–3348 (2021). 25. Pitere, R. R., van Heerden, M. B., Pepper, M. S. & Ambele, M. A. Slc7a8 deletion is protective against diet-induced obesity and attenuates lipid accumulation in multiple organs. Biology 11, 311 (2022). 26. Jang, J.-H. et al. Klhl3 deficiency in mice ameliorates obesity, insulin resistance, and nonalcoholic fatty liver disease by regulating energy expenditure. Exp. Mol. Med. 54, 1250–1261 (2022). 27. Naim, A., Pan, Q. & Baig, M. S. Matrix metalloproteinases (MMPS) in liver diseases. J. Clin. Exp. Hepatol. 7, 367–372 (2017). 28. Xiang, L. et al. Comparison of hepatic gene expression profiles between three mouse models of nonalcoholic fatty liver disease. Genes Dis. 9, 201–215 (2022). 29. Endo-Umeda, K. et al. Liver x receptors regulate hepatic f4/80+ cd11b+ Kupffer cells/macrophages and innate immune responses in mice. Sci. Rep. 8, 1–14 (2018). 30. Liu, X., Taylor, S. A., Celaj, S., Levitsky, J. & Green, R. M. Expression of unfolded protein response genes in post-transplantation liver biopsies. BMC Gastroenterol. 22, 380 (2022). 31. Zhang, Y.-W. et al. Nap1l1 functions as a tumor promoter via recruiting hepatoma-derived growth factor/c-jun signal in hepatocel- lular carcinoma. Front. Cell Dev. Biol. 9, 659680 (2021). 32. Sun, F.-K. et al. High promoter methylation levels of glutathione-S-transferase m3 predict poor prognosis of acute-on-chronic hepatitis b liver failure. Hepatol. Res. 47, 566–573 (2017). 33. Arif, E. et al. Targeting myosin 1c inhibits murine hepatic fibrogenesis. Am. J. Physiol.-Gastrointest. Liver Physiol. 320, G1044–G1053 (2021). 34. Zhu, J., Min, N., Gong, W., Chen, Y. & Li, X. Identification of hub genes and biological mechanisms associated with non-alcoholic fatty liver disease and triple-negative breast cancer. Life 13, 998 (2023). 35. Prysyazhnyuk, V. et al. Glutathione-S-transferases genes-promising predictors of hepatic dysfunction. World J. Hepatol. 13, 620 (2021). Acknowledgements This work has been supported (in part) by the Research Training Group “Biostatistical Methods for High- Dimensional Data in Toxicology” (RTG 2624, Project P2) funded by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation-Project Number 427806116). Author contributions J.R. conceived the original idea. C.D. performed the calculations with the help of H.K. J.D. wrote the manuscript with input from all authors. J.R. and F.K. supervised the work, revised the manuscript, and proposed analyses in discussions with J.D. and C.D. All authors read and improved the manuscript. Funding Open Access funding enabled and organized by Projekt DEAL. Competing interests The authors declare no competing interests. Additional information Supplementary Information The online version contains supplementary material available at https://d oi. org/ 10. 1038/s 41598-0 23- 47057-0. Correspondence and requests for materials should be addressed to J.C.D. Reprints and permissions information is available at www.nature.com/reprints. Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://c reati vecom mons.o rg/l icens es/ by/4.0 /. © The Author(s) 2023 Scientific Reports | (2023) 13:20804 | https://doi.org/10.1038/s41598-023-47057-0 12 Vol:.(1234567890) Article 4 Bayesian non-linear subspace shrinkage using horseshoe priors Julia Christin Duda duda@statistik.tu-dortmund.de Department of Statistics, TU Dortmund University, Dortmund, Germany Matthew Wheeler matt.wheeler@nih.gov Biostatistics and Computational Biology Branch, National Institute of Environmental Health Sciences, Research Triangle Park, Durham, North Carolina, U.S.A. 2024/06/05 Abstract When modeling biological responses using Bayesian non-parametric regression, prior information may be available on the shape of the response in the form of non-linear function spaces that define the general shape of the response. To incorporate such information into the analysis, we develop a non-linear functional shrinkage (NLFS) approach that uniformly shrinks the non-parametric fitted function into a non-linear function space while allowing for fits outside of this space when the data suggest alternative shapes. This approach extends existing functional shrinkage approaches into linear subspaces to shrinkage into non-linear function spaces using a Taylor series expansion and corresponding updating of non-linear parameters. We demonstrate this general approach on the Hill model, a popular, biologically motivated model, and show that shrinkage into combined function spaces, i.e., where one has two or more non-linear functions a priori, is straightforward. We demonstrate this approach through synthetic and real data. Computational details on the underlying MCMC sampling are provided with data and analysis available in an online supplement. 1 1 Introduction 2 When modeling complex biological systems, mechanistic knowledge about the system under in- 3 vestigation is often available; however, including this information in a statistical model may be im- 4 possible due to the system’s complexity in relation to experimental and computational resources 5 [Mesarovic et al., 2004]. Often, simplified models are used in lieu of the true mechanistic model 6 [Šimon, 2005]. When using these simplified models, one expects them to describe the observed 1 1 INTRODUCTION 7 data correctly or be mildly misspecified, and in the case of misspecification, the model may still 8 be helpful in describing the response. 9 When modeling biological systems, an example of this situation is the use of the Hill model. 10 This model, which represents sigmoidal-shaped responses, is a simplification of the complex 11 biochemical process based upon chemical kinetics [Hill, 1910] and is used to model a wide variety 12 of biochemical processes [Goutelle et al., 2008]. Despite its widespread use, it may not always 13 represent the observed response. Non-monotone deviations of the Hill’s functional form may be 14 evident in the data. Additionally, other competing models may also be available, and the modeler 15 might like to include this information to inform the fitting process, too. We develop a framework 16 that allows one to define a subspace over one or more function spaces of interest for Bayesian 17 non-parametric regression. 18 From the Bayesian perspective, there is a rich literature on approaches incorporating prior 19 knowledge in non-parametric regression. Naively, one may center the non-parametric model on 20 the specified parametric function. When the parametric data-generating mechanism’s mean is 21 the known parametric model, ensuring that estimates do not contain artifactual deviations from 22 that model is difficult, implying that shrinkage to the prior model will not be uniform. Further, using 23 this method, there is no way to create a space based on multiple parametric functions. More 24 sophisticated approaches use shape constraints induced through the prior distribution, which 25 include monotonicity or limits to the number of extrema [Brezger and Steiner, 2008, Shively et al., 26 2009, Meyer, 2008, Shively et al., 2011, Meyer et al., 2011, Gunn and Dunson, 2005, Köllmann 27 et al., 2014, Wheeler et al., 2017]. Though these approaches are often effective, they do not 28 directly incorporate parametric modeling information on the shape of the model; they force the 29 response to be in the constrained space by putting a prior mass of zero on all responses outside 30 of that space. 31 Alternatively, one may merge mechanistic prior knowledge into a model is through ordinary 32 differential equations (ODEs) within a Bayesian framework. Parametric Bayesian models include 33 pharmacokinetic/pharmacodynamic modeling, discussed by Lunn et al. [2002], and Huang et al. 34 [2006] present an HIV-modeling example using Bayesian hierarchical models with non-linear dif- 35 ferential equations. More flexible non-parametric approaches use differential equations to inform 36 stochastic processes with induced constraints [Golightly andWilkinson, 2011, Titsias et al., 2012]. 37 While Alvarez et al. [2013] and Wheeler et al. [2014] proposed a Gaussian Process (GP) ap- 38 proach that incorporates mechanistic knowledge defined by differential equations. More recently, 39 Chen et al. [2022] incorporate mechanistic knowledge defined by linear or non-linear partial dif- 40 ferential equations (PDE) into a GP framework by selecting PDE points, i.e. pseudo covariate 2 2 MODEL 41 points through which the assumed PDE information is incorporated. Like the shape-constrained 42 approaches, these methods form a Basis expansion consistent with a subspace defined using 43 mechanistic knowledge. Thus, these priors imply that an estimated function is within the given 44 subspace, and they do not allow for deviations outside of this space. 45 We define a prior distribution over a non-linear subspace - such as the Hill model and power 46 model - that does not require a fitted function to be within that subspace. When the non-linear 47 subspace is correctly specified, shrinkage into it occurs; but, when the true model is outside 48 of the subspace, the approach is unconstrained. We build upon the work of Shin et al. [2020] 49 who introduced the functional horseshoe (fHS) prior for linear spaces. The fHS prior shrinks the 50 non-parametric fit towards a pre-specified, linear subspace. This approach is different from well- 51 known shrinkage approaches such as Ridge, Lasso or Horseshoe [Hoerl and Kennard, 1970, 52 Tibshirani, 1996, Carvalho et al., 2010], which shrink model coefficients in a non-parametric 53 regression towards the origin. The prior of Shin et al. [2020] has the appealing property that 54 the posterior shrinks into the pre-specified subspace f it is consistent with the observed data or, 55 alternatively, is left unconstrained otherwise. The shrinkage occurs at the minimax optimal rate. 56 In our extension, we use a Taylor expansion to locally linearize the response function, where 57 the derivatives depend on parameters of the non-linear model. The extension allows functional 58 shrinkage into a non-linear function space or adapts the function to be outside of the non-linear 59 space. The relevant non-linear function space is specified a priori using one or more parametric 60 models. 61 We present our shrinking approach in Section 2. Section 3 then illustrates the approach both 62 for the case of shrinkage into a single function space - shown for the Hill model - and into a 63 combined function space - shown for the Hill and the power models. We compare our method 64 against other parametric and non-parametric approaches in a simulation study in Section 4. We 65 apply our method to a real-world data example of total testosterone levels measured in 9943 66 males aged between 3 and 85 years in section 5. The computational back-bone of the approach 67 is MCMC sampling combining Gibbs-, Metropolis-Hastings- and Slice-sampling [Brooks et al., 68 2011, Neal, 2003], detailed in the supplementary material. 69 2 Model 70 2.1 Spline Model 71 Consider the non-parametric regression problem yi = g(xi) + εi, (1) 3 2.2 Bayesian Priors for a Non-linear Subspace 2 MODEL 72 with unknown mean function g : R → R. We observe y = (y1, . . . , y )′n corresponding to co- 73 variates iidx = (x1, . . . , x ′n) and wish to estimate g. Assuming εi ∼ N(0, σ2), it is common to 74 approximate g using a B-spline basis expansion [Carl, 2001], i.e., ∑k f(x ) = ϕji m(xi)βm, (2) m=1 75 or f(x) = Φβ. Here, the B-spline basis ϕjk(x) are of order j defined on k ∗ internal knots, where ∗ 76 k = k +j, β = (β1, . . . , β ) ⊤ k denotes the vector of basis coefficients. We consider cubic splines 77 and omit the superscript j = 3. With a dense knot set, the spline approximation f can be made 78 to be arbitrarily close to any continuous g, allowing one to estimate a large space of functions to 79 arbitrary precision. 80 2.2 Bayesian Priors for a Non-linear Subspace 81 For many prior specifications, the expansion in (2) may not place high prior probability on biologi- 82 cally relevant responses. To define a biologically relevant model, we construct a prior distribution 83 that places significant prior mass on the function space defined by the non-linear model, e.g., the 84 space of Hill models, but does not put zero mass outside the function space. 85 To do this, assume knowledge about the shape of g through a twice differentiable function 86 hθ : R → R. The function hθ depends on parameter vector θ, and defines the function space ΩΘ87 0 = {hθ|θ ∈ Θ} for all realizations Θ ⊆ Rs. If the true mean function g happens to be outside 88 ΩΘ, shrinkage towards ΩΘ0 0 is undesirable. Given a dense knot set, the spline f can approximate 89 hθ for any ϵ−ball. Consequently, the space of functions represented by the spline contains ΩΘ0 . 90 We define a prior for (2) that places prior mass on ΩΘ0 , but does not limit responses to be only in Θ 91 Ω0 . 92 To define this prior, we consider Shin et al. [2020], who defined a projection prior that shrinks 93 into the linear column space defined by the matrix Φ ∈ Rn×d0 through ( ) p(β|σ2, τ2) ∝ 1(τ2)−(k−d0)/2 exp − β⊤Φ⊤(I − PΦ0)Φβ , (3)2σ2τ2 94 where d0 = rank(Φ0), Φ0 is constructed as a linear space of known covariates, and PΦ0 is the 95 orthogonal projection matrix into the column space of Φ0. The hyperparameter , τ, is given a 96 generalized horseshoe (HS) prior with hyperparameters a and b (cf. Shin et al. [2020]). When 97 a = b = 0.5 the prior is a half-Cauchy distribution, and one arrives at the HS prior [Carvalho 98 et al., 2010]. 99 In (3), one constructs PΦ0 using the linear column space of Φ0. Given our space is non-linear, 100 there is no direct analogue to PΦ0 . As an approximation, we use a Taylor series approximation of 4 3 NON-LINEAR FUNCTIONAL SHRINKAGE FOR SINGLE OR COMBINED FUNCTION SPACES 101 hθ0 , θ0 ∈ Θ. That is, we linearly approximate ΩΘ0 at any θ0 using a first-order Taylor expansion hθ(x) ≈ hθ0(x) + Ḣθ0(x)(θ − θ0) (4) ∣ 102 where Ḣ (x) = ∂hθ(x) ∣∣ ∈ Rn×sθ0 ∂θ is the Jacobian containing the partial derivatives of hθθ=θ0 103 evaluated at θ0. The column space of Ḣθ(x) approximates hθ(x) [Seber and Wild, 2003][p. 130] 104 and we use Ḣθ(x) to construct Pθ = PḢ . Thus, for any θ0, we project f(x) onto the space locallyθ 105 approximating hθ0 . When there are multiple function spaces to consider, the same approach 106 applies; in this case, operator PḢ defines the projection into a combined linear space, where Ḣθθ 107 represents the Jacobian across all assumed functions. 108 We place the prior ( ) p(β| 1σ2, τ2, θ) ∝ (τ2)−k/2 exp − β⊤Φ⊤(I − Pθ)Φβ (5) 2σ2τ2 109 over β to shrink realizations of (2) into ΩΘ0 . In (5), θ is given an appropriate prior to complete 110 the specification. This approach penalizes deviations of Φβ based upon the projection operator 111 (I−Pθ). As we shrink back to a planar approximation given a specific θ0, we require appropriately 112 specified priors for the non-linear parameters in Θ. As β is defined conditional on θ through the 113 linear projection operator Pθ, only priors for the non-linear parameters can be learned. 114 In this formulation, (τ2)−(k−d0)/2 in (3) becomes (τ2)−k/2 because we separately model the 115 intercept (cf. Section 3.1). This change yields proper priors as due to the non-linearity, no linear 116 basis of Φ is in (I − Pθ) and Φ⊤(I − Pθ)Φ has full rank. 117 3 Non-linear functional shrinkage for single or combined function 118 spaces 119 3.1 Single function spaces 120 As an example of non-linear functional shrinkage using a single function, we consider the Hill 121 model. This function is given by xθ4 h(x, θ) = θ1 + θ2 , (6) θ θ4 θ3 + x 4 122 where θ1 is the background response at x = 0, θ2 is the maximal change in the response, θ3 is 123 the dose where half of this change is reached and θ4 defines the steepness of the curve. The 124 Jacobian, Ḣθ, is ∣ ∂h(x, θ) ∣ ( )∣ ∣∣ = 1 x θ4 −θ ∣4 , (7) ∂θ xθ4+θ θ θ2 θ s(x, θ3, θ4) θ2 log(θ3/x)s(x, θ3, θ4) ∣ θ=θ 3 4 3 θ=θ 0 0 5 3 NON-LINEAR FUNCTIONAL SHRINKAGE FOR SINGLE OR COMBINED FUNCTION 3.2 Combined function spaces SPACES 125 with s(x, θ , θ ) = ((xθ −1)θ43 4 3 + 1)−1((θ −1 θ4 −13x ) + 1) . The derivative matrix does not depend 126 upon the linear parameters θ1, but it still depends on θ2. However, Pθ does not depend on θ1 and 127 θ2 (cf. Lemma 1 in the appendix), which gives a direct example of why we do not place a prior over 128 these linear quantities. Of the parameters in (7), parameter θ3 is of particular interest because 129 it represents the value of x that produces a response that is the average of the lower and upper 130 asymptote. Values of the covariate below θ3 correspond to values of the response less than 50% 131 of the maximal response. Further, θ4 corresponds to the steepness of the response and speed 132 of a chemical reaction in a biological substrate. As both quantities have direct interpretation, 133 informative priors can be developed for these quantities accordingly, which in turn informs the 134 subspace the model may shrink into. 135 To specify the hyperprior over (θ3, θ4), we assume x ∈ [0, 1], and let E[θ3] = 0.5, the mid- 136 point, and for θ4, we center it on 3, letting the parameter vary within a range that we have often 137 seen in bioassays. In our model, θ1 enters as the intercept, and θ2, the maximal response 138 change, implicitly enters the model through the β coefficients. Using the Hill model as a prior to 139 define (5), we complete the prior specification as (y|β, σ2, θ ) ∼ N(θ +Φβ, σ21 1 In) (8) θ1 ∼ N(0, 20), θ3 ∼ N+(0.5, 0.05) θ4 ∼ LN(0.95, 0.29) (9) σ2 ∼ IG(0.001, 0.001), (10) 140 where N+(a, b) is a truncated normal distribution with mean a and variance b (before truncation), 141 LN(a, b) is a log-normal distribution with log-mean a and log-variance b, and IG(a, b) is an 142 inverse-gamma distribution with shape a and scale b. Note that θ4 ∼ LN(0.95, 0.29) results in 143 E[θ4] = 3 and V [θ4] = 3. 144 3.2 Combined function spaces 145 If one desires multiple functions to define in the function space because of uncertainty in the 146 function space, one can add multiple functions. Here, assume there are r ∈ {1, . . . , R} = R function spaces (r) { (r)147 Ω0 = hθ |θ ∈ Θ(r)} of interest; we omit the index r on each θ for simplicity. (r) 148 For each (r)Ω0 , calculate the Jacobian, Ḣθ , i.e., (R) (1) (R) Ḣθ = (Ḣθ . . . Ḣθ ), (R) 149 and use this to construct Pθ. The Jacobian, Ḣθ , must be full rank without linear bases other 150 than an intercept column for Equ. 5 to hold. 151 To illustrate the combined subspace shrinkage approach, we use the Hill and power models. 152 The latter function defined as as hθ(x) = θ1 + θ xθ32 , which has only one non-linear parameter, 6 4 SIMULATION STUDY 153 θ3, that requires a prior specification. We use θ3 ∼ N(0.5, 0.25), to center on a concave shape. 154 The partial derivatives of the power model are ∂h(x, θ) ( ) = 1 xθ (1) 3 log(x)xθ3 = Ḣθ . (11)∂θ (1) (2) 155 Prior to combining Ḣθ of the power model and Ḣθ of the Hill model (Eq. 7), we remove the (1) 156 intercept from Ḣθ to obtain a full rank. Shrinkage into the combined subspaces is equivalent to 157 shrinkage into a single subspace. 158 4 Simulation Study 159 4.1 Setup 160 We perform a simulation study and evaluate the performance of the proposed approach against 161 other fitting strategies. Full details of the simulation design are summarized by the ADEMP 162 principle described in Morris et al. [2019] (Table S2). We generate data using three parametric 163 cases: the Hill model, the power model, and a misspecified model (the Hill model with downturn). 164 We look at exposure-response data as, for such data, chemical kinetics of exposure are often 165 approximated by the Hill model, but the results generalize to other domains. 166 For each data set, we draw x ∈ [0, 1] uniformly for n ∈ {50, 100, 200, 500} observations, 167 where 50 is a realistic assay size and larger n are chosen to study the large sample behavior. 168 Mean zero normal noise with variance σ2 = 0.005 and a larger noise σ2 = 0.05 is added. These 169 variances represent a 2-SD spread that is approximately 14% and 45% of the maximal response. 170 In total, 24 data generating scenarios are used, with nrep = 1000 simulations per scenario. For 171 each dataset, we apply the following methods: 172 4.1.1 Modeling Approaches 173 Non-linear functional shrinkage (NLFS) 174 175 Non-linear functional shrinkage is performed with shrinkage into the Hill space (NLFS(Hill)), 176 power space (NLFS(power)), or a combination (cf. Section 3.2) of both function spaces (NLFS(Hill+power)). 2 177 Two variations for the shrinkage parameter τ are considered. One uses a half Cauchy prior 178 (a = b = 0.5) and is implemented according to Makalic and Schmidt [2015]; the other, imple- 179 mented ourselves using slice sampling [Neal, 2003], uses a ω ∼ Beta(a, b) prior where a = 0.5 180 and b = exp(−k log(n)/2) as proposed by Shin et al. [2020] and k is the number of knots. 7 4.1 Setup 4 SIMULATION STUDY 181 Parametric Model (Param.) 182 183 We investigate the performance fitting of the Oracle model using Bayesian parametric regression 184 for the Hill model (Param.(Hill)) or the power model (Param.(power)) (priors in Table S2). Fitting 185 these models allows us to compare the performance of the Oracle NLFS to the Oracle parametric 186 model. 187 B-splines 188 189 Bayesian B-splines with a scaling parameter λ2 ∼ IG(0.001, 0.001) where the spline coefficients 190 are given by the prior β ∼ N(0, σ2λ2diag(k)). This model represents a basis approach without 191 smoothing and is used to compare the performance of the NLFS approach when the shrinkage 192 subspace is misspecified. 193 P-splines 194 195 Penalized Bayesian smoothing splines where β ∼ N(0, σ2τ2K−1) where K = R⊤R and R is 196 a second order penalty matrix and τ2 ∼ IG(1, 0.005), similar to the hyperparameter choices in 197 Lang and Brezger [2004]. This approach builds upon the B-spline approach, adding a smoothing 198 component, and typically performs better in practice than B-splines 199 Parametric Model + horseshoe B-spline 200 201 We also consider a model that includes the true parametric model plus an additional B-spline to 202 account for model misspecification, i.e., y = hθ(x) + Φβ + ε. When hθ(x) specifies the correct 203 model, one has β = 0; otherwise, β ̸= 0. To shrink the β coefficients to zero, we use a horseshoe 204 prior, i.e., β ∼ iidN(0, σ2τ2diag(λ 2 21 , . . . , λk )) where τ ∼ C+(0, 1) and λj ∼ C+(0, 1), cf. Makalic 205 and Schmidt [2015]. C+(0, 1) denotes a standard Half-Cauchy prior. As in the parametric model 206 case, hθ is either the parametric Hill (Param.(Hill)+B-spline) or power model (Param.(power)+B- 207 spline). This approach represents a direct competitor to the NLFS approach. 8 4.2 Results 4 SIMULATION STUDY 208 4.1.2 Further Considerations 209 210 For all simulations, we use k = 15 inner knots for the B-spline basis matrix. When MCMC 211 sampling, we took 10,000 draws from the posterior, discarding the first 2000 samples as burn-in. 212 Initial experiments indicated that this number of samples was adequate to estimate the posterior 213 distribution. For the spline-based approaches (NLFS, B-spline, P-spline), we place a vague prior 214 on the intercept term, θ1, defined in (8). Traceplots, of an NLFS fit with correctly and incorrectly 215 specified subspaces, are given in the supplement (supplemental Figures 4 and 5) and show 216 convergence. 217 4.2 Results 218 Figure 1 gives representative results of the simulation, where all results are provided in the sup- 219 plement. Unsurprisingly, when the Hill model is the truth (Figure 1a), we observe the largest 220 RMSE of 0.151 when fitting the misspecified parametric model (Param.(power)). Unlike the 221 misspecified parametric fits, when the function space is misspecified in the NLFS approach 222 (NLFS(power)), the RMSE is approximately one-third (0.046) that from fitting the misspecified 223 parametric model, indicating the NLFS approach adjusts to the data. In this scenario, the 224 NLFS(power) performance with mean RMSE of 0.046 was similar to that of the B-spline approach 225 with mean RMSE of 0.042. 226 When the correct function space is assumed for the NLFS prior, the RMSE drops to 0.019 227 (Figure 1a), as low as that of the 0racle parametric fit. This demonstrates the adaptive shrinkage 228 behavior of the NLFS approach in the case of correct subspace specification. Here, the NLFS 229 approach effectively shrank towards the correctly assumed space for sample sizes as low as 230 n = 50, and performed similarly to the oracle parametric model fit. The P-spline approach 231 receives no prior model or subspace specification but yields smooth splines. Consequently, its 232 performance was in between the approach with misspecification and correct specification. 233 When the correct function space is assumed, the NLFS approach tended to outperform the 234 parametric + horseshoe B-spline (PHBspline) approach. The PHBspline approach does not en- 235 force an equally smooth, global shrinkage of all β towards zero, especially when there are lever- 236 age points far from the observed mean. 237 When all assumed models or spaces are misspecified (Figure 1b), the NLFS approach was 238 outperformed by the PHBspline approach for the same model misspecification, i.e., NLFS(power) 239 was outperformed by param.(power) + horseshoe B-spline and NLFS(Hill) was outperformed by 240 param.(Hill) + horseshoe B-spline. However, the NLFS approach in general has an advantage 9 4.2 Results 4 SIMULATION STUDY a) Truth: Hill (n=50) 0.100 0.030 0.010 0.003 Param. Param. NLFS P−Spline NLFS* Param.* Param. NLFS (power) (power) (power) (Hill) (Hill) (Hill) (Hill+power) +horseshoe +horseshoe B−Spline B−spline b) Truth: Hill + Downturn (n=50) 0.10 0.05 0.03 Param. Param. NLFS P−Spline NLFS Param. Param. NLFS (power) (power) (power) (Hill) (Hill) (Hill) (Hill+power) +horseshoe +horseshoe B−Spline B−spline Figure 1: Representative root mean square error (RMSE) results of the simulation for the scenar- ios where the truth is Hill (a) or Hill + Downturn (b) and a medium noise level. Pane b represents the situation where a deviation of an unknown shape is the truth. All simulation results can be found in Tables S3 and S4. ∗Correct model used in the model fit. 10 RMSE RMSE 5 REAL DATA EXAMPLE 241 in misspecification scenarios due to its inherent flexibility to shrink toward combined function 242 spaces. The NLFS(Hill+power) outperformed all other approaches in this scenario with a mean 243 RMSE of 0.028. Only the P-spline approach came close, showing a slightly weaker performance 244 with a mean RMSE of 0.030. 245 The NLFS prior appropriately shrinks into the correct space, giving equivalent fits to the para- 246 metric Hill model (Figure 2a). Correspondingly, the P-Spline smoothing approach estimate shows 247 various artifactual bumps not evident in the NLFS approach. Even if the true model is the Hill 248 model, the naive PHBspline approach produces an artifactual bump in the asymptote region that 249 does not occur with the NLFS fit (Figure 2d). The NLFS approach and generic B-spline are 250 equivalent when the model is misspecified (i.e., fits a model outside of the assumed space) (Fig- 251 ure 2b). The NLFS(Hill+power) approach fits a model outside of the Hill space (Figure 2c) and 252 illustrates how shrinkage into combined subspaces can reduce misspecification errors involving 253 minor deviations. 254 5 Real Data Example 255 We applied the proposed method on a testosterone data set collected by Kelsey et al. [2014]. 256 They modeled total testosterone (TT) concentration in male participants dependent on age, to 257 identify normal TT ranges at any age. TT levels are the result of highly complex physiological 258 processes, mechanistic models are not available. TT is expected to increase during puberty, 259 reach a maximum and possibly slowly decline with age, which is a sigmoidal assumption. Con- 260 sequently, we a priori assume the Hill based shape through the NLFS prior. 261 Kelsey et al. [2014] collected data from 13 studies on TT by age, yielding>10.000 data points; 262 they then fit 330 polynomial models and selected a single best parametric model based on the 2 263 best R with 5-fold cross-validation. Due to the large number of data points, spline smoothing 264 approaches tend to produce local artifacts that are biologically unreasonable. The NLFS ap- 265 proach offers an alternative to extensive model comparisons while simultaneously incorporating 266 knowledge on the curves shape. 267 For the analysis, we set θ3 ∼ N(15, 4), (12) θ4 ∼ LN(2.28, 0.05). (13) 268 Since testosterone levels increase during puberty, a priori we assume TT levels to reach half 269 maximal levels (e.g., θ3) at approximately 15 years with a standard deviation of 2 years. For the 270 steepness parameter, θ4, we set the log-mean to 2.28 and the log-variance to 0.05 implying that 11 5 REAL DATA EXAMPLE a) Truth = Hill, n = 50 b) Truth = Hill, n = 50 NLFS(Hill) NLFS(power) Param.(Hill) B−Spline P−Spline Param.(power) 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 Dose Dose c) Truth = Hill + Downturn, n = 50 d) Truth = Hill, n = 50 NLFS(Hill+power) NLFS(Hill): Post. mean P−Spline NLFS(Hill): 95% CI Truth Param.(Hill) + B−spline: Post. mean Param.(Hill) + B−spline: 95% CI 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 Dose Dose Figure 2: Posterior mean responses (a-d) and credible intervals (d) of representative simulation runs with median noise level (σ2 = 0.005): (a) Oracle scenario. (b) misspecification scenario. (c) deviations of unknown shape. Combined subspace shrinkage reduces misspecification errors. (d) Comparison of credible intervals between NLFS and PHBspline in the Oracle scenario. 12 Posterior mean Posterior mean 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 Response Posterior mean 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 5 REAL DATA EXAMPLE NLFS(Hill) Literature model P−spline 0 20 40 60 80 Age [years] Figure 3: Comparison of the NLFS approach to a P-spline fit and the best parametric model selected by Kelsey et al. [2014]. For the P-spline and NLFS(Hill), the posterior mean is displayed. The parametric model by Kelsey is log10(TT + 1) = (a+ cx+ ex 2 + gx3)/(1 + bx+ dx2 + fx2) with a = 0.04655, b = −0.05311, c = 0.05123, −d = 0.00793, e = −0.01222, f = 0.00058 and g = 0.00069. For display, the model is backtransformed using exp(log10(TT + 1) + σ̂ 2/2) where σ̂2 is the estimated residual variance. 271 the actual mean and variance to be 10 and 5, respectively. We choose these values as we expect 272 TT to increase rapidly upon the onset of puberty. 273 We fit our model to the slightly reduced data set of men of age 85 or younger (98.5% of 274 original data) because of extreme variability in the approximately 150 observations above 85. To model the noise, we let σ2 ∼ C+275 (0, 1), to account for the large variance in the data. 276 The NLFS and parametric fit by Kelsey et al. [2014] are roughly sigmoidal (Figure 3). They 277 both show a peak around age 19, followed by a slight descent that eventually plateaus. The NLFS 278 approach notably predicts a larger mean testosterone level than the parametric model. The P- 279 spline fit has a less pronounced peak TT around age 20 and has artifactual bumps that oscillate 280 around the NLFS estimate. For each method, the observed RMSE values were 5.123 (NLFS), 281 5.154 (parametric), and 5.111 (P-spline) and therefore similar, but the three resulting model fits 282 are visibly distinguishable. Arguably, the NLFS approach more appropriately models the mean 283 than the P-spline due to the latter’s bumpiness. Further, NLFS estimates a higher mean TT level 284 than the parametric model, which suggests there may be some underestimation of the mean TT 285 when using a parametric approach. 13 Total Testosterone [nmol/L] 0 10 20 30 40 50 6 DISCUSSION 286 6 Discussion 287 The NLFS prior enables adaptive shrinkage into a pre-specified non-linear function space but 288 does not constrain the resulting function to be in that space if the data are not compatible with 289 that a priori function space. This approach can be applied to function spaces defined by any 290 twice differentiable function. Because such a setting is common, the NLFS approach balances 291 adhering to prior assumptions and accounting for model misspecification. 292 The NLFS approach can shrink into a combined function space, thereby providing robustness 293 against misspecification. This benefit is supported by simulation results, where NLFS combined 294 function space prior outperformed all other methods under model misspecification. Defining such 295 a prior is straightforward for the NLFS approach. Attempting to account for parametric misspec- 296 ification by including a spline that shrinks to zero if the model is correctly specified can give 297 artifactual features in the Oracle scenario. As a comparison, the NLFS prior gave slightly better 298 RMSE results under the Oracle scenario, and provided more realistic curve fits without the ar- 299 tifactual features. Though NLFS with a misspecified single subspace performed slightly worse, 300 adding subspaces in NLFS did not lead to a relevant performance loss compared to only including 301 the correct model but also robustified NLFS against misspecification. 302 When modeling the TT data in Kelsey et al. [2014], the NLFS yielded a plausible non- 303 parametric estimate that did not produce artifactual features. In this regard, NLFS provided 304 equally reasonable mean estimates as the parametric model, while not requiring a model selec- 305 tion procedure on over 300 models. 306 Simulation results empirically show that NLFS correctly decides to either shrink towards the 307 specified subspace, or remain unconstrained. Though we have not provided a theoretical proof, 308 our simulation results suggest that the optimal, theoretical shrinkage properties given by Shin 309 et al. [2020] approximately hold in the non-linear case. Because it performs similarly to an un- 310 smoothed spline estimate, adding a smoothness penalty similar to the one proposed by Wiemann 311 and Kneib [2021] for linear subspace shrinkage may be a promising extension. 312 The construction of NLFS assumes the independent variable to be continuously distributed, 313 with a unique covariate value for each observation. In some biological applications, data are 314 generated in a planned experimental setting, with multiple units treated at few distinct exposure 315 levels. For such experiments, the exposure is typically a dose or concentration. Dose-response 316 modeling is often performed in terms of a simple parametric Hill model fit, which can lead to 317 misspecification errors that could be prevented using NLFS. Tailoring NLFS to a such a data 318 structure is necessary. This can be done using a grid that defines the shrinkage locations, such 319 that the shrinkage is independent of the few experimentally selected doses. Precisely, Φθ would 14 6 DISCUSSION 320 be evaluated at a grid instead of the observed exposure levels. This avoids a lack of shrink- 321 age at basis functions that fall between exposure levels. This extension would yield more model 322 parameters related to the construction of the grid. Another extension using a fully specified Gaus- 323 sian processes is an alternative and would reduce hyperparameter choices on knot sequences 324 and shrinkage grids. Another extension is to account for heteroscedasticity. For non-parametric 325 Bayesian modeling, different methodologies can be applied, e.g. Dirichlet process priors. Other 326 computational challenges in the NLFS approach relate to the derivatives. For example, using the 327 Hill model, derivatives w.r.t. the non-linear parameters can be almost linearly dependent. Careful 328 prior selection or expanding the shrinkage onto additional subspaces might soften this challenge. 15 A TABLES 329 Acknowledgements 330 This manuscript was funded in part by the Research Training Group “Biostatistical Methods for 331 High-Dimensional Data in Toxicology” (RTG 2624) funded by the Deutsche Forschungsgemein- 332 schaft (DFG, German Research Foundation-Project Number 427806116) and intramural funds 333 at the NIEHS. 334 Supporting Information 335 The code and data underlying this article are available on GitLab at 336 https://gitlab.tu-dortmund.de/functional shrinkage/nonlinear shrinkage. 337 A Tables Table 1: Overview of method settings used in the simulation study. Algorithm Assume Shrinkage (Horseshoe prior) 1 NLFS Hill a = b = 0.5 2 NLFS Hill a = 0.5, b = exp(−k log(n)/2) 3 NLFS Hill & power a = b = 0.5 4 NLFS Hill & power a = 0.5, b = exp(−k log(n)/2) 5 NLFS power a = b = 0.5 6 NLFS power a = 0.5, b = exp(−k log(n)/2) 7 parametric Hill - 8 parametric power - 9 B-spline - - 10 P-spline - - 11 Parametric + B-spline Hill a = b = 0.5 12 Parametric + B-spline power a = b = 0.5 16 A TABLES Table 2: Simulation setup summarized by the ADEMP principle. Aim Comparing proposed approach against existing approaches Data generation Dose-response models: - Hill: h (x) = x θ4 θ θ (θ4 θ 3 = 0.3, θ4 = 6)θ +x 43 - Power: h θ2θ(x) = θ1x , (θ2 = 0.5) - Hill + Downturn: hθ(x) = hHillθ (x) + 1[0.6,∞)(x)(−1.5(x− 0.6)2) (θ3 = 0.3, θ4 = 6) Doses: Unif∼ [0, 1] Sample sizes: n ∈ {50, 100, 200, 500} Added noise: ε ∼ N(0, σ2), σ2 ∈ {0.005, 0.05} Estimand Mean of posterior dose-response function estimate f(x) Methods Non-linear functional shrinkage (NLFS) (θ1 ∼ N(0, 1), σ2 ∼ IG(0.001, 0.001)) - assuming Hill (NLFS (Hill)) Priors: θ3 ∼ N+(0.5, 0.05), θ4 ∼ LN s.t. E(θ4) = 3, Var(θ4) = 3 - assuming power (NLFS (power)) Priors: θ3 ∼ N(0.5, 0.25) - assuming Hill and power (NLFS (Hill+power)) Priors: As in NLFS(Hill) and NLFS(power) Parametric Bayesian fit (Param.) (θ1 ∼ N(0, 1), log(σ2) ∼ N(−1.75, 1)) - assuming Hill (Param.(Hill)) Priors: θ3 ∼ N+(0.5, 0.05), θ4 ∼ LN s.t. E(θ4) = 3, Var(θ4) = 3 - assuming power (Param.(power)) Priors: θ ∼ N(0.5, 0.25) B-spline Priors: θ1 ∼ N(0, 1), σ2 ∼ IG(0.001, 0.001), λ2 ∼ IG(0.001, 0.001) P-spline Priors: θ1 ∼ N(0, 1), σ2 ∼ IG(0.001, 0.001), τ2 ∼ IG(1, 0.005) Parametric + horseshoe B-spline y = hθ(x) + Φ(β) + ε Priors: β ∼ N(0, σ2τ2diag(λ2 21, . . . , λk)) τ ∼ C+(0, 1), iidλ +j ∼ C (0, 1) θ1 ∼ N(0, 1), θ2 ∼ N(1.5, 2) (Scaling) - assuming Hill (Param.(Hill) + B-spline)) Prior: θ3 ∼ N+(0.5, 0.05), θ4 ∼ LN s.t. E(θ4) = 3, Var(θ4) = 3 - assuming power (Param.(power) + B-spline) Prior: θ ∼ N(0.5, 0.25) Performance RMSE between posterior mean E(f(x)|dis) and true g(x) evaluated at drawn doses x ∈ [0, 1]n 17 A TABLES 18 Table 3: Simulation results for σ2 = 0.005 (2σ = 14.1% of maximal effect) summarized by mean RMSE and corresponding standard deviation in parenthesis. OS means own slice and refers to setting a and b for the shrinkage parameters as suggested in Shin et al. [2020] whereas HC means Half Cauchy and refers to the standard horseshoe prior with a = b = 0.5. n=50 n=100 n=200 n=500 Method Hill power Hill+down Hill power Hill+down Hill power Hill+down Hill power Hill+down 1 NLFS(Hill), OS 0.019 0.027 0.037 0.014 0.017 0.028 0.009 0.011 0.024 0.006 0.008 0.017 (0.007) (0.008) (0.008) (0.005) (0.005) (0.005) (0.003) (0.003) (0.002) (0.002) (0.002) (0.004) 2 NLFS(power), OS 0.046 0.018 0.053 0.031 0.013 0.031 0.023 0.009 0.022 0.015 0.006 0.015 (0.013) (0.006) (0.017) (0.005) (0.005) (0.005) (0.004) (0.003) (0.004) (0.002) (0.002) (0.002) 3 NLFS(Hill+power), OS 0.021 0.022 0.028 0.015 0.015 0.023 0.01 0.011 0.018 0.007 0.007 0.011 (0.007) (0.007) (0.006) (0.005) (0.005) (0.004) (0.003) (0.003) (0.003) (0.002) (0.002) (0.003) 4 NLFS(Hill), HC 0.019 0.022 0.033 0.014 0.015 0.027 0.01 0.011 0.024 0.006 0.008 0.015 (0.007) (0.007) (0.007) (0.005) (0.005) (0.004) (0.003) (0.003) (0.002) (0.002) (0.002) (0.004) 5 NLFS(power), HC 0.046 0.017 0.054 0.03 0.012 0.03 0.021 0.009 0.022 0.013 0.006 0.013 (0.015) (0.006) (0.019) (0.005) (0.005) (0.005) (0.004) (0.003) (0.004) (0.002) (0.002) (0.002) 6 NLFS(Hill&power), HC 0.021 0.021 0.028 0.015 0.015 0.023 0.01 0.011 0.018 0.007 0.007 0.011 (0.007) (0.007) (0.006) (0.005) (0.005) (0.004) (0.003) (0.003) (0.003) (0.002) (0.002) (0.002) 7 param(Hill) 0.019 0.023 0.052 0.013 0.019 0.05 0.009 0.016 0.05 0.006 0.012 0.049 (0.007) (0.005) (0.008) (0.005) (0.004) (0.005) (0.003) (0.002) (0.003) (0.002) (0.001) (0.002) 8 param(power) 0.151 0.015 0.18 0.152 0.011 0.18 0.154 0.008 0.181 0.154 0.005 0.182 (0.011) (0.006) (0.011) (0.008) (0.005) (0.008) (0.005) (0.003) (0.006) (0.003) (0.002) (0.004) 9 bspline 0.042 0.042 0.042 0.032 0.033 0.032 0.023 0.024 0.023 0.014 0.014 0.014 (0.007) (0.008) (0.007) (0.005) (0.005) (0.005) (0.004) (0.004) (0.004) (0.002) (0.002) (0.002) 10 pspline 0.03 0.026 0.03 0.022 0.019 0.022 0.016 0.015 0.016 0.011 0.01 0.011 (0.007) (0.006) (0.007) (0.005) (0.004) (0.005) (0.003) (0.003) (0.003) (0.002) (0.002) (0.002) 11 param(Hill)+bspline 0.021 0.022 0.031 0.015 0.017 0.024 0.011 0.014 0.018 0.007 0.01 0.012 (0.007) (0.006) (0.006) (0.005) (0.004) (0.005) (0.003) (0.003) (0.003) (0.002) (0.002) (0.002) 12 param(power)+bspline 0.038 0.019 0.04 0.029 0.013 0.03 0.021 0.009 0.022 0.013 0.006 0.014 (0.007) (0.007) (0.007) (0.005) (0.005) (0.005) (0.004) (0.003) (0.004) (0.002) (0.002) (0.002) A TABLES 19 Table 4: Simulation results for σ2 = 0.05 (2σ = 44.7% of maximal effect) summarized by mean RMSE and corresponding standard deviation in parenthesis. OS means own slice and refers to setting a and b for the shrinkage parameters as suggested in Shin et al. [2020] whereas HC means Half Cauchy and refers to the standard horseshoe prior with a = b = 0.5. n=50 n=100 n=200 n=500 Method Hill power Hill+down Hill power Hill+down Hill power Hill+down Hill power Hill+down 1 NLFS(Hill), OS 0.059 0.081 0.076 0.043 0.058 0.063 0.03 0.039 0.051 0.019 0.022 0.036 (0.021) (0.021) (0.017) (0.015) (0.015) (0.012) (0.01) (0.011) (0.009) (0.007) (0.007) (0.007) 2 NLFS(power), OS 0.137 0.053 0.155 0.107 0.039 0.101 0.066 0.027 0.067 0.042 0.017 0.043 (0.015) (0.019) (0.022) (0.023) (0.014) (0.024) (0.012) (0.011) (0.012) (0.007) (0.007) (0.007) 3 NLFS(Hill+power), OS 0.064 0.066 0.072 0.046 0.046 0.055 0.032 0.031 0.041 0.02 0.02 0.03 (0.021) (0.021) (0.019) (0.015) (0.015) (0.014) (0.01) (0.011) (0.01) (0.007) (0.007) (0.006) 4 NLFS(Hill), HC 0.057 0.067 0.073 0.042 0.049 0.059 0.03 0.034 0.046 0.02 0.021 0.033 (0.02) (0.02) (0.017) (0.015) (0.015) (0.013) (0.01) (0.011) (0.01) (0.007) (0.007) (0.006) 5 NLFS(power), HC 0.133 0.048 0.141 0.098 0.036 0.097 0.066 0.026 0.072 0.042 0.017 0.043 (0.016) (0.02) (0.026) (0.021) (0.015) (0.019) (0.013) (0.011) (0.014) (0.007) (0.007) (0.008) 6 NLFS(Hill+power), HC 0.061 0.06 0.068 0.045 0.045 0.053 0.032 0.032 0.041 0.02 0.02 0.03 (0.02) (0.019) (0.019) (0.015) (0.015) (0.014) (0.01) (0.011) (0.01) (0.007) (0.007) (0.006) 7 param(Hill) 0.063 0.054 0.082 0.043 0.042 0.066 0.03 0.033 0.058 0.019 0.024 0.053 (0.021) (0.018) (0.016) (0.016) (0.013) (0.011) (0.011) (0.009) (0.006) (0.007) (0.005) (0.003) 8 param(power) 0.161 0.043 0.188 0.157 0.032 0.184 0.156 0.024 0.183 0.155 0.016 0.183 (0.012) (0.019) (0.012) (0.008) (0.014) (0.009) (0.005) (0.01) (0.006) (0.003) (0.007) (0.004) 9 bspline 0.112 0.107 0.111 0.088 0.084 0.086 0.068 0.067 0.067 0.047 0.048 0.046 (0.02) (0.022) (0.02) (0.014) (0.014) (0.014) (0.01) (0.01) (0.01) (0.006) (0.006) (0.006) 10 pspline 0.081 0.063 0.08 0.061 0.048 0.06 0.045 0.037 0.045 0.03 0.026 0.03 (0.019) (0.017) (0.019) (0.014) (0.013) (0.013) (0.01) (0.009) (0.01) (0.006) (0.006) (0.006) 11 param(Hill)+bspline, HC 0.072 0.06 0.081 0.05 0.045 0.061 0.034 0.034 0.047 0.022 0.023 0.033 (0.021) (0.02) (0.019) (0.016) (0.014) (0.014) (0.011) (0.01) (0.009) (0.007) (0.006) (0.006) 12 param(power)+bspline, HC 0.097 0.053 0.103 0.076 0.039 0.081 0.058 0.028 0.061 0.04 0.019 0.042 (0.019) (0.02) (0.019) (0.014) (0.015) (0.014) (0.01) (0.01) (0.01) (0.006) (0.007) (0.006) D COMPUTATION 338 B Figures 339 C Proofs 340 Lemma 1: Pθ does not depend on linear parameters. 341 Let h(x, θ) = θ1+θ2q(x, θ3) be a twice differentiable function with q non-linear in θ3. W.L.o.G. 342 assume that θ3 ∈ R. Then   1 0 0∂h(x, θ) ∂q(x, θ )  Ḣ 3 θ = = (1n q︸(x︷,︷θ3︸) θ2 ) = (︸1n ︷c︷1 c2∂θ ︸ ∂︷θ︷3 ︸ ︸ )0 1 0 :=c1 :=H ∈Rn×3 :=c 12 0 0 θ2︸ ︷︷ ︸ H2∈R3×3 343 and ⊤ ⊤ P = Ḣ (Ḣ Ḣ )−1θ θ θ θ Ḣθ = H H (H H⊤1 2 2 1 H1H ) −1 2 H2H ⊤ 1 = H −11H2H2 (H ⊤H )−1H−1 ⊤1 1 2 H2H1 = H1(H ⊤ −1 ⊤ 1 H1) H1 344 and H1 does not depend on θ2. 345 D Computation 346 The code to reproduce results is available at 347 https://gitlab.tu-dortmund.de/functional shrinkage/nonlinear shrinkage. 348 The non-linear functional shrinkage (NLFS) approach for the Hill model is implemented using 349 a combination of Gibbs-, Metropolis-Hastings- and Slice sampling [Brooks et al., 2011, Neal, 350 2003]. We separately model the function intercept θ1. 351 Given the likelihood and priors Y ∼ N(θ11n +Φβ, σ2In) σ2 ∼ IG(aσ, bσ), θ1 ∼ N(µ 2θ1 , σθ )1 β ∼ N(0, σ2τ2(Φ⊤(I − P )Φ)−1θ ) θ3 ∼ N+(µθ3 , σ2θ ), θ4 ∼ LN(µθ4 , σ23 θ )4 ω = 1/(1 + τ2) ∼ Beta(aω, bω), aω = 0.5, bω = exp(− log(n)/2), 20 D COMPUTATION α σ θ θ ω β Figure 4: Traceplots of the NLFS(Hill) example fit in Figure 2, pane (a), correct subspace speci- fication. The first 2000 samples were discarded as burn-in. Due to the correct subspace speci- fication, there is strong shrinkage (ω = (1 + τ2)−1 close to 1). The effective sample size (ESS) was calculated based on the 10000 draws after discarding the first 2000 burn-in draws using the coda R-package [Plummer et al., 2006]. 21 D COMPUTATION • • • • • Figure 5: Traceplots of the NLFS(power) example fit in Figure 2, pane (b), subspace misspecifi- cation. The first 2000 samples were discarded as burn-in. Due to the subspace misspecification, there is little shrinkage (ω = (1+ τ2)−1 close to 0). Further, power exponent θ3 is stuck at 0 for a few thousand draws, and θ1, the intercept, seems highly correlated. Due to the misspecification and effectively no shrinkage, single parameters are not well identifiable and the whole response must be viewed to inspect convergence. The resulting response mixes well (bottom row). The effective sample size (ESS) was calculated based on the 10000 draws after discarding the first 2000 burn-in draws using the coda R-package [Plummer et al., 2006]. 22 D COMPUTATION Algorithm 1 Non-linear functional shrinkage (NLFS) 1: Initialize: β(1), σ2(1), τ (1), ω(1), θ(1) 2: for i : 2 → B do 3: Calculate Ḣθ(i−1) 4: Sample β(i) ∼ p(β|·) ▷ Conjugate 5: Sample (i)θ1 ∼ p(θ1|·) ▷ Conjugate 6: Sample σ2(i) ∼ p(σ2|·) ▷ Conjugate 7: Sample ω(i) ∼ p(ω|·) ▷ Slice Sampler 8: Sample Non-linear θ(i) ∼ p(θ|·) ▷ MH Sampler 9: end for 10: return All samples 352 the sampler is summarized in Algorithm 1. 353 Update β 354 We update β using the full conditional posterior β|σ2, τ2, θ, y ∼ N(µ′ ,Σ′β β) (14) 355 where Σ′β = σ 2(Φ⊤Φ+ τ2Φ⊤(In − P )Φ)−1θ and µ′β = σ−2Σ′ ′βΦ ỹ and ỹ = y − 1nθ1. 356 Update θ1 357 358 The intercept θ1 is updated using θ |β, σ2, y ∼ N(µ′ 2 ′1 θ , σθ ) (15)1 1 359 where ′ ′σ2 = (σ2σ2θ θ )(nσ 2 θ + σ 2) and µ′θ = σ −2σ2 ⊤θ 1n (y − Φβ) + µθ1/σ21 1 1 1 1 θ .1 360 Update σ2 361 The noise variance σ2 is updated by σ2|β, τ2, θ ∼ IG(a′ , b′σ σ) (16) 362 where a′σ = (n+ k)/2 + a and b ′ σ σ = 0.5(RSS+ τ −2β⊤(Φ⊤(In − Pθ)Φ)β) + bσ where RSS = 363 ||y − (θ 1 + Φβ)||21 n 2 is the residual sum of squares and ||.||2 is the Euclidean norm. 364 Update τ2 We update τ2365 using a slice sampler [Neal, 2003] considering the posterior log likelihood g(τ) = log(p(τ |β, σ2, θ)) =(−k/2 + bω − 0.5) log(τ2) (17) + (−a 2ω − bω) log(1 + τ ) (18) ( ) − 1+ β⊤Φ⊤(In − P −22 θ)Φβ τ . (19)2σ 23 D COMPUTATION 366 Note that we use −k/2 and not −(k − d0)/2 as in Shin et al. [2020] where d0 is the rank of the 367 (linear) projection matrix. We omit −d0 as for the non-linear approach, there are no linear bases 368 in Ḣθ that are in Φ (because there is no intercept in Φ) and hence the prior covariance matrix of 369 β is of full rank k. For a current τ0, calculate v = g(τ0)). Uniformly draw z̃ ∼ U(0, exp(v)). For 370 z = log(z̃) define the slice Sz = {x : g(x) < g(z)} and sample the next τ1 uniformly from Sz. 371 For computational ease, we restrict τ2 to [0.001, 10]. 372 The above sampling for the general ω ∼ Beta(aω, bω) prior was primarily featured and labelled 373 ’own slice’ (OC) in Table 3. We also considered a standard horseshoe (HS) prior (a = b = 0.5). 374 Details on its implementation are in Makalic and Schmidt [2015]. 375 Update non-linear θ 376 One only has to update the non-linear parameters of θ, as Pθ only depends on the non-linear pa- 377 rameters. For the Hill model, the non-linear parameters are θ3 and θ4. We assume independence 378 and separately update θ3 and θ4 using a Metropolis-Hastings sampler [Brooks et al., 2011] and 379 explain the sampling for θ3. 380 Perform the three sampling steps 1. Draw a candidate (1)381 θ3 from a proposal distribution ppropusing (0) θ3 382 2. Calculate the hastings ratio (1) (0) (1) p(θ HR = 3 |·)pprop(θ3 |θ3 ) . (0) (1) (0) p(θ3 |·)pprop(θ3 |θ3 ) 383 3. Draw u ∼ Unif (1)[0, 1]. If HR > u, accept θ3 as new draw. Otherwise, reject and consider (0) 384 θ3 as new draw. 385 For step (1), sample a new candidate (1)θ3 from a proposal distribution, e.g. (0) N 2+(θ3 , σprop) 386 where σ2prop might be calculated as, e.g. the empirical variance of the latest 100 draws of θ3, or simply as σ2 = σ2387 prop θ . To sample x from a truncated normal distribution with positive support,3 388 x ∼ N+(µ, σ2), calculate l = P (X < 0) where X ∼ N(µ, σ2). Sample u ∼ Unif[l, 1] and 389 calculate x = qµ,σ2(u), the corresponding quantile. 390 For step (2), consider log(HR) for computational stability: HR (1)|· − (0)|· (0)| (1) − (1) (0)log( ) = log(p(θ3 )) log(p(θ3 )) + log(pprop(θ3 θ3 )) log(pprop(θ3 |θ3 )). 391 For the fully conditional log posterior log(p(θ|·)), we can integrate out β to reduce the autocorre- 392 lation in the sampling. Since p(θ|·) ∝ p(y|θ, σ2, τ2)p(θ) and y|θ, σ2, τ2 ∼ N(E(Φβ + ε),Cov(Φβ + ε)), 393 we use y|θ, σ2, τ2 ∼ N(0,Σy) with Σ = σ2τ2y Φ(Φ⊤(I − Pθ)Φ)−1Φ⊤ + σ2In. 24 REFERENCES REFERENCES 394 References 395 M. A. Alvarez, D. Luengo, and N. D. Lawrence. Linear latent force models using gaussian pro- 396 cesses. IEEE transactions on pattern analysis and machine intelligence, 35(11):2693–2705, 397 2013. 398 A. Brezger and W. J. Steiner. Monotonic regression based on bayesian p–splines. Journal 399 of Business & Economic Statistics, 26(1):90–104, 2008. doi: 10.1198/073500107000000223. 400 URL https://doi.org/10.1198/073500107000000223. 401 S. Brooks, A. Gelman, G. Jones, and X.-L. Meng. Handbook of markov chain monte carlo. CRC 402 press, 2011. 403 D. Carl. A practical guide to splines. Springer, 2001. 404 C. M. Carvalho, N. G. Polson, and J. G. Scott. The horseshoe estimator for sparse signals. 405 Biometrika, 97(2):465–480, 2010. 406 J. Chen, Z. Chen, C. Zhang, and C. Jeff Wu. Apik: Active physics-informed kriging model with 407 partial differential equations. SIAM/ASA Journal on Uncertainty Quantification, 10(1):481–506, 408 2022. 409 A. Golightly and D. J. Wilkinson. Bayesian parameter inference for stochastic biochemical network 410 models using particle markov chain monte carlo. Interface focus, 1(6):807–820, 2011. 411 S. Goutelle, M. Maurin, F. Rougier, X. Barbaut, L. Bourguignon, M. Ducher, and P. Maire. The 412 hill equation: a review of its capabilities in pharmacological modelling. Fundamental & clinical 413 pharmacology, 22(6):633–648, 2008. 414 L. H. Gunn and D. B. Dunson. A transformation approach for incorporating monotone or unimodal 415 constraints. Biostatistics, 6(3):434–449, 2005. 416 A. V. Hill. The possible effects of the aggregation of the molecules of hemoglobin on its dissocia- 417 tion curves. The Journal of Physiology, 40:iv–vii, 1910. 418 A. E. Hoerl and R. W. Kennard. Ridge regression: Biased estimation for nonorthogonal problems. 419 Technometrics, 12(1):55–67, 1970. 420 Y. Huang, D. Liu, and H. Wu. Hierarchical bayesian methods for estimation of parameters in a 421 longitudinal hiv dynamic system. Biometrics, 62(2):413–423, 2006. 25 REFERENCES REFERENCES 422 T. W. Kelsey, L. Q. Li, R. T. Mitchell, A. Whelan, R. A. Anderson, and W. H. B. Wallace. A 423 validated age-related normative model for male total testosterone shows increasing variance 424 but no decline after age 40 years. PloS one, 9(10):e109346, 2014. 425 C. Köllmann, B. Bornkamp, and K. Ickstadt. Unimodal regression using bernstein–schoenberg 426 splines and penalties. Biometrics, 70(4):783–793, 2014. 427 S. Lang and A. Brezger. Bayesian p-splines. Journal of computational and graphical statistics, 428 13(1):183–212, 2004. 429 D. J. Lunn, N. Best, A. Thomas, J. Wakefield, and D. Spiegelhalter. Bayesian analysis of popula- 430 tion pk/pd models: general concepts and software. Journal of pharmacokinetics and pharma- 431 codynamics, 29:271–307, 2002. 432 E. Makalic and D. F. Schmidt. A simple sampler for the horseshoe estimator. IEEE Signal Pro- 433 cessing Letters, 23(1):179–182, 2015. 434 D. Mesarovic, Mihajlo, S. Sreenath, and J. Keene. Search for organising principles: understand- 435 ing in systems biology. Systems biology, 1(1):19–27, 2004. 436 M. C. Meyer. Inference using shape-restricted regression splines. The Annals of Applied 437 Statistics, 2(3):1013–1033, 2008. ISSN 19326157. URL http://www.jstor.org/stable/ 438 30245118. 439 M. C. Meyer, A. J. Hackstadt, and J. A. Hoeting. Bayesian estimation and inference for gener- 440 alised partial linear models using shape-restricted splines. Journal of Nonparametric Statistics, 441 23(4):867–884, 2011. 442 T. P. Morris, I. R. White, and M. J. Crowther. Using simulation studies to evaluate statistical 443 methods. Statistics in medicine, 38(11):2074–2102, 2019. 444 R. M. Neal. Slice sampling. The annals of statistics, 31(3):705–767, 2003. 445 M. Plummer, N. Best, K. Cowles, and K. Vines. Coda: Convergence diagnosis and output 446 analysis for MCMC. R News, 6(1):7–11, 2006. URL https://journal.r-project.org/ 447 archive/. 448 G. A. Seber and C. J. Wild. Nonlinear regression. hoboken. New Jersey: John Wiley & Sons, 62 449 (63):1238, 2003. 450 M. Shin, A. Bhattacharya, and V. E. Johnson. Functional horseshoe priors for subspace shrink- 451 age. Journal of the American Statistical Association, 115(532):1784–1797, 2020. 26 REFERENCES REFERENCES 452 T. S. Shively, T. W. Sager, and S. G. Walker. A bayesian approach to non-parametric monotone 453 function estimation. Journal of the Royal Statistical Society. Series B (Statistical Methodology), 454 71(1):159–175, 2009. 455 T. S. Shively, S. G. Walker, and P. Damien. Nonparametric function estimation subject to mono- 456 tonicity, convexity and other shape constraints. Journal of Econometrics, 161(2):166–181, 457 2011. 458 P. Šimon. Considerations on the single-step kinetics approximation. Journal of Thermal Analysis 459 and Calorimetry, 82(3):651–657, 2005. 460 R. Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical 461 Society: Series B (Methodological), 58(1):267–288, 1996. 462 M. K. Titsias, A. Honkela, N. D. Lawrence, and M. Rattray. Identifying targets of multiple co- 463 regulating transcription factors from expression time-series by bayesian model comparison. 464 BMC systems biology, 6:1–21, 2012. 465 M. W. Wheeler, D. B. Dunson, S. P. Pandalai, B. A. Baker, and A. H. Herring. Mechanistic 466 hierarchical gaussian processes. Journal of the American Statistical Association, 109(507): 467 894–904, 2014. 468 M. W. Wheeler, D. B. Dunson, and A. H. Herring. Bayesian local extremum splines. Biometrika, 469 104(4):939–952, 2017. 470 P. Wiemann and T. Kneib. Adaptive shrinkage of smooth functional effects towards a predefined 471 functional subspace. arXiv preprint arXiv:2101.05630, 2021. 27