Classic statistical and modern machine learning methods for

modeling and prediction of major tennis tournaments

by:

Nourah Buhamra

A thesis submitted in fulfilment of the requirements
for the Doctor of Statistic

of the

Department of Statistic,

TU Dortmund University

Dortmund, April 2025


Amtierender Dekan:

Prof. Dr. Philipp Doebler

Gutachter:

Prof. Dr. Andreas Groll (Technische Universität Dortmund)

Prof. Dr. Markus Pauly (Technische Universität Dortmund)

Tag der Prüfung:

11.06.2025


Abstract

We propose a comprehensive approach for predicting outcomes in Grand Slam tennis tour-
naments from 2011 to 2022, focusing on the probability that the first-named player would
win. Our study includes both regression and machine learning models, evaluated using
cross validation and external validation strategies through performance measures such as
classification rate, predictive likelihood, and Brier score. Some specific aspects are exam-
ined in greater detail: non-linear effects, court-specific abilities and global player abilities.
Additionally, we introduce the concept of “newcomer” players to estimate ability effects for
individuals participating in only one tournament. Among regression models, spline-based
approaches with P-splines yielded positive betting returns while placing fewer bets, sug-
gesting a more conservative and potentially safer approach, confirmed as top performers in
2022 validation, while LASSO models were less effective. Notably, all models outperformed
the benchmark model.

For the machine learning (ML) models, we investigate classification trees (CT), random
forests (RF), extreme gradient boosting (XGBoost), support vector machines (SVM), and ar-
tificial neural networks (ANNs) to model and predict match outcomes over the same period.
These models were compared with regression approaches using the same predefined valida-
tion strategies. However, within the framework of machine learning, a sequential validation
technique, namely the rolling window strategy, was employed to assess model stability and
accuracy. The findings highlighted that, among the different validation strategies consid-
ered, XGBoost achieved the highest classification rate, linear SVM was the most effective
in predictive likelihood, and spline models with Points and Prob covariates consistently
delivered the lowest Brier scores, emphasizing the strength of these methods across various
validation techniques.

Furthermore, we apply the concept of statistically enhanced learning to improve predic-
tive performance in our models. We incorporate covariates derived from separate modeling
approaches, such as the Elo rating, along with covariates based on meaningful mathematical
transformations, including two Age-related variables. Evaluating the predictive power of
these statistically enhanced covariates involves assessing their contribution to model accu-
racy, often resulting in improved performance compared to models that rely solely on classic
covariates. By integrating advanced statistical techniques, we can enhance the model’s abil-
ity to generalize to new data. This approach has been successfully applied in football, where
team ability parameters derived from separate statistical models have improved predictive

2


performance.
To enhance interpretability, we employed interpretable machine learning (IML) tools such

as partial dependence plots (PDPs) and individual conditional expectation (ICE) plots, par-
ticularly for models like RF. These visualization tools, combined with feature engineering,
allow us to assess the individual and combined impact of key variables, such as Age and Elo
ratings, on prediction accuracy. The results confirm that enhanced variables contribute posi-
tively to model performance and provide deeper insights into predictors of match outcomes
in sports analytics.

3


Acknowledgments

They say life is a river, and we must learn to go with the flow and embrace its surprises.
Learning to go with the flow would not have been possible without the support of the two
most important people in my life. My deepest gratitude goes to my wonderful mother and
my soulmate, my incredible husband. Thank you both for being my pillars of strength and
encouragement. I am truly grateful for everything you have done to help me succeed.

I extend my heartfelt thanks to my beloved children, whose love has always been a source
of strength and inspiration.

I am also profoundly grateful to my supervisor Professor Andreas Groll for guiding me
throughout my studies. I deeply appreciate his trust, mentorship, and the opportunity to be
one of his students. Thank you for your valuable feedback, for dedicating your time, and
for handling essential matters, including the visa process and follow-ups with the Kuwait
embassy, which I realize may have been inconvenient. I would also like to sincerely thank
Prof. Dr. Markus Pauly for dedicating a substantial amount of time to carefully reading
my thesis and providing valuable feedback in the report. I truly appreciate your support
throughout this process.

Many thanks also go to my colleagues, Hendrik van der Wurp and Guillermo Briseno.
Special thanks to my co-authors, Susanne Brunner and Alexander Gerharz, for their assis-
tance in proofreading my papers and their help with R code. I am deeply thankful for all
your help.

This work would not have been possible—or even completed—without the support and
confidence of everyone mentioned here. Finally, I am proud to have accomplished this, and
if given the chance, I would choose to relive this journey all over again. Thank you all for
being a part of this journey.

“The unexamined life is not worth living”

— Socrates

4


List of main publications

This cumulative thesis is based on the following three manuscripts:

Article 1: Buhamra, N., Groll, A., and Brunner, S. (2024). Modeling and prediction of tennis
matches at grand slam tournaments. Journal of Sports Analytics, 10(1):17–33.

Contribution of the author:

The author of this thesis played a leading role in the preparation and organization of
the manuscript. She was also responsible for implementing the simulation studies and
conducting the analysis of the data. S. Brunner was involved in data preparation and
preprocessing. Moreover, the co-authors contributed through scientific discussions,
feedback, and revision support during the development of the manuscript.

Article 2: Buhamra, N., Groll, A., and Gerharz, A. (2025). Comparing modern machine learning
approaches and different forecasting strategies for modeling tennis matches at grand
slam tournaments. Journal of Sports Analytics. to appear.

Contribution of the author:

The author of this thesis was responsible for the conceptualization, structuring, and
writing of the manuscript. She carried out the simulation studies and performed
the statistical analyses of the studies. The literature review, methodology selection,
and interpretation of the results were also primarily handled by the author. Input
from collaborator and supervisor was incorporated during the writing. In particu-
lar, A. Gerharz also provided continuous support throughout the development of the
manuscript, including advise during the simulation studies, the revision process, and
other technical tasks.

Article 3: Buhamra, N. and Groll, A. (2025). Statistical enhanced learning for modeling and pre-
diction tennis matches at grand slam tournaments. arXiv preprint arXiv:2502.01613.

5


Contribution of the author:

While this manuscript benefited from the guidance of supervisor, the author took a
leading role in the development of its structure and content. She was mainly responsi-
ble for conducting simulations, analyzing datasets, and drafting the text. Collaborative
discussions helped shape the direction of the study, but the implementation and inter-
pretation were carried out independently by the author. The final manuscript reflects
her substantial personal contribution to the research.

6


Contents

Abstract 2

Acknowledgments 3

List of main publications 5

Abbreviation 9

I Summary of thesis work 10

1 Introduction 12

2 Statistical methods 20
2.1 Logistic regression model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.2 Extension for the regression models . . . . . . . . . . . . . . . . . . . . . . . . 22

2.2.1 Least absolute shrinkage and selection operator (LASSO) . . . . . . . . 22
2.2.2 Non-parametric models (splines) . . . . . . . . . . . . . . . . . . . . . . 26

2.3 Machine learning (ML) approaches . . . . . . . . . . . . . . . . . . . . . . . . . 30
2.3.1 Decision trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
2.3.2 Random forest (RF) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
2.3.3 Extreme gradient boosting (XGBoost) . . . . . . . . . . . . . . . . . . . 35
2.3.4 Support vector machine (SVM) . . . . . . . . . . . . . . . . . . . . . . . 37
2.3.5 Artificial neural networks (ANNs) . . . . . . . . . . . . . . . . . . . . . 41

2.4 Interpretable machine learning (IML) . . . . . . . . . . . . . . . . . . . . . . . 45
2.4.1 Partial dependence plot (PDP) . . . . . . . . . . . . . . . . . . . . . . . 46
2.4.2 Individual conditional expectation (ICE) . . . . . . . . . . . . . . . . . 47

3 Summary of the articles 49
3.1 Article 1: Modeling and prediction of tennis matches at Grand Slam tourna-

ments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
3.2 Article 2: Comparing modern machine learning approaches and different

forecast strategies on Grand Slam tennis tournaments . . . . . . . . . . . . . . 52

7


3.3 Article 3: Statistical enhanced learning for modeling and prediction tennis
matches at Grand Slam tournaments . . . . . . . . . . . . . . . . . . . . . . . . 55

4 Discussion and outlook 58
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

II Publications 70

8


Abbreviations

ANNs Artificial neural networks

CART Classification and regression tree

cp Complexity parameter

CT Classification tree

CV Cross validation

c-ICE Centered ICE plot

d-ICE Derivative ICE plot

GAM Generalized additive model

ICE Individual conditional expectation

IML Interpretable machine learning

LASSO Least absolute shrinkage and selection operator

ML Machine learning

MLP Multilayer perceptron

P-splines Penalizes splines

PDP Partial dependence plot

RBF Radial basis function kernel

RF Random forest

RSS Residual sum of squares

SEL Statistical enhanced learning

SVM Support vector machine, a machine learning method

XGBoost Extreme gradient boosting

9


Part I

Summary of thesis work

10


11


Chapter 1

Introduction

In recent years, sports data analysis has gained significant attention and importance in both
professional and amateur athletics. With advancements in data collection technologies and
the increasing availability of sophisticated analytical tools, sports organizations and teams
have turned to data analytics to gain a competitive edge. These sports data cover various
aspects, such as tracking player movements, analyzing game strategies, monitoring physical
conditions, and even predicting injury risks. The impact of data analytics is particularly
evident in professional leagues and competitions, where every decision can influence the
outcome of a match or season. A notable example is tennis, which is undoubtedly one of
the most popular sports globally, played by millions of players, with its most prestigious
tournaments attracting significant international attention. This widespread popularity, com-
bined with the potential for profit in betting markets, makes tennis an appealing subject for
research.

In tennis, data is gathered from multiple sources, analyzed to identify patterns, to op-
timize player performance, and to develop winning strategies. By using such data, both
players and coaches can assess opponents’ strengths and weaknesses, refine their training
programs, and make informed decisions during a match.

In this cumulative dissertation, a summary of all my publications in this field (sports
analytics on tennis) is presented. Since it is a cumulative dissertation, only concise overviews
of the included articles are provided. For exhaustive results and further details, I recommend
reading the respective articles themselves which can be found in Part II.

First, we present a statistical approach for modeling and predicting the outcomes of Grand
Slam tennis tournaments using regression models. While linear regression is commonly
employed for this purpose, it may not be suitable when the relationships between variables
are nonlinear or complex. In such cases, non-linear methods, such as splines or decision
trees, can be used to develop reliable models without necessarily providing explicit insights
into the relationships between the input variables and the target outcomes.

In addition to selecting appropriate models, which is crucial for accurately predicting
match results, we also evaluate the proposed models from a betting perspective by assessing
the gains or losses of the centrality-based specification.

12


CHAPTER 1. INTRODUCTION 13

Furthermore, nowadays, machine learning (ML) has shown promising results in sports
prediction. Therefore, we propose several machine learning approaches to accurately predict
tennis match outcomes. Within this context, the dissertation also explores how interpretable
machine learning (IML) tools can be applied to sports analytics, focusing on the ability to
understand and interpret the predictions made by machine learning models. IML empha-
sizes building models that not only make accurate predictions but also provide explanations
or justifications for their decisions.

Finally, we provide examples of how statistical techniques are applied to tennis matches
at Grand Slam tournaments, including feature engineering to enhance predictive modeling
and derive insights from tennis data. Thus, the aim of this work is to investigate which
model approaches yield the most accurate and interpretable results.

State of the art in tennis analytics

Nowadays, there is growing interest in predicting sports match outcomes using statistical and
machine learning methods. Recent studies highlight the increasing application of machine
learning techniques in tennis analytics, aiming to enhance predictive accuracy and to better
understand match dynamics.

In Bunker et al. (2024b), a comparative evaluation of Elo ratings and machine learning-
based methods for tennis match result prediction is presented. The authors evaluate five
ML models: Alternating Decision Trees (ADTrees), Logistic Regression, Support Vector Ma-
chines (SVMs), Random Forests (RFs), and Artificial Neural Networks (ANNs). The findings
indicate that ADTrees and Logistic Regression achieve higher accuracies than Elo ratings and
comparable accuracies to predictions derived from betting odds. Notably, ADTrees provide
an interpretable decision tree model that accommodates variations in the average betting
odds difference threshold, highlighting their potential in this domain. The study suggests
that ADTrees, with their balance of accuracy and interpretability, could be particularly ben-
eficial for applications requiring both high predictive performance and transparency. Lv
et al. (2024) explore the impact of momentum in tennis matches, with a particular focus on
the 2023 Wimbledon men’s final between Carlos Alcaraz and Novak Djokovic. The authors
developed a Composite Bayesian Regression Framework (CBRF), integrating CatBoost and
Random Forest regression models to predict momentum shifts throughout the match. Their
analysis revealed significant non-zero autocorrelation coefficients, indicating a strong corre-
lation between momentum and match success. Additionally, the study proposed winning
strategies based on key influencing factors and conducted predictive analyses for six specific
time intervals. The models demonstrated strong generalization capabilities when applied to
women’s matches, different tournaments, and various playing surfaces.

Candila and Palazzo (2020) investigated the application of artificial neural networks
(ANNs) to predict the outcomes of men’s tennis matches and assess the profitability of
various betting strategies based on these predictions. The study not only demonstrates
ANN’s superior predictive performance compared to traditional models but also highlights


CHAPTER 1. INTRODUCTION 14

its practical applicability in developing profitable betting strategies. Their work contributes
to the broader field of sports analytics by illustrating the advantages of integrating advanced
machine learning techniques into predictive modeling and decision-making processes.

LeCun et al. (2015) provided a comprehensive overview of deep learning, particularly
focusing on artificial neural networks (ANNs) and their advancements. Their work high-
lighted how deep learning techniques, such as convolutional neural networks (CNNs) and
recurrent neural networks (RNNs), have revolutionized various fields, including computer
vision, speech recognition, and natural language processing. They discussed the underlying
principles of neural networks, optimization techniques, and the role of large-scale datasets
and computational power in improving model performance. The paper also explored prac-
tical applications and the potential future developments of deep learning in artificial intel-
ligence. Additionally, Hastie et al. (2009) and James et al. (2013) provide comprehensive
introductions to statistical learning methods, including ANNs.

Another study highlighting the growing application of ANNs and deep learning in
tennis analytics is Almarashi et al. (2024). This study evaluated the effectiveness of the
Neural Network Auto-Regressive (NNAR) model in forecasting tennis players’ performance
indicators, such as wining probability, aces, double faults, and game dominance. The
researchers compared the NNAR model with traditional linear stochastic time series models,
finding that the NNAR model outperformed others based on metrics like Root Mean Square
Error (RMSE), Mean Absolute Error (MAE), and Mean Absolute Percentage Error (MAPE).
The study emphasized the importance of considering time-dependent performance measures
over solely relying on official rankings, suggesting that players can utilize these insights to
enhance future performance.

The ANNs are also used for injury forecasting, strategy optimization, and even real-
time decision-making during matches, helping coaches and analysts gain deeper insights
into the game. For Example, Sun et al. (2023) employs a deep neural network leveraging
Long-Short-Term Memory (LSTM) cells to predict tennis match outcomes based on various
statistics. De Seranno (2020) utilizes neural networks to forecast the probability of winning
in tennis matches, incorporating a vast number of input variables to handle non-linear re-
lationships effectively. Sampaio et al. (2024) explore the role of machine learning in their
research, including neural networks, to enhance tennis performance through psychological
state monitoring, talent identification, match outcome prediction, spatial and tactical anal-
ysis, as well as injury prevention. Fernando et al. (2019) presents a novel framework for
predicting shot location and type in tennis, incorporating neural memory modules to model
the episodic and semantic memory components of a tennis player.

In Hovad et al. (2024), the authors explored the application of deep learning techniques
to classify tennis actions from video data. Utilizing the Slow Fast deep learning architecture,
they trained three models of varying sizes on the THETIS tennis dataset. The best-performing
model achieved a generalization accuracy of 74%, demonstrating the potential of deep
learning in accurately identifying specific tennis actions. The study also conducted an error


CHAPTER 1. INTRODUCTION 15

analysis and discussed limitations in current publicly available tennis datasets, providing
directions for future improvements in tennis action classification.

Zhang et al. (2024) tackles the challenge of assessing and predicting tennis players’ scoring
abilities using in-match data. The authors present a comprehensive approach that includes
modeling real-time scoring capabilities, establishing a predictive framework, and enhancing
model applicability. Using data from the 2023 Wimbledon men’s singles final, they apply
the Entropy Weighting Method (EWM) and Analytic Hierarchy Process (AHP) to calculate
weights, followed by the Technique for Order Preference by Similarity to Ideal Solution
(TOPSIS) to evaluate player performance. To improve real-time scoring assessment, they
introduce a Principal Component Analysis Stepwise Regression Model, which combines
Principal Component Analysis (PCA) with stepwise linear regression. The model accurately
captures players’ performance during matches. Additionally, the authors propose a scoring
ability migration model based on time series concepts to predict match trends, achieving
a prediction accuracy of up to 95.27% in women’s tennis matches. This research provides
valuable insights for future advancements in tennis match outcome prediction.

The literature overview in the remainder of this section is mainly based on Buhamra et al.
(2024) and Buhamra et al. (2025). In recent years, various approaches have also been proposed
for the statistical modeling of tennis matches and tournaments, leading to advancements
in existing methods for predicting match-winning probabilities. Once individual match
outcomes can be predicted, it becomes possible to estimate the winning probabilities for an
entire tournament. For example, Clarke and Dyte (2000) utilized the official Association
of Tennis Professionals (ATP) rankings to estimate a player’s probability of winning using
logistic regression. Arcagni et al. (2022) expanded the methodology for rating calculations to
determine match-winning probabilities. Their approach incorporated a centrality measure,
allowing player ratings to dynamically adjust with each new match. These updated ratings
were then used as covariates in a simple logit model. Klaassen and Magnus (2003) utilized
a large live dataset from Wimbledon to make predictions during the tournament, making
their approach particularly relevant for betting markets. Easton and Uylangco (2010) built
upon Klaassen and Magnus’ model, comparing it with bookmakers’ odds on a point-by-
point basis. Their findings confirmed that bookmakers’ odds serve as strong predictors of
match outcomes for both men’s and women’s tennis. Gu and Saaty (2019) analyzed Grand
Slam, ATP, and the Women’s Tennis Association (WTA) matches, combining statistical data
with subjective judgments. Their study identified and systematically prioritized multiple
factors—both objectively and subjectively—to enhance predictive accuracy. McHale and
Morton (2011) introduced a Bradley-Terry-type model to forecast the outcomes of top-tier
ATP and WTA competitions. Their approach incorporated surface types (hard court, carpet,
clay, and grass) and demonstrated that including match scores, play data, and surface
influence improved prediction accuracy compared to ranking-based models. However, they
did not account for dependencies between factors.

Ma et al. (2013) applied a logistic regression model to predict match outcomes (win/loss)


CHAPTER 1. INTRODUCTION 16

based on 16 variables, including player skills, performance metrics, player characteristics,
and match conditions. More recently, Yue et al. (2022) proposed a statistical framework
for predicting Grand Slam match outcomes, incorporating exploratory data analysis. Their
approach introduced new variables using the Glicko rating model, a Bayesian method com-
monly used in professional chess, to refine prediction accuracy. Del Corral and Prieto-
Rodríguez (2010) estimated separate probit models for men and women, using Grand Slam
tennis match data from 2005 to 2008. The explanatory variables were categorized into three
groups: a player’s past performance, physical characteristics, and match characteristics. The
accuracy of the models was assessed both in-sample and out-of-sample by computing Brier
scores, comparing predicted probabilities with actual outcomes from 2005 to 2008 and the
2009 Australian Open. Additionally, they employed bootstrapping techniques to evaluate
the out-of-sample Brier scores for the 2005–2008 data.

Somboonphokkaphan et al. (2009) proposed a method to predict the winner of tennis
matches by utilizing both match statistics and environmental data, based on a Multi-Layer
Perceptron (MLP) equipped with a backpropagation learning algorithm. MLP, a type of
ANN, is an effective technique for solving real-world classification problems, especially
when relying on large datasets and capable of handling incomplete or noisy data. Whiteside
et al. (2017) developed an automated stroke classification system to quantify the hitting load
in tennis, employing machine learning models such as a cubic kernel support vector machine.
Additionally, Wilkens (2021) focused on machine learning approaches and expanded upon
previous research by applying a broad range of machine learning techniques. He used
various models, including neural networks and random forests, along with one of the most
extensive datasets in professional men’s and women’s tennis singles matches. The author
also demonstrated that the average prediction accuracy could not exceed approximately
70%.

As well, Gao and Kowalczyk (2021) developed a model that predicts tennis match out-
comes with an accuracy exceeding 80%, significantly large than predictions based solely on
betting odds. The model identified serve strength as a key predictor of match outcomes.
Using a RF classifier, the study emphasizes the importance and effectiveness of simple mod-
els in the era of deep learning. Moreover, the model incorporates a wide range of features,
capturing physical, psychological, court-related, and match-related variables, all compiled
and processed from ATP data spanning 2000 to 2016. Additionally, in past few years, there
has been an increasing interest in feature engineering, as it plays a critical role in boosting
the performance of machine learning models by identifying and capturing relevant patterns
and relationships within the data. As an example, Felice et al. (2023) introduce Statisti-
cally Enhanced Learning (SEL), a framework that formalizes existing feature engineering
and extraction techniques in machine learning. This approach aims to tackle challenges in
ML tasks by optimizing both feature selection and representation. Furthermore, numerous
studies have explored methods to understand and interpret complex (black box) ML mod-
els. For instance, Auret and Aldrich (2012) utilized variable importance measures, directly


CHAPTER 1. INTRODUCTION 17

derived from RF models, along with partial dependence plots. Their work demonstrated
that RF models can reliably identify the influence of individual variables, even when there
is significant additive noise.

Guidelines and descriptions for the thesis

The dissertation focuses on modeling and predicting tennis matches at Grand Slam tourna-
ments by covering various classical statistical and machine learning approaches. Within this
work, different strategies are applied, including leave-one-tournament-out cross validation,
expanding window 1, and rolling window. Additionally, it covers some other main aspects
such as statistical enhanced learning (SEL) and interpretable machine learning tools, includ-
ing Partial Dependency Plots (PDP; Friedman, 2001) and Individual Conditional Expectation
(ICE; Apley and Zhu, 2020).

Dataset

The data includes information on 5,013 matches from men’s Grand Slam tournaments which
took place between 2011 and 2022. Several potential covariates were included, such as the
players’ Age, ATP ranking and Points, Odds, Elo rating, and two additional Age variables.
These additional age variables were designed to capture the “optimal” age range for a tennis
player, typically between 28 and 32 years (Weston, 2014). Note that, in our models, we
will consider these covariates in the form of differences. For example, the difference in the
players’ ranking positions was calculated by subtracting the Rank of the second player from
that of the first player. Similarly, for the Age variable, the age of the second player was
subtracted from that of the first player, and this method was applied to all other covariates.
For a detailed explanation and definition of those covariates we refer to Buhamra et al. (2024).
For this analysis, we compiled a dataset using variables obtained from the R, package deuce
(Kovalchik, 2019).

Classical regression models for tennis matches

First, we present a statistical approach for predicting the tennis matches of Grand Slam tennis
tournaments using regression models. While linear regression is commonly employed for
this purpose, it may not be suitable when the relationships between variables are non-linear
or complex. In such case, splines and LASSO are utilized as an extension to the logistic model.
So, we examined whether the assumption of non-linear effects or the inclusion of surface-
and player-specific abilities is reasonable (i.e., we account for the incorporation of court-
specific abilities). Furthermore, we address how to handle players who have participated in

1This strategy is used in Buhamra et al. (2024) and Buhamra et al. (2025), however in Buhamra et al. (2024)
was referred to as a rolling window. Hence, it should not be confused with the rolling window approach from
Buhamra et al. (2025).


CHAPTER 1. INTRODUCTION 18

only one tournament by introducing the concept of the “newcomer”.
Along with selecting the most suitable models, we evaluate their performance from a

betting perspective by examining the potential gains or losses each model may yield.
In this part of our work, we aim to fully leverage the flexibility of modern regression

approaches, utilizing their high interpretability to uncover insights into specific associa-
tions and relationships in professional tennis. We introduce various regression methods
and compare them based on some performance measures. A detailed description of these
performance measures can be found in Buhamra et al. (2024).

Machine learning approaches

Next, we switch to a ML perspective and introduce those methods to demonstrate how
modern machine learning approaches can help improve predictive performance. However,
since these methods tend to be less interpretable, we also show how this can be addressed
using interpretable ML techniques such as PDP and ICE. We introduce several machine
learning approaches suitable for binary responses, such as classification trees (Breiman et al.,
1984), random forests (Breiman, 2001), extreme gradient boosting (Chen and Guestrin, 2016),
support vector machines (Cortes and Vapnik, 1995), and ANNs (Rosenblatt, 1958). SVMs
have the advantage of being more effective in high-dimensional spaces where the number of
features is very large. According to Sipko and Knottenbelt (2015), they often achieve better
accuracy compared to ANNs. However, the latter will still be investigated in this thesis.

Additionally, a comparison of these approaches with the regression models examined
in our previous work (Buhamra et al., 2024) is provided. Both the regression and machine
learning models are compared. Here, the same dataset and performance measures are used
as for the regression approaches in the previous paragraph. The evaluation is conducted
separately using the same leave-one-tournament-out cross validation strategy, as well as
an expanding window approach. However, this time, a rolling window approach is also
introduced (see Buhamra et al., 2025).

Statistical enhanced learning aspect

Finally, in this part, we aim to explore the relevance of specific enhanced variables for both
regression and machine learning approaches. To achieve this, we examine the predictive
potential of “statistically enhanced covariates”2 in more detail. For instance, in Buhamra
and Groll (2025), we examined enhanced variables derived from separate approaches, such
as the Elo rating, as well as those obtained through advanced transformations of original
variables. One example is the covariate Age.30, which is based on the idea proposed by
Weston (2014), emphasizing that the optimal age of tennis players falls between 28 and 32

2This concept is closely related to feature engineering, which involves selecting, transforming, or creating
input variables to improve a model’s performance. Feature engineering helps machine learning algorithms
extract meaningful patterns from raw data by structuring it in a way that enhances predictive accuracy. Effective
feature engineering can significantly impact model performance and enhance a model’s predictive accuracy.


CHAPTER 1. INTRODUCTION 19

years. Therefore, the middle of this interval (i.e., 30 years) was used as the reference age
in our analysis, which is how the name of this variable also originated. Additionally, we
considered more conventional features directly available on certain websites, such as the
players’ ATP world ranking. Then, all enhanced covariates were analyzed alongside these
conventional covariates. Our objective is to determine whether these enhanced covariates
can improve statistical learning approaches, particularly in terms of predictive performance.
The remainder of the thesis is structured as follows. Chapter 2 provides an overview of all
the statistical methods used in this work. Each article is summarized in Chapter 3, including
their introductions and conclusions. In Chapter 4, the overall conclusion of the thesis is
presented. Finally, Part II consists of the three articles that form the core of this thesis.


Chapter 2

Statistical methods

This chapter introduces the reader to the statistical methods used in our analysis. It be-
gins with a classical logistic regression approach for a binary outcome. We then extend
the regression model, particularly focusing on regularization techniques like Least Absolute
Shrinkage and Selection Operator (LASSO). Non-parametric models, such as splines, are also
introduced. Finally, we cover five machine learning approaches, emphasizing interpretabil-
ity through IML tools, which help illustrate and understand ’black box’ models, such as RF.
Note that some parts of this chapter are adapted from the main articles (Buhamra et al., 2024;
2025).

2.1 Logistic regression model

Although linear models are well-suited for regression analyses when the response variable
is continuous and may show an approximate normal distribution (conditional on the co-
variates), if necessary after an appropriate transformation, in many cases the response is
not continuous but binary, as is the case in our study. Generalized linear models (GLMs)
combine various regression methods for response variables that do not need to follow a
normal distribution, including the logit model for binary responses, which will be the focus
of our analysis.

For n individuals, let be given observations (yi, xi1, . . . , xip), i = 1, . . . ,n, of a binary target
variable y and covariates x1, . . . , xp. In the logistic regression model, the relationship between
y and metric, categorical or binary covariates is examined. Here, y = 1 denotes the occurrence
of a particular event (typically defined as “success”) and y = 0 that the event does not occur
(also defined as “failure”). Then,

πi = P(yi = 1|xi1, . . . , xip) = E[y|xi1, . . . , xip]

is the (conditional) probability for the occurrence of yi = 1, given the covariate values
xi1, . . . , xip. The aim is to model πi appropriately as a function of the feature variables.

The associated linear predictor is defined as

20


CHAPTER 2. STATISTICAL METHODS 21

ηi = xi
⊤β, (2.1)

with xi = (1, xi1, . . . , xip)⊤ and β = (β0, β1, . . . , βp)⊤. Since πi ∈ [0, 1] has to hold for all possible
choices of x, the linear model approach π̂i = yi cannot be chosen.

Therefore, the linear predictor ηi is related to the probabilityπi by a strictly monotonically
increasing function h : R→ [0, 1], i.e.

πi = h(ηi) = h(β0 + β1xi1 + · · · + βkxip) .

The function h(·) is also called the response function. With the help of the inverse function
g = h−1, we can also write ηi = g(πi). Equation (2.1) can be written as

ηi = g(πi).

The function g is called link function (see, e.g., Nelder and Wedderburn, 1972, or Fahrmeir
et al., 2013). In the logit model, the logistic response function

π = h(η) =
exp(η)

exp(1 + η)

is chosen and the corresponding logit link function is given by

g(π) = ln
(
π

1 − π

)
= η = β0 + β1x1 + . . . + βpxp.

Parameter estimation

The regression parameters of the logit model are typically estimated via maximum likelihood
(Fahrmeir et al., 2013). We write the given data as (yi,xi), i = 1, . . . ,n. Assuming that the
observations are stochastically independent, the likelihood function is given by

L(β) =
n∏

i=1

f (yi|β; xi).

Here, f (yi|β;xi) denotes the likelihood contributions of yi|xi, which depend on the un-
known parameter vector β via πi = E[yi|xi] = h(x⊤i β). Thus, for Bernoulli data the likelihood
can also be represented as

L(β) =
n∏

i=1

Li(β) =
n∏

i=1

πyi
i (1 − πi)1−yi

where Li(β) is the individual likelihood contribution of the i-th observation. With


CHAPTER 2. STATISTICAL METHODS 22

πi =
exp(x⊤i β)

1 + exp(x⊤i β)

and by logarithmizing we obtain the log-likelihood

l(β) := ln(L(β)) =
n∑

i=1

[
yi(x⊤i β) − ln(1 + exp(x⊤i β))

]
.

The estimators for β0, . . . , βp are obtained by numerical maximization of the log-likelihood,
e.g. by using the Fisher scoring or the Newton-Raphson method, (see, e.g., Nelder and
Wedderburn, 1972). Generally, for more details on GLMs, see also Fahrmeir and Tutz (2001).
The standard logit model can be fitted in R via the glm function available in the basic stats
package (R Core Team, 2024).

2.2 Extension for the regression models

In the following, we will present extensions of the classical regression model. First, in
Section 2.2.1 we will explain the idea of LASSO penalization. Then, splines will be introduced
in more detiales in Section 2.2.2.

2.2.1 Least absolute shrinkage and selection operator (LASSO)

Regularization is a statistical technique used to prevent overfitting in predictive models, e.g.
by adding a penalty term to the loss function, discouraging overly complex models. This reg-
ularization can take various forms, such as L1 penalization (LASSO), which encourages spar-
sity by forcing some coefficients to be exactly zero, or L2 penalization (ridge), which shrinks
coefficients toward zero but does not eliminate them entirely. Additionally, spline smoothing
is a form of regularization that uses piecewise polynomial functions to create flexible models
while controlling excessive wiggliness in the fitted curve. Component-wise boosting (see,
e.g., Bühlmann and Yu, 2003, Bühlmann and Hothorn, 2007, or Schmid and Hothorn, 2008) is
another regularization approach that builds models incrementally by adding weak learners
sequentially, focusing on correcting errors made by previous iterations while maintaining
model simplicity. Penalization, as a specific concept of regularization, helps create simpler
and more interpretable models while improving their ability to generalize to unseen data.
The strength of the penalty is controlled by a tuning parameter, which balances the trade-
off between model complexity and predictive accuracy. For example, ridge regression (L2
penalization) adds a penalty proportional to the sum of the squared coefficients, shrinking
them toward zero without completely eliminating any. In contrast, LASSO regression (L1
penalization) applies a penalty based on the absolute values of the coefficients, encourag-
ing sparsity by forcing some coefficients to be exactly zero, effectively performing variable
selection. Throughout this work, we will mainly focus on LASSO penalization and spline
smoothing.


CHAPTER 2. STATISTICAL METHODS 23

Overall, LASSO is a powerful tool for improving prediction accuracy and simplifying
models by reducing the number of predictor variables (Tibshirani, 1996). It is particularly
applicable to binary response variables. LASSO enhances the traditional linear regression
model by incorporating an ℓ1-norm penalty on the coefficients. This penalty shrinks some
of the coefficient estimates toward zero, effectively performing variable selection.

Figure 2.1 by James et al. (2013) provides a geometrical interpretation of the constraints
imposed by LASSO and ridge regression, exemplarily in two-dimensional coefficient space
(β1, β2).

The left chart shows the LASSO constraint region, which is as a diamond-shaped area,
representing the ℓ1-norm constraint |β1| + |β2| ≤ s. The chart on the right illustrates the ridge
regression constraint, which is a circular region corresponding to the ℓ2-norm constraint
β2

1 + β
2
2 ≤ s. In the context of LASSO and ridge regression, s represents the constraint

or shrinkage parameter, which determines the maximum allowable size of the coefficient
vector within a specified norm. Thus, s controls the degree of regularization—a smaller s
forces coefficients to be closer to zero, while a larger s allows more flexibility in coefficient
values.

The red ellipses in both plots represent the contours of the residual sum of squares (RSS),
which indicate the direction of the optimal solution in an unconstrained regression setting.
The estimated coefficients are obtained where these ellipses first touch the constraint region.
For LASSO, the diamond-shaped constraint leads to solutions that often lie exactly on the
axes, meaning some coefficients are set to zero. This occurs because the diamond has sharp
corners, which increases the likelihood that the RSS contours will first intersect at these
points—enabling automatic feature selection.

In contrast, the circular constraint of ridge regression does not favor solutions where
coefficients are exactly zero. Instead, it shrinks all coefficients toward zero continuously, but
without completely eliminating any feature. This explains why ridge regression is effective
for shrinkage but not for variable selection. Thus, the fundamental difference between the
two approaches is that LASSO encourages sparsity by setting some coefficients to zero,
whereas ridge regression retains all predictors but shrinks their magnitudes to prevent
overfitting.

Note that in Figure 2.1, the authors explored the case where p = 2. Where p represents
the number of features in the regression model. When p = 3, the ridge regression constraint
region becomes a sphere, while the LASSO constraint forms a polyhedron. For p > 3, the
ridge regression constraint becomes a hypersphere, and the LASSO constraint becomes a
polytope. Despite these changes in the constraint shapes, the fundamental concepts shown
in the coreponding figure still apply. Specifically, the LASSO continues to perform feature
selection when p > 2, due to the sharp corners of the polyhedron or polytope. According
to James et al. (2013), the LASSO has a major advantage over ridge regression in that it
produces simpler and more interpretable models, involving only a subset of the predictors.
Further information and a comparison between the two methods, along with insights into


CHAPTER 2. STATISTICAL METHODS 24

which method yields better predictions, can be found e.g. in Hastie et al. (2015).

Figure 2.1: LASSO (ℓ1-norm) vs. ridge regression (ℓ2-norm) constraints: The LASSO con-
straint forms a diamond-shaped region promoting sparsity, while the ridge constraint forms
a circular region, leading to coefficient shrinkage without selection as provided by James
et al. (2013) in the online resources (see https://www.statlearning.com/resources-second
-edition).

When the number of covariates p becomes very large, estimation becomes numerically
unstable (see, e.g., Fahrmeir et al., 2013). This can also be the case if there is some substantial

mulitcollinearity between the columns of the design matrix X =
(
x1,x2, . . . ,xn

)T
. To address

this problem, a penalty term pen(β) is added to the negative log-likelihood in the logit model.
According to Park and Hastie (2007), the estimator is then obtained by minimizing

β̂pen = arg min
β

(
−l(β) + λ · pen(β)

)
,

where λ is the penalty parameter that controls the influence of the penalty term on the pa-
rameters estimated by the maximum likelihood method (Fahrmeir et al., 2013). Here, the
penalty term is given by

pen(β) =
p∑

j=1

|β j|.

By keeping the intercept free from penalization, regularization methods help maintain a
balance between model simplicity and predictive accuracy while preserving essential aspects
of the prediction. Penalizing the intercept could reduce model interpretability and make it
more challenging to understand the model’s baseline behavior.

Thus, the penalty term allows model estimation and variable selection to be performed
in one step, as small coefficients are shrunk to 0. According to Park and Hastie (2007), the
estimator β̂LASSO is given by


CHAPTER 2. STATISTICAL METHODS 25

β̂LASSO = arg min
β

−l(β) + λ ·
p∑

j=1

|β j|

 .
There is no closed-form representation for solving this minimization problem. Therefore,

numerical optimization methods are used to obtain the optimal LASSO estimator β̂LASSO

(see, e.g., Friedman et al., 2010). In order to determine an optimal value for λ, typically
for a sequence of different λ values K-fold cross validation3 is performed. The mean cross
validated (CV) error is then calculated for each of these λ. The error is typically compared
on the basis of the deviance on the omitted data, which is defined as the negative of twice the
log-likelihood. To estimate βLASSO, the λ that minimizes mean CV error is used (Friedman
et al., 2010). This method is implemented in the glmnet R package by Friedman et al. (2010)
and a notable update is detailed in Tay et al. (2023).

Figure 2.2 provides an illustrative example on an artificial dataset to explain the con-
cept of LASSO. The LASSO model is fitted using the glmnet function from the eponymous
R package, and the coefficient paths are then plotted. This visualization is useful for un-
derstanding how LASSO selects features and penalizes complexity to improve predictive
performance. It displays the coefficient paths of a LASSO-regularized logistic regression
model, fitted using the glmnet package on a simulated dataset with 500 observations and 10
predictors generated from a normal distribution. The coefficient paths illustrate how feature
selection is influenced across different values of the regularization parameter λ. The plot
helps identify which predictors remain important after applying LASSO regularization. The
x-axis represents the λ values, which control the strength of penalization. As λ increases,
many coefficients are shrunk to zero, leaving only a few important features (blue lines).
Conversely, as λ decreases, the model allows larger coefficient magnitudes, capturing more
complexity. The y-axis shows the values of the estimated regression coefficients (β̂ j) for
different predictors at each λ value. If a coefficient remains nonzero, it means the corre-
sponding predictor is more relevant for classification in the respective logistic regression
model. Each line represents how a specific predictor’s coefficient changes as λ varies. Black
lines represent coefficients that are strongly shrunk toward zero (less important predictors).
Blue lines represent the three most significant predictors (true nonzero coefficients), which
retain nonzero values even under strong penalization. As λ increases, most black lines
move toward zero, meaning those features are being eliminated. The red dashed vertical
line represents λopt, the optimal penalty parameter selected via tuning, in particular here via
10-fold cross validation. This is the point where the model achieves the best balance between
accuracy and generalization. At λopt, some coefficients remain nonzero (important predic-
tors), while others have been eliminated (unimportant features). The optimal model at λopt

includes only the strongest predictors, improving interpretability and avoiding overfitting.

3K-fold cross validation is a technique used to assess the performance of a machine learning model. It
involves dividing the dataset into K equally-sized subsets (or folds). The model is trained on K-1 folds and
tested on the remaining fold. This process is repeated K times, with each fold used as the test set once. The
results are then averaged to provide a more reliable estimate of model performance.


CHAPTER 2. STATISTICAL METHODS 26

0.00 0.04 0.08 0.12

−
0.

5
0.

0
0.

5
1.

0

λ

β j^

λopt = 0.01701217

Figure 2.2: Exemplary coefficient paths for the LASSO model across λ values on an artificial
dataset, with the three key nonzero coefficients in blue and others in black. The vertical red
dashed line marks the optimal λ from cross-validation.

2.2.2 Non-parametric models (splines)

In the methods discussed above, the relationship between the covariates and the target
variable is assumed to be linear. However, real-world data often exhibit non-linear patterns
that cannot be adequately captured by linear models. To address this, we begin with one
of the simplest ways to introduce non-linearity—polynomial regression. While polynomials
offer more flexibility, they can lead to issues such as overfitting and instability. A more
advanced and robust approach is the use of splines, which provide a piecewise polynomial
representation that balances flexibility and interpretability. In this context, we employ B-
splines (Eilers and Marx, 1996) to model non-linear relationships effectively.

Let yi be the i-th observation of the response variable and xi the corresponding value of
a single (metric) covariate. The influence of xi on yi can then be modeled as a polynomial
function f (·) of degree l, i.e.,

f (xi) = γ0 + γ1xi + . . . + γlxl
i.

For more flexible modeling, the range of values of the covariates is also divided into
intervals, so that one obtains several locally estimated polynomials. However, since this
leads to a non-steady function, polynomial splines are defined. A polynomial spline of degree
l ≥ 0 to the nodes a = κ1 < · · · < κm = b is a function f : [a, b]→ R with the following features
(see, e.g., Fahrmeir et al., 2013):

1. f (·) is (l − 1) times continuously differentiable.

2. On every interval [κ j, κ j+1) f (·) is a polynomial of the l-th degree.


CHAPTER 2. STATISTICAL METHODS 27

The range of values is thus decomposed by the knots κ1, . . . , κm. For the constructive repre-
sentation of polynomial splines, various equivalent approaches exist (Fahrmeir et al., 2013).
Here, we use the so-called B-splines basis proposed by (Eilers and Marx, 1996).

B-splines

The function f (x) from above can be represented as a linear combination of d = m + l − 1 so-
called basic-spline or B-spline basis functions. Such a basis function consists of l+1 polynomial
pieces of l-th degree, which are composed in a l−1 times continuously differentiable manner.
f (·) can then be represented according to Fahrmeir et al. (2013) as

f (x) =
d∑

j=1

γ jB j(x),

where γ j are the coefficients to be estimated, and B j(x) represents a set of basis functions.
Following Fahrmeir et al. (2013), the B-spline basis functions can be recursively defined

as following:

B l
j(x) =

x − κ j

κ j+l − κ j
B l−1

j (x) +
κ j+l+1 − x
κ j+l+1 − κ j+1

B l−1
j+1(x),

where

B0
j (x) = 1[κ j,κ j+1)(x) =

1, if κ j ≤ x < κ j+1,

0, otherwise.
j = 1, . . . , d − 1.

To estimate a B-spline, the corresponding basis functions have to be first computed at a
given number of knots. At this point, we introduce the B-spline design matrix B, which
plays a crucial role in representing smooth functions through B-splines. It is constructed by
evaluating a set of B-spline basis functions at specified data points. Each row of the matrix
corresponds to an observation, while each column represents the values of a specific B-spline
basis function evaluated at all n observations. The matrix B serves as the foundation for
spline-based regression and smoothing techniques, allowing flexible, piecewise polynomial
fitting. This matrix forms the foundation for representing the spline function and is later
used in the estimation process. It is given as

B =


B1(x1) B2(x1) · · · Bd(x1)
B1(x2) B2(x2) · · · Bd(x2)
...

...
. . .

...

B1(xn) B2(xn) · · · Bd(xn)

 ,
each row corresponds to an observation xi, each column represents the evaluation of a B-
spline basis function B j(·) at the corresponding data points, d is the number of basis functions,
and n is the number of data points.

This process is shown in Figure 2.3, where basis functions are computed to approximate


CHAPTER 2. STATISTICAL METHODS 28

the simulated data (colored gray here). In Figure 2.4, each basis function is assigned a weight
γ̂ j for scaling by the estimator γ̂.

Figure 2.3: B-spline basis Figure 2.4: Weighted basis functions

Figure 2.5: Final spline estimation

The final estimation is calculated by summing up the single weighted basis functions (see
Figure 2.5).

As an unpenalized estimation of a non-linear B-spline effect often overfits, typically
the non-linear effect is smoothed by using penalized B-splines, i.e. P-splines, which will be
introduced in the next paragraph.

P-splines

Beside the problem of potential overfitting, the goodness-of-fit of the B-spline approach
depends on the number of selected nodes. To avoid this problem, various penalization
methods exist in the form of P-splines (Eilers and Marx, 2021). Here, a penalized estimation
criterion, which is extended by a penalty term, is used instead of the usual estimation criterion
(Fahrmeir et al., 2013). For P-splines based on B-splines (see, e.g., Eilers and Marx, 1996), the
function f (x) is first approximated by a polynomial spline with many nodes (typically about
20 to 40). The penalty term then results in

λ

∫ (
f ′′(x)

)2 dx .


CHAPTER 2. STATISTICAL METHODS 29

This is motivated by the fact that the second derivative is used as a measure of the curvature
of a function. For approximation of the second derivative exist simple representations, so
that the penalty term results in

λ
d∑

j=3

(
∆2γ j

)2
,

where ∆2γ j = γ j − 2γ j−1 + γ j−2 (see again Eilers and Marx, 1996).
The second order difference ∆2γ j can be represented as a matrix operation involving the

vector of coefficients γ = (γ1, γ2, . . . , γd)T. The penalty on the second differences is equivalent
to:

λ
d∑

j=3

(γ j − 2γ j−1 + γ j−2)2.

This can be written in matrix form as:

λγTQγ

where the penalty matrix Q is a symmetric matrix. Specifically, Q is a tridiagonal matrix that
captures the second-order differences between the coefficients γ j, where the diagonal entries
are 2 and the off-diagonal entries are -1, as follows:

Q =



2 −1 0 · · · 0
−1 2 −1 · · · 0
0 −1 2 · · · 0
...

...
...
. . .
...

0 0 0 · · · 2


.

To fit the P-spline model e.g. in the simple case of a Gaussian response, i.e. a classic
linear model setup, we minimize the following objective function, which combines the least
squares loss with the penalty term:

L(γ) =
n∑

i=1

(yi − f (xi))2 + λγTQγ

the penalized least squares estimate then has the form

γ̂ =
(
BTB + λQ

)−1
BTy

where B is the design matrix formed by the B-spline basis functions, and Q is the penalty
matrix (which here exemplarily encodes second order differences between the coefficients
γ j). The penalty matrix Q encourages smoothness by shrinking differences between adjacent
coefficients. As shown by Eilers and Marx (2021), this approach results in a computationally
efficient and stable smoothing method, widely used in practical applications.

The optimal smoothing parameter λ is again a tuning parameter of the procedure and is


CHAPTER 2. STATISTICAL METHODS 30

determined using generalized CV. For more details on the methodology, refer to Eilers and
Marx (2021). Information on the corresponding software implementation in R can be found
in Wood (2017).

2.3 Machine learning (ML) approaches

In this section, we present the machine learning methods employed throughout this work.
While some of the mathematical formulas are derived from the original articles, additional
details and explanations are provided here for further clarity.

2.3.1 Decision trees

The fundamental concept behind the decision trees is to recursively partition the data into
subsets based on the values of the input features. At each node of the tree, a decision is
made about which feature to split on and what threshold to use. This process is repeated
until a stopping criterion is met, resulting in a tree structure with leaf nodes containing the
predicted values.

The Classification and Regression Tree (CART) algorithm, introduced by Breiman et al.
(1984), is a class of non-parametric algorithms capable of handling both classification and
regression tasks. While it has powerful capabilities in regression, where it can predict
continuous outcomes, this thesis primarily focuses on classification tasks, as our target
variable is binary. This type of learning is widely applied across various fields, e.g. Angelini
et al. (2008); Conneau et al. (2017). In medical diagnosis, the goal is often to assess whether a
particular disease is present in a patient, based on a combination of clinical data, test results,
and symptoms Kourou et al. (2015). Another example is provided by Bishop and Nasrabadi
(2006), which offers a comprehensive resource for understanding classification problems and
pattern recognition. This work is applicable to a wide range of fields, including computer
vision, speech recognition, and bioinformatics.

In the case of classification trees, the focus is on predicting categorical outcomes. In the
context of sport domain such as tennis, a classification tree can predict the winner of a match
based on features like player ranking, age, experience, recent performance, surface type,
and more. Unlike traditional linear regression models, which assume a linear relationship
between variables, classification trees are capable of capturing complex non-linear patterns
and interactions within the data. The target (response) variable is typically binary (e.g., in
our case, y = 1 if Player A wins, y = 0 if Player B wins). The tree grows by iteratively
splitting the dataset based on values of the covariates. These splits are selected to create
more homogeneous groups with respect to the response variable. Common splitting criteria
include metrics such as Gini impurity or entropy (information gain). Hence, decision trees
seek to find the best split to subset the data.

CARTs use the Gini index in the process of splitting the dataset into a decision tree, which
is defined by


CHAPTER 2. STATISTICAL METHODS 31

G =
K∑

k=1

p̂mk(1 − p̂mk),

a measure of total variance across the K classes. Here, p̂mk represents the proportion of
training observations in the m-th region that are from the k-th class. The Gini index is
minimized when all values of p̂mk are close to zero or one. Therefore, the Gini index serves
as a measure of node purity, where a low value indicates that a node is largely composed of
observations from a single class (James et al., 2013).

Another preferable measure which is an alternative to the Gini index is the entropy, given
by

E = −
K∑

k=1

p̂mk log p̂mk.

Similar to the Gini index, entropy takes on a low value when the m-th node is pure. In fact,
the Gini index and entropy are often numerically quite close (James et al., 2013). Note that,
in this work, we focus on using the Gini index. Some mathematical background is provided,
assuming a basic knowledge of tree-based methods. For more detailed description see James
et al. (2013) and Hastie et al. (2009).

One notable drawback of classification trees is their tendency to overfit the data, especially
when the tree grows too deep. A fully grown decision tree often overfits the training data,
leading to poor generalization on new, unseen test data. This issue can be addressed by
pruning the tree to reduce its complexity or by using ensemble methods such as random
forest or gradient boosting. In the rpart package (Therneau et al., 2015), this is managed
by the complexity parameter (cp), which sets a threshold that an improvement must exceed
for a split to remain in the tree. A very small cp value leads to overfitting, while a large cp
results in an overly simple tree, causing underfitting. Hence, cp is the main tuning parameter
of the method. Both extremes reduce the model’s predictive performance. The optimal cp
can be identified by testing different values and using cross validation (CV) to determine
the corresponding prediction accuracy of the model (we implemented this CV approach to
ensure that for all methods requiring tuning, the same data folds are used, allowing for
better comparability). The best cp is defined as the one that maximizes the CV accuracy.
Another issue is the instability of classification trees, as small changes in the data can lead to
different splits and significantly different trees. This instability is related to the high variance
of decision trees, making them highly sensitive to even small variations in the training data
and potentially leading to inconsistent predictions.

An example of a classification tree in practice is shown in Figure 2.6. To fit the tree and
produce a plot that reflects the specified rules accurately, the rpart package is used along
with the function rpart.plot for visualization. In a binary classification tree, the goal is
to predict which of two possible outcomes will occur. In the context of tennis matches, the
focus would be on predicting the winner of a match, such as:


CHAPTER 2. STATISTICAL METHODS 32

• Y = 1: Player A wins the match.

• Y = 0: Player B wins the match.

At each internal node of the tree, a decision is made (e.g., "Is Player A’s ranking larger
than 10?"). Based on the answer (Yes/No), the data is split and moves to the next node,
continuing until a prediction is made at the leaf node. The final prediction at the leaf node
(e.g., Player A wins or Player B wins) is determined by the majority class in the training
data that reaches that node. Based on the proportions of observations in the final nodes, the
respective probabilities for Y = 1 and Y = 0 are also obtained.

Thus, the classification tree splits the data based on input features (such as player statis-
tics) to classify the outcome into one of the predefined categories. This means that we can
train a classification tree using features like player rankings, their recent performance, age,
surface type, etc. Each node in the tree represents a decision point based on one of the
features (e.g., EloA, EloB, AgeA, etc.). The splits in the tree indicate how the model partitions
the data to make predictions. For example, it may first split based on whether EloA is larger
than a certain value. The terminal nodes (leaves) represent the final predictions (outcomes)
of matches. Each leaf will show the predicted class (0 or 1) and the proportion of instances
that belong to each class. The graph also shows the probability of winning for each player.
This provides insight into how confident the model is about its predictions.

By examining the splits, one can understand which features (Elo ratings, Age, Surface)
are most influential in determining the match outcome. If EloA is frequently used for splits,
it may indicate its importance. By following the branches from the root to the leaves, we
can identify the criteria that lead to each prediction, making the model interpretable for
tennis match outcomes. Note that later in our work, we modify this approach by using
the differences between the covariate values of the first-named and second-named players
to better capture their relative advantages. Specifically, the value of the second player was
always subtracted from the value of the first player.

In the following sections, we introduce both random forest and extreme gradient boosting
(XGBoost) as extensions and enhancements to this approach. Random forests are a parallel
ensemble method that improves prediction accuracy by averaging the outcomes of multiple
decision trees. Similarly, gradient boosting machines build trees sequentially, with each new
tree attempting to correct the errors of the previous ones. Classification trees, particularly
when combined with these ensemble methods, provide a robust framework for predicting
tennis matches by incorporating some factors such as player form, surface type, or ranking
into the model. While the literature on employing trees to model and forecast tennis matches
is still scarce (Gao and Kowalczyk, 2021; Svensson, 2021 and Ünal, 2023), this is different
in soccer. Bunker et al. (2024a) demonstrates such applications in predicting soccer match
results. Groll et al. (2019) introduced a hybrid modeling approach that combines random
forests with Poisson ranking methods to predict soccer match outcomes in international tour-
naments. This innovative model integrates team ability parameters derived from historical


CHAPTER 2. STATISTICAL METHODS 33

Figure 2.6: Example of a classification tree model to predict the outcome of a tennis match
based on players’ Elo ratings and Ages

match data into the random forest framework, enhancing its predictive accuracy. The results
confirmed the model’s strong predictive capabilities, outperforming traditional methods,
including betting odds.

2.3.2 Random forest (RF)

A random forest (RF) is a machine learning algorithm that can be used for both classification
and regression tasks, first proposed by Breiman (2001). It is based on a parallel ensemble
of regression or decision trees, depending on the type of outcome. Here, we’ll focus on the
version with decision trees. The algorithm is called “random” due to two key aspects of
randomness in its construction. First, in each tree it randomly selects subsets of predictors
at each split during training, ensuring diversity across the trees. Second, each decision tree
in the ensemble is trained on a randomly drawn bootstrap sample from the original dataset,
meaning that different subsets of the data are used for training each tree. It is referred to as a
“forest” because it builds an ensemble of multiple decision trees, which collectively improve
predictive accuracy and robustness.

Random forests are typically trained using the ’bagging’ method, which reduces model
variance by introducing randomness through bootstrap sampling with replacement from
the training data set. Nevertheless, sampling without replacement is often assumed, as it
simplifies analysis (Ramosaj, 2020; Scornet et al., 2015).

To apply bagging in a classification tree, a large number B of trees are generally generated
commonly with default values of B = 500 or B = 1000 (Probst et al., 2019), using B boot-
strapped training sets. The final prediction is then determined by majority voting among
the resulting trees. In our study, we found that using B = 400 was sufficient to achieve


CHAPTER 2. STATISTICAL METHODS 34

good performance (i.e., we set ntree = 400 and, hence, did not tune this parameter). Note
that, the number of trees in a forest is typically not tuned, as increasing the number of trees
generally enhances performance (Scornet, 2017; Díaz-Uriarte and Alvarez de Andrés, 2006).
Additional details on bagging are available in James et al. (2013).

The main extension compared to bagging is the idea of drawing randomly only a subset
of "mtry" features at each node for splitting in the individual trees, which involves selecting
a random subset of predictors from the set of all available features at each decision node
in a decision tree. This introduces diversity among the trees in the ensemble, decreasing
the variance of the ensemble and, hence, reducing the risk of overfitting and enhances the
model’s generalization performance by ensuring that each tree is built on a different subset of
features. For fitting random forest models, we use the ranger implementation in R (Wright
and Ziegler, 2017). In addition, there are other R packages that provide an automatic tuning
procedure for random forests, such as mlr3 and caret packages (see Bischl et al., 2023 and
Kuhn, 2008).

The choice of hyperparameters is task-dependent and can be determined using techniques
such as CV or random search (Bartz et al., 2023). Here, mtry and the ntree are the two
parameters that most likely have a big effect on our final accuracy and are defined as follows:

• mtry: number of variables randomly sampled as candidates at each split.

• ntree: number of trees to grow.

The mtry hyperparameter controls the split-variable randomization feature of the ran-
dom forest and it helps to balance low tree correlation with reasonable predictive strength.
According to Hastie et al. (2009), for classification mtry =

√
p is recommended, where p is

the number of predictors. However, when there are fewer relevant predictors a larger value
of mtry tends to perform better because it makes it more likely to select those features with
the strongest signal. When there are many relevant predictors, a lower mtrymight perform
better. For this reason, we decided to properly tune this parameter using (self-implemented)
10-fold CV. Regarding the minimum node size, it is recommended to set the minimum node
size to one for classification and five for regression (Díaz-Uriarte and Alvarez de Andrés,
2006).

There are no specific criterias when it comes to choosing the number of trees that the
random forest should use, as long as ntree is large enough. However, there is a threshold
where adding more trees does not give any significant gain to the random forest (Oshiro
et al., 2012), but computational costs would unnecessarily increase.

Like a decision tree, using random forest for tennis match prediction is advantageous
because they can handle non-linear relationships between variables and naturally account
for complex interactions, such as how a player’s performance might depend on the match
surface, Elo or their age. Additionally, random forest reduces the risk of overfitting by
averaging predictions across many trees, making it robust and highly accurate for real-world
predictions. This method also provides insights into which factors are most important in


CHAPTER 2. STATISTICAL METHODS 35

influencing match outcomes through variable importance metrics. For further details about
variable importance, the reader can refer to Hastie et al. (2009).

2.3.3 Extreme gradient boosting (XGBoost)

As alternative to the random forest approach from above (parallel ensemble methods) we
also consider sequential ensembles. A well-known approach in this context is boosting, a
technique derived from the machine learning field (Freund et al., 1996), which was sub-
sequently adapted to estimate predictors in statistical models (Friedman et al., 2000, and
Friedman, 2001). Generally, the concept of an iterative boosting algorithm is to combine
multiple weak learners additively into a single, strong ensemble model, which leads to high
predictive accuracy (Schapire, 1990). One notable advantage of statistical boosting algo-
rithms is their flexibility for high-dimensional data and their ability to incorporate variable
selection in the fitting process (Mayr et al., 2014). The boosting model f boost has the form

f boost(x) =
M∑

m=1

fm(xi)

where f1, . . . , fM, M ∈ N are weak learners, which will be referred to as base functions in the
following. According to Friedman et al. (2000), boosting can be considered as an additive
expansion in a set of elementary functions. Hence, f boost can be rewritten as

f boost(x) =
M∑

m=1

αmh(x; am),

where α1, . . . , αM ∈ R are expansion coefficients and h : Rd
→ Rwith x→ h(x; a) are described

in terms of simple basis functions by a set of parameters am ∈ RD. Fitting such a model
involves selecting the basis functions and then their parameters {αm, am}

M
m=1 are determined

by minimizing a loss function L averaged over the training dataset

{α̂m, âm}
M
m=1 = arg min

{αm,am}
M
m=1

1
n

n∑
i=1

L

Yi,
M∑

m=1

αmh(Xi, am)

 .
Here, a forward stagewise additive modeling approach is commonly applied. In this

method, optimization is carried out by sequentially adding new basis functions to the ad-
ditive model, without modifying the parameters and coefficients of those already included.
At each iteration m, m = 1, . . . ,M, we identify the following

(αm, am) = arg min
α,a

1
n

n∑
i=1

L(Yi, gm−1(Xi) + αh(Xi, a)), (2.2)

with

g0(x) = 0 and gm−1(x) =
m−1∑
k=1

αkh(x; ak),


CHAPTER 2. STATISTICAL METHODS 36

which results in

gm(x) = gm−1(x) + αmh(x; am).

Further details can be found in Hastie et al. (2009). The boosting framework was later gener-
alized to support the optimization of any differentiable loss function (Friedman, 2001), where
a functional gradient descent approach is used to solve equation (2.2). For a differentiable
loss function L, the corresponding negative gradient on the training data is given by

−γ(xi) =
[
∂L(Yi, g(xi))
∂g(xi)

]
g(x)=gm−1(x)

.

Since the negative gradient is available only at the data points given in the training dataset,
it is approximated by h(x; am) to enable generalization to other points, where

{αm, am} = arg min
α,a

n∑
i=1

(
−γm(xi) − αh(xi; a)

)2 .

The step length ρm is determined using a line search of the form

ρm = arg min
ρ

n∑
i=1

L
(
Yi, gm−1(xi) + ρh(xi; am)

)
.

Furthermore, Friedman (2001) proposed a regularization parameter 0 < ν ≤ 1, which is
applied as a multiplicative factor to the step length at each iteration. This factor, also known
as the learning rate, is typically small (e.g., ν = 0.1), and it determines the step size taken at
each iteration m. The corresponding update can then be defined as

νρmh(x; am).

See Hastie et al. (2009) for more details about the algorithm of the gradient boosting method.
Extreme Gradient Boosting, or XGBoost, is a powerful and efficient machine learning algo-
rithm built upon the principles of gradient boosting. It was introduced by Chen and Guestrin
(2016) and incorporates a Newton step instead of the standard gradient step, which enhances
performance and helps reduce overfitting (Friedman et al., 2000). Here, the second-order
derivatives are considered in addition to the negative gradient, which are given by

δm(xi) =
[
∂2L(Yi, g(xi))
∂g(xi)2

]
g(x)=gm−1(x)

.

The Newton step can be obtained by solving

am = arg min
a

n∑
i=1

(
γm(xi)h(xi, a) +

1
2
δm(xi)h(xi, a)2

)
The step length is subsequently defined as νh(x; am).

The extreme gradient boosting method is widely used for both regression and classifica-


CHAPTER 2. STATISTICAL METHODS 37

tion tasks. In the field of sports, XGBoost has been successfully applied in areas like tennis
and football prediction (Pan et al., 2024; Markopoulou, 2024; Yigit et al., 2020 and Groll et al.,
2021).

One important aspect of using XGBoost is that it involves several tuning parameters
to control the model’s behavior and performance. In our study, we tuned the following
parameters:

• max_depth: The maximum depth of the trees in the boosting ensemble. Higher values
allow the trees to capture more complex patterns but may lead to overfitting, while
lower values may underfit the data. Typical range is 3 to 10 (default is 6) (Chen and
Guestrin, 2016).

• nthread: Specifies the number of threads to be used for processing (Chen and Guestrin,
2016).

• learning rate: Controls the step size in each boosting step and the contribution of
each tree to the final prediction. A smaller learning rate reduces the risk of overfitting
by shrinking the contribution of each tree but requires more boosting rounds (nrounds)
to converge. Typical range values are 0.01 to 0.3, default is 0.3 (Contributors, 2024).

• nrounds: The number of boosting rounds (or trees) to train. A larger number of rounds
increases model complexity but can also lead to overfitting, especially if the learning
rate is high. The range depends on the dataset, but it usually falls between 50 and 1000
(Contributors, 2024).

The parameters max_depth, eta, and nrounds are critical for controlling the complexity,
speed, and convergence of the XGBoost model. Regularization and parallelization (using
nthread) further help to improve training efficiency and reduce the risk of overfitting.
Finding optimal values for these parameters often involves tuning techniques like grid
search or random search. For our approach, we specified reasonable, discrete grids and used
a self-implemented 10-fold cross validation to determine the optimal tuning parameters. The
approach is implemented using the xgboost function from the xgboost R package (Chen
et al., 2019).

2.3.4 Support vector machine (SVM)

Support vector machine (SVM) is a supervised learning algorithm for classification tasks,
introduced by Cortes and Vapnik (1995). The fundamental idea behind SVM is to find an
optimal decision boundary, or hyperplane, which separates data points into distinct classes
with the maximum possible margin.

In a binary classification task, SVM identifies a boundary that best separates data points
of one class from those of the other. The algorithm maximizes the distance between this
hyperplane and the nearest data points from each class, known as support vectors. Only


CHAPTER 2. STATISTICAL METHODS 38

these closest points influence the position and orientation of the hyperplane, determining
the optimal boundary that maximizes separation.

When the data is linearly separable, SVM finds a straight hyperplane that maximizes
the margin between classes. For non-linearly separable data, SVM uses kernel functions to
transform the original data into a higher-dimensional space where a separating hyperplane
can be found. Commonly used kernels include:

• Linear Kernel: suitable for linearly separable data.

• Polynomial Kernel: adds polynomial terms, enabling more complex boundaries.

• Radial Basis Function (RBF) Kernel: popular for capturing complex relationships in
non-linear data.

• Sigmoid Kernel: similar in behavior to neural networks.

SVM also includes a regularization parameter C, which controls the trade-off between
maximizing the margin and minimizing classification error. A higher C value attempts to
classify every point correctly, while a lower C allows more misclassifications, helping the
model generalize better.

There are several references that cover a range of SVM applications, illustrating the
versatility of SVM in handling both linear and non-linear data across various fields. For
example, in image classification and facial recognition, Osuna et al. (1997) applies SVM to
facial recognition, specifically in distinguishing faces from non-face images, demonstrating
that SVM’s margin-maximizing approach can improve accuracy in image classification tasks.
In the field of medical diagnostics, Guyon et al. (2002) applies SVM for feature selection in
cancer classification, highlighting its effectiveness in bioinformatics and showing SVM’s
capability in analyzing complex biological data to improve diagnostic accuracy. In sports
analytics, Wei et al. (2013) uses SVM to classify different types of tennis serves based on player
style and motion, showing that by analyzing shot characteristics, SVM can accurately classify
serve types and contribute to more advanced player analytics. Additionally, Hughes et al.
(2007) explore advancements in computerized notational analysis, showing how SVM can
support coaching strategies by identifying winning patterns and key performance indicators
in tennis, highlighting SVM’s role in analyzing historical match data to uncover predictive
patterns for strategic improvement. Finally, in Wei et al. (2015), SVM is again employed
to classify types of tennis serves based on player style and motion, reinforcing the model’s
effectiveness in advanced player analytics. Figure 2.7 illustrates an exemplary data situation
in a binary classification problem, where two different classes are represented by circles and
diamond shapes (Savareh et al., 2018). The support vectors, which are the points closest to
the hyperplane, are highlighted in dark shade. A hyperplane separates the classes, and the
margins, indicated by dashed lines, represent the distance between the hyperplane and the
support vectors.


CHAPTER 2. STATISTICAL METHODS 39

Figure 2.7: Graphical illustration of classification problem for two classes (Savareh et al.,
2018). The solid line corresponds to the decision boundary (hyperplane), and the dashed
lines visualize the margin.

The mathematical definition of a hyperplane in two dimensions is given by James et al.
(2013) as

β0 + β1x1 + β2x2 = 0 , (2.3)

for parameters β0, β1 and β2. Any x = (x1, x2)T for which equation (2.3) holds is a point on
the hyperplane. Equation (2.3) can be extended to the p-dimensional case by

β0 + β1x1 + β2x2 + ... + βpxp = 0 , (2.4)

with x = (x1, x2, ..., xp)T being a vector of length p. Again, if x satisfies equation (2.4), then x
lies on the hyperplane. In case that x does not satisfy equation (2.4), then this indicates that
x lies on one side of the hyperplane, e.g.

β0 + β1x1 + β2x2 + ... + βpxp > 0.

Otherwise, if

β0 + β1x1 + β2x2 + ... + βpxp < 0 ,

then x lies on the other side of the hyperplane.
Suppose that we have an n × p data matrix X that consists of n training observations in

the p-dimensional space with corresponding binary responses falling into two classes, that
is, y1, ..., yn ∈ {−1, 1}, where −1 represents one class and 1 the other class.

Then, a separating hyperplane has the property that

yi · (β0 + β1xi1 + β2xi2 + ... + βpxip) > 0


CHAPTER 2. STATISTICAL METHODS 40

for all i = 1, ...,n.
The maximal marginal hyperplane is the separating hyperplane for which the margin

is largest. For a set of n training observations x1, ..., xn ∈ R
p and associated class labels

y1, ..., yn ∈ {−1, 1}, the maximal margin hyperplane is e.g. defined by James et al. (2013) as
the solution to the following optimization problem. Let M > 0 represent the margin of our
hyperplane. Then, we want to maximize M and optimize β0, β1, . . . , βp in

yi · (β0 + β1xi1 + β2xi2 + ... + βpxip) ≥M , ∀i = 1, ...,n ,

subject to
∑p

j=1 β
2
j = 1. Further details about SVM can be found e.g. in James et al. (2013).

Often classes are not perfectly separable, then so-called slack variables ξ1, . . . , ξn are
introduced in additional constraints

yi ·
(
x⊤i β + β0

)
≥M (1 − ξi) , i = 1, . . . ,n

ξi ≥ 0, i = 1, . . . ,n
n∑

i=1

ξi ≤ C,

with C ≥ 0 constant. These slack variables allow some flexibility compared to perfect
separation, and ξi represents the extent to which observation i lies on the wrong side of
its margin line, relative to the total margin violations M (see, e.g., Hastie et al., 2009). For
0 < ξi ≤ 1, observation i lies between its margin line and the hyperplane, and for ξi > 1 it lies
on the wrong side of the hyperplane. Observations on the correct side of the margin yield
ξi = 0. Hence, C is a tuning parameter that restricts the total amount of observations that
may lie on wrong sides. Only those observations with ξi ≥ 0 affect the hyperplane and are
called support vectors.

In this work, we employ two types of SVM models: the linear SVM and the non-linear
SVM with a radial basis function (RBF) kernel.

In order to define a proper kernel, the kernel function k must be a proper inner product
for an introduction to inner product spaces (see Young, 1988). In particular, the kernel matrix
K with Ki, j := k(xi, x j) must be positive semi-definite. One of the most popular kernels is the
RBF kernel, defined by

kRBF(x1, x2) := exp
(
−
∥x1 − x2∥

2

2γ2

)
,

where γ is an additional kernel hyperparameter. A manual grid search over hyperparameter
values is conducted to fine-tune the parameters for the RBF kernel, identifying the optimal
SVM model. The performance of this tuned SVM model is then assessed using the testing
dataset, with the process carried out using the svm function from the e1071 R package
(Dimitriadou et al., 2009).

In Buhamra et al. (2025), we conduct a comparative analysis to predict and evaluate the


CHAPTER 2. STATISTICAL METHODS 41

performance of machine learning methods versus traditional regression techniques obtained
from Buhamra et al. (2024) in the context of sports data, specifically focusing on men’s Grand
Slam tennis tournaments.

2.3.5 Artificial neural networks (ANNs)

An alternative approach that can model highly complex functions of the input features is
artificial neural networks (ANNs). The term “neural network” has its origins in the field of
neuroscience and early computational models inspired by biological neural systems. The
concept dates back to McCulloch and Pitts (1943), who developed the first mathematical
model of artificial neurons. Their work laid the foundation for ANNs, influencing later
developments like perceptrons (Rosenblatt, 1958) and deep learning models. ANNs are
composed of layers of interconnected nodes, or neurons, which process input data through
weighted connections and activation functions. Typically, an ANN consists of an input layer,
one or more hidden layers, and an output layer, where each layer transforms the data before
passing it to the next.

In this subsection, we focus on neural networks as effective models for statistical pattern
recognition. Specifically, we concentrate on the class of neural networks that have demon-
strated the highest practical value, namely the multilayer perceptron. This representation has
the potential to reveal significant relationships between features that a lower-dimensional
model might overlook. However, employing ANNs also introduces new challenges.

ANNs can capture complex relationships between various match features (in the field
of sport analytics). However, they are often considered ’black box’ models, as they do not
provide transparency into the decision-making process, making interpretation difficult. In
addition to tuning hyperparameters, developing an effective tennis prediction model re-
quires a careful selection of these parameters to achieve optimal performance. Moreover,
ANNs are especially prone to overfitting, particularly when the number of hidden layers or
neurons is large compared to the number of training examples. So one could say that an
ANN’s ability to model non-linear relationships comes at the cost of increased susceptibil-
ity to overfitting, a larger number of hyperparameters to tune, and higher computational
expense. Nevertheless, this interesting model merits our attention and further investigation.

In such a model, each neuron computes a value based on its inputs, which is then passed
on to other neurons. A feedforward network is represented as a directed acyclic graph.
Typically, ANNs are structured into multiple layers, with neurons in non-input layers being
fully connected to all neurons in the preceding layer. Each connection in the network has
an associated weight, and a neuron uses these inputs and their corresponding weights to
calculate an output value. A common approach for this calculation is a non-linear weighted
sum:

f (x) = K

 n∑
i=1

wixi

 ,


CHAPTER 2. STATISTICAL METHODS 42

where wi is the weight of input xi. The non-linear activation function K(·) enables the network
to solve complex problems using a relatively small number of neurons. Among the various
sigmoid functions, the logistic function is commonly used for this purpose. Match prediction
can be performed by feeding the player and match features into the neurons of the input
layer and propagating the values through the network. When a logistic activation function
is applied, the network’s output can represent the probability of winning the match. Which
means if a logistic activation function is used, the output of the network can represent a
probability or a classification score. Specifically, the logistic function maps input values to a
range between 0 and 1, making it suitable for binary classification tasks where the output can
be interpreted as the probability of one class (e.g., class 1) occurring. The closer the output
is to 1, the larger the probability that the input belongs to the positive class, and the closer it
is to 0, the larger the probability it belongs to the negative class.

There are various training algorithms designed to optimize the network’s weights in or-
der to produce the most accurate outputs for a given set of training examples. For instance,
the backpropagation algorithm utilizes gradient descent to minimize the mean square error
between the target values and the network’s outputs. Somboonphokkaphan et al. (2009)
used a three-layers feedforward ANN for match prediction, employing the backpropaga-
tion algorithm. Multiple networks with different sets of input features were trained and
compared. The best-performing model had 27 input nodes, representing both player and
match features, and achieved an average accuracy of approximately 75% in predicting match
outcomes during the 2007 and 2008 Grand Slam tournaments.

Considering the network structure, the ANN used in this work follows a multilayer
perceptron (MLP) architecture, which is a type of feedforward network where information
flows in one direction from the input layer through one or more hidden layers to the output
layer. The network is trained using backpropagation, a gradient-based optimization method
that adjusts weights to minimize prediction errors.

This structure enables MLPs to learn complex patterns and make accurate predictions.
These networks are widely used in tasks such as image recognition, speech processing,
and predictive analytics due to their ability to model complex patterns effectively. Within
the frame work of sport analytics, particularly in tennis, they are used to predict match
outcomes, assess player performance, and analyze playing styles. By processing features
such as player rankings, recent form, head-to-head records, and match conditions, ANNs
can identify patterns and generate probabilities for match predictions.

One key aspect of the network structure is hyperparameter tuning, which involves ad-
justing key parameters to optimize model performance and improve its ability to generalize
to new data. These hyperparameters control the learning behavior and impact the accuracy
of the neural network. In this work, we focused on the following parameters:

• size: The number of neurons in the hidden layer, which defines the model’s complexity.
A larger number of neurons can result in a more complex model but also increases the
risk of overfitting. Therefore, grid search is used to select the values that minimize the


CHAPTER 2. STATISTICAL METHODS 43

error.

• decay: This parameter acts as a regularization factor. A smaller decay value will
correspond to a larger learning rate, while a larger decay value corresponds to a slower
learning rate. Tuning decay helps control overfitting and the rate at which the network
learns. The best values for decay is obtained by defining grid search.

• maxit: The maximum number of iterations (epochs) for training. If the decay is large,
fewer iterations may be required for convergence. If the decay is low, more iterations
may be needed for the model to converge. The convergence-based is used for tuning
the maxit parameter. It adjusts the number of iterations dynamically based on whether
the loss is still improving, helping to find a suitable stopping point for the neural
network training process. This approach avoids the need for exhaustive searching
over a range of maxit values. I.e., the method adapts the number of iterations based
on the model’s convergence, ensuring the training process is computationally efficient
while still trying to improve model performance.

• activation function: The most commonly used activation functions in neural net-
works are sigmoid-based functions. One example is the logistic (sigmoid) function,
which restricts output values to the range [0,1] and is frequently used in logistic re-
gression. Another option is the hyperbolic tangent (tanh) function, which outputs
values between [-1,1]. According to LeCun et al. (2012), symmetric activation func-
tions like tanh(·) can improve training efficiency by reducing training time. However,
here we employed the sigmoid function for the classification task, as it is well-suited
for predicting binary outcomes such as match wins and losses.

As shown by Hornik (1991), a single hidden layer with a finite number of neurons is suf-
ficient to approximate any continuous function. In this work, we used a single hidden layer.
This concept is known as the universal approximation theorem for neural networks Csáji
et al. (2001). However, the number of neurons in the hidden layer remains a hyperparameter
that needs to be optimized. Note that a neural network without hidden layers is essentially
equivalent to logistic regression. It is the inclusion of hidden layers that enables the network
to perform non-linear classification.

Further background and detailed information on ANN models can be found in Bishop
(1995) and Bishop and Nasrabadi (2006).

A visualization of a fully connected feedforward neural network is shown in Figure 2.8. A
network is considered feedforward as the connections between neurons do not form a cycle.
A neural network always consists of at least an input layer and an output layer. Additionally,
the network may include one or more intermediate layers that connect the input and output
layers. These intermediate layers are called hidden layers. Figure 2.8 illustrates the structure
of the ANN, which includes input nodes in the input layer, a variable number of nodes in
the hidden layer, and a single output node.


CHAPTER 2. STATISTICAL METHODS 44

Each layer consists of a specific number of neurons, with the number of neurons in the
input and output layers determined by the requirements of the problem or the specific case
being investigated. In our example, seven input neurons are used, each corresponding
to a feature. The output layer consists of a single neuron, as we only need to predict
the probability of a win. The values of the neurons in the input layer are equal to the
corresponding feature. The values of a neuron h j in the subsequent layer is computed as
follows:

h j = ϕ

b j +

n∑
i=1

xiwi j

 , (2.5)

where b is trainable bias weight unique to every neuron and ϕ(xi) is an activation function.
The bias weight fulfills a similar role to the intercept in logistic regression and is not strictly
required for our problem. The activation function introduces non-linearity into the model.
For this, we used commonly chosen and suitable activation functions, specifically the logistic
sigmoid function (Bishop and Nasrabadi, 2006; Goodfellow et al., 2016 and Orr and Müller,
1998).

By calculating equation (2.5) for each neuron beyond the input layer, the neuron values
are propagated through the network layers until they reach the output layer. The values
of the output layer neurons represent the neural network’s predictions. For binary classifi-
cation task, a single output neuron with the logistic (sigmoid) activation function is used.
During training, the network weights are optimized using a gradient descent approach, with
backpropagation employed to compute the gradients (Hecht-Nielsen, 1992).

The implementation is done in R using the nnet package by Ripley (2002). For visualiza-
tion, the plotnet function from the NeuralNetTools package (Beck, 2018) is used.
In the following, we will describe the layers (structure) of the ANN corresponding to Fig-
ure 2.8.

Input Layer

The input layer represents the features used to predict the outcome. In this model, the input
features are Prob, Elo, Points, Age, Age.30, Age.int and Rank.

Hidden Layer

The hidden layer processes input features and extracts meaningful patterns. This model
uses 3 hidden nodes (size = 3), which act as intermediate computational units.

Output Layer

The output layer generates the final prediction. Since this is a binary classification problem
(win or loss), the output layer consists of a single neuron, which means that the output is a
value between 0 and 1, representing the probability of winning.


CHAPTER 2. STATISTICAL METHODS 45

I1

I2

I3

I4

I5

I6

I7

prob

elo

points

age

age.30

age.int

rank

H1

H2

H3

O1 win

Iutput Layer 
(7 nodes)

Hidden Layer 
( sigmoid Function)

Output Layer 
(logistic activation)

Figure 2.8: Example of a three layers fully connected feedforward neural network.

The thickness of the arrows in the plot indicates the strength of the connection between
neurons, which is based on the estimated weights from the trained neural network model.

The arrows between nodes (connections) represent the weights, which are adjusted dur-
ing training. The thickness of the arrows in the plot indicates the strength of the connection
between neurons, which is based on the estimated weights from the trained neural network
model. Thicker arrows represent higher-weighted connections, while thinner ones indicate
lower weights. Activation functions introduce non-linearity, enabling the network to learn
complex patterns. Sigmoid with outputs values between (0,1) is used as it useful for binary
classification.

2.4 Interpretable machine learning (IML)

Simple models, such as linear regression or decision trees, are inherently more interpretable
due to their straightforward relationships between input and output. However, this is not
the case for more complex models like RF or XGBoost, which lack this inherent transparency.
Interpretability in machine learning refers to the ability to understand and explain how a
model makes its predictions or decisions. In other words, interpretability enables one to
identify the factors influencing a model’s outcomes and the relationships within the data
that shape those results. Therefore, interpretable machine learning (IML) tools, such as
partial dependence plot (PDP) and individual conditional expectation (ICE) plot, are used
to provide insights into the decision-making process of complex models. The goal of using
IML is to create models that not only yield accurate predictions but also offer explanations


CHAPTER 2. STATISTICAL METHODS 46

for those predictions.

2.4.1 Partial dependence plot (PDP)

The partial dependence plot (PDP) is a global interpretation tool that represents expected
values based on the data distribution (Friedman, 2001). It illustrates the average effect of
a specific feature on predictions while averaging out the influence of other features. As a
global interpretation method, it reveals the overall behavior of the model by considering all
instances and providing insights into the relationship between a feature and the predicted
outcome. Consequently, PDP is particularly useful for understanding general patterns in
the data and for diagnosing model performance.

The PDP is a feature effect method, which illustrates the marginal effect of one or two
features on the predicted outcome of a machine learning model (Friedman, 2001). The partial
dependence function for regression is expressed as

f̂S(xS) = ExC[ f̂ (xS, xC)] =
∫

f̂ (xS, xC) dP(xC) .

where P(·) is the distribution of the features in set C and the features in xS are those for which
the partial dependence function is plotted, while xC represents the other features used in
the machine learning model f̂ (·), which are considered as random variables in this context.
Typically, the set S contains only one or two features, namely those whose effect on the
prediction one aims to understand. The feature vectors xS and xC together form the complete
feature space x. Partial dependence operates by marginalizing the model’s output over the
distribution of the features in set C, allowing the function to reveal the relationship between
the features in S and the predicted outcome. By doing so, we obtain a function depending
solely on the features in S, while still accounting for interactions with other features. The
partial function f̂S is given by

f̂S(xS) =
1
n

n∑
i=1

f̂ (xS, xi,C) ,

i.e., it is estimated by calculating averages on the training data. The partial function shows
the average marginal effect on the prediction for given values of the features in S. Here, xi,C

denotes the actual values from the dataset for the features not of interest, and n represents
the number of instances in the dataset. A key assumption of the PDP is that the features in C
are uncorrelated with those in S. If this assumption is not met, the computed averages may
incorporate data points that are very unlikely or even impossible.

In classification tasks where the machine learning model produces probabilities, the
partial dependence plot shows the probability of a specific class based on various values of
the feature(s) in S.

In addtion, Greenwell et al. (2018) introduced a simple measure of feature importance
that relies on partial dependence. The core concept is that a flat PDP implies the feature is


CHAPTER 2. STATISTICAL METHODS 47

not significant, whereas increased variability in the PDP signifies greater importance of the
feature.

The importance for numerical features is determined by the deviation of each unique
feature value from the average curve (Greenwell et al., 2018) and is given by

I(xS) =

√√√
1

K − 1

K∑
k=1

 f̂S(x(k)
S ) −

1
K

K∑
k=1

f̂S(x(k)
S )


2

.

Note that here the x(k)
S

are the K unique values of feature xS. The range of the PDP values
for the unique categories or categorical features is

I(xS) =
maxk( f̂S(x(k)

S )) −mink( f̂S(x(k)
S ))

4
.

This method for calculating deviation is referred to as the range rule. It provides a rough
estimate of deviation when only the range is known. The denominator of four is derived
from the properties of the standard normal distribution.

Finally, this PDP-based feature importance measure only captures the primary effect of
the feature while ignoring any potential interactions with other features. Another, drawback
of this measure is that it is based solely on unique values. Consequently, a unique feature
value with only one instance is assigned the same importance as a value that has many
instances when computing feature importance. For further details and examples see Molnar
(2020).

2.4.2 Individual conditional expectation (ICE)

The individual counterpart to a PDP is known as the individual conditional expectation
(ICE) plot (Goldstein et al., 2015). Unlike partial dependence plots, which show an average
effect across all data points, ICE plots are especially useful for observing individual-level
variations in predictions. ICE plots display separate curves for each instance in the dataset,
illustrating how predicted values change as the feature of interest is varied. They are
particularly effective in cases where feature interactions exist, as they allow us to see how
different instances might respond differently even under the same feature values.

The purpose of examining individual expectations alongside partial dependencies is
to gain a deeper understanding of how a feature affects predictions on an individual basis.
PDPs reveal the average relationship between a feature and the prediction, which works well
when interactions between the target feature and other features are minimal. However, in
cases where interactions are present, ICE plots offer significantly more insight by displaying
how each instance uniquely responds to changes in the feature. This means that in ICE plots,
for each instance in {(x(i)

S , x
(i)
C )}Ni=1, the curve f̂

(i)
S is plotted against x(i)

S , while x(i)
C remains fixed.

A limitation of ICE plots is that it can sometimes be challenging to discern differences
between individual curves due to variations in their starting predictions. A straightforward


CHAPTER 2. STATISTICAL METHODS 48

solution is to center the curves at a specific value of the feature, showing only the difference in
predictions relative to this point. This modified plot is known as a centered ICE plot (c-ICE).
Anchoring the curves at the lower end of the feature range is often effective. According to
Goldstein et al. (2015), the centered curves are then defined as follows

f̂
(i)
cent = f̂ i − 1 f̂ (xa, x(i)

C ),

where 1 is a vector of ones with the appropriate dimensions (typically one or two), f̂ repre-
sents the fitted model, and xa denotes the anchor point.

An alternative approach to enhance the visual identification of heterogeneity is to examine
the individual derivatives of the prediction function concerning a feature. This results in
the development of a derivative ICE plot (d-ICE). The derivatives of a function (or curve)
reveal whether changes are occurring and their direction. By using a derivative ICE plot,
it becomes easier to pinpoint ranges of feature values where the predictions from the black
box model vary for some instances.

If there is no interaction between the feature being analyzed, xS and the other features,
xC, then the prediction function can be formulated as

f̂ (x) = f̂ (xS, xC) = g(xS) + h(xC),

with
δ f̂ (x)
δxS

= g′(xS).

When there are no interactions, the individual partial derivatives should remain consis-
tent across all instances. If there are differences, it indicates the presence of interactions,
which can be observed in the d-ICE plot. Generally, The derivative ICE plot takes a long
time to compute and is rather impractical (Molnar, 2020).


Chapter 3

Summary of the articles

3.1 Article 1: Modeling and prediction of tennis matches at

Grand Slam tournaments

In the following, a summary of Buhamra et al. (2024) is presented. Through this work,
we explore various approaches for modeling and predicting the outcomes of Grand Slam
tennis tournaments from 2011 to 2022, specifically focusing on the probability that the first-
named player wins. In our analysis, we account for covariate differences by calculating the
difference between the corresponding values of the two players. For instance, the ranking
difference is obtained by subtracting the second player’s Rank from the first player’s Rank,
and the same approach is applied to other covariates such as Age and Elo rating.

Our goal is to determine which modeling approaches are best suited for this purpose.
All methods are based on regression models and incorporate multiple potential covariates.

The evaluation of these regression models is conducted using both leave-one-tournament-
out cross validation and external validation, with respect to three performance measures:
classification rate, predictive Bernoulli likelihood, and Brier score.

To predict and compare the results of the 2022 tournaments with actual outcomes, we
apply a “rolling window” strategy with a continuously updating dataset.

Additionally, we investigate the impact of incorporating non-linear effects and additional
court- and player-specific attributes to assess their relevance. Those two specific aspects are
examined in more detail. This analysis aims to leverage the flexibility and interpretability of
modern regression methods to gain insights into key associations and relationships in pro-
fessional tennis. Furthermore, we introduce the concept of “newcomer” players to estimate
the ability effects for those who have participated in only one tournament.

Our main focus is on the logistic regression model, where a binary target variable y and
covariates x1, . . . , xp are given. In our case, y = 1 denotes the occurrence of a particular event
(defined as “win”) and y = 0 that the event does not occur (defined as “lose”). We aim to
model πi appropriately as a function of the feature variables.

49


CHAPTER 3. SUMMARY OF THE ARTICLES 50

πi = P(yi = 1|xi1, . . . , xip) = E[y|xi1, . . . , xip]

hence, the linear predictor ηi is related to the probability πi by a strictly monotonically
increasing function h : R→ [0, 1], we can also write

πi = h(ηi) = h(β0 + β1xi1 + · · · + βkxip) .

When the number of covariates p becomes very large, estimation may become numerically
unstable. To address this issue, regularization is applied using the least absolute shrinkage
and selection operator (LASSO; Tibshirani, 1996). Here, the penalty term is expressed by

pen(β) =
p∑

j=1

|β j|.

It allows model estimation and variable selection to be performed in one step, as small
coefficients are shrunk to 0. In our case, this approach was applied to both surface-specific
player skills and global player skills. For the former, since tennis is played on various
surfaces, players’ performance may vary depending on the surface. To capture this effect, a
factor variable was introduced, leading to a large set of new (dummy-encoded) covariates.
However, this can result in an excessive number of parameters and unstable estimates. To
address this issue, logistic regression was combined with LASSO regularization. A similar
approach was applied to global player skills. Further details can be found in Section 4.1
of the corresponding article. As the effect of covariates on the target variable can also be
non-linear, splines can be used to capture this flexibility. Therefore, we introduce P-splines,
which incorporate a penalty term to enhance regularization. Specifically, we use P-splines
based on B-splines (Eilers and Marx, 1996). Then, all proposed linear and non-linear models,
including the benchmark model based on probabilities calculated from the average odds
of different bookmakers, are compared with respect to their predictive performance using
leave-one-tournament-out cross validation strategy to select the best model in terms of the
perviously mentioned performance measures. Subsequently, an external validation method
“rolling window” is conducted to evaluate the best models identified in the corresponding
cross validation approach. We also consider betting profit as an additional method to
compare the predictive quality of different models.

The results indicate that within the leave-one-tournament-out cross validation approach,
all models performed very similarly in terms of classification rate, likelihood, and Brier
score. However, when comparing betting profits and the number of bets placed, it was
found that only the spline models yielded positive betting returns, though they placed fewer
bets. Therefore, it can be inferred that these models are more conservative and potentially
safer in terms of betting (see Table 3.1).

The 2022 tournaments were then used as an external validation dataset. The findings from
the cross validation comparison for the rolling window approach were largely confirmed,


CHAPTER 3. SUMMARY OF THE ARTICLES 51

Table 3.1: The values of betting profits and the number of bets placed for leave-one-
tournament-out cross validation approach for different models approaches including linear,
splines and LASSO. In brackets is the ratio of all matches in which a bet was placed. The
values of the winner models are highlighted in bold font.

Explanatory Betting Amount of
variables profit bets

Linear Prob −128.8 2696 (57.1%)
Points, Prob −99.9 2395 (50.7%)
Rank, Prob −110.9 2707 (57.3%)
Age, Prob −150.3 2716 (57.5%)
Rank, Points, Prob −104.3 2418 (51.2%)

Splines Prob 113.7 1170 (24.8%)
Rank, Prob 113.7 1170 (24.8%)
Points, Prob 101.3 1298 (27.5%)
Rank, Points, Prob 101.8 1299 (27.5%)
Elo, Prob 106.9 1175 (24.9%)

LASSO Surface specific −225.3 1891 (40.1%)
General skills −199.3 2002 (42.4%)
Benchmark −311.4 1359 (28.8%)

with performance measure values generally showing slight improvements over those from
the CV. Regarding betting returns, only the spline models achieved a profit, except for the
model using ranking Points and betting Odds, which resulted in a small loss. The proportion
of bets placed increased for all models, although the benchmark model placed significantly
fewer bets than in the cross validation comparison. Notably, the benchmark model again
performed no better than the other models across most performance measures. Additionally,
it was observed that among the linear models, the covariates Rank and Prob yielded the best
results. Within the spline approach, all three models performed nearly equally well. The
LASSO models generally performed worse than the other models and therefore are not
recommended.

In summary, we propose several regression models for predicting Grand Slam tennis
tournaments from 2011 to 2022, focusing on the probability that the first-named player
would win. Models were evaluated using leave-one-tournament-out CV and rolling window
approach across particular performance measures. The study also considered non-linear
effects, player and court-specific abilities, and the inclusion of newcomer players. Spline
models using P-splines showed positive betting returns and demonstrated a conservative
approach to betting. Validation on 2022 data confirmed spline models as top performers,
while LASSO models were less effective. Also, it was notable that all models performed at
least as well as the benchmark model.


CHAPTER 3. SUMMARY OF THE ARTICLES 52

3.2 Article 2: Comparing modern machine learning approaches

and different forecast strategies on Grand Slam tennis

tournaments

Machine learning has recently demonstrated promising results in sports prediction. In
Buhamra et al. (2025), we provide various machine learning approaches for modeling and
predicting outcomes of tennis matches in Grand Slam tournaments.

This study focuses on multiple machine learning approaches, including CT, RF, XGBoost,
SVM and ANNs, applied to matches in men’s Grand Slam tournaments from 2011 to 2022.
We then compare these approaches with the regression models examined in our previous
work from Article 1 (Buhamra et al., 2024).

The models are evaluated using the same validation strategies, dataset, and performance
measures as in Article 1. However, within this work, we introduce an additional validation
strategy with a fixed three-year training window.

This analysis considers a similar selection of potential covariates as in Article 1 (Buhamra
et al., 2024), including player Age, ATP Ranking and Points, betting Odds, Elo rating, and
two additional Age variables that take into account the "optimal" age range for tennis players,
which is typically between 28 and 32 years (Weston, 2014).

Similar to Article 1, the first step involves evaluating the models using Grand Slam
tournaments from 2011 to 2021. The 2022 tournaments are initially excluded and later
utilized as an external validation dataset. To validate the models from the preceding CV
phase, we assess their predictive performance on new, unseen data in a more realistic setting
by applying an expanding validation forecasting approach. In this approach, each remaining
tournament is used sequentially as a test dataset in chronological order, while the training
dataset is updated and enlarged continuously. This scheme, as previously discussed in
Buhamra et al. (2024), provides a progressive view of model performance. An alternative
strategy is to use only recent data for prediction through a rolling window approach. Here,
a fixed three-year training window is applied, with the training dataset rolling forward each
year to include only the most recent three years of data. The results show that within the CV
approach, the selected linear and spline models performed similarly in terms of classification
rate, likelihood, and Brier score, whereas Brier score values for the machine learning models
were more volatile.

For the tournaments from 2022, which were used as an external validation set in an ex-
panding window validation approach, the performance measures generally showed slight
improvement. However, performance varied across those performance measures: XGBoost
achieved the largest classification rate, the linear SVM model excelled in predictive likeli-
hood, and the spline model including both Points and Prob covariates attained the lowest
Brier score, outperforming all other models.

In the rolling window approach, with a training dataset of 12 tournaments (three-years
training window) used to predict each subsequent tournament, the same models as in the


CHAPTER 3. SUMMARY OF THE ARTICLES 53

expanding window validation were evaluated. The results were mostly consistent with the
expanding window strategy, although performance measure values were generally slightly
lower. Performance differed slightly between models and performance measures: the linear
SVM model performed best in both classification rate and predictive likelihood, while the
linear regression model with Points and Prob achieved the lowest Brier score. Table 3.2
provides a summarized overview for the results of ML models based on the performance
measures with respect to the proposed validation strategies.

According to the results obtained from the ML models, only minor differences were
observed across the three validation approaches, and the models performed similarly in
terms of the performance measures.

Within the leave-one-tournament-out CV, the ANNs model achieved the best classifica-
tion rate and Brier score among all models corresponding to this validation strategy, while
the largest likelihood value was obtained by another model, namely SVM (linear).

In the expanding window approach, the XGBoost model achieved the best classification
rate among all models and validation strategies, followed closely by the SVM (linear) with a
value of 0.8295. Also, the SVM (linear) model outperformed the others in terms of predictive
log-likelihood and Brier score.

For the rolling approach with a fixed 3-years window, the SVM (linear) model achieved
the largest classification rate and also yielded the best likelihood value. However, the ANNs
model outperformed others with respect to the Brier score measure.

Table 3.2: Results of the different proposed validation approach for the ML models; best
performer in bold font.

ML Class. Likeli- Brier
models rate hood score

Leave-one-tournament-out CV ANNs 0.8021 0.6828 0.1560
Random forest 0.7978 0.6817 0.1584
SVM (Linear) 0.7961 0.7415 0.1675
SVM (Radial kernel) 0.7829 0.6149 0.1807
XGBoost 0.7777 0.7234 0.1901
Classification tree 0.7486 0.6703 0.2178

Expanding validation approach SVM (Linear) 0.8295 0.7637 0.1446
XGBoost 0.8306 0.7526 0.1562
Random forest 0.8162 0.6734 0.1511
SVM (Radial kernel) 0.7956 0.5858 0.1882
ANNs 0.7918 0.6904 0.1466
Classification tree 0.7577 0.6803 0.2061

Rolling approach with a fixed 3-years window SVM (Linear) 0.8012 0.7424 0.1709
Random forest 0.7971 0.6727 0.1608
XGBoost 0.7755 0.7231 0.2074
SVM (Radial kernel) 0.7729 0.5448 0.2139
ANNs 0.7726 0.6837 0.1554
Classification tree 0.7256 0.6900 0.2678

To sum up, this study explores machine learning methods, including CT, RF, XGBoost,
SVM and ANNs, to model and predict outcomes of men’s Grand Slam tennis matches from


CHAPTER 3. SUMMARY OF THE ARTICLES 54

2011 to 2022, the same period as in Article 1 (Buhamra et al., 2024). These models are
evaluated and compared with regression models using the same strategy as in Buhamra
et al. (2024), with the addition of a rolling window validation strategy using a fixed 3-years
window, focusing on the same performance measures. The results of the considered strategy
(expanding window) indicate that XGBoost achieved the largest classification rate, SVM
(linear) performed best in predictive likelihood, and spline models (with Points and Prob
covariates) consistently attained the lowest Brier score.


CHAPTER 3. SUMMARY OF THE ARTICLES 55

3.3 Article 3: Statistical enhanced learning for modeling and

prediction tennis matches at Grand Slam tournaments

Here, we demonstrate how feature engineering techniques can be effectively applied in
sports analytics to enhance predictive models and derive valuable insights from sports data.

As our main focus is to analyze whether “statistically enhanced features” improve pre-
diction performance, we particularly examine the use of covariates such as Elo, Age.30, and
Age.int, which are not directly available but are derived either through separate modeling
techniques (Elo) or meaningful mathematical transformations (Age.30, Age.int). These co-
variates are used to further enhance statistical models for tennis. Note that, the additional
age variables were created to capture the “optimal” age range for a tennis player, typically
considered to be between 28 and 32 years (Weston, 2014). Our objective is to investigate
whether the enhanced variables have substantial potential to improve the performance of
our proposed models, specifically when compared to more classical covariates, which are
directly available, such as e.g. (Rank and Points).

Furthermore, as machine learning algorithms grow increasingly sophisticated, the need
for interpretability grows as well, especially for models like random forests (RF), whose
predictions can be challenging to understand.

Interpretable machine learning (IML) is a branch of machine learning focused on making
complex models more transparent and understandable. In our work, we enhance model
transparency by applying specific IML tools that provide insights into how these models
make predictions, thereby making them more accessible and reliable.

We introduced combined PDP and ICE plots to comprehensively visualize the rela-
tionship between predictor variables and the predicted outcome. This dual perspective is
essential for understanding the complexity of machine learning models: overlaying the PDP
on the ICE plots reveals general trends (captured by the PDP) while also highlighting the
variability in how individual observations respond to changes in each predictor (captured by
the ICE lines). Individual Conditional Expectation (ICE) plots display the effect of a feature
on the predicted outcome for each individual instance by plotting one line per observation.
Each line shows how the prediction changes as the feature value varies, keeping all other
features constant. This allows us to observe heterogeneity in predictions—different instances
may respond differently to changes in the same feature, which is crucial for identifying in-
teractions or non-linear effects (Goldstein et al., 2015). PDP shows the average effect of a
feature on the model’s prediction by averaging over the effects of other features. This helps
to better understand the relationship between each feature and the prediction.

There are a wide range of applications across various domains within IML area, such as
criminal justice Dressel and Farid (2018), Doshi-Velez and Kim (2017), finance Ustun and
Rudin (2019), healthcare and medical fields Hu et al. (2022), Ouyang et al. (2023), and di-
abetes management Payrovnaziri et al. (2020), where IML, particularly Local interpretable
model-agnostic explanations (LIME) by Ribeiro et al. (2016), is leveraged to provide individ-


CHAPTER 3. SUMMARY OF THE ARTICLES 56

ualized insights into factors affecting blood glucose levels. This analysis incorporates various
methods from Buhamra et al. (2024; 2025), including linear regression, splines, and random
forests. These approaches were then compared within an expanding window strategy to
analyze tennis data from Grand Slam tournaments using the same performance measures
as defined before. When analyzing the results, we find that values vary across different
approaches and performance measures. In general, models tend to perform better when at
least one of these enhanced variables is included. Examining PDPs provided insights into
how each predictor variable influences the model’s predictions. For instance, the PDP for
the differences of Points shows a general upward trend, suggesting that accumulating more
Points positively influences the predicted outcome. Likewise, the PDP for Age.30 indicates
that being closer to the optimal age of 30 is slightly associated to large winning probabil-
ities. Figure 3.1 and Figure 3.2 show the PDP and ICE plots for the covariate differences.
These include the differences between both players in age (Age.30), ranking positions (Rank),
ranking points (Points), and Elo ratings (Elo).

Points Rank

Age 30 Elo

−1000 −500 0 500 1000 −500 0 500 1000

−10 −5 0 5 10 −1000 −500 0 500 1000

0.0

0.2

0.4

0.6

0.8

0.0

0.2

0.4

0.6

0.8

P
re

di
ct

ed
 P

ro
ba

bi
lit

y

Figure 3.1: PDP plot for the random fores model includes Age.30, Elo, Points and Rank as
covariates

Points Rank

Age 30 Elo

−1000 −500 0 500 1000 −500 0 500 1000

−10 −5 0 5 10 −1000 −500 0 500 1000

0.0

0.2

0.4

0.6

0.8

0.0

0.2

0.4

0.6

0.8

P
re

di
ct

ed
 P

ro
ba

bi
lit

y

Figure 3.2: ICE plot for the random forest model including the covariates Age.30, Elo, Points
and Rank

To sum up, we tested various models, including linear regression and random forests, on


CHAPTER 3. SUMMARY OF THE ARTICLES 57

tennis data to assess the impact of enhanced variables, such as Age and Elo, on predictive
accuracy. Additionally, we used IML tools, such as PDP and ICE plots, to improve the
interpretability of machine learning models like RF. In sports analytics, these tools, along
with feature engineering, enhance model insights and performance.


Chapter 4

Discussion and outlook

This work presents a comprehensive comparison of classical regression models with modern
machine learning approaches for modeling and predicting outcomes of men’s Grand Slam
tennis matches. Moreover, we analyze the predictive potential of so-called statistically
enhanced covariates, and take procedures from the field of interpretable machine learning
into account to make complex machine learning models more understandable.

As a starting point, logistic regression models with linear effects are presented. Given
that the relationship between covariates and the target variable may not always be strictly
linear, we also incorporate non-linear effects using spline-based approaches, which are based
on penalized B-splines, i.e. P-splines. Additionally, we explore a regularization method,
LASSO, as an extension to regression models. This approach is particularly useful when a
large number of covariates are constructed, as it helps mitigate issues related to an excessive
number of parameters and potential estimator instability. This regularization method enables
model estimation and variable selection to be performed in one step by shrinking irrelevant
coefficients to zero. Therefore, we investigate the inclusion of both surface-specific and global
player skill parameters within this regularization framework. We also define the concept of
a “newcomer”, identifying players who have competed in only one Grand Slam tournament.
In our analysis, we also consider betting returns based on the average probability of a win
for each respective player, calculated from the average odds provided by several different
betting providers, as included in the deucepackage obtained from https://www.oddsportal
.com/. We continue our analysis using classification rate, predictive Bernoulli likelihood,
and Brier score through leave-one-tournament-out cross validation (CV) followed by external
validation.

Our main finding from the classical regression comparison is that all models performed
similarly in terms of classification rate, predictive likelihood, and Brier score within the cross-
validation framework. However, in terms of betting profits and the number of bets placed,
only the spline models generated positive returns, despite placing fewer bets. Regarding
external validation, the performance measure values typically showed slight improvements
compared to those from the cross validation. In terms of betting returns, all spline models
generated a profit, except for the one that included Points and Prob as covariates, which

58


CHAPTER 4. DISCUSSION AND OUTLOOK 59

resulted in a loss. The LASSO models generally performed worse than the other models
and are therefore not recommended. It is worth mentioning that the benchmark model
performed no better than the other models across most performance measures.

Machine learning (ML) has gained enormous popularity and widespread usage in recent
years, becoming an integral part of various applications. By enabling systems to learn
from data and improve their performance over time, ML has demonstrated the ability to
analyze large datasets, identify patterns, and make predictions, driving innovation and
efficiency across sectors. Furthermore, ML is now more accessible than ever, allowing
organizations to harness the power of data for better decision-making and enhanced user
experiences. Therefore, we continue our analysis and extend our work by using the same
dataset and performance measures, this time applying machine learning methods, with the
aim of investigating whether these methods can enhance the predictions from our previous
work in Buhamra et al. (2024). As a result, we conduct a comparative analysis of various
machine learning models, including classification trees, random forests, XGBoost, SVM, and
ANNs alongside the classical regression models used in our previous work. The models
are validated through different strategies, namely cross validation, expanding window, and
rolling window.

The main findings indicate that in the CV strategy, the selected linear and spline mod-
els exhibited similar performance in terms of classification rate and predictive likelihood.
However, the Brier score values for the ML models exhibited larger volatility, except for the
RF and ANN models, which produced acceptable values comparable to those of the other
models. In the expanding window approach, the performance measure values demonstrated
slight improvements. However, variations were observed among these measures: XGBoost
recorded the highest classification rate, the linear SVM model excelled in predictive likeli-
hood, and the spline model that incorporated both the Points and Prob covariates achieved
the lowest Brier score, outperforming all other models. Within the rolling window approach,
the performance measure values generally somewhat deteriorated, with slight differences
across models and measures; this time, the linear SVM model outperformed the others,
achieving higher results in both classification rate and predictive likelihood, whereas the
linear regression model that included Points and Prob attained the lowest Brier score.

Then, in order to improve predictive performance, we investigated enhanced covari-
ates, particularly Elo ratings and Age variables. This approach demonstrates that models
incorporating these enhanced variables yield better accuracy. Furthermore, to enhance the
interpretability of complex ML models, specifically the RF, we focus on IML tools. By
employing techniques like PDPs and ICE plots, we visually clarify the effects of predictor
variables on model predictions, thereby increasing transparency and accessibility. PDPs and
ICE plots provide valuable insights into both the individual and average impacts of features
on outcomes.

Future considerations involve extending the analysis to female tennis data, though in-
formation availability on female players might be limited. Additionally, other types of tour-


CHAPTER 4. DISCUSSION AND OUTLOOK 60

naments beyond the four Grand Slams could be included, such as world cups and smaller
tournaments. Another avenue for exploration is to conduct 1,000 simulation runs of new
tournaments to estimate the probability of a player winning a tournament. Moreover, one
could explore more advanced machine learning models, such as deep learning techniques,
which could be a promising direction for future research (Bishop, 1995; LeCun et al., 2015).
Finally, further statistically enhanced covariates could be developed. For instance, similar to
the historical match abilities created for soccer teams by Ley et al. (2019), analogous abilities
could be formulated for tennis players.


CHAPTER 4. DISCUSSION AND OUTLOOK 61


References

Almarashi, A. et al. (2024). A novel comparative study of nnar approach with li...

Angelini, E., Di Tollo, G., and Roli, A. (2008). A neural network approach for credit risk
evaluation. The quarterly review of economics and finance, 48(4):733–755.

Apley, D. W. and Zhu, J. (2020). Visualizing the effects of predictor variables in black box
supervised learning models. Journal of the Royal Statistical Society Series B: Statistical
Methodology, 82(4):1059–1086.

Arcagni, A., Candila, V., and Grassi, R. (2022). A new model for predicting the winner in
tennis based on the eigenvector centrality. Annals of Operations Research, pages 1–18.

Auret, L. and Aldrich, C. (2012). Interpretation of nonlinear relationships between process
variables by use of random forests. Minerals Engineering, 35:27–42.

Bartz, E., Bartz-Beielstein, T., Zaefferer, M., and Mersmann, O. (2023). Hyperparameter tuning
for machine and deep learning with R: A practical guide. Springer Nature.

Beck, M. W. (2018). Neuralnettools: visualization and analysis tools for neural networks.
Journal of statistical software, 85:1–20.

Bischl, B., Binder, M., Lang, M., Pielok, T., Richter, J., Coors, S., Thomas, J., Ullmann, T.,
Becker, M., Boulesteix, A.-L., et al. (2023). Hyperparameter optimization: Foundations,
algorithms, best practices, and open challenges. Wiley Interdisciplinary Reviews: Data
Mining and Knowledge Discovery, 13(2):e1484.

Bishop, C. M. (1995). Neural networks for pattern recognition. Oxford university press.

Bishop, C. M. and Nasrabadi, N. M. (2006). Pattern recognition and machine learning,
volume 4. Springer.

Breiman, L. (2001). Random forests. Machine learning, 45:5–32.

Breiman, L., Friedman, J., Olshen, R., and Stone, C. (1984). Classification and regression
trees. wadsworth int. Group, 37(15):237–251.

Buhamra, N. and Groll, A. (2025). Statistical enhanced learning for modeling and
prediction tennis matches at grand slam tournaments. arXiv preprint arXiv:2502.01613.

62


REFERENCES 63

Buhamra, N., Groll, A., and Brunner, S. (2024). Modeling and prediction of tennis matches
at grand slam tournaments. Journal of Sports Analytics. 10(1), 17-33.

Buhamra, N., Groll, A., and Gerharz, A. (2025). Comparing modern machine learning
approaches and different forecasting strategies for modeling tennis matches at grand
slam tournaments. Journal of Sports Analytics. to appear.

Bühlmann, P. and Hothorn, T. (2007). Boosting algorithms: Regularization, prediction and
model fitting.

Bühlmann, P. and Yu, B. (2003). Boosting with the l 2 loss: regression and classification.
Journal of the American Statistical Association, 98(462):324–339.

Bunker, R., Yeung, C., and Fujii, K. (2024a). Machine learning for soccer match result
prediction. arXiv preprint arXiv:2403.07669.

Bunker, R., Yeung, C., Susnjak, T., Espie, C., and Fujii, K. (2024b). A comparative evaluation
of elo ratings-and machine learning-based methods for tennis match result prediction.
Proceedings of the Institution of Mechanical Engineers, Part P: Journal of Sports Engineering
and Technology, 238(4):305–316.

Candila, V. and Palazzo, L. (2020). Neural networks and betting strategies for tennis. Risks,
8(3):68.

Chen, T. and Guestrin, C. (2016). Xgboost: A scalable tree boosting system. In Proceedings of
the 22nd acm sigkdd international conference on knowledge discovery and data mining, pages
785–794.

Chen, T., He, T., Benesty, M., and Khotilovich, V. (2019). Package ‘xgboost’. R version,
90:1–66.

Clarke, S. R. and Dyte, D. (2000). Using official ratings to simulate major tennis
tournaments. International transactions in operational research, 7(6):585–594.

Conneau, A., Kiela, D., Schwenk, H., Barrault, L., and Bordes, A. (2017). Supervised
learning of universal sentence representations from natural language inference data.
arXiv preprint arXiv:1705.02364.

Contributors, X. (2024). Xgboost documentation.
https://xgboost.readthedocs.io/en/stable/parameter.html. Accessed: 2024-11-08.

Cortes, C. and Vapnik, V. (1995). Support-vector networks. Machine learning, 20:273–297.

Csáji, B. C. et al. (2001). Approximation with artificial neural networks. Faculty of Sciences,
Etvs Lornd University, Hungary, 24(48):7.


REFERENCES 64

De Seranno, A. (2020). Predicting tennis matches using machine learning. Master of Science
in Computer Science Engineering.

Del Corral, J. and Prieto-Rodríguez, J. (2010). Are differences in ranks good predictors for
grand slam tennis matches? International Journal of Forecasting, 26(3):551–563.

Díaz-Uriarte, R. and Alvarez de Andrés, S. (2006). Gene selection and classification of
microarray data using random forest. BMC bioinformatics, 7:1–13.

Dimitriadou, E., Hornik, K., Leisch, F., Meyer, D., Weingessel, A., and Leisch, M. F. (2009).
Package ‘e1071’. R Software package, avaliable at http://cran. rproject.
org/web/packages/e1071/index. html.

Doshi-Velez, F. and Kim, B. (2017). Towards a rigorous science of interpretable machine
learning. arXiv preprint arXiv:1702.08608.

Dressel, J. and Farid, H. (2018). The accuracy, fairness, and limits of predicting recidivism.
Science advances, 4(1):eaao5580.

Easton, S. and Uylangco, K. (2010). Forecasting outcomes in tennis matches using
within-match betting markets. International Journal of Forecasting, 26(3):564–575.

Eilers and Marx, B. D. (1996). Flexible smoothing with B-splines and penalties. Statistical
Science, 11:89–121.

Eilers and Marx, B. D. (2021). Practical smoothing: The joys of P-splines. Cambridge
University Press.

Fahrmeir, L., Kneib, T., Lang, S., and Marx, B. (2013). Regression. Modells, Methods and
Applications. Springer, Berlin.

Fahrmeir, L. and Tutz, G. (2001). Multivariate Statistical Modelling Based on Generalized Linear
Models. Springer-Verlag, New York, 2nd edition.

Felice, F., Ley, C., Groll, A., and Bordas, S. (2023). Statistically enhanced learning: a feature
engineering framework to boost (any) learning algorithms. arXiv preprint
arXiv:2306.17006.

Fernando, T., Denman, S., Sridharan, S., and Fookes, C. (2019). Memory augmented deep
generative models for forecasting the next shot location in tennis. IEEE Transactions on
Knowledge and Data Engineering, 32(9):1785–1797.

Freund, Y., Schapire, R. E., et al. (1996). Experiments with a new boosting algorithm. In
icml, volume 96, pages 148–156. Citeseer.

Friedman (2001). Greedy function approximation: a gradient boosting machine. Annals of
statistics, pages 1189–1232.


REFERENCES 65

Friedman, Hastie, T., and Tibshirani, R. (2000). Additive logistic regression: a statistical
view of boosting (with discussion and a rejoinder by the authors). The annals of statistics,
28(2):337–407.

Friedman, Hastie, T., and Tibshirani, R. (2010). Regularization paths for generalized linear
models via coordinate descent. Journal of Statistical Software, 33(1):1–22.

Gao, Z. and Kowalczyk, A. (2021). Random forest model identifies serve strength as a key
predictor of tennis match outcome. Journal of Sports Analytics, 7(4):255–262.

Goldstein, A., Kapelner, A., Bleich, J., and Pitkin, E. (2015). Peeking inside the black box:
Visualizing statistical learning with plots of individual conditional expectation. journal of
Computational and Graphical Statistics, 24(1):44–65.

Goodfellow, I., Bengio, Y., Courville, A., and Bengio, Y. (2016). Deep learning, volume 1. MIT
press Cambridge.

Greenwell, B. M., Boehmke, B. C., and McCarthy, A. J. (2018). A simple and effective
model-based variable importance measure. arXiv preprint arXiv:1805.04755.

Groll, A., Hvattum, L. M., Ley, C., Popp, F., Schauberger, G., Van Eetvelde, H., and Zeileis,
A. (2021). Hybrid machine learning forecasts for the uefa euro 2020. arXiv preprint
arXiv:2106.05799.

Groll, A., Ley, C., Schauberger, G., and Van Eetvelde, H. (2019). A hybrid random forest to
predict soccer matches in international tournaments. Journal of quantitative analysis in
sports, 15(4):271–287.

Gu, W. and Saaty, T. L. (2019). Predicting the outcome of a tennis tournament: Based on both
data and judgments. Journal of Systems Science and Systems Engineering, 28(3):317–343.

Guyon, I., Weston, J., Barnhill, S., and Vapnik, V. (2002). Gene selection for cancer
classification using support vector machines. Machine learning, 46:389–422.

Hastie, T., Tibshirani, R., Friedman, J. H., and Friedman, J. H. (2009). The elements of
statistical learning: data mining, inference, and prediction, volume 2. Springer.

Hastie, T., Tibshirani, R., and Wainwright, M. (2015). Statistical learning with sparsity.
Monographs on statistics and applied probability, 143(143):8.

Hecht-Nielsen, R. (1992). Theory of the backpropagation neural network. In Neural
networks for perception, pages 65–93. Elsevier.

Hornik, K. (1991). Approximation capabilities of multilayer feedforward networks. Neural
networks, 4(2):251–257.


REFERENCES 66

Hovad, E., Hougaard-Jensen, T., and Clemmensen, L. K. H. (2024). Classification of tennis
actions using deep learning. arXiv preprint arXiv:2402.02545.

Hu, C., Tan, Q., Zhang, Q., Li, Y., Wang, F., Zou, X., and Peng, Z. (2022). Application of
interpretable machine learning for early prediction of prognosis in acute kidney injury.
Computational and Structural Biotechnology Journal, 20:2861–2870.

Hughes, M., Hughes, M. T., and Behan, H. (2007). The evolution of computerised
notational analysis through the example of racket sports. International Journal of Sports
Science and Engineering, 1(1):3–28.

James, G., Witten, D., Hastie, T., Tibshirani, R., et al. (2013). An introduction to statistical
learning, volume 112. Springer.

Klaassen, F. J. and Magnus, J. R. (2003). Forecasting the winner of a tennis match. European
Journal of Operational Research, 148(2):257–267.

Kourou, K., Exarchos, T. P., Exarchos, K. P., Karamouzis, M. V., and Fotiadis, D. I. (2015).
Machine learning applications in cancer prognosis and prediction. Computational and
structural biotechnology journal, 13:8–17.

Kuhn, M. (2008). Building predictive models in r using the caret package. Journal of
statistical software, 28:1–26.

LeCun, Y., Bengio, Y., and Hinton, G. (2015). Deep learning. nature, 521(7553):436–444.

LeCun, Y., Bottou, L., Orr, G., and Müller, K. (2012). Efficient backprop, lecture notes in
computer science (including subseries lecture notes in artificial intelligence and lecture
notes in bioinformatics), lectu (1998) 7700: 9–48. DOI: https://doi.
org/10.1007/978-3-642-35289-8-3.

Ley, C., Wiele, T. V. d., and Eetvelde, H. V. (2019). Ranking soccer teams on the basis of their
current strength: A comparison of maximum likelihood approaches. Statistical Modelling,
19(1):55–73.

Lv, X., Gu, D., Liu, X., Dong, J., and Li, Y. (2024). Momentum prediction models of tennis
match based on catboost regression and random forest algorithms. Scientific Reports,
14(1):18834.

Ma, S. C., Ma, S. M., Wu, J. H., and Rotherham, I. D. (2013). Host residents’ perception
changes on major sport events. European Sport Management Quarterly, 13(5):511–536.

Markopoulou, C. (2024). Sport analytics algorithms for football performance prediction.

Mayr, A., Binder, H., Gefeller, O., and Schmid, M. (2014). The evolution of boosting
algorithms. Methods of information in medicine, 53(06):419–427.


REFERENCES 67

McCulloch, W. S. and Pitts, W. (1943). A logical calculus of the ideas immanent in nervous
activity. The bulletin of mathematical biophysics, 5:115–133.

McHale, I. and Morton, A. (2011). A bradley-terry type model for forecasting tennis match
results. International Journal of Forecasting, 27(2):619–630.

Molnar, C. (2020). Interpretable machine learning. Lulu. com.

Nelder, J. A. and Wedderburn, R. W. M. (1972). Generalized linear models. Journal of the
Royal Statistical Society, A 135:370–384.

Orr, G. B. and Müller, K.-R. (1998). Neural networks: tricks of the trade. Springer.

Oshiro, T. M., Perez, P. S., and Baranauskas, J. A. (2012). How many trees in a random
forest? In Machine Learning and Data Mining in Pattern Recognition: 8th International
Conference, MLDM 2012, Berlin, Germany, July 13-20, 2012. Proceedings 8, pages 154–168.
Springer.

Osuna, E., Freund, R., and Girosit, F. (1997). Training support vector machines: an
application to face detection. In Proceedings of IEEE computer society conference on computer
vision and pattern recognition, pages 130–136. IEEE.

Ouyang, Y., Cheng, M., He, B., Zhang, F., Ouyang, W., Zhao, J., and Qu, Y. (2023).
Interpretable machine learning models for predicting in-hospital death in patients in the
intensive care unit with cerebral infarction. Computer Methods and Programs in Biomedicine,
231:107431.

Pan, H., Luan, Q., and Cong, F. (2024). An integrated tennis momentum exploration model
based on logistic regression and xgboost. In Proceedings of the 2024 International Conference
on Smart City and Information System, pages 535–540.

Park, M. Y. and Hastie, T. (2007). L1-regularization path algorithm for generalized linear
models. Journal of the Royal Statistical Society Series B, 19:659–677.

Payrovnaziri, S. N., Chen, Z., Rengifo-Moreno, P., Miller, T., Bian, J., Chen, J. H., Liu, X., and
He, Z. (2020). Explainable artificial intelligence models using real-world electronic health
record data: a systematic scoping review. Journal of the American Medical Informatics
Association, 27(7):1173–1185.

Probst, P., Wright, M. N., and Boulesteix, A.-L. (2019). Hyperparameters and tuning
strategies for random forest. Wiley Interdisciplinary Reviews: data mining and knowledge
discovery, 9(3):e1301.

R Core Team (2024). R: A Language and Environment for Statistical Computing. R Foundation
for Statistical Computing, Vienna, Austria.

Ramosaj, B. (2020). Analyzing consistency and statistical inference in random forest models.


REFERENCES 68

Ribeiro, M. T., Singh, S., and Guestrin, C. (2016). " why should i trust you?" explaining the
predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD international
conference on knowledge discovery and data mining, pages 1135–1144.

Ripley, B. D. (2002). Modern applied statistics with S. springer.

Rosenblatt, F. (1958). The perceptron: a probabilistic model for information storage and
organization in the brain. Psychological review, 65(6):386.

Sampaio, T., Oliveira, J. P., Marinho, D. A., Neiva, H. P., and Morais, J. E. (2024).
Applications of machine learning to optimize tennis performance: a systematic review.
Applied Sciences, 14(13):5517.

Savareh, B. A., Bashiri, A., Behmanesh, A., Meftahi, G. H., and Hatef, B. (2018).
Performance comparison of machine learning techniques in sleep scoring based on
wavelet features and neighboring component analysis. PeerJ, 6:e5247.

Schapire, R. E. (1990). The strength of weak learnability. Machine learning, 5:197–227.

Schmid, M. and Hothorn, T. (2008). Boosting additive models using component-wise
p-splines. Computational Statistics & Data Analysis, 53(2):298–311.

Scornet, E. (2017). Tuning parameters in random forests. ESAIM: Proceedings and surveys,
60:144–162.

Scornet, E., Biau, G., and Vert, J.-P. (2015). Consistency of random forests.

Sipko, M. and Knottenbelt, W. (2015). Machine learning for the prediction of professional
tennis matches. MEng computing-final year project, Imperial College London, 2.

Somboonphokkaphan, A., Phimoltares, S., and Lursinsap, C. (2009). Tennis winner
prediction based on time-series history with neural modeling. In Proceedings of the
International MultiConference of Engineers and Computer Scientists, volume 1, pages 18–20.
Citeseer.

Sun, X., Wang, Y., and Khan, J. (2023). Hybrid lstm and gan model for action recognition
and prediction of lawn tennis sport activities. Soft Computing, 27(23):18093–18112.

Svensson, W. (2021). Can statistical models beat benchmark predictions based on rankings
in tennis?

Tay, J. K., Narasimhan, B., and Hastie, T. (2023). Elastic net regularization paths for all
generalized linear models. Journal of statistical software, 106.

Therneau, T., Atkinson, B., Ripley, B., and Ripley, M. B. (2015). Package ‘rpart’. Available
online: cran. ma. ic. ac. uk/web/packages/rpart/rpart. pdf (accessed on 20 April 2016).


REFERENCES 69

Tibshirani, R. (1996). Regression shrinkage and selection via the Lasso. Journal of the Royal
Statistical Society, B 58:267–288.

Ünal, T. (2023). Predicting tennis match outcome: a machine learning approach using the
srp-crisp-dm framework. Master’s thesis, Middle East Technical University.

Ustun, B. and Rudin, C. (2019). Learning optimized risk scores. Journal of Machine Learning
Research, 20(150):1–75.

Wei, X., Lucey, P., Morgan, S., Carr, P., Reid, M., and Sridharan, S. (2015). Predicting serves
in tennis using style priors. In Proceedings of the 21th ACM SIGKDD International
Conference on Knowledge Discovery and Data Mining, pages 2207–2215.

Wei, X., Lucey, P., Morgan, S., and Sridharan, S. (2013). Predicting shot locations in tennis
using spatiotemporal data. In 2013 International Conference on Digital Image Computing:
Techniques and Applications (DICTA), pages 1–8. IEEE.

Weston, D. (2014). Using age statistics to gain a tennis betting edge.
http://www.pinnacle.com/en/betting-articles/Tennis/

atp-players-tipping-point/LMPJF7BY7BKR2EY.

Whiteside, D., Cant, O., Connolly, M., and Reid, M. (2017). Monitoring hitting load in
tennis using inertial sensors and machine learning. International journal of sports
physiology and performance, 12(9):1212–1217.

Wilkens, S. (2021). Sports prediction and betting models in the machine learning age: The
case of tennis. Journal of Sports Analytics, 7(2):99–117.

Wood, S. N. (2017). Generalized Additive Models: An Introduction with R. Chapman &
Hall/CRC, London, 2nd edition.

Wright, M. N. and Ziegler, A. (2017). ranger: A fast implementation of random forests for
high dimensional data in C++ and R. Journal of Statistical Software, 77(1):1–17.

Yigit, A. T., Samak, B., and Kaya, T. (2020). An xgboost-lasso ensemble modeling approach
to football player value assessment. Journal of Intelligent & Fuzzy Systems, 39(5):6303–6314.

Young, N. (1988). An introduction to Hilbert space. Cambridge university press.

Yue, J. C., Chou, E. P., Hsieh, M.-H., and Hsiao, L.-C. (2022). A study of forecasting tennis
matches via the glicko model. Plos one, 17(4):e0266838.

Zhang, J., Zhu, Q., Xu, L., and Gong, H. (2024). Research on tennis match evaluation and
prediction model based on data mining. In Fourth International Conference on Applied
Mathematics, Modelling, and Intelligent Computing (CAMMIC 2024), volume 13219, pages
1169–1177. SPIE.


Part II

Publications

70


Journal of Sports Analytics 10 (2024) 17–33
DOI 10.3233/JSA-240670
IOS Press

17

Modeling and prediction of tennis matches
at Grand Slam tournaments

N. Buhamra, A. Groll∗ and S. Brunner
Department of Statistics, TU Dortmund University, Dortmund, Germany

Received 7 September 2022
Accepted 11 December 2023
Published 29 February 2024

Abstract. In this manuscript, different approaches for modeling and prediction of tennis matches in Grand Slam tournaments
are proposed. The data used here contain information on 5,013 matches in men’s Grand Slam tournaments from the years
2011–2022. All regarded approaches are based on regression models, modeling the probability of the first-named player
winning. Several potential covariates are considered including the players’ age, the ATP ranking and points, odds, elo rating
as well as two additional age variables, which take into account that the optimal age of a tennis player is between 28
and 32 years. We compare the different regression model approaches with respect to three performance measures, namely
classification rate, predictive Bernoulli likelihood, and Brier score in a 43-fold cross-validation-type approach for the matches
of the years 2011 to 2021. The top five optimal models with highest average ranks are then selected. In order to predict and
compare the results of the tournaments in 2022 with the actual results, a comparison over a continuously updating data set via
a “rolling window” strategy is used. Also, again the previously mentioned performance measures are calculated. Additionally,
we examine whether the assumption of non-linear effects or additional court- and player-specific abilities is reasonable.

Keywords: Grand Slam tournaments, Tennis matches, prediction, model selection, cross validation, penalization

1. Introduction

In recent years, several approaches to the statis-
tical modeling of tennis matches and tournaments
have been proposed and the existing methods for pre-
dicting the probability of winning matches in tennis
have been expanded. Then, when all matches can be
predicted, also winning probabilities for a whole tour-
nament could potentially be calculated. For instance,
Clarke and Dyte (2000) used the official Associa-
tion of Tennis Professionals (ATP) computer tennis
rankings to predict a player’s chance of winning via
logistic regression. Arcagni et al. (2022) extended the
approach of rating calculations to determine the prob-
ability that a player will win a match. The usage of
this centrality measure allows the ratings of the whole
set of players to vary every time there is a new match,

∗Corresponding author: A. Groll, Department of Statistics,
TU Dortmund University, Vogelpothsweg 87, 44221 Dortmund,
Germany. E-mail: groll@statistik.tu-dortmund.de.

and the resulting ratings are then used as a covari-
ate in a simple logit model. Klaassen and Magnus
(2003) used a large (live) data set from the Wimble-
don predictions during the event. Hence, their work is
suitable for the betting market. Easton and Uylangco
(2010) used Klaassen and Magnus’ model and com-
pared it with bookmakers’ odds on a point-by-point
basis. They verified that bookmakers’ odds are a good
predictor of outcomes of both men’s and women’s
tennis matches. Gu and Saaty (2019) predicted the
outcome of tennis matches of Grand slam tourna-
ments as well as of the ATP and the Women’s Tennis
Association (WTA) using both data and (unquali-
fied, subjective) judgments, and this way identified
numerous factors and systematically prioritized them
subjectively and objectively, so as to improve the
accuracy of the prediction. In McHale and Morton
(2011), a Bradley-Terry type model was proposed for
forecasting the top tier of the WTA and ATP com-
petition. They considered surface (hardcourt, carpet,
clay or grass) influence on match outcomes. They

ISSN 2215-020X © 2024 – The authors. Published by IOS Press. This is an Open Access article distributed under the terms
of the Creative Commons Attribution-NonCommercial License (CC BY-NC 4.0).


18 N. Buhamra et al. / Modeling and prediction of tennis matches at Grand Slam tournaments

found that a model incorporating information on
match score, play data and surface can give a higher
accuracy of forecasting than ranking-based mod-
els. However, they did not consider the dependence
between factors. Ma et al. (2013) applied another
logistic regression model on 16 variables representing
player skills and performance, player characteristics
and match characteristics. Yue et al. (2022) proposed
a statistical approach for predicting the match out-
comes of Grand Slam tournaments, using exploratory
data analysis. The proposed approach introduces new
variables via the Glicko rating model, a Bayesian
method commonly used in professional chess.

Recently, machine learning models have been
utilized to predict the winner of tennis matches. Som-
boonphokkaphan et al. (2009) proposed a method
to predict the winner of tennis matches using both
match statistics and environmental data based on
a Multi-Layer Perceptron (MLP) equipped with a
back-propagation learning algorithm. MLP is a basic
sort of Artificial Neural Networks (ANN). ANNs
are a powerful technique to solve real world classi-
fication problems and are particularly effective for
predicting outcomes when the networks rely on a
large database and are able to deal with incom-
plete information or noisy data. In addition, there
are several studies for predictions based on machine
learning approaches (see, for example, Whiteside et
al., 2017). Also Wilkens (2021) focused on machine
learning approaches and extended previous research
by conducting and applying a wide range of machine
learning techniques. He used a variety of models such
as neural networks and random forests in combina-
tion with one of the most extensive data sets in the
area of professional men’s and women’s tennis sin-
gles matches. Moreover, the author showed that the
average prediction accuracy cannot be increased to
more than about 70%. Bayram et al. (2021) defined
a new method based on network analysis to extract a
new feature that represents the player’s skill on each
surface considering the variation of his performance
over time, which is believed to have a big effect on
the match outcome. In addition, advanced machine
learning paradigms such as Multi-Output Regression
and Learning using privileged information have been
applied, and the results were compared with standard
machine learning approaches, such as regression tree-
and forest-based methods as well as single- and multi-
target regression techniques. Evaluating the results
showed that the proposed methods provide more
accurate predictions of tennis match outcomes than
classical approaches frequently used in the literature.

Chitnis and Vaidya (2014) considered performance
assessments of professional tennis players using Data
Envelopment Analysis in historical matches played in
ATP world tour rankings. Radicchi (2011) novel evi-
dence of the utility of tools and methods of network
theory in real application. Del Corral and Prieto-
Rodrıguez (2010) estimated separate probit models
for men and women using Grand Slam tennis match
data from 2005 to 2008. The explanatory variables
are divided into three groups: a player’s past perfor-
mance, a player’s physical characteristics, and match
characteristics. The accuracies of the different models
were evaluated both in-sample and out-of-sample by
computing Brier scores and comparing the predicted
probabilities with the actual outcomes from 2005 to
2008 and from the 2009 Australian Open. In addition,
they used bootstrapping techniques, and evaluated the
out-of-sample Brier scores for the 2005–2008 data.

Statistical and machine learning techniques have
also been applied in other racket sports. For exam-
ple, Lennartz et al. (2021) focused on international
table tennis and analyzed matches of recent holdings
of the Men’s World Cup and the Grand Finals of the
Men’s ITTF World Tour. Also, they applied statis-
tical and machine learning methods on table tennis
tournaments for prediction with a correct classifica-
tion rate of around 75% by a random forest and 74%
by a penalized generalized linear logit model. Even
though both models based their predictive power
mainly on the official table tennis rankings and points,
variables like age, playing handedness or individual
strength turned out to be important additional factors.

In the present work, we concentrate on several
regression-based modeling approaches with a focus
on tennis Grand Slam tournament data. While com-
plex machine learning models often have the capacity
to further increase the predictive performance, they
also come with the substantial draw-back of loos-
ing interpretability. Hence, in this work we want
to fully exploit the flexibility of modern regression
approaches, using their high level of interpretability
to gain some knowledge to understand certain asso-
ciations and relations in professional tennis. For this
purpose, a data set was compiled using the R package
deuce (Kovalchik, 2018). It contained information
on 5,013 matches at men’s Grand Slam tournaments
from 2011 to 2022. Several potential covariates are
considered including the players’ age, the ATP rank-
ing and points, odds, Elo rating as well as two
additional age variables, which take into account that
the “optimal” age of a tennis player is between 28
and 32 years (Weston, 2014). We present different


N. Buhamra et al. / Modeling and prediction of tennis matches at Grand Slam tournaments 19

regression approaches which are then compared with
respect to various performance measures. Two spe-
cific aspects that we will investigate in more detail
are whether it makes sense to (i) allow for non-
linear covariate effects or to (ii) incorporate additional
court-specific abilities.

The rest of the article is structured as follows.
Section 2 introduces the present data set, explains
the variables and defines the objectives. Then, in
Section 3, different modeling approaches are intro-
duced, including logistic regression, regularization
using the Least Absolute Shrinkage and Selection
Operator (LASSO) and non-parametric spline regres-
sion. In Section 4, these modeling approaches are
compared via 43-fold cross-validation (CV) for the
Grand Slam tournaments from 2011 to 2021. For
this comparison, various performance measure are
defined. Then, the performance measures are calcu-
lated on the Grand Slam tournament 2022 using a
rolling window approach. We discuss the obtained
results in Section 5. Section 6 summarizes the main
results and gives a final overview.

2. Data

In the following, both the used data set and the
variables it contains are described in more detail.
Subsequently, the objective of this work is specified.

The underlying data set was compiled using
the R package deuce (Kovalchik, 2018). It con-
tains information on 5,013 matches in men’s Grand
Slam tournaments from 2011 to 2022. The variables
included in the data set are listed and described below.
Unless stated otherwise, the variables were directly
included in the data sets of the package.

Player1: The name of the (randomly chosen)
first-named player.

Player2: The name of the (randomly chosen)
second-named player.

Year: The year when the match took place
(ranging from 2011 to 2022).

Tournament: The Grand Slam tournament where
the players met (Australian Open,
French Open, Wimbledon, US Open).

Surface: A factor variable describing the sur-
face on which the match was played
(either “hard”, “clay” or “grass”).

Victory: A dummy variable capturing whether
the first-named player did win the
match (1: yes, 0: no).

Age: A metric predictor collecting the age differ-
ence of the players in years; age of the 2nd
player was subtracted from the age of the
1st player. Note that players’ ages were not
given directly and had to be calculated from
the player’s date of birth as well as the date
of the relevant match.

Prob: Difference in the probabilities that the
respective player will win. These were
calculated from the average odds for
both players (see AvgProb1 and AvgProb2
below). The probability of the 2nd player
winning was subtracted from the probabil-
ity of the 1st player winning. This variable
is later used for modeling.

Rank: Difference in the players’ ranking positions.
These were calculated by subtracting the
rank of the 2nd player from the rank of the
1st player. For this, the rank of the player
at the start of the tournament was used. The
position in the ranking is based on the ATP
ranking points.

Points: Difference in the ATP ranking points. The
points of the 2nd player were subtracted
from the points of the 1st player. World
ranking points are awarded for each match
won per tournament. Wins in later rounds
of a tournament are valued higher than
wins in the first rounds of a tournament.
Points earned in a tournament expire after
52 weeks.

Elo: Difference of the Elo-numbers. The Elo-
number of the 2nd player was subtracted
from the Elo-number of the 1st player. The
Elo-number takes into account whether a
player played against a higher or lower
ranked player. The Elo-number increases
more if a player wins against a player with
a high Elo-number than if he wins against
a player with a lower Elo-number. It is
updated after each match of a player.

Age.30: To calculate this variable, first the distance
between the age of the players and reference
age 30 was calculated and then the corre-
sponding difference was calculated as for
the variable Age. It is assumed that the stan-
dard Age variable introduced above does
not contain enough information. For exam-
ple, a 25-year-old player typically has an
advantage over a 20-year-old one, while a
40-year-old player typically has a disadvan-
tage over a 35-year-old one. However, in


20 N. Buhamra et al. / Modeling and prediction of tennis matches at Grand Slam tournaments

both cases, the age difference is 5 years.
As Weston (2014) argued, the optimal
age of tennis players is between 28 and
32 years. Therefore, the middle of the
interval, i.e. 30, was used as reference
age here.

Age.int: For this feature, the distance to the closer
limit of the interval was used, i.e. for
players younger than 28 the distance to
28 was calculated and for players older
than 32 the distance to 32 was calculated.
For players between 28 and 32 the dis-
tance was set to 0. Then, the difference
was calculated as for the variable Age.

AvgProb1: Average probability for a win by
Player1, calculated from the aver-
age odds of several different betting
providers, which were included in
the deuce package as obtained from
https://www.oddsportal.com/.

AvgProb2: Average probability for a win by
Player2, calculated from the average
odds of the betting providers. Together
with AvgProb1, it sums up to 1 per match.

B365.1 Winning odds for Player1 obtained from
the specific bookmaker Bet365. For
example, with winning odds of 2.31 for
Player1, one gets back 2.31 money units
if he wins, if previously one money unit
was placed on this event. The odds from
this specific bookmaker are later used to
calculate the betting returns.

B365.2 Same as B365.1, but from the perspective
of Player2.

It should be noted that the data set does not include
matches in which one of the two players retired or
was unable to compete, e.g. due to injury, such that
the other player won without actually playing the
match. These matches do not contain any informa-
tion and could distort the results and are therefore
excluded. Furthermore, the data set does not contain
any missing values.

In addition, it should also be noted that there are
some players which only participated in a single
Grand Slam tournament. For instance, Camilo Ugo
Carabelli participated only at the French Open 2022
and did not participate in any of the Grand Slam
tournaments from 2011 to 2021. Also, as another
example, Jan Satral participated for the very first
time at the US Open 2016 and never again after this.
Altogether, there are 70 players which participated in

only one single Grand Slam tournament. Therefore,
in our leave-one-tournament-out strategy for compar-
ing the predictive power of the different modeling
approaches, for these players no estimates of their
abilities are available, if their matches are part of the
tournament which is currently used as test data. In
order to obtain nonetheless reasonable estimates for
the player ability effects of such players, which can
then be used for the prediction of the currently left-
out Grand Slam tournaments, we group all players
that have only participated in a single tournament in a
group called “newcomer”. Hence, these players share
the same player-specific ability parameters.

Based on this data set, the best possible regression
model for predicting tennis matches at Grand Slam
tournaments is sought. We investigate which model
approaches are particularly suitable for this purpose.
Among other things, it will be examined whether
the assumption of non-linear influences or additional
surface- and player-specific abilities is reasonable.
For the different modeling approaches, we then deter-
mine models which are optimal with respect to certain
performance measures, and compare those with each
other.

3. Statistical methods

In the following, the statistical methods used in this
work are briefly introduced. These include logistic
regression and parameter estimation using maximum
likelihood. Based on this, we shortly motivate regu-
larization and define the so-called LASSO-estimator.
Finally, spline regression with P-splines is described.

3.1. The logistic regression model

For n individuals, let observations
(yi, xi1, . . . , xip), i = 1, . . . , n of a binary tar-
get variable y and covariates x1, . . . , xp be given.
In the logistic regression model, the relationship
between y and metric, categorical or binary covari-
ates is examined. Here, y = 1 denotes the occurrence
of a particular event (typically defined as “success”)
and y = 0 that the event does not occur (also defined
as “failure”). Then,

πi = P(yi = 1|xi1, . . . , xip) = E(y|xi1, . . . , xip)

is the (conditional) probability for the occurrence of
yi = 1, given the covariate values xi1, . . . , xip. The
aim is to model πi appropriately as a function of the
feature variables. Therefore, the linear predictor ηi is


N. Buhamra et al. / Modeling and prediction of tennis matches at Grand Slam tournaments 21

related to the probability πi by a strictly monotoni-
cally increasing function h : R→ [0, 1], i.e.

πi = h(ηi) = h(β0 + β1xi1 + · · · + βpxip) .

The function h(·) is also called the response function.
With the help of the inverse function g = h−1, we can
also write ηi = g(πi).

The estimators for β0, . . . , βp are obtained by
numerical maximization of the log-likelihood, e.g.
by using the Fisher scoring or the Newton-Raphson
method, see, e.g., Nelder and Wedderburn (1972).
Generally, for more details on GLMs, see also
Fahrmeir and Tutz (2001).

3.2. Regularization

If the number of covariates p becomes very large,
estimation becomes numerically unstable (see, e.g.,
Fahrmeir et al., 2013). This can also be the case if
there is some substantial mulitcollinearity between
the columns of the design matrix X = (x1, . . . , xn)�.
To address this problem, a penalty term pen(βββ) is
added to the negative log-likelihood in the logit
model. According to Park and Hastie (2007), the
estimator is then obtained by minimizing

β̂ββpen = arg min
βββ

(−l(βββ) + λ · pen(βββ)),

where λ is the penalty parameter that controls the
influence of the penalty term on the parameters esti-
mated by the ML method.

The Least Absolute Shrinkage and Selection
Operator (LASSO)

One possibility for penalization is provided by the
Least Absolute Shrinkage and Selection Operator
(LASSO; Tibshirani, 1996). Here the penalty term is
given by

pen(βββ) =
p∑

j=1

|βj|.

It allows model estimation and variable selection to be
performed in one step, as small coefficients are shrunk
to 0. There is no closed-form representation for solv-
ing this minimization problem. Therefore, numerical
optimization methods are used to obtain the optimal
LASSO estimator β̂ββLASSO (see, e.g., Friedman et al.,
2010). To optimize the penalty parameter λ typically
K-fold cross validation can be is used.

3.3. Splines

In the methods introduced above, the influence
of the covariates on the target variable is assumed
to be strictly linear. However, often also non-linear
influences are worthwhile. In order to model these
appropriately and flexibly, so-called splines can be
used. Here, the so-called B-splines (Eilers and Marx,
1996) are used.

B-splines
In principle, with B-splines a non-linear effect f (x)

of a metric predictor can be represented as

f (x) =
d∑

j=1

γjBj(x).

As an unpenalized estimation of a non-linear B-spline
effect often overfits, typically the non-linear effect is
smoothed by using penalized B-splines, i.e. P-splines.

P-splines
Beside the problem of potential overfitting, the

goodness-of-fit of the B-spline approach depends on
the number of selected nodes. To avoid this prob-
lem, various penalization methods exist in the form
of P-splines. Here, a penalized estimation criterion,
which is extended by a penalty term, is used instead
of the usual estimation criterion. For P-splines based
on B-splines (see, e.g., Eilers and Marx, 1996), the
function f (x) is first approximated by a polynomial
spline with many nodes (typically about 20 to 40).
The penalty term then results in

λ

∫ (
f ′′(x)

)2
dx .

This is motivated by the fact that the second derivative
is used as a measure of the curvature of a function. If
this becomes too large, it will be penalized by the term
above. For approximation of the second derivative
exist simple representations, so that the penalty term
results in

λ

d∑

j=3

(
�2γj

)2
,

where �2γj = γj − 2γj−1 + γj−2 (see again Eilers
and Marx, 1996).

The optimal smoothing parameter λ is deter-
mined using generalized CV. For more details on
the methodology, see also Eilers and Marx (2021),


22 N. Buhamra et al. / Modeling and prediction of tennis matches at Grand Slam tournaments

and for more details on the corresponding software
implementation in R, see Wood (2017).

4. Evaluation

In the following, some model approaches which
seem to be suitable to adequately model tennis
matches well, are investigated. These are compared
with each other in a CV-type strategy in order to be
able to select the best model with respect to a selection
of performance measures. In an external validation
with previously unused data, the best models from
the preceding CV-type approach are evaluated. All
calculations and evaluations are performed with the
statistical programming software R (R Core Team,
2022).

4.1. Model selection

To model the outcome of a tennis match appro-
priately, different regression models with different
assumptions can be used. Since the target variable
y (win/loss) is a binary variable, the methods are all
based on logistic regression. Therefore, the model can
be generally formulated as

ln

(
πi

1 − πi

)
= ηi = β0 + xi1β1 + · · · + xipβp.

Since the xi1, . . . , xip, i = 1, . . . , n, are differences
of the covariate values of the players, where the value
of the second player is always subtracted from the
value of the first player, β0 would correspond to
a kind of “home effect” for the first-named player.
However, since the data are composed in such a way
that one of the two players is named first randomly,
β0 cannot be meaningfully interpreted here and is
therefore excluded and set to zero in the following
considerations. Furthermore, it is assumed that the
yi|xi1, . . . , xip are independent for i = 1, . . . , n.

Linear effects
The simplest and most straight-forward model

approach is to assume linear covariate effects in the
linear predictor. Here, nevertheless, covariates must
be selected appropriately. To ensure this, all possible
combinations of the available variables are com-
pared in the CV-type approach. Since there are seven
covariates in the data set, there are

∑7
i=1

(7
i

) = 127
different combinations of these. However, of these
seven covariates, three reflect specific age differ-
ences. Therefore, it is additionally assumed that the

combinations always include a maximum of one age
variable. This results in 63 different combinations.

Non-linear effects (splines)
The assumption of linear effects can possibly be

very limiting and lead to insufficient results. There-
fore, spline models based on B-splines are also
considered. Here, all 63 possible combinations are
again compared with respect to various performance
measures. To obtain smoothness, penalization of the
splines is performed and, hence, P-splines are used.
Furthermore, an additional regularization approach is
used, i.e. an additional variable selection is performed
on the spline effects. This is done by setting the
spline coefficients to zero and is implemented in the
gam-function from mgcv (Wood, 2017) via setting
the select argument to TRUE. For the underlying
methodology of this additional penalization and vari-
able selection approach, see e.g. Marra and Wood
(2011). We will later on see that exemplarily the
effect of the variable probs is non-linear for very low
and high values (see Fig. 1), which also seems to be
reasonable (see our explanation in Section 5).

Surface-specific player skills (LASSO)
Since tennis is played on different surfaces (grass,

hard court and clay court) and these surfaces have
specific characteristics, so that each surface is played
differently, it is plausible to assume that not every
player copes equally well on every surface. For
example, the Spaniard Rafael Nadal is considered as
the “king of clay”, as he has won 14 French Open
titles on this surface. At the same time, however, he
was “only” able to win the Wimbledon title twice,
which is played on grass. In contrast, the Swiss
Roger Federer has already won Wimbledon eight
times, but the French Open only once. In order to
take into account these specific features of players
and surfaces, a corresponding factor variable is
created. Technically, this requires effect coding.
Actually, to the data set artificially columns are
added, whose entries are either 0, 1 or –1. Each
column represents a combination between a player
and one of the corresponding surfaces, i.e. for each
player there are a maximum of three such columns.
If a player has not played on one of the three surface
types, the corresponding columns are omitted. Each
row consists of exactly one entry of –1 and 1,
respectively, while all other entries of the row are
0. The entry 1 is in the column of the combination
of the first player and the corresponding surface on
which the match took place. Analogously, the entry


N. Buhamra et al. / Modeling and prediction of tennis matches at Grand Slam tournaments 23

–1 is in the column of the combination of the second
player and the surface. For example, the line

· · · Nadal.Hard Nadal.Grass Nadal.Clay · · · Federer.Hard Federer.Grass Federer.Clay · · ·
· · · 0 1 0 · · · 0 −1 0 · · ·

means that Rafael Nadal played against Roger
Federer on grass, Nadal was named as the first
player and Federer as the second. The remain-
ing entries in the row are 0, because the match
was played on grass and only these two players
were involved. All these columns are appended
to the existing data set. The newly constructed
variables as well as the already existing variables,
i.e. Age, Ranking, Points, Elo, Prob, Age.30 and
Age.int, are then jointly used as covariates.

As a result, a large number of new covariates is
constructed, namely 1,024, which generally leads
to an extremely large number of parameters being
estimated and the associated estimators becoming
unstable. Therefore, for this approach the logistic
regression model is combined with LASSO regular-
ization. The influences of all covariates is assumed
to be linear here. Via the surface-specific player skill
parameters, the model can detect when a player has
won or lost more often than average on a surface.

Global player skills (LASSO)
Analogously, it can be argued that there are also

players who overall perform even better or worse than
the information of their covariate values would sug-
gest, i.e. who have a generally great or substandard
talent and are therefore more likely to be assessed
as winners or losers. For this purpose, similar to the
previous paragraph, a corresponding factor variable
is created, again using effect coding. In this case, we
add only one column per player. The added columns
again only have the entries 0, 1 or –1, where the val-
ues 1 and –1 occur exactly once per row. In each row,
the value 1 appears at the position belonging to the
first-named player and the value –1 at the position
belonging to the second-named player, all remaining
entries are 0. The following exemplary row

· · · Novak.Djokovic · · · Rafael.Nadal · · · Roger.Federer · · ·
· · · 0 · · · −1 · · · 1 · · ·

indicates that Rafael Nadal played as the second
named player against the first named player Roger
Federer. The column for the player Novak Djokovic
(as well as for all other players), for example, is

then 0, since he did not play. Again, those global
player-specific abilities are then added to the design

matrix and used along with the covariates defined
in Section 2. The columns are then appended to the
already existing record.

Again, a large number of new covariates is con-
structed, namely 426, so again a large number of
player-specific skill parameters has to be estimated.
And, hence, again we extend the logistic regression
model with LASSO penalization and assume linear
covariate effects only.

Benchmark model
As a benchmark model for prediction, we solely

use the probabilities (probs) calculated from the aver-
age odds of the different bookmakers included in the
deuce package. The winning probabilities in the
i-th match π̂i1 and π̂i2 for player 1 and player 2,
respectively, can be derived from the two winning
odds, i.e. oddi1 and oddi2, respectively, according to
Schauberger and Groll (2018) as follows:

π̂i1 =
1

oddi1
1

oddi1
+ 1

oddi2

or π̂i2 =
1

oddi2
1

oddi1
+ 1

oddi2

Note that naturally those probs fullfill π̂i1 + π̂i2 = 1.
Moreover, this way, it is automatically adjusted for
the bookmaker’s margin. These margins result from
the fact that the betting providers artificially lower
their betting odds to gain some profit. This means that
when the inverse values of the odds are directly used
as probabilities, they do not sum up to 1, but to a value
slightly larger than 1. In order to calculate the margin,
the sum of the reciprocals in the denominator is used
for normalization. Here, it is implicitly assumed that
the margin is equally distributed to both players.

4.2. Performance measures

To compare the selection of regression models in
terms of their predictive performance on new, unseen

data, the following criteria are considered. First, use
ỹ1, . . . , ỹn for the true binary outcomes of the n

matches, i.e., ỹi ∈ {0, 1}, i = 1, . . . , n. Further, let
π̂i1 =: π̂i denote the probability, predicted by a cer-
tain model, that player 1 wins the i-th match. The


24 N. Buhamra et al. / Modeling and prediction of tennis matches at Grand Slam tournaments

probability that player 2 wins the match is given by
π̂i2 = 1 − π̂i1 = 1 − π̂i, since y is binary.

Classification rate
The (mean) classification rate indicates how many

matches on average are correctly predicted by a cer-
tain model. It is defined by

1

n

n∑

i=1

1(ỹi = ŷi), where ŷi =
{

1, π̂i > 0.5

0, π̂i ≤ 0.5
,

see, e.g., Schauberger and Groll (2018). Here, large
values indicate a good model. A mean classifica-
tion rate of 0.5 would correspond to a random
classification. It is therefore also desirable that the
classification rate is much larger than 0.5.

Predictive Bernoulli likelihood
The predictive Bernoulli likelihood is based on the

predicted probability for the true outcome and for n

predictions is defined as

π̂
ỹi

i (1 − π̂i)
1−ỹi ,

see again Schauberger and Groll (2018). Once again,
a high value is an indicator of a good model. In the
following, the average likelihood is used for model
comparison.

Brier score
The Brier score is based on the squared distances

between the predicted probability and the actual
(binary) output from the i-th match. It is defined
according to Brier (1950) as

1

n

n∑

i=1

(π̂i − ỹi)
2 .

This is an error measure, so low values indicate a
good model.

Betting profit
Another way to compare the predictive quality of

different models is the betting profit. Let oddi1 and
oddi2 be the odds from a specific betting provider for
a win of the first and second player in the i-th match,
respectively. If one bets one monetary unit on a win
of the respective player, the expected betting returns
for the i-th match are given by

E[returni,player1] = π̂i1 · oddi1 − 1 or

E[returni,player2] = π̂i2 · oddi2 − 1,

because, for instance, if player 1 wins the match
(which due to the model at hand happens with pre-
dicted probability π̂i1), the better, who has previously
invested one money unit (hence the –1), would receive
oddi1 money units, if they has bet on this event (see
Schauberger and Groll, 2018). Hence, if the player
on whom the bet was placed wins, the betting return
is calculated by the player’s odds minus the invest
of one monetary unit. If the other player wins, the
better’s loss is −1 monetary unit.

Principally, the bet should of course be placed on
the match outcome with maximum positive expected
return. If the return is not positive for either outcome,
no bet is placed on the corresponding match. In this
work, to calculate actual, realistic betting returns, the
odds of the specific betting provider Bet365 are used,
i.e. it is assumed that the bets are placed with this
provider. The total betting return is then the sum of
all betting returns across all matches.

4.3. Leave-one-tournament-out cross validation

To evaluate the models, all Grand Slam tour-
naments from 2011 to 2021 are used, i.e. the
tournaments that took place in 2022 are initially not
considered and are used as external validation data
later on. So, from each of the 11 years four tour-
naments are used. A 43-fold CV-type strategy is
performed with these tournaments. The data set then
still contains 4,720 of the original 5,013 matches. The
following scheme is used:

1. From the 43 Grand Slam tournaments present in
the data set, a training data set of 42 tournaments
and a test data set of the remaining tournament
are constructed.

2. Then, all regression models introduced above are
fitted:
– For the models with linear influences, the func-

tion glm from the R package stats (R Core
Team, 2022) is used. As described in Sec-
tion 4.1, there are 63 such models, each using
at most one of the three variables for age.

– For the calculation of the spline models the
function gam from the R package mgcv
(Wood, 2004) is used. Again, there are also
63 different models here.

– In order to be able to compute the two LASSO-
penalized models, first the design matrices
have to be constructed as described in Sec-
tion 4.1. For a proper usage of the LASSO, then
all columns of the design matrix of the train-


N. Buhamra et al. / Modeling and prediction of tennis matches at Grand Slam tournaments 25

ing data need to be standardized. The function
cv.glmnet from the R package glmnet
(Friedman et al., 2010) is used to compute both
models. For this, a 10-fold (inner) CV is first
performed on the current training data to find
the optimal λ which provides the minimum
deviance. The corresponding LASSO model is
used afterwards for prediction. 10-fold CV is
also recommended in Friedman et al. (2010).

– For prediction based on average betting odds,
the probabilities calculated from odds are used
as predicted probabilities.

3. After fitting the respective model, for each match
of the test data the probabilities that the first player
wins are predicted.

4. Steps 1–3 are repeated until each of the 43 tour-
naments has served once as a test data set.

5. Finally, the predicted results are compared with
the actual results and the performance measures
defined in Section 4.2 are calculated.

Table 1 on page 9 shows the results of the five best
models with linear effects and the five best mod-
els with spline effects. Additionally, the results of
the LASSO models and the benchmark model are
shown. The top five models were selected from the
63 models by assigning ranks for each of the three
performance measures, classification rate, predictive
Bernoulli likelihood, and Brier score, with the best
model receiving the highest rank. Average ranks were
assigned if the models had equal performances. In
order to select the best models in terms of all three
measures, their ranks were also averaged per model
and the five models with the highest average ranks
were selected. The models are listed below in such a

way that the best model is determined in first place,
the second best in second place and so on. The over-
all betting performance is also given together with
the per match betting return in brackets. Finally, the
number of bets placed is provided in the last column,
together with the ratio of all matches in which a bet
was placed in brackets.

The classification rate is slightly above 77% for
all models (see 1st column). Differences between the
models can only be seen at the third decimal. The
linear regression model with both rank and probs
as covariates here performs best with a value of
0.7776, i.e. this model predicts the correct outcome
for 77.76% of the matches.

The values of the predictive Bernoulli likelihood
(2nd column) differ somewhat more. It is notice-
able that within one group of model approaches the
likelihood is almost the same (differences are only
seen in the fourth decimal). The average likelihood
of the models with both linear and non-linear effects
is slightly more than 0.69, i.e. the models predict the
correct outcome with an average probability of about
69%. The two LASSO models are just below 0.69. It
is noticeable that the benchmark model has the low-
est likelihood, which is 0.6709. The average betting
provider correctly predicts the outcome of a match
with an average probability of 67.09%.

The differences across model groups in the Brier
scores are rather small, similar to the classification
rate. The models with non-linear effects perform
best in this regard with Brier scores between 0.1546
and 0.1547. The models with linear effects produce
slightly larger values, followed by the LASSO mod-
els. In the case of the Brier score, the benchmark
model again performs worst with a value of 0.1555.

Table 1

Results of the cross validation for the (at most) best five models per model class (best performing model in bold font). In brackets are the
betting profits per match and the ratio of all matches in which a bet was placed

Explanatory Class. Likeli- Brier Betting Amount of
variables rate hood score profit bets

Linear Prob 0.7772 0.6919 0.1549 −128.8 (−0.05) 2696 (57.1%)
Points, Prob 0.7772 0.6917 0.1549 −99.9 (−0.04) 2395 (50.7%)
Rank, Prob 0.7776 0.6918 0.1550 −110.9 (−0.04) 2707 (57.3%)
Age, Prob 0.7766 0.6918 0.1550 −150.3 (−0.05) 2716 (57.5%)
Rank, Points, Prob 0.7772 0.6917 0.1550 −104.3 (−0.04) 2418 (51.2%)

Splines Prob 0.7764 0.6912 0.1546 113.7 (0.09) 1170 (24.8%)
Rank, Prob 0.7764 0.6912 0.1546 113.7 (0.09) 1170 (24.8%)
Points, Prob 0.7764 0.6912 0.1547 101.3 (0.08) 1298 (27.5%)
Rank, Points, Prob 0.7764 0.6912 0.1547 101.8 (0.08) 1299 (27.5%)
Elo, Prob 0.7764 0.6911 0.1547 106.9 (0.09) 1175 (24.9%)

LASSO Surface specific 0.7774 0.6816 0.1551 −225.3 (−0.12) 1891 (40.1%)
General skills 0.7770 0.6838 0.1551 −199.3 (−0.10) 2002 (42.4%)

Benchmark 0.7772 0.6709 0.1555 −311.4 (−0.23) 1359 (28.8%)


26 N. Buhamra et al. / Modeling and prediction of tennis matches at Grand Slam tournaments

The betting returns are negative for all models
except for the spline-based approaches, and here there
are substantial differences between the various mod-
eling approaches. If linear effects are assumed, the
losses range from about 100 to 150 monetary units.
For the spline models, gains are achieved which lie
between about 101 and 114 monetary units. In partic-
ular, the model in which either only prob or prob and
rank were included perform best with a gain of about
114 monetary units. The two models with LASSO
perform even worse than the linear models, with a
betting loss of almost 200 and 225 money units. Once
again, the benchmark model performs worst with a
loss of about 311 monetary units. This comparatively
high loss is due to the fact that this model almost
always bets on the underdog, but the underdog rarely
wins. The betting profit must be seen in relation to
the number of bets placed. Here, the tendency can be
seen that models with a high betting loss have the ten-
dency to bet more often. It is particularly noticeable
that bets are placed much less frequently when using
the non-linear models compared to the other models.
It can therefore be assumed that these models are a
little more conservative. This can also be seen from
the betting returns per match (bet). In some cases,
the corresponding losses are significantly lower than
those of the other model approaches.

4.4. External validation

To validate the models from above with respect
to their predictive performance on new, unseen test
data, the three best models from the groups of mod-
eling approaches with linear and non-linear effects
are used, as well as the two LASSO models and the
benchmark model. For this purpose, the performance
measures are calculated on the four Grand Slam
tournaments 2022. The data set then contains 293
matches. The validation is performed using a “rolling
window”-type approach, i.e. one of the remaining
tournaments is used as the test data set in chrono-
logical order. The training data set then continues to
be constantly updated and enlarged, this scheme can
be explained as follows:

1. First, all tournaments prior to 2022 are used as
the training data set and then the models are fitted
in a way as it is described in the second step of
the CV approach described in Section 4.3. With
those, then predictions can be obtained for the
2022 Australian Open, as this is the first Grand
Slam tournament of the year 2022.

2. The new training data set will then contain all tour-
naments up to and including the Australian Open
2022, on which the models are fitted again and
predictions are made for the French Open 2022,
the 2nd Grand Slam tournament of the year 2022.

3. Now the French Open 2022 is added to the train-
ing data set and the models are fitted again. This
will then be used to predict Wimbledon 2022.

4. Then, the Wimbledon 2022 matches will be added
to the training data set, and again, the models are
fitted and predictions are made for the final Grand
Slam tournament, the US Open 2022.

5. Finally, the predicted results for all four tourna-
ments are compared with the actual results and
the performance measures are calculated.

The results of the external validation are shown in
Table 2. Once again, in the last two columns addi-
tionally the average betting returns per match and the
proportion of bets placed are given in brackets.

The classification rate is again above 77% for all
models, and for some models even above 78% and,
hence, slightly better compared to the classification
rates from the CV-type strategy from above. The best
value is 78.16% and is achieved by three models. It is
striking here that now the benchmark model is among
the best.

Similar trends as in Table 1 (cf. page 9) can be seen
for the likelihood, although the results here are some-
what better. The linear models achieve the highest
likelihood, where the model including the covariates
Rank and Prob delivers the best value with 0.7037,
only slightly behind are the models with non-linear
effects (all between 0.7031 and 0.7033). The LASSO
models yield an even slightly smaller likelihood. The
benchmark model again performs worst.

The Brier score for all models is around 0.141–
0.142. In Section 4.3, the values were slightly larger.
The benchmark model again yields the largest, and
thus worst value with 0.1441. The best value is
achieved by the non-linear model including covari-
ates Points and Prob (0.1411).

Looking at the results of betting returns, the larger
part of the results are similar to those of Table 1.
The LASSO models and the linear models yields the
largest betting loss: approximately between 16 and 24
monetary units each, respectively. The models with
non-linear effects again mostly yield profits (about
four monetary units). However, here the benchmark
model yields the largest loss with 39 monetary units.

Also in terms of the number of bets placed, the
results from Table 1 are mostly confirmed. The lin-


N. Buhamra et al. / Modeling and prediction of tennis matches at Grand Slam tournaments 27

Table 2

Results of the external validation for the (at most) best three models per model class from Section 4.3 (best performing model in bold font).
In brackets are the betting profits per match and the ratio of all matches in which a bet was placed

Explanatory Class. Likeli- Brier Betting Amount of
variables rate hood score profit bets

Linear Prob 0.7816 0.7036 0.1419 −21.6 (−0.11) 197 (67.2%)
Points, Prob 0.7747 0.7027 0.1421 −17.4 (−0.09) 191 (65.2%)
Rank, Prob 0.7816 0.7037 0.1418 −20.5 (−0.01) 198 (67.6%)

Splines Prob 0.7782 0.7031 0.1413 4.2 (0.03) 121 (41.3%)
Rank, Prob 0.7782 0.7031 0.1413 4.2 (0.03) 121 (41.3%)
Points, Prob 0.7782 0.7033 0.1411 −0.28 (−0.00) 126 (43.0%)

LASSO Surface specific 0.7782 0.6934 0.1423 −24.4 (−0.15) 161 (54.9%)
General skills 0.7782 0.6969 0.1416 −16.1 (−0.09) 171 (58.4%)

Benchmark 0.7816 0.6810 0.1441 −39 (−1.00) 39 (13.3%)

ear models again bet most frequently (in almost every
third match), the LASSO models bet second most fre-
quently (in about half of the matches). If non-linear
effects are included, betting is even less frequent (in
about 40% of the cases), these models seem to be
again the most conservative ones, but at a higher level,
since the proportion of bets placed has increased for
each of these three model approaches compared to the
results from Section 4.3. The same applies to the bet-
ting returns per match, which is now 0.00 monetary
units of loss in the worst case. It is noticeable that bets
were placed much less frequently with the benchmark
model than with all other models. This was not the
case in Section 4.3. There, a bet was placed in 28.8%
of the possible matches, while here the percentage is
only 13.3%.

5. Discussion

In Section 4.3, a leave-one-tournament-out CV-
type approach was performed with the 43 Grand
Slam tournaments from the years 2011–2021. Since
the tournaments actually were played one after the
other in time, the CV-type strategy in a certain sense
resulted in the setting that the past was predicted
with information from the future. Initially, this argues
against the normal intuition of prediction, since, for
example, players with currently high rankings are
also more likely to have had high rankings in the past
tournament, i.e. the values are correlated. However,
since the crucial assumption that the yi|xi1, . . . , xip

are (conditionally on the covariate information) inde-
pendent for i = 1, . . . , n is still realistic, CV was
used here as a technical tool to compare performance
across many different prediction models.

Instead of a CV-type approach, a performance
comparison over a continuously updating data set

(“rolling window”) could be considered, as in Sec-
tion 4.4. This would have the advantage that the
temporal structure of the data could be preserved.
However, the CV-type strategy here had the advan-
tage that the models could be compared on more data.
Thus, each tournament served once as a test data set
and since this was only used for an initial comparison
to find principally suitable models, CV was preferred
to the rolling window approach.

For the models from Table 1 (page 9) and Table 2
(page 11) it is noticeable that the variable Prob
is always selected. So the bookmaker information
seems to be very important and to have a big influ-
ence on the prediction. This could also be a reason
why the models all perform quite similarly. The num-
ber of ranking points or the rank itself are also partly
selected in the three best linear and spline-based mod-
els. Hence, these variables also appear to be important
to a certain extend, although not quite as influential
as the odds.

It is hardly possible to filter out a clear win-
ner amongst the regarded models, since they differ
little with regard to the performance measures. If
one initially compares only the three regression
modeling approaches, tendencies can be identified.
When considering the classification rate and predic-
tive Bernoulli likelihood, the linear models might be
very slightly preferred. Regarding the Brier score, the
spline-based models perform slightly better.

However, since the two LASSO models never per-
form best with respect to any of the measures, and
since also their respective betting loss is the largest,
these models are rather not to be preferred. Within the
first modeling approach, according to the results from
Table 2 (page 11), the linear model with the ranking
position and the betting odds is best suited to model
a tennis match. Within the spline models, the model
including only the betting odds and the number of


28 N. Buhamra et al. / Modeling and prediction of tennis matches at Grand Slam tournaments

ranking points could be chosen as the winner with
respect to the first three performance measures. But
as all three models almost yield equal results, also the
most sparse and simple model, here the first one just
including the betting odds, could be chosen, as it also
results in positive betting returns.

Between the two models selected in this way (Rank
and Prob as linear effects or only Prob as a non-linear
effect), the betting profit should also be considered.
If this is taken into account, the spline model should
be preferred; if the betting returns are not considered
to be important, the linear model should be selected.

Spline models

Figure 1 shows the fitted spline of the variable Prob
and its pointwise 95% confidence intervals. For this
purpose, the model was fitted with the covariate Prob
on all tournaments.

The estimated effect looks almost linear between
−0.5 and 0.5, and then steeper at both edges. This
suggests that betting providers use somewhat “unfair”
or special odds in these regions. If the absolute differ-
ence in the winning probabilities is more than 0.5, it
can be assumed that a strong player is playing against
an extreme underdog, which occurs particularly often
in the first round of a tournament. For example, the
favorite in a match may have a 99% chance of winning
the match according to the betting companies, while
the underdog has a 1% chance of winning. This would
result in odds of 1.01 for the favorite and 100 for
the underdog (ignoring the bookmaker’s margin for a

moment). However, since an underdog’s win is very
unlikely, the bookmaker probably don’t always with-
hold the same margin from an underdog’s odds. For
higher odds, they probably withdraw larger margins
than for lower odds. Therefore, it seems reasonable
to assume that the effect of Prob is not linear, but
instead (slightly) non-linear.

LASSO models

In determining the optimal penalty strength λopt

for the LASSO models, 10-fold CV was performed
for a sequence of different λ values and the mean
deviance was calculated. Figure 2 shows this process
as an example with the 2011–2021 data for the model
with general player skills.

Here, λopt = 0.0161 provided a minimum
deviance of 0.9504. The LASSO model with this
choice for λopt was then used to predict the Aus-
tralian Open 2022. For the same setting, Fig. 3 on
page 13 shows the corresponding coefficient paths
as a function of λ, together with λopt as the vertical
dashed line.

For any λ between about 0.02 and 0.3, every coeffi-
cient except that of the variable Prob is shrunk to zero.
For decreasing λ, the coefficient of Prob becomes
larger, again showing the importance of this vari-
able. For λ smaller than 0.02, the coefficients of both
variables Points and Elo are also positive. The coef-
ficients estimated by the model can be read at the
location of λopt . Prob has a coefficient estimate of
1.50, Points of 0.06 and Elo of 0.02. Among others,

Fig. 1. Estimated non-linear effect of the variable Prob.


N. Buhamra et al. / Modeling and prediction of tennis matches at Grand Slam tournaments 29

Fig. 2. CV-deviance of LASSO-model as a function of λ; vertical dashed line: λopt .

Fig. 3. Coefficient paths vs. penalty strength λ; vertical dashed line: λopt .

the coefficient estimate of the player Tennys Sand-
gren (plotted in gray), is also positive at this point
with a value of 0.03.

This can be seen in more detail in Fig. 4, where it
is zoomed into the range of the smaller λ values. At
the optimal amount of penalization, the coefficients of
most other players are 0, except for ten different play-
ers. Among those, Tennys Sandgren has the largest
estimated regression coefficient with a value of 0.03,
so if he is one of the two players competing in a match,
the value of 0.03 is added to his linear predictor for

modeling the probability of him winning the match,
so he seems to perform a bit better than his covariate
values indicate. If one chooses λ to be even smaller,
the coefficient estimates of many other players are
also no longer shrunken to zero, and they are given
positive or negative abilities. As the LASSO estimator
for smaller λ gets closer and closer to the maximum
likelihood estimator, these coefficient estimates are
numerically very unstable. This can also be seen in
the path of the variable Prob (see Fig. 3), which shows
a very wiggly behavior for λ close to 0.


30 N. Buhamra et al. / Modeling and prediction of tennis matches at Grand Slam tournaments

Fig. 4. Zoomed fragment of the coefficient paths for lower λ values.

6. Summary and overview

In this work, tennis matches in Grand Slam tour-
naments were modeled within the framework of
regression. For this purpose, a data set that was
compiled using the R package deuce (Kovalchik,
2018). This contained information on 5,013 matches
in men’s Grand Slam tournaments from the years
2011–2022. This included the age difference of both
players (Age), the difference in their ranking positions
(Rank) and ranking points (Points), in Elo numbers
(Elo), in probabilities for the victory of the play-
ers, calculated from the average odds by the betting
providers (Prob), as well as the two additional age
variables Age.30 and Age.int, which should take into
account that the optimal age of a tennis player is
between 28 and 32 years.

Different regression approaches were considered
for modeling and prediction of tennis matches. As
there are only two possible outcomes in tennis (win or
loss), all models were based on a binary outcome and,
hence, on logistic regression, modeling the probabil-
ity of the first named player to win. It was discussed
that modeling with intercept would not be useful as
this would incorporate a kind of home effect for the
first named player – a property which was not desired
here.

The different modeling approaches were compared
in a 43-fold leave-one-tournament-out CV-type strat-
egy. Each of the 43 Grand Slam tournaments from

2011 to 2021 served once as a test data set. The
following models were included:

– Models with linear effects: to find suitable
covariates, all possible combinations of the
seven covariates were considered such that at
most one of the three age variables was incorpo-
rated. This resulted in 63 models.

– Models with non-linear effects (splines): Again,
63 models were considered.

– A model which took into account surface- and
player-specific effects. Due to the large amount
of unknown parameters, here LASSO penaliza-
tion was used.

– A model that considered general player-specific
abilities. Again, LASSO penalization was used.

– A benchmark model, where the predicted proba-
bilities were derived from average betting odds.

Within the CV-type approach, the models were com-
pared in terms of the classification rate, the predictive
Bernoulli likelihood, the Brier score as well as bet-
ting returns. Since 63 different models resulted for
each of the first two approaches, the five best models
were selected in each case.

It was found that all models performed very sim-
ilarly in terms of classification rate, likelihood and
Brier score. The classification rate was slightly above
77% for all models, meaning that the models pre-
dicted the correct outcome in about 77% of the
matches. The predictive Bernoulli likelihood was


N. Buhamra et al. / Modeling and prediction of tennis matches at Grand Slam tournaments 31

about 0.69 for both linear and non-linear models,
while the LASSO models and the benchmark model
were slightly below. The models thus predicted with
an average probability of about 69% the correct out-
come of a match. The Brier score was slightly above
0.15 for all models, differences were mostly found
in the third decimal. When comparing the betting
profits and the amount of bets placed, it was found
that only the spline models achieved positive betting
returns, but also placed fewer bets. Therefore, it can
be assumed that these models are more conservative
(and thus probably safer) in terms of betting.

The tournaments in 2022 were then used as an
external validation data set. The three best mod-
els with each linear and non-linear effects, the two
LASSO models and the benchmark model were then
compared again. The results of the preceding CV-
type competition could be mostly confirmed, with the
values of the performance measures generally being
slightly better than for the CV: the classification rate
was around 0.775 − 0.782, the predictive likelihood
yielded around 0.693 − 0.704 and the Brier score
was between 0.141 − 0.142. With regard to the bet-
ting returns, again only the spline models achieved
a betting profit, except for the model with the num-
ber of ranking points and the betting odds led to loss.
The proportion of placed bets increased for all mod-
els, only the benchmark model placed considerably
fewer bets than in the preceding CV-type competition.
The most striking here was that, again the benchmark
model was no better than the other models for most
performance measures.

In a more detailed discussion, it was pointed out
that the CV-type strategy here was preferable to a
“rolling window approach” in finding models, since
on the one hand the assumption of independence
of the observations of the target variable, given the
covariates, is fulfilled. Secondly, this allowed the
models to be compared on more data, as the initial
aim was to find suitable models. Furthermore, it was
emphasized that the betting odds are very important
for the prediction. In addition, it could be worked
out that within the linear models the covariates Rank
and Prob provided the best results. Within the spline
approach, all three models provided almost equal
results. Due to simplicity, the model that only con-
sidered betting odds would be preferred here. The
LASSO models tended to perform worse than the
other models and therefore could not really be rec-
ommended. The obtained betting returns can be used
to decide between the first two types of approaches.
They turned out to be positive for the best spline

model, while the linear model yielded a loss. How-
ever, the classification rate and predictive likelihood
were better for the latter model. It was also notable
that all models performed at least as well as the bench-
mark model.

Lastly, the spline model with the Prob variable
and the model with general player-specific skills
were examined in more detail. Based on the corre-
sponding fitted smooth effect, the behavior of the
bookmakers in setting odds for an extreme under-
dog were discussed. Using the LASSO model for
general player-specific skills, the CV for finding an
optimal λ via deviance minimization was illustrated.
In addition, the corresponding coefficient paths for
the different covariates were shown and explained.

In future research, an upcoming complete tourna-
ment could also be repeatedly simulated, and then
the probability of a certain player to win the tourna-
ment could be determined. This can take advantage
of the fact that the tournament course is completely
drawn before the start, i.e. it can already be said on
the basis of the tournament tree that two players can
meet at earliest in a certain round. This means that is
not necessary to take into account whether someone
has finished first or second in a certain group stage,
as it is the case in soccer, for example. With such an
approach, however, only the match-specific betting
odds for the first round would be available. Models
that do not use the odds as covariates could then be
preferable, but this could lead to a poorer prediction
performance due to the large influence of this vari-
able. Alternatively, one could look at models that do
not use the odds for individual matches, but instead
use odds set before the tournament on each player to
win the whole tournament.

Moreover, another extension of the approach pro-
posed here would be to allow for more flexible,
time-varying player-specific ability parameters, sim-
ilar e.g. to the approach proposed by Ley et al. (2019)
for modeling soccer. This approach has also been
used successfully and the resulting estimates have
been incorporated as a so-called “hybrid” feature in
a random forest model for predicting the FIFA World
Cup 2018 in Groll et al. (2019), and hence, seems
to be also promising in tennis. And the authors have
already planned to carry over this idea to tennis. To
do so, a different (and much larger) data set has to be
collected and also a separate, quite complex model
specifically designed for historic match data has to
be built up.

Finally, as already stated, in this work only
(directly interpretable) approaches within the frame-


32 N. Buhamra et al. / Modeling and prediction of tennis matches at Grand Slam tournaments

work of regression were analyzed. In future research,
we have planned to compare their performance with
different complex machine learning models, which
often have the capapbility to further increase the
predictive performance, though coming with the sub-
stantial draw-back of loosing interpretability. For that
reason, they typically should be equipped with certain
methods of interpretable machine learning (IML),
such as partial dependence plots Friedman, 2001, ICE
plots Goldstein et al., 2015 and ALE plots Apley and
Zhu, 2020. A first attempt to summarize the limited
selection of available methods for interpretable ML
appears in Molnar (2020).

References

Apley, D.W. & Zhu, J., 2020, Visualizing the effects of predictor
variables in black box supervised learning models, Journal of
the Royal Statistical Society Series B: Statistical Methodol-
ogy, 82(4), 1059-1086.

Arcagni, A., Candila, V. & Grassi, R., 2022, A new model for
predicting the winner in tennis based on the eigenvector cen-
trality, Annals of Operations Research, pages 1-18.

Bayram, F., Garbarino, D. & Barla, A., 2021, Predicting tennis
match outcomes with network analysis and machine learning,
In International Conference on Current Trends in Theory and
Practice of Informatics, pages 505-518. Springer.

Brier, G.W., 1950, Verification of forecasts expressed in terms of
probability, Monthly Weather Review, 78, 1-3.

Chitnis, A. & Vaidya, O., 2014, Performance assessment of tennis
players: Application of dea, Procedia-Social and Behavioral
Sciences, 133, 74-83.

Clarke, S.R. & Dyte, D., 2000, Using official ratings to simu-
late major tennis tournaments, International Transactions in
Operational Research, 7(6), 585-594.

Del Corral, J. & Prieto-Rodrı́guez, J., 2010, Are differences in
ranks good predictors for grand slam tennis matches? Inter-
national Journal of Forecasting, 26(3), 551-563.

Easton, S. & Uylangco, K., 2010, Forecasting outcomes in tennis
matches using within-match betting markets, International
Journal of Forecasting, 26(3), 564-575.

Eilers, P.H. & Marx, B.D., 2021, Practical smoothing: The joys of
P-splines, Cambridge University Press.

Eilers, P.H.C. & Marx, B.D., 1996, Flexible smoothing with B-
splines and penalties, Statistical Science, 11, 89-121.

Fahrmeir, L., Kneib, T., Lang, S. & Marx, B., 2013, Regression,
Modells, Methods and Applications, Springer, Berlin.

Fahrmeir, L. & Tutz, G., 2001, Multivariate Statistical Modelling
Based on Generalized Linear Models, Springer-Verlag, New
York, 2nd edition.

Friedman, J., Hastie, T. & Tibshirani, R., 2010, Regularization
paths for generalized linear models via coordinate descent,
Journal of Statistical Software, 33(1), 1-22.

Friedman, J.H., 2001, Greedy function approximation: a gradient
boosting machine, Annals of Statistics, 29, 337-407.

Goldstein, A., Kapelner, A., Bleich, J. & Pitkin, E., 2015, Peek-
ing inside the black box: Visualizing statistical learning with
plots of individual conditional expectation, Journal of Com-
putational and Graphical Statistics, 24(1), 44-65.

Groll, A., Ley, C., Schauberger, G. & Van Eetvelde, H., 2019,
A hybrid random forest to predict soccer matches in inter-
national tournaments, Journal of Quantitative Analysis in
Sports, 15, 271-287.

Gu, W. & Saaty, T.L., 2019, Predicting the outcome of a tennis
tournament: Based on both data and judgments, Journal of
Systems Science and Systems Engineering, 28(3), 317-343.

Klaassen, F.J. & Magnus, J.R., 2003, Forecasting the winner of
a tennis match, European Journal of Operational Research,
148(2), 257-267.

Kovalchik, S., 2018, deuce: resources for analysis of professional
tennis data, R package version, 1.

Lennartz, J., Groll, A. & van der Wurp, H., 2021, Predicting table
tennis tournaments: A comparison of statistical modelling
techniques, International Journal of Racket Sports Science,
3(2).

Ley, C., Wiele, T.V.d. & Eetvelde, H.V., 2019, Ranking soccer
teams on the basis of their current strength: A comparison
of maximum likelihood approaches, Statistical Modelling,
19(1), 55-73.

Ma, S.C., Ma, S.M., Wu, J.H. & Rotherham, I.D., 2013, Host res-
idents’ perception changes on major sport events, European
Sport Management Quarterly, 13(5), 511-536.

Marra, G. & Wood, S.N., 2011, Practical variable selection for
generalized additive models, Computational Statistics & Data
Analysis, 55(7), 2372-2387.

McHale, I. & Morton, A., 2011, A bradley-terry type model for
forecasting tennis match results, International Journal of
Forecasting, 27(2), 619-630.

Molnar, C., 2020, Interpretable machine learning, Lulu. com.

Nelder, J.A. & Wedderburn, R.W.M., 1972, Generalized linear
models, Journal of the Royal Statistical Society, A, 135, 370-
384.

Park, M.Y. & Hastie, T., 2007, L1-regularization path algorithm
for generalized linear models, Journal of the Royal Statistical
Society Series B, 19, 659-677.

R Core Team, 2022, R: A Language and Environment for Statistical
Computing, R Foundation for Statistical Computing, Vienna,
Austria.

Radicchi, F., 2011, Who is the best player ever? a complex network
analysis of the history of professional tennis, PloS One, 6(2),
e17249.

Schauberger, G. & Groll, A., 2018, Predicting matches in interna-
tional football tournaments with random forests, Statistical
Modelling, 18(5-6), 460-482.

Somboonphokkaphan, A., Phimoltares, S. & Lursinsap, C., 2009,
Tennis winner prediction based on time-series history with
neural modeling, In Proceedings of the International Multi
Conference of Engineers and Computer Scientists, volume 1,
pages 18-20. Citeseer,

Tibshirani, R., 1996, Regression shrinkage and selection via the
Lasso, Journal of the Royal Statistical Society, B, 58, 267-288.


N. Buhamra et al. / Modeling and prediction of tennis matches at Grand Slam tournaments 33

Weston, D., 2014, Using age statistics to gain a tennis betting edge,
http://www.pinnacle.com/en/betting-articles/Tennis/atp-
players-tipping-point/LMPJF7BY7BKR2EY

Whiteside, D., Cant, O., Connolly, M. & Reid, M., 2017, Monitor-
ing hitting load in tennis using inertial sensors and machine
learning, International Journal of Sports Physiology and Per-
formance, 12(9), 1212-1217.

Wilkens, S., 2021, Sports prediction and betting models in the
machine learning age: The case of tennis, Journal of Sports
Analytics, 7(2), 99-117.

Wood, S.N., 2004, Stable and efficient multiple smoothing param-
eter estimation for generalized additive models, Journal of
the American Statistical Association, 99(467), 673-686.

Wood, S.N., 2017, Generalized Additive Models: An Introduction
with R. Chapman & Hall/CRC, London, 2nd edition,

Yue, J.C., Chou, E.P., Hsieh, M.-H. & Hsiao, L.-C., 2022, A study
of forecasting tennis matches via the glicko model, PloS One,
17(4), e0266838.


Comparing modern machine learning approaches
and different forecast strategies on Grand Slam

tennis tournaments

N. Buhamra1 A. Groll2 A. Gerharz3

1Department of Statistics, TU Dortmund University, Vogelpothsweg 87, 44221 Dortmund, email:
nourah.buhamra@tu-dortmund.de

2Corresponding author, Department of Statistics, TU Dortmund University, Vogelpothsweg 87,
44221 Dortmund, email: groll@statistik.tu-dortmund.de

3Department of Statistics, TU Dortmund University, Vogelpothsweg 87, 44221 Dortmund, email:
gerharz@statistik.tu-dortmund.de

Abstract

This manuscript builds upon the approach presented in Buhamra et al. (2024). Differ-
ent modern machine learning and regression approaches for modeling and prediction
of tennis matches in Grand Slam tournaments are investigated. Our data includes in-
formation on 5,013 matches in men’s Grand Slam tournaments from the period 2011-
2022. The investigated methods focus on modeling the probability of the first-named
player to win the respective match. Moreover, different features are considered includ-
ing the players’ age, the ATP ranking and points, bookmakers’ odds, Elo rating as well
as two additional age variables, which take into account the optimal age of a tennis
player. We compare the different regression approaches regarded in Buhamra et al.
(2024) to modern machine learning approaches with respect to various performance
measures. Moreover, we also investigate different forecast strategies. First of all, a
cross-validation-type strategy for all matches between 2011 and 2021. We also use an
“expanding window” strategy by continuously updating the training data to analyze the
predictive performance of the approaches on the tournaments from 2022. Finally, a
“rolling window” strategy is used with only three years of tournaments as training data.
We then select small subsets of best models with largest average ranks, and investigate
those in more detail by the help of interpretable machine learning techniques.

Keywords: Grand Slam tournaments, Tennis matches, machine learning, prediction,
model selection, cross validation, rolling window.

1 Introduction
Recently, various statistical and machine learning methods for modeling both tennis
matches and whole tournaments have been introduced, leading to an expansion of ex-
isting techniques for predicting match-winning probabilities in tennis. Consequently,
once all individual matches can be predicted, it may also be possible to calculate the
winning probabilities for an entire tournament.

In recent years, machine learning (ML) models were employed to forecast the win-
ners of tennis matches. Somboonphokkaphan et al. (2009) introduced an approach

1


that is able to predict match winners by utilizing both match statistics and environ-
mental data, which uses a Multi-Layer Perceptron (MLP) in combination with a back-
propagation learning algorithm. MLP is a fundamental type of Artificial Neural Net-
work (ANN), which are an effective method for addressing real-world classification
task and are especially useful for predicting outcomes when they have access to exten-
sive databases and can handle incomplete or noisy data. Whiteside et al. (2017), devel-
oped an automated stroke classification system in order to quantify the hitting load in
tennis using machine learning models such as a cubic kernel support vector machine.
Wilkens (2021) builds on prior research by exploring various ML techniques, including
neural networks and random forests, using one of the largest datasets for professional
men’s and women’s tennis matches. He found that the average prediction accuracy
cannot exceed approximately 70%.

In Sipko and Knottenbelt (2015), the authors predicted the winner of the tennis
match based on the probability of serve points won which in turn gives the probability
of a player winning the match. They define a method of extracting 22 features from raw
historical data, including abstract features, such as player fatigue and injury. Then, by
using the resulting data set, they developed and optimized models which were able to
outperform Knottenbelt’s Common-Opponent model based on machine learning algo-
rithms such as artificial neural networking. The authors believe that machine learning
can bring new innovation in tennis betting as the neural networking generates a 4.35%
return on investment in the betting market, which is an improvement of about 75% over
the current state-of- the-art stochastic models. Furthermore, Gao and Kowalczyk (2021)
constructed a model that predicts tennis match outcomes with high accuracy over 80%,
much larger than predictions using betting odds alone, and identified serve strength as a
key predictor of match outcome. This was done by using a random forest classifier that
highlights the importance and relevance of simple models in the age of deep learning.
Moreover, a wide variety of features to capture information about physical, psycholog-
ical, court-related, and match-related variables are included in a large data set of tennis
matches, which is compiled and processed from ATP data from 2000 to 2016. Finally,
an overview of modeling and predicting tennis matches at Grand Slam tournaments by
different regression approaches has been presented in Buhamra et al. (2024).

In the present work, we focus on several ML modeling approaches, which are ap-
plied on tennis Grand Slam tournament match data. Additionally, we proceed further
with the comparison of those approaches to the regression models investigated in our
previous work (see Buhamra et al., 2024). To achieve this, we use the same data an-
alyzed in Buhamra et al. (2024), which was created from the deuce R package (Ko-
valchik, 2019), and includes information on 5,013 matches from men’s Grand Slam
tournaments between 2011 and 2022. Several features are available, such as the play-
ers’ age, ATP ranking and points, odds, Elo rating, and two other age-based variables
that account for the “optimal” player’s age, which was found to be in the interval of
[28;32] years (Weston, 2014). Various machine learning and regression approaches will
be introduced that are then compared via different performance measures. Moreover, as
it is a generally known result that feature engineering often also is quite a crucial aspect
for the modeler (and sometimes even more relevant than the choice among different
types of statistical or machine learning models), we analyze the impact of different
types of forecasting strategies: a cross-validation-type strategy for all matches between
2011 and 2021, an “expanding window” strategy by continuously extending the train-
ing data, and a “rolling window” strategy based on only three years of tournaments as
training data.

The rest of the article is structured as follows. Section 2 briefly introduces the data
set and defines the objectives of this work. Then, in Section 3, different machine learn-

2


ing modeling approaches are introduced, including classification trees, random forest,
XGBoost, support vector machine (SVM) and artificial neural networks (ANNs). Be-
sides, some classical regression modeling approaches and corresponding regularization
concepts such as the Least Absolute Shrinkage and Selection Operator (LASSO) and
non-parametric spline techniques are defined. More details on the regression models
and a benchmark model are introduced in Section 4. Moreover, three performance
measure are defined and two approaches from the field of interpretable machine learn-
ing (IML) are described, which are then used to interpret the best machine learning
model. In Section 5, all modeling approaches are compared via the different forecast-
ing strategies and the introduced performance measures, on the one hand on all Grand
Slam tournaments from 2011 to 2021 via a leave-one-tournament-out CV-type strat-
egy. On the other hand, we employ an expanding window approach on the Grand Slam
tournament 2022, as well as a rolling window approach with only three years of tourna-
ments as training data in order to have very recent prediction results. Finally, Section 6
summarizes the main results and concludes.

2 Data
In this section, we shortly introduce the data set from Buhamra et al. (2024), which was
compiled based on the R package deuce (Kovalchik, 2019). It contains the following
information for 5,013 matches in men’s Grand Slam tournaments from 2011 to 2022:

Player1: name of (randomly chosen) first-named player.

Player2: name of (randomly chosen) second-named player.

Year: year when the match took place (ranging from 2011 to 2022).

Tournament: Grand Slam tournament where the respective match took place (Aus-
tralian Open, French Open, Wimbledon, US Open).

Surface: factor variable describing the surface on which the match was played (“hard”,“clay”
or “grass”).

Victory: binary target, determining whether the first-named player won the match (1:
yes, 0: no).

Age: metric predictor capturing the age difference of the players (in years); age of the
2nd player was subtracted from that of 1st player.

Prob: difference in winning probabilities of both players, obtained from their average
odds (see AvgProb1 and AvgProb2 below); winning probability of 2nd player is
subtracted from that of 1st player.

Rank: difference in players’ rankings, calculated by subtracting the rank of 2nd player
from that of 1st player. For this, the rank of the players at the start of the tourna-
ment is used. The position in the ranking is determined by the ATP points.

Points: difference in ATP points; points of 2nd player are subtracted from those of
1st player; points are awarded for each match won per tournament. Wins in
later rounds of a tournament are valued higher than wins in the first rounds of a
tournament. Points expire after 52 weeks. Again, the players’ points at the start
of the tournament are used.

Elo: difference in Elo rating; Elo rating of 2nd player is subtracted from that of 1st
player; Elo rating takes into account whether a player played against a higher or
lower ranked player. It increases more if a player wins against a player with a
large Elo rating than if he wins against a player with a lower one; it is updated
after each new match of a player.

3


Age.30: first, the distance between the age of the players and reference age of 30 years
is calculated, then the corresponding difference between both competing players
is calculated, as e.g. for Age. This is because the standard Age variable from
above does not contain enough information. For example, a 25-year-old player
typically has an advantage over a 20-year-old one, while a 40-year-old player
typically has a disadvantage over a 35-year-old one. However, in both cases, the
age difference is 5 years. Following Weston (2014), who argued that the optimal
age of tennis players is within [28;32] years, we use 30 years as the mid point of
the interval as reference age.

Age.int: first, the distance to the closer limit of the interval was used, i.e. for players
younger than 28 the distance to 28 was calculated, while for players older than 32
the distance to 32 was computed. For players with an age within [28;32] years the
distance was set to 0. Then, the difference between both players was calculated.

AvgProb1: average winning probability of Player 1, based on the average odds from
several different betting providers, which are provided in the deuce package (ob-
tained from https://www.oddsportal.com/).

AvgProb2: Same as AvgProb1, but from the perspective of Player 2. Note that per
match we have AvgProb1 + AvgProb2=1.

B365.1 Winning odds for Player 1 obtained from the specific bookmaker Bet365. Ex-
emplarily, winning odds of 2.31 for Player 1 yield a return of 2.31 money units
if he wins, if previously one money unit is placed on this event. These odds are
later used to compute betting returns.

B365.2 Same as B365.1, but from the perspective of Player 2.

A detailed description of the variables included in the data set can be found in
Section 2 of Buhamra et al. (2024).

It is important to note that we exclude matches in which one player forfeited or
was unable to compete, such as due to injury, resulting in a win for the other player
without an actual match being played. These matches lack informative data and could
skew the results, and hence are omitted. Additionally, the dataset contains no missing
values. For this dataset, we aim to identify the most effective machine learning model
for predicting tennis matches at Grand Slam tournaments and explore which modeling
approaches are especially suitable for this task.

Among other things, it will be examined whether the machine learning approaches
are powerful with regard to prediction of new matches compared to the classical re-
gression approaches used in Buhamra et al. (2024). Then, for the different machine
learning modeling approaches, based on a leave-one-tournament-out CV strategy we
determine models which are optimal with respect to certain performance measures, and
compare those with each other. Additionally, both an expanding window approach with
a starting window of eleven years and a rolling window approach with short window of
three years period are used.

3 Statistical and machine learning methods
In the following, closely following Buhamra et al. (2024), in Section 3.1 classical
statistical methods are introduced, beginning with logistic regression. Based on this,
we shortly motivate the concept of regularization and define the so-called LASSO-
estimator. Moreover, spline regression with P-splines is described. Then, in Section 3.2
several machine learning methods such as classification trees, random forest, XGBoost
and support vector machine (SVM) are introduced.

4


3.1 Regression approaches
Next, we will first introduce the classic logistic regression model for binary outcomes
and then sketch different extensions, which will be used in this work. All of those have
been introduced in detail already in Buhamra et al. (2024), but for better readability of
this manuscript we show them again.

Logistic regression

For n individuals, let observations (yi,xi1, . . . ,xip), i= 1, . . . ,n of a binary target variable
y and covariates x1, . . . ,xp be given. In the logistic regression model, the relationship
between y and metric, categorical or binary covariates is examined. Here, y = 1 denotes
the occurrence of a particular event (typically defined as “success”) and y = 0 that the
event does not occur (also defined as “failure”). Then,

πi = P(yi = 1|xi1, . . . ,xip) = E[yi|xi1, . . . ,xip]

is the (conditional) probability for the occurrence of yi = 1, given the covariate values
xi1, . . . ,xip. The aim is to model πi appropriately as a function of the features.

Therefore, the linear predictor ηi is related to the probability πi by a strictly mono-
tonically increasing function h : R→ [0,1], i.e.

πi = h(ηi) = h(β0 +β1xi1 + · · ·+βkxip) .

The function h(·) is also called the response function. With the help of the inverse
function g = h−1, the so-called link function, we can also write ηi = g(πi). The logistic
regression model is the most famous candidate within the framework of Generalized
Linear Models (GLMs).

The estimators for β0, . . . ,βp are obtained by numerical maximization of the log-
likelihood, e.g. by using the Fisher scoring or the Newton-Raphson method, see, e.g.,
Nelder and Wedderburn (1972). Generally, for more details on GLMs, see also Fahrmeir
and Tutz (2001).

Regularization

If the number of covariates p becomes very large, the estimation becomes numerically
unstable (see, e.g., Fahrmeir et al., 2013). This can also be the case if there is some sub-
stantial mulitcollinearity between the columns of the design matrix X = (x1, . . . ,xn)

>.
To address this problem, a penalty term pen(βββ ) is added to the model’s (negative) log-
likelihood l(·), which in our case corresponds to a Bernoulli logit model. According to
Park and Hastie (2007), the estimator is then obtained by minimizing

β̂ββ pen = argmin
βββ

(−l(βββ )+λ · pen(βββ )),

where λ is the penalty parameter that controls the influence of the penalty term on the
parameters estimated by the ML method.

The Least Absolute Shrinkage and Selection Operator (LASSO)

One possibility for penalization is provided by the Least Absolute Shrinkage and Se-
lection Operator (LASSO; Tibshirani, 1996). The corresponding penalty term is given
by

pen(βββ ) =
p

∑
j=1
|β j|.

5


It allows model estimation and variable selection to be performed in one step, as small
coefficients are shrunk to 0. No closed-form representation for solving this minimiza-
tion problem does exist. Therefore, numerical optimization methods are used to obtain
the optimal LASSO estimator β̂ββ LASSO (see, e.g.,Friedman et al., 2010). To optimize the
penalty parameter λ typically K-fold cross validation is used.

Spline-based approaches

In the methods introduced above, the influence of the covariates on the target variable
is assumed to be strictly linear. However, often also non-linear influences are relevant.
In order to model these appropriately and flexibly, so-called splines can be used. Here,
the so-called B-splines (Eilers and Marx, 1996, 2021) will be employed.

B-Splines

In principle, with B-splines a non-linear effect f (x) of a metric predictor can be repre-
sented as

f (x) =
d

∑
j=1

γ jB j(x) ,

where B j(x) represent the different B-spline basis functions, d denotes the number of
basis functions used, and γ j the corresponding spline coefficients. As an unpenalized
estimation of a non-linear B-spline effect often overfits, typically the non-linear effect
is smoothed by using penalized B-splines, i.e. P-splines.

P-splines

Beside the problem of potential overfitting, the goodness-of-fit of the B-spline approach
depends on the number of selected nodes. To avoid this problem, various penalization
methods exist in the form of P-splines. Here, a penalized estimation criterion, which
is extended by a penalty term, is used instead of the usual estimation criterion. For P-
splines based on B-splines (see, e.g., Eilers and Marx, 1996), the function f (x) is first
approximated by a polynomial spline with many nodes (typically about 20 to 40).

A suitable penalty term guaranteeing smoothness then would be given by

λ
∫ (

f ′′(x)
)2 dx .

This is motivated by the fact that the second derivative is used as a measure of the
curvature of a function. If this becomes too large, it will be penalized by the term
above. For approximation of the second derivative exist simple representations, so that
the simpler penalty term

λ
d

∑
j=3

(
∆2γ j

)2

could be used instead, where ∆2γ j = γ j − 2γ j−1 + γ j−2 (see again Eilers and Marx,
1996).

The optimal smoothing parameter λ is determined using generalized CV. For more
details on the methodology, see also Eilers and Marx (2021), and for more details on
the corresponding software implementation in R, see Wood (2017).

6


3.2 Machine learning approaches
In the following, we’ll introduce different machine learning approaches that are suitable
for binary responses and will be used in this work for comparison against the regression
approaches, which were introduced before in Section 3.1.

Classification trees

Decision trees are a supervised machine learning algorithm. A tree starts with its root
node and then follows binary splits based on the features’ values until a leaf node is
reached and the final binary result is given. The goal is to create a model that predicts
the value of a target variable by learning simple decision rules inferred from the given
features. The splitting of nodes (including the root node) to sub-nodes happens based
on impurity metrics. Hence, decision trees seek to find the best split to subset the data,
and they are typically trained through the Classification And Regression Tree (CART)
algorithm. Metrics, such as Gini impurity and information gain can be used to evaluate
the quality of the splits. CARTs were introduced by Breiman et al. (1984) and use
Gini impurity in the process of splitting the data set into a decision tree. Gini impurity
measures how often a randomly chosen attribute is misclassified and can be represented
as follows

1−
c

∑
i=0

(pi)
2 ,

where pi is the probability of class i with total number of classes c. The values of the
Gini index range from 0 to 1, a lower value being more ideal. It should be noted that in
this manuscript we will focus on the classification trees rather than the regression ones,
as in our case the target is a binary variable (i.e. the target variable is categorical).

However, a fully grown tree typically will overfit the training data and the resulting
model will not generalize well for predicting the outcome of new (unseen) test data.
Therefore, techniques such as pruning (stopping the tree to grow) are used to control
this problem. In the rpart package Therneau et al. (2015), this is controlled by the
complexity parameter (cp), which gives a threshold which an improvement has to pass
that a split stays in the tree. A too small value of cp leads to overfitting, and a too large
cp value will result in a too small tree and, hence, underfitting. Both cases decrease
the predictive performance of the model. An optimal cp value can be estimated by
testing different cp values and using cross-validation (CV) approaches to determine the
corresponding prediction accuracy of the model1. The best cp is then defined as the
one that maximizes the CV accuracy.

Random forest

Individual trees, as Breiman (1996b) pointed out, suffer from instability, which is in-
herent in the tree model and can not be fixed through changing the way a tree is built.
A first solution that does not interfere in the tree construction process is an ensem-
ble method called bootstrapping and aggregating (bagging; Breiman, 1996a), which in
general improves the predictive performance compared to a single regression tree and
is easy to implement.

Similar to bagging, a random forest also consists of multiple decision trees which
are aggregated for a more accurate prediction (Breiman, 2001). The logic behind the
random forest model is that multiple uncorrelated models (i.e., the individual decision

1We self-implemented this CV approach such that for all methods, where tuning was necessary, the same
folds are used and, hence, better comparability is achieved.

7


trees) perform much better as a group than they do alone, as this reduces the variance
and yields more accurate predictions than an individual model. Depending on the type
of problem, the type of prediction will vary. For a regression task, the individual deci-
sion trees will be averaged, and for a classification task, a majority vote, (i.e. the most
frequent categorical variable) will yield the predicted class.

Again, in this manuscript, we will consider the classification task for the random
forest since our target variable is binary. The main extension compared to bagging
is the idea of drawing randomly only a subset of ”mtry” features at each node for
splitting in random forest, which involves selecting a random subset of predictors from
the set of all available features at each decision node in a decision tree. This introduces
diversity among the trees in the ensemble, reducing the risk of overfitting and enhances
the model’s generalization performance by ensuring that each tree is built on a different
subset of features. For fitting random forest models, we use the ranger implementation
in R (Wright and Ziegler, 2017).

Here, in the random forest, mtry and the ntree are the two parameters that most
likely to have a big effect on our final accuracy and are defined as follows:

− mtry: number of variables randomly sampled as candidates at each split.

− ntree: number of trees to grow.

The mtry hyperparameter controls the split-variable randomization feature of the ran-
dom forests and it helps to balance low tree correlation with reasonable predictive
strength. For classification mtry =

√
p is recommended, where p is the number of

predictors. However, when there are fewer relevant predictors a larger value of mtry
tends to perform better because it makes it more likely to select those features with
the strongest signal. When there are many relevant predictors, a lower mtry might
perform better. For this reason, we decided to properly tune this parameter using (self-
implemented) 10-fold CV (see footnote 1 for details).

There are no specific criterias when it comes to choosing the number of trees that
the random forest should use, as long as ntree is large enought. However, there is a
threshold where adding more trees does not give any significant gain to the random for-
est (Oshiro et al., 2012), but computational costs would unnecessarily increase. Here,
we decided to set ntree= 400 (and, hence, did not tune this parameter).

Extreme gradient boosting

Boosting originates in the machine learning community and proved to be a successful
and practical strategy to improve classification procedures. Boosting is a sequential
ensemble method that combines the predictions of multiple weak models to produce a
stronger prediction, resulting in a final, aggregated model. I.e., first, a single model is
built for the task and prediction errors are computed. Then, the second model is built
which tries to correct the errors present in the first model. This procedure is continued
and models are added until either the complete training data set is predicted correctly
or the maximum number of models are added.

Since Freund et al. (1996) have presented their famous Ada-Boost algorithm, many
extensions have been proposed and boosting approaches were updated to estimate pre-
dictors for statistical models (e.g., gradient boosting by Friedman et al., 2000, general-
ized linear and additive regression based on the L2-loss by Bühlmann and Yu, 2003).

Hence, boosting is considered as a sequential process; i.e., if e.g. (simple) trees are
used as learners, those are grown using the information from a previously grown tree
one after the other. This process slowly learns from data and tries to improve its pre-
diction in subsequent iterations. A main advantage of statistical boosting algorithms

8


is their flexibility for high-dimensional data and their ability to incorporate variable
selection in fitting process (Mayr et al., 2014). An extensive and enlightening, gen-
eral overview on gradient boosting algorithms can be found in Bühlmann and Hothorn
(2007). Friedman (2001) introduced the idea of gradient tree boosting, using decision
trees as learners. The decision trees are repeatedly fitted on the residuals of the previous
fit and, hence, are combined to a sequential ensemble.

This technique was then further improved by Chen and Guestrin (2016) via in-
troducing additional regularization in the objective function. The regularization terms
make the single trees weak learners to avoid over fitting. In a certain boosting iter-
ation, the next tree is additively incorporated into the ensemble after multiplication
with a rather small learning rate which makes the learners even weaker. The method
is called extreme gradient boosting, in short XGBoost, and is known in the machine
learning community for its high predictive power. The approach is implemented in the
xgb.train function from the xgboost R package (Chen et al., 2023). One impor-
tant aspect is that XGBoost involves several tuning parameters. For this purpose, we
specified reasonable, discrete grids and again used (self-implemented) 10-fold cross
validation to determine optimal tuning parameters, see footnote 1 for further details.

Support vector machine (SVM)

Support vector machine (SVM) is another supervised machine learning technique, de-
veloped by Cortes and Vapnik (1995). The theory of SVM can be explained by defin-
ing four concepts, namely the hyperplane, the optimal margin, the soft margin and the
kernel function (Noble, 2006): (i) the hyperplane, is the optimal line in a high dimen-
sional space that separates the data points; (ii) the optimal margin is defined as the
distance between the hyperplane and the closest support vectors; (iii) the soft margin
which allows for separation of data points with a minimal number of misclassifications;
(iv) as data points in a classification problem cannot always be separated linearly, the
fourth concept, the kernel function, was developed. The kernel function changes the
shape of space to a higher dimension and makes it possible to find a non-linear decision
boundary. A kernel function that does not suggest too many unnecessary dimensions
is preferred because too many dimensions can lead to over fitting. When the data is
perfectly linearly separable only then we can use Linear SVM. Perfectly linearly sep-
arable means that the data points can be classified into two classes by using a single
straight line. However, in case the data is not linearly separable we can use non-linear
SVM. So if the data points cannot be separated into two classes by using a straight line,
we use kernel functions to classify them. In most applications we do not find linearly
separable data points, hence, we use kernel functions in order to solve them. In this
work, two different SVM types are used for our models, namely the linear SVM and
the non-linear SVM with radial basis function (RBF) kernel.

The mathematical definition of a hyperplane in two dimensions is given by James
et al. (2013) as

β0 +β1x1 +β2x2 = 0 , (3.1)

for parameters β0, β1 and β2. Any x = (x1,x2)
T for which equation (3.1) holds is a

point on the hyperplane.
Equation (3.1) can be extended to the p-dimensional case by

β0 +β1x1 +β2x2 + ...+βpxp = 0 , (3.2)

with x = (x1,x2, ...,xp)
T being a vector of length p. Again, if x satisfies equation (3.2),

then x lies on the hyperplane. In case that x does not satisfy equation (3.2), then this

9


indicates that x lies on one side of the hyperplane, e.g.

β0 +β1x1 +β2x2 + ...+βpxp > 0.

Otherwise, if

β0 +β1x1 +β2x2 + ...+βpxp < 0 ,

then x lies on the other side of the hyperplane.
Suppose that we have an n× p data matrix X that consists of n training observa-

tions in the p-dimensional space with corresponding binary responses falling into two
classes, that is, y1, ...,yn ∈ {−1,1}, where−1 represents one class and 1 the other class.

Then, a separating hyperplane has the property that

yi · (β0 +β1Xi1 +β2Xi2 + ...+βpXip)> 0

for all i = 1, ...,n.
The maximal marginal hyperplane is the separating hyperplane for which the mar-

gin is largest. For a set of n training observations x1, ...,xn ∈ Rp and associated class
labels y1, ...,yn ∈{−1,1}, the maximal margin hyperplane is e.g. defined by James et al.
(2013) as the solution to the following optimization problem. Let M > 0 represent the
margin of our hyperplane. Then we want to maximize M and optimize β0,β1, . . . ,βp in

yi · (β0 +β1xi1 +β2xi2 + ...+βpxip)≥M , ∀i = 1, ...,n ,

subject to ∑p
j=1 β 2

j = 1. Further details about SVM can be found e.g. in James et al.
(2013).

Often classes are not perfectly separable, then so-called slack variables ξ1, . . . ,ξn

are introduced in additional constraints

yi ·
(

x>i β +β0

)
≥M (1−ξi) , i = 1, . . . ,n

ξi ≥ 0, i = 1, . . . ,n
n

∑
i=1

ξi ≤C

with C ≥ 0 constant. Those slack variables allow some leeway compared to perfect
separation, and ξi corresponds to the proportion of M that observation i lies on the
wrong side of its margin line (see, e.g., Hastie et al., 2009). For 0 < ξi ≤ 1, observation
i lies between its margin line and the hyperplane, and for ξi > 1 it lies on the wrong side
of the hyperplane. Observations on the correct side of the margin yield ξi = 0. Hence,
C is a tuning parameter that restricts the total amount of observations that may lie on
wrong sides. Only those observations with ξi ≥ 0 affect the hyperplane and are called
support vectors. Note that in addition to linear SVM, there also exists a non-linear
version known as the radial SVM, which will also be used in the following.

The main advantage of using SVM is that they are more effective in high dimen-
sions where the number of features can be very large. The SVM commonly achieves
better accuracy in comparison to Artificial Neural Networks (ANNs; Sipko and Knot-
tenbelt, 2015). Furthermore, Valero (2016) explains that SVM also gives a sparse solu-
tion, which reduces the risk of overfitting. A detailed overview of the SVM approach
for classification is given in Wang and Casasent (2008) and Steinwart and Christmann
(2008). A manual grid search of hyperparameter values is used. Based on tuning re-
sults of the parameters for the radial basis function (RBF) kernel, the best SVM model
is identified. Then the performance of the tuned SVM model is evaluated using the test-
ing data set. This is done by using the tune.svm function from the e1071 R package
(Dimitriadou et al., 2009).

10


Artificial neural networks (ANNs)

An artificial neural network (ANN) is a computational model inspired by the way bi-
ological neural networks in the human brain process information (Rosenblatt, 1958).
ANNs were developed separately in different fields, e.g. statistics and artificial intel-
ligence, andand reach back to at least 1986 (see, e.g., Rumelhart et al., 1986a,b; Mc-
Clelland et al., 1987). They consist of interconnected nodes, or neurons, organized into
layers: an input layer, one or more hidden layers, and an output layer. Each connec-
tion between neurons has an associated weight that adjusts as learning proceeds. In a
nutshell, an ANN relies on the following components and works as follows:

− Input Layer: Data is fed into the network through the input layer.

− Weighted Sum: Each neuron calculates a weighted sum of its inputs.

− Activation Function: The weighted sum is passed through an activation function
(e.g. ReLU or sigmoid). It can introduce non-linearity and determines the neu-
ron’s output.

− Forward Propagation: The output from one layer becomes the input for the next
layer until reaching the output layer.

− Loss Function: The difference between the predicted output and actual target
values is calculated using a loss function.

− Backpropagation: The network adjusts its weights based on this error using opti-
mization algorithms like gradient descent, effectively “learning” from the data.

Through repeated iterations over training data, ANNs can learn complex patterns and
make predictions or classifications on new data. A sound introduction to ANNs can be
found for example in Hastie et al. (2009).

In this work, we train our ANNs using the R function neuralnet (Fritsch et al.,
2019). The function allows to use backpropagation, resilient backpropagation (with or
without weight backtracking) or a modified globally convergent version. Moreover, it
enables flexible settings through custom-choice of error and activation function.

4 Further modeling details and evaluation aspects
In the following, some more details on the regression model approaches, which seem
to be suitable to adequately model tennis matches, are provided. Then, both the re-
gression and machine learning approaches are investigated and compared with respect
to their predictive performance on new, unseen matches. To achieve this, a leave-one-
tournament-out CV-type approach is employed to select the best model based on various
performance measures. In an external validation using previously unused data, the top
models from the earlier CV-type strategy are assessed through an expanding window
method. Subsequently, a rolling window with a recent three-year training dataset is uti-
lized to make predictions for the selected models. For all computations and evaluations
the statistical programming software R (R Core Team, 2023) was used.

4.1 Modeling details for the regression approaches
In this section, we provide more details for the regression approaches. These can also be
found in Buhamra et al. (2024), but for better readability of this manuscript we describe
them here again.

11


Principally, to model the binary outcome y (win/loss) of a tennis match, the logis-
tic regression models introduced in Section 3.1 can be employed, which generally are
given by

ln
(

πi

1−πi

)
= ηi = β0 + xi1β1 + · · ·+ xipβp .

Since the xi1, . . . ,xip, i = 1, . . . ,n, are differences of the covariates of both players,
i.e. the value of the second player always being subtracted from the one of the first
player, β0 would represent a “home effect” of the first-named player. However, since
the ordering of the two players in our dataset is random, β0 is not meaningful and,
hence, excluded (i.e., β0 is set to zero). Finally, the yi|xi1, . . . ,xip are assumed to be
independent for i = 1, . . . ,n.

Linear effects

The simplest model approach assumes linear covariate effects in the predictor, but re-
quires careful selection of covariates. For this purpose, all possible combinations of
the available variables are compared in the CV-type strategy. With seven covariates
in the dataset, there are initially ∑7

i=1
(7

i

)
= 127 combinations; however, since three of

these reflect specific age differences, we ensure that maximum one age variable can be
incorporated in each combination. This leads to a total of 63 distinct combinations.

Non-linear effects (splines)

As the assumption of strictly linear effects can possibly be very restrictive, spline mod-
els using B-splines are also considered. These enable to incorporate non-linear covari-
ate effects. Again, we regard the same 63 covariate combinations, and again compare
the respective models in terms of various performance measures. To obtain smoothness,
the splines are penalized via using P-splines. Moreover, an additional regularization ap-
proach is used, i.e. additional variable selection is performed on the spline effects by
potentially setting the corresponding spline coefficients to zero. This is implemented in
the gam-function from mgcv (Wood, 2017) (via setting the select argument to TRUE).
The underlying methodology of this additional penalization and variable selection ap-
proach is described e.g. Marra and Wood (2011).

Surface-specific player skills

Tennis matches are played on different surfaces (grass, hard court, and clay), each in-
fluencing match outcomes differently. Hence, it is reasonable to assume that players’
performances vary across these surfaces. For instance, Rafael Nadal, known as the
“king of clay”, has won 14 French Open titles but only two Wimbledon titles on grass.
Conversely, Roger Federer has won Wimbledon eight times but just once at the French
Open. As already described in Buhamra et al. (2024), we account for these player-
surface dynamics by creating a corresponding factor variable using effect encoding.
This involves adding artificial columns to the dataset with entries ∈ {−1,0,1} for each
player-surface combination; a player can have up to three such columns. If a player has
not competed on a surface, the respective column is omitted. Each row contains one
entry of 1 and of -1, while all other entries are 0. The entry 1 indicates the first player’s
combination with the match surface, while -1 corresponds to the second player’s com-
bination. For example, the line

· · · Nadal.Hard Nadal.Grass Nadal.Clay · · · Federer.Hard Federer.Grass Federer.Clay · · ·
· · · 0 1 0 · · · 0 −1 0 · · ·

12


represents a match where Rafael Nadal played against Roger Federer on grass, with
Nadal being first-named and Federer second-named. All other entries in that row are 0,
as the match was played on grass and only these two players were involved. All these
columns are added to our data set, and are used as new features in combination with the
existing variables age, ranking, points, Elo, prob, age.30 and age.int.

This results in a large number of new covariates, namely 1,024, and hence leads to
an extremely large number of parameters being estimated, and the associated estima-
tors become unstable. Hence, we combine the logistic regression model with LASSO
regularization (see Section 3.1). Consequently, this modling approach is restricted to
linear covariate effects. These surface-specific player abilities allow the model to detect
players who have won or lost more often than average on a specific surface.

Global player skills (LASSO)

Analogously, and again following Buhamra et al. (2024), one could simply allow for
global player-specific abilities, to account for players who overall perform even better
or worse than the information of their covariate values would suggest (independent
from a specific surface type). Similar to the previous paragraph, this can be done by
introducing a corresponding factor variable, again via effect coding. Now, only one
column per player is added. These columns again only have entries ∈ {−1,0,1}, with
1 and -1 occurring exactly once per row. Again, in each row, the values 1 and -1
appear at the positions corresponding to the first-named and the second-named player,
respectively, and all remaining entries are 0. The following exemplary row

· · · Novak.D jokovic · · · Ra f ael.Nadal · · · Roger.Federer · · ·
· · · 0 · · · −1 · · · 1 · · ·

indicates that Rafael Nadal played as second-named player against the first-named
player Roger Federer. For example, the entry in Novak Djokovic’s column is then 0 (as
well as for all other players), since he was not involved in that particular match. Again,
those global player-specific abilities are then added to the design matrix and used along
with the basic covariates defined in Section 2. Consequently, again a large number of
new covariates is obtained, namely 426, and, hence, a large number of player-specific
skill parameters has to be estimated. Once more, we extend the logistic regression
model with LASSO penalization and assume linear covariate effects only.

Modeling details for the machine learning approaches

For all the ML models introduced in Section 3.2, no player-specific dummies have
been included. Also note that all tree-based approaches implicitly incorporate both
non-linear as well as interaction effects. While linear SVM only accounts for linear
effects, radial SVM also allows for a certain extend of non-linearity.

4.2 Benchmark model
Similar to Buhamra et al. (2024), we built a benchmark model for prediction based on
the probabilities (probs) calculated from the average odds of the different bookmakers
included in the deuce package. According to e.g. Schauberger and Groll (2018), es-
timates for the winning probabilities of both players, π̂i1 and π̂i2, for match i can be
derived from the two corresponding winning odds, oddi1 and oddi2, as follows:

π̂i1 =

1
oddi1

1
oddi1

+ 1
oddi2

or π̂i2 =

1
oddi2

1
oddi1

+ 1
oddi2

13


Naturally, we have π̂i1 + π̂i2 = 1, adjusting for the bookmaker’s margins, which result
from the fact that the betting providers artificially lower their betting odds to gain some
profit. This means that when the inverse values of the original odds are directly used as
probabilities, they do not sum up to 1, but to a value slightly larger than 1. Hence, the
sum of the reciprocals in the denominator is used for normalization, assuming implicitly
that the margin is equally distributed to both players.

4.3 Performance measures
To compare the predictive performance of the used regression and machine learning
models on new, unseen data, several measures are considered (see also Buhamra et al.,
2024). First, let ỹ1, . . . , ỹn denote the true binary outcomes of the n matches, i.e., ỹi ∈
{0,1}, i = 1, . . . ,n. Second, let π̂i1 =: π̂i denote the probability, predicted by a certain
model, that Player 1 wins the i-th match. Since ỹi is binary, the probability that Player 2
wins the match then simply yields π̂i2 = 1− π̂i1.

Classification rate

The (mean) classification rate gives the proportion of matches correctly predicted by a
certain model, i.e.

1
n

n

∑
i=1

1(ỹi = ŷi), where ŷi =

{
1, π̂i > 0.5
0, π̂i ≤ 0.5

,

see, e.g., Schauberger and Groll (2018). Hence, large values indicate a good predictive
performance.

Predictive Bernoulli likelihood

The (mean) predictive Bernoulli likelihood is based on the predicted probability π̂i for
the true outcome ỹi, and for n observations is defined as

1
n

n

∑
i=1

π̂ ỹi
i (1− π̂i)

1−ỹi ,

see again Schauberger and Groll (2018). Once again, a large value is an indicator of a
good model.

Brier score

The Brier score is based on the squared distances between the predicted probability π̂i

and the actual (binary) output ỹi from match i, and according to Brier (1950) is defined
as

1
n

n

∑
i=1

(π̂i− ỹi)
2 .

As this is an error measure, here low values indicate a good model.

4.4 Interpretable machine learning
Next, we briefly introduce two methods from the field of interpretable machine learn-
ing (IML), namely partial dependence plots (PDP; see Friedman, 2001) and individual
conditional expectation (ICE) plots (Goldstein et al., 2015). These allow to make com-
plex, black box-type machine learning models more interpretable by understanding the

14


marginal effects of individual features (see R packages pdp by Greenwell et al., 2017).
More details on IML are provided in Molnar (2020).

Partial Dependence Plot (PDP)

Following Friedman (2001), PDPs display the marginal effect of one or two predictors
on the predicted response of a ML model, which might be linear, monotonic, or strongly
non-linear. In the notion of Molnar (2020), the partial dependence function is given by

f̂S(xS) = ExC [ f̂ (xS,xC)] =
∫

f̂ (xS,xC)dP(xC) ,

where P(·) is the density of the features in set C, xS are the predictors from the (typi-
cally small) set S for which f̂S(·) is desired, while xC denotes the remaining predictors
included in the underlying ML model f̂ (·). by integrating out the influence of the fea-
tures from C, we obtain the marginal effects of the features of interest in S, still taking
into account potential interactions.

For given training data of size n and an arbitrary value of the features from S, f̂S(·)
can be estimated via

f̂S(xS) =
1
n

n

∑
i=1

f̂ (xS,xi,C) ,

i.e. by their marginal effect on the response, averaging over the observed values xi,C

of the remaining predictors. An important underlying assumptions is that the covari-
ates from the two different set C and S are uncorrelated in order to avoid unrealistic
estimates.

Individual Conditional Expectation (ICE)

Closely related to PDPs, ICE plots illustrate the effect of a certain predictor on the
individual observation level (Goldstein et al., 2015). The plot contains one line per
observation, representing the predicted response for this observation as the covariate
of interest varies. Hence, the ICE plot reveals the variability and heterogeneity of the
covariate effects across different all individual obersvations. This way, interpretabil-
ity and transparency of a complex ML model can be increased and outliers identified.
Moreover, possible interactions between different covariates may be detected.

5 Results
In the following, the proposed regression models along with the machine learning tech-
niques are compared using the various performance measures introduced in Section 4.3.
This is done separately via a leave-one-tournament-out cross validation strategy (Sec-
tion 5.1), as well as both via an expanding window (Section 5.2) and a rolling window
external validation scenario (Section 5.3).

5.1 Leave-one-tournament-out cross validation
In a first step, to evaluate the models only the Grand Slam tournaments from 2011 to
2021 are used, i.e. the tournaments that took place in 2022 are initially not considered
and are used as external validation data later on (see Section 5.2). So, from each of
the eleven years four tournaments are used with the exception of year 2020, where
Wimbledon did not take place due to the COVID-19 pandemic. This results in a sub
data set, which contains 4,720 observations, on which a 43-fold CV-type strategy is
performed based on the following scheme (similar to Buhamra et al., 2024):

15


1. From the 43 Grand Slam tournaments present in the data set, a training data set of
42 tournaments and a test data set of the remaining tournament are constructed.

2. Then, all regression and machine learning models introduced above are fitted:

− For the models with linear influences, the function glm from the R package
stats (R Core Team, 2022) is used. As described in Section 4.1, there are
63 such models, each using at most one of the three variables for age.

− For the calculation of the spline models the function gam from the R package
mgcv (Wood, 2004) is used. Again, there are also 63 different models here.

− In order to be able to compute the two LASSO-penalized models, first the
design matrices have to be constructed as described in Section 4.1. For a
proper usage of the LASSO, then all columns of the design matrix of the
training data need to be standardized (including also those columns corre-
sponding to the surface-specific and global player skills).
The function cv.glmnet from the R package glmnet (Friedman et al.,
2010) is used to compute both models. For this, a 10-fold (inner) CV is
first performed on the current training data to find the optimal λ which pro-
vides the minimum deviance. The corresponding LASSO model is used af-
terwards for prediction. 10-fold CV is also recommended in Friedman et al.
(2010).

− The random forest model is fitted via the R package ranger (Wright et al.,
2019), wheres for the XGBoost model the respective xgboost implemen-
tation is used (Chen et al., 2019). Finally, for SVM the e1071 package is
employed, see Dimitriadou et al. (2006). Note that for the aforementioned
ML models, no player-specific dummies have been included, as the results
did not improve (result not shown). Actually, they even partly deteriorated.
Also note that the tree-based approaches implicitly incorporate both non-
linear and interaction effects.

3. After fitting the respective model, for each match of the test data the probabilities
that the first player wins are predicted.

4. Steps 1–3 are repeated until each of the 43 tournaments has once been used as
test data set.

5. Finally, the predicted results are compared with the actual results and the perfor-
mance measures defined in Section 4.3 are calculated.

Table 1 shows the results of the five best models with linear effects and the five best
models with spline effects2. Additionally, the results of the LASSO models, all machine
learning approaches and the benchmark model are shown. The models are listed below
in such a way that the best model is determined in first place, the second best in second
place and so forth.

The classification rate is above 77% for all models, except for the classification
tree (see 1st column). It is noticeable that within each group of model approaches the
classification rate is almost the same (differences are only seen in the fourth decimal).
However, for the machine learning models the values vary more. The neural network
performs best with a value of 0.8021, i.e. this model predicts the correct outcome for

2For the models with linear effects and the spline models, the top five models were selected from the 63
models by assigning ranks for each of the three performance measures, classification rate, predictive Bernoulli
likelihood, and Brier score, with the best model receiving the highest rank. If the models had equal per-
formances, average ranks were assigned. Moreover, in order to select the best models in terms of all three
measures, the three corresponding ranks were averaged per model. Finally, only the five models with the
highest average ranks were selected.

16


Table 1: Results of the leave-one-tournament-out CV approach for the five best models per regres-
sion model class, as well as for the machine learning, LASSO approaches and the benchmark model;
best performer in bold font.

Explanatory Class. Likeli- Brier
variables rate hood score

Linear

Prob 0.7772 0.6919 0.1549
Rank, Prob 0.7776 0.6918 0.1550
Points, Prob 0.7772 0.6917 0.1549
Rank, Points, Prob 0.7772 0.6917 0.1550
Age, Prob 0.7766 0.6918 0.1550

Splines

Prob 0.7764 0.6912 0.1546
Rank, Prob 0.7764 0.6912 0.1546
Points, Prob 0.7764 0.6912 0.1547
Rank, Points, Prob 0.7764 0.6912 0.1547
Elo, Prob 0.7764 0.6911 0.1547

Machine learning

ANN 0.8021 0.6828 0.1560
Random forest 0.7978 0.6817 0.1584
SVM (Linear) 0.7961 0.7415 0.1675
SVM (Radial kernel) 0.7829 0.6149 0.1807
XGBoost 0.7777 0.7234 0.1901
Classification tree 0.7486 0.6703 0.2178

LASSO
Surface specific 0.7774 0.6816 0.1551
General skills 0.7770 0.6838 0.1551
Benchmark 0.7772 0.6709 0.1555

80.21% of the matches. In second place is the RF model, which has a correct prediction
value of 0.7978. Also, it is noticeable that all spline models perform (almost) equal
with a value of 0.7764. In addition, with respect to the classification rate, four different
models deliver the same value of 0.7772.

For the (average) predictive Bernoulli likelihood (2nd column) we obtain a some-
what similar picture. The mean likelihood of the models with both linear and non-linear
effects is slightly larger than 0.69, i.e. the models predict the correct outcome with
an average probability of about 69%. The two LASSO models are just below 0.69.
It is noticeable that the linear SVM model achieves the largest predictive likelihood,
which is 0.7415. Again, the spline models performed almost identical with a value of
about 0.691.

Across model groups, similar to the classification rate the differences in the Brier
scores are rather small. The models with non-linear effects perform best in this regard,
with Brier scores between 0.1546 and 0.1547. The models with linear effects produce
slightly larger values, followed by the LASSO models and then the machine learning
models. Here, the two LASSO models delivered the same value of 0.1551. The classi-
fication tree performs clearly worst with a value of 0.2178, followed by XGBoost and
the SVM (with radial kernel) with values of 0.1901 and 0.1807, respectively.

Overall, the results show that the spline models with only prob or both rank and prob
as covariates, the random forest and the linear SVM perform best across all performance
measures.

5.2 Expanding window validation
Next, we will validate the models from above with respect to their predictive perfor-
mance on new, unseen test data in a more realistic way. For this purpose, we investigate

17


the three best models from the groups of modeling approaches with linear and non-
linear effects, as well as the two LASSO models, the machine learning models and the
benchmark model. Again, all performance measures are calculated on the four Grand
Slam tournaments 2022, which contains overall 293 matches. To mimic a more realistic
prediction use case, this time the validation is performed using an expanding window
forecasting approach, i.e. each time one of the remaining tournaments is used as the
test data set in chronological order, and the training data set is constantly updated and
enlarged. Altogether, this scheme has already been used in Buhamra et al. (2024) and
can be explained as follows:

1. First, all tournaments prior to 2022 are used as the training data set. Then, all
models are fitted as described in step 2 of the CV approach described in Sec-
tion 5.1. Based on those, predictions are derived for the 2022 Australian Open
matches, as this was the 1st Grand Slam tournament in 2022.

2. Then, the training data is updated, by adding the matches of the Australian Open
2022 on which the models are fitted again and predictions are made for the French
Open 2022, which is the 2nd Grand Slam tournament in 2022.

3. Next, the matches of the French Open 2022 are added to the training data set and
the models are fitted again. Based on those fits, Wimbledon 2022 is predicted.

4. Then, the Wimbledon 2022 matches will be added to the training data set, and
again, the models are fitted and predictions are made for the final Grand Slam
tournament in 2022, the US Open 2022.

Finally, the prediction results for all four tournaments are compared with the actual
match outcomes, and the corresponding performance measures are calculated (see Ta-
ble 2 for results).

Table 2: Results of the expanding window approach for the best three regression models out of
those five best models found Section 5.1, as well as machine learning, LASSO approaches and the
benchmark model; best performer in bold font.

Explanatory Class. Likeli- Brier
variables rate hood score

Linear
Rank, Prob 0.7816 0.7037 0.1418
Prob 0.7816 0.7036 0.1419
Points, Prob 0.7747 0.7027 0.1421

Splines
Points, Prob 0.7782 0.7033 0.1411
Prob 0.7782 0.7031 0.1413
Rank, Prob 0.7782 0.7031 0.1413

Machine learning

SVM (Linear) 0.8295 0.7637 0.1446
XGBoost 0.8306 0.7526 0.1562
Random forest 0.8162 0.6734 0.1511
SVM (Radial kernel) 0.7956 0.5858 0.1882
ANN 0.7918 0.6904 0.1466
Classification tree 0.7577 0.6803 0.2061

LASSO
General skills 0.7782 0.6969 0.1416
Surface specific 0.7782 0.6934 0.1423
Benchmark 0.7816 0.6810 0.1441

Regarding the classification rate, most of the models performed better here com-
pared to the classification rates from the CV-type strategy from above. However, here
the classification tree model delivers the worst results with a value of 0.7577, while

18


the best value is 83.06% and is achieved by the XGBoost model. Now, the benchmark
model is not among the best.

Similar trends as in Table 1 (cf. page 17) can be seen for the predictive likelihood,
although the results here have generally slightly improved. XGBoost together with
linear SVM achieve the largest likelihood, where the linear SVM model delivers the
best value with 0.7637, only slightly behind are the linear and spline regression models
that perform almost equally (all around between 0.7031 and roughly 0.7036): in case
of the linear models the differences appeared in the third decimal, while in case of non-
linear effects the differences can be seen only in the fourth decimal. Also, it should
be noted that two of the spline models deliver the same value of 0.7031. Both the
benchmark model and the LASSO models yield an even slightly smaller likelihood.
The LASSO models perform almost equally with differences that can be seen only in
the third decimal place. The SVM model with radial kernel here performs worst.

For most of the model the Brier score is ∈ [0.1411−0.1441], except for the machine
learning models, where the values differ more. Compared to Section 5.1, the values are
slightly smaller. The classification tree model again yields the largest, and thus worst
value with 0.2061. The best value is 0.1411 and is achieved by the non-linear model
including covariates points and prob.

Overall, interestingly almost all approaches have improved, some even quite sub-
stantially. This indicates that the predictions should occur in chronological order. This
time the results show that the spline models with both points and prob as covariates,
XGBoost and linear SVM perform best, each with respect to one of the performance
measures. Overall, also the linear model with only prob performs quite well. Alto-
gether, the expanding window strategy seems to be clearly superior compared to the
CV-type strategy.

5.3 Rolling window validation
An alternative strategy is to only use recent data for prediction, but in a rolling window
manner. Hence, only 3 years of training data are used, i.e. 12 tournaments in total,
starting from the first tournament in 2011, which is Australian open, until we reach the
last tournament of 2013, which is US open. This means we use the 12 tournaments from
2011 to 2013 as first training data set and then predict the 13th tournament (i.e., the first
tournament of 2014, which is Australian Open). Then, a rolling window approach is
performed and the training data set constantly changes with a fixed short time window
of three years. We again focus on the three best models from the groups of linear and
non-linear regresion models, as well as the two LASSO models, the machine learning
models and the benchmark model. In detail, this scheme can be explained as follows:

1. The tournaments of the first 3 years in our data set, 2011 - 2013, are used as
training data, on which the models are fitted. Then, predictions can be obtained
for the 2014 Australian Open, the 13th tournament in our data set and first Grand
Slam tournament in 2014.

2. The new training data set then shifts by one tournament and starts from the 2nd

tournament for 2011, which is French Open 2011 and ends with the 13th tourna-
ment, i.e. Australian Open 2014. Again, all models are fitted and used to predict
the next tournament, which is now the 14th tournament in our data set (and 2nd

Grand Slam tournament in 2014).

3. This procedure is repeated until finally, the last training data consists tournaments
35-46 from our full data set, on which again all models are fitted and used to
predict the very last tournament in our data, which is the 47th tournament, namely
the final Grand Slam tournament in 2022, US Open 2022.

19


Finally, the predicted results for all tournaments, on which predictions have been made,
starting from Australian open 2014 until the US open 2022, are compared with the
actual results and the respective performance measures are calculated. The results of
the rolling window are shown in Table 3.

Table 3: Results of the rolling window approach for the best three regression models out of those five
best models found Section 5.1, as well as machine learning, LASSO approaches and the benchmark
model; best performer in bold font.

Explanatory Class. Likeli- Brier
variables rate hood score

Linear
Prob 0.7743 0.6893 0.1562
Points, Prob 0.7740 0.6890 0.1546
Rank, Prob 0.7714 0.6884 0.1563

Splines
Rank, Prob 0.7727 0.6895 0.1564
Points, Prob 0.7719 0.6887 0.1566
Prob 0.7708 0.6884 0.1563

Machine learning

SVM (Linear) 0.8012 0.7424 0.1709
Random forest 0.7971 0.6727 0.1608
XGBoost 0.7755 0.7231 0.2074
SVM (Radial kernel) 0.7729 0.5448 0.2139
ANN 0.7726 0.6837 0.1554
Classification tree 0.7256 0.6900 0.2678

LASSO
Surface specific 0.7735 0.6735 0.1564
General skills 0.7727 0.6753 0.1565
Benchmark 0.7743 0.6683 0.1567

Principally, the results are quite similar compared to the previous results, but have
all deteriorated a bit. The classification rate is again slightly above 77% for all mod-
els, with differences mostly only in the third decimals, except for the machine learning
models which partly perform better (random forest and linear SVM). Here, linear SVM
performs best with a value of 0.8012, i.e. this model predicts the correct outcome for
80.12% of the matches. Also, the random forest model here delivered good perfor-
mance with a value of 0.7971. All non-linear models performed very similar, with
differences that can bee found only in the third decimal place. The two LASSO models
also achieved very similar values around 0.773. While three other models deliver the
value 0.774.

Compared to the results in Table 2, the predictive likelihood results are somewhat
lower here. The best value is achieved by the linear SVM model (0.7424), followed
by XGBoost with a value of 0.7231. The differences in both linear and non-linear
regression model can only be seen in the third decimal. Both LASSO models yield
very similar results around 0.67. The SVM with radial kernel and the benchmark model
perform worst with values of 0.5448 and 0.6683, respectively.

Also the Brier score values are slightly worse (i.e., larger) compared to the previous
results from Table 2. This clearly indicates that for a good predictive performance on
Grand Slam matches, training data should be rather large, and a collection of three years
of Grand Slam matches data as used here is not sufficient. Hence, besides the precise
choice of statistical or machine learning approach, a sound basis of informative covari-
ates and large enough sample sizes are very important. The classification tree model
once again yields the worst value with 0.2678. The linear model including covariates
points and prob is achieved the best value of 0.1546. However, all linear and non-linear
regression models, the tow LASSO models and the benchmark model deliver roughly

20


similar values with differences that can be seen only in the fourth decimal. All machine
learning models instead result in rather large values, with linear SVM, random forest
and neural network still being acceptable.

5.4 Interpretable machine learning
In the following, interpretable machine learning (IML) is used to analyze the perfor-
mance of XGBoost model, which achieved the largest classification rate among all
models, with a value of 0.8306, as discussed in Section 5.2. For this purpose, combined
partial dependence plots (PDPs; Friedman, 2001) and Individual Conditional Expecta-
tion (ICE; Goldstein et al., 2015 plots are presented to enhance the interpretability and
visualization of the respective complex model. For the implementation in R, we rely on
the pdp package (Greenwell et al., 2017). Also, we use the ggplot2 (Hadley, 2016) to
create custom PDPs and ICE plots. In Figure 1, a multi-panel plot is presented where
each panel represents the combined PDP and ICE plot for a different covariate (namely,
age.30, Elo, points and rank). The x-axis of each panel represents the respective covari-
ate (recall that we account for differences), while the y-axis shows the corresponding
predicted winning probabilities. This visualization helps in understanding how each
predictor variable individually influences the predicted outcome, highlighting potential
non-linear relationships captured by the model. The ICE plot (gray lines) shows how
predictions vary for individual instances, while the PDP (bold black line) displays the
overall average effect of the variable.

The steep increasing trend for the Elo panel in the PDP line indicates that as Elo
differences increase, the predicted probability of Player 1 winning also increases. Re-
garding the ICE lines, as they vary widely, the impact of Elo difference is not consistent
across all matches, possibly due to interactions with other features. For the rank, PDP
line decreases with increasing rank difference, which means that when Player 1 has
a worse ranking, his predicted probability of winning decreases. The PDP for age.30
shows a slightly increasing trend at the beginning, then a slight decrease, indicating that
age difference does not have a substantial impact on predictions. Toward the end, we
observe another obvious fluctuating trend. This could mean that the impact of extreme
age differences is inconsistent.

For covariate points, the PDP line also fluctuates rather strongly, indicating that the
effect of the points difference on the predicted probability of winning is not straightfor-
ward, i.e., there could be ranges where having a larger points advantage helps, but other
ranges where the effect is weaker or even reversed. Another explanation could be that
the impact of the points difference correlates with other covariates like surface, age, or
Elo rating. For example, a player with a small points advantage might have a higher
winning probability on clay courts, but not necessarily on grass courts (i.e., there is an
interaction effect with other features).

6 Summary and overview
In this manuscript, we continued the work from Buhamra et al. (2024) and modeled ten-
nis matches in Grand Slam tournaments within the framework of regression and modern
machine learning approaches. For this purpose, the same data set as used in Buhamra
et al. (2024) was analyzed, which was build based on the deuce R package (Kovalchik,
2019). The final data set contains information on 5,013 matches in 47 men’s Grand
Slam tournaments from the years 2011-2022, and includes covariate information on the
age difference of both players, the difference in their rank positions and ATP points, in
Elo rating, in probabilities for the victory of the players, calculated from the average

21


points rank

age.30 elo

−1
00

00 0

10
00

0
−5

00 0
50

0
10

00

−1
0 −5 0 5 10

−1
00

0
−5

00 0
50

0
10

00

0.0

0.2

0.4

0.6

0.8

1.0

0.0

0.2

0.4

0.6

0.8

1.0

P
re

di
ct

ed
 P

ro
ba

bi
lit

y

Figure 1: Combined PDP (thick black line) and ICE plots (gray lines) for XGBoost model in com-
bination with the expanding window approach, including covariates age.30, Elo, points and rank

odds by the betting providers (prob), as well as the two additional age variables, age.30
and age.Int, which take into account that the optimal age of a tennis player is between
28 and 32 years.

Different regression models, which were already considered in Buhamra et al. (2024),
were compared to modern machine learning approaches for modeling and prediction of
tennis matches. As the outcome in tennis is binary (win or loss), the probability of the
first-named player to win is modeled. It was discussed that for the logistic regression
approaches, modeling with intercept is not suitable, as this would incorporate a kind of
home effect for the first-named player - a property which was not desired here.

The different modeling approaches were compared in (i) a 43-fold leave-one-tourna-
ment-out CV-type strategy, (ii) an expanding window, and (iii) a rolling window valida-
tion strategy (with a short window for a period of three years). The following regression
and ML approaches were included:

− Logistic regression with linear effects: all possible combinations of the seven
covariates, such that at most one of the three age variables was incorporated,
were considered. This resulted in 63 models.

− Logistic regression with non-linear effects (splines): Again, the same 63 models
were considered.

− A model taking into account surface- and player-specific effects. Due to the large

22


amount of unknown parameters, here LASSO penalization was used.

− A model taking into account general player-specific abilities. Again, LASSO
penalization was used.

− Different machine learning approaches: classification trees, random forest, XG-
Boost, and support vector machine with linear and radial kernel.

− A benchmark model, where the predicted probabilities were derived from average
betting odds.

Within the CV-type approach, the models were compared in terms of the classifi-
cation rate, the predictive Bernoulli likelihood, and the Brier score. Since 63 different
models resulted for each of the first two approaches, the five best models were selected
in each case.

It was found that the selected linear and spline models performed very similarly
in terms of classification rate, likelihood and Brier score. The classification rate was
slightly above 77% for all models, meaning that the models predicted the correct out-
come in about 77% of the matches. The predictive Bernoulli likelihood was about
0.69, while the LASSO models and the benchmark model were slightly below. The
models thus predicted with an average probability of about 69% the correct outcome of
a match. The Brier score was above 0.154 for all models, and differences were mostly
found either in the third or fourth decimal. Regarding the machine learning models the
values for the Brier score were more volatile.

The tournaments in 2022 were then used as an external validation data set in an ex-
panding window validation approach. The three best models with each linear and non-
linear effects, the two LASSO models, all machine learning models and the benchmark
model were then compared again. Principally, the results of the preceding CV-type
competition could be mostly confirmed, but surprisingly the expanding window strategy
turned out to be clearly superior compared to the CV-type strategy: the classification
rate was around 0.757−0.831, the predictive likelihood yielded around 0.681−0.764,
and the Brier score was between 0.141−0.206. Obviously, performances of the models
vary across measures. The XGBoost has the best performance among all others when
considering the classification rate. The (linear) SVM model achieves the best predic-
tive performance in terms of the predictive likelihood. Meanwhile, the spline model,
including both covariates points and prob, outperformed all other models by yielding
the lowest Brier score value.

In order to use more recent data sets for prediction, we also investigated a rolling
window approach, which was based on a training data containing 12 tournaments and
which was used to always predict the next tournament. The same selection of mod-
els as for the expanding window validation approach was analyzed. Principally, the
results of the preceding expanding window-strategy competition could be mostly con-
firmed, with the values of the performance measures generally being slightly worse.
The classification rate was around between 0.771− 0.801, the predictive likelihood
yielded around 0.545−0.742, and the Brier score was between 0.154−0.268. Again,
the values somewhat differ between models and proposed performance measures. If
we consider the classification rate and the predictive likelihood, then the (linear) SVM
model outperforms all other models. However, for the Brier score, the linear regression
model including points and prob achieves the best performance.

So, interestingly, our analyses show that for the prediction of Grand Slam tennis
matches, while there are slight differences across different statistical and machine learn-
ing approaches, the specific forecast strategy used seems to play an even more important
role. It turned out that on our data, an expanding window approach was clearly supe-
rior compared to two other forecast strategies. The results show that on the one hand,

23


it seems to be important to do the predictions in chronological order, as the expand-
ing window approach clearly outperforms the CV-type strategy. On the other hand,
the training data needs to be rather large, and three years of training data as used in
our rolling window approach are not sufficient. Particularly, complex machine learning
methods such as SVM and XGBoost, which involve an extensive tuning process, seem
to benefit from these two aspects to fully exploit their potential, as they exhibit rather
good predictive performance (see Table 2). These findings are consistent with the gen-
erally known result that feature engineering tends to be the more critical choice for the
modeler than the choice among strongly related models.

In the same direction heads a certain feature engineering-type approach, namely
statistical enhanced learning (SEL), which is based on so-called “statistical enhanced
covariates”. These are covariates which are not directly observed but obtained by per-
forming certain steps of feature engineering, or even as statistical estimators from sep-
arate statistical model, see Felice et al. (2023) for more details. For example, in the
context of modeling soccer, Ley et al. (2019) proposed a model to estimate flexible,
time-varying team-specific ability parameters. The resulting estimates were then added
to the set of (conventional) features in a random forest model, which turned out to
be quite sucessful for predicting the FIFA World Cup 2018 in Groll et al. (2019). A
first attempt to carry over this idea to tennis is sketched in Buhamra and Groll (2025).
Moreover, we are currently working on extending the approach from Ley et al. (2019)
to professional tennis (see Bartmann et al., 2025). In future research, we plan to inves-
tigate the predictive potential of such “statistical enhanced covariates” in more detail
and to combine the methods proposed here with the player-specific ability parameters
obtained with approach from Bartmann et al. (2025).

References
Bartmann, H., Groll, A., and Michels, R. (2025). Estimation of match abilities for

tennis players via a maximum likelihood approach. Submitted at the IES 2025 con-
ference proceedings.

Breiman, L. (1996a). Bagging predictors. Machine learning, 24:123–140.

Breiman, L. (1996b). Heuristics of instability and stabilization in model selection. The
annals of statistics, 24(6):2350–2383.

Breiman, L. (2001). Random forests. Machine learning, 45:5–32.

Breiman, L., Friedman, J., Olshen, R., and Stone, C. (1984). Classification and regres-
sion trees. wadsworth int. Group, 37(15):237–251.

Brier, G. W. (1950). Verification of forecasts expressed in terms of probability. Monthly
Weather Review, 78:1–3.

Buhamra, N. and Groll, A. (2025). Statistical enhanced learning for modeling and pre-
diction tennis matches at grand slam tournaments. arXiv preprint arXiv:2502.01613.

Buhamra, N., Groll, A., and Brunner, S. (2024). Modeling and prediction of tennis
matches at grand slam tournaments. Journal of Sports Analytics, 10(1):17–33.

Bühlmann, P. and Hothorn, T. (2007). Boosting algorithms: Regularization, prediction
and model fitting.

24


Bühlmann, P. and Yu, B. (2003). Boosting with the l 2 loss: regression and classifica-
tion. Journal of the American Statistical Association, 98(462):324–339.

Chen, T. and Guestrin, C. (2016). Xgboost: A scalable tree boosting system. In Pro-
ceedings of the 22nd acm sigkdd international conference on knowledge discovery
and data mining, pages 785–794.

Chen, T., He, T., Benesty, M., and Khotilovich, V. (2019). Package xgboost. R version,
90:1–66.

Chen, T., He, T., Benesty, M., Khotilovich, V., Tang, Y., Cho, H., Chen, K., Mitchell,
R., Cano, I., Zhou, T., Li, M., Xie, J., Lin, M., Geng, Y., Li, Y., and Yuan, J. (2023).
xgboost: Extreme gradient boosting. R package version 1.7.5.1.

Cortes, C. and Vapnik, V. (1995). Support-vector networks. Machine learning, 20:273–
297.

Dimitriadou, E., Hornik, K., Leisch, F., Meyer, D., Weingessel, A., and Leisch, M. F.
(2006). The e1071 package. Misc Functions of Department of Statistics (e1071), TU
Wien, pages 297–304.

Dimitriadou, E., Hornik, K., Leisch, F., Meyer, D., Weingessel, A., and Leisch, M. F.
(2009). Package e1071. R Software package, avaliable at http://cran. rproject.
org/web/packages/e1071/index. html.

Eilers, P. H. and Marx, B. D. (2021). Practical smoothing: The joys of P-splines.
Cambridge University Press.

Eilers, P. H. C. and Marx, B. D. (1996). Flexible smoothing with B-splines and penal-
ties. Statistical Science, 11:89–121.

Fahrmeir, L., Kneib, T., Lang, S., and Marx, B. (2013). Regression. Modells, Methods
and Applications. Springer, Berlin.

Fahrmeir, L. and Tutz, G. (2001). Multivariate Statistical Modelling Based on Gener-
alized Linear Models. Springer-Verlag, New York, 2nd edition.

Felice, F., Ley, C., Groll, A., and Bordas, S. (2023). Statistically enhanced learning:
a feature engineering framework to boost (any) learning algorithms. arXiv preprint
arXiv:2306.17006.

Freund, Y., Schapire, R. E., et al. (1996). Experiments with a new boosting algorithm.
In icml, volume 96, pages 148–156. Citeseer.

Friedman, J., Hastie, T., and Tibshirani, R. (2000). Additive logistic regression: a
statistical view of boosting (with discussion and a rejoinder by the authors). The
annals of statistics, 28(2):337–407.

Friedman, J., Hastie, T., and Tibshirani, R. (2010). Regularization paths for generalized
linear models via coordinate descent. Journal of Statistical Software, 33(1):1–22.

Friedman, J. H. (2001). Greedy function approximation: a gradient boosting machine.
Annals of statistics, pages 1189–1232.

Fritsch, S., Guenther, F., and Wright, M. N. (2019). neuralnet: Training of Neural
Networks. R package version 1.44.2.

25


Gao, Z. and Kowalczyk, A. (2021). Random forest model identifies serve strength as a
key predictor of tennis match outcome. Journal of Sports Analytics, 7(4):255–262.

Goldstein, A., Kapelner, A., Bleich, J., and Pitkin, E. (2015). Peeking inside the black
box: Visualizing statistical learning with plots of individual conditional expectation.
journal of Computational and Graphical Statistics, 24(1):44–65.

Greenwell, B. M. et al. (2017). pdp: An r package for constructing partial dependence
plots. R J., 9(1):421.

Groll, A., Ley, C., Schauberger, G., and Van Eetvelde, H. (2019). A hybrid random
forest to predict soccer matches in international tournaments. Journal of Quantitative
Analysis in Sports, 15:271–287.

Hadley, W. (2016). Ggplot2: Elegrant graphics for data analysis. Springer.

Hastie, T., Tibshirani, R., Friedman, J. H., and Friedman, J. H. (2009). The elements of
statistical learning: data mining, inference, and prediction, volume 2. Springer.

James, G., Witten, D., Hastie, T., Tibshirani, R., et al. (2013). An introduction to
statistical learning, volume 112. Springer.

Kovalchik, S. (2019). deuce: resources for analysis of professional tennis data. R
package version 1.4.

Ley, C., Wiele, T. V. d., and Eetvelde, H. V. (2019). Ranking soccer teams on the
basis of their current strength: A comparison of maximum likelihood approaches.
Statistical Modelling, 19(1):55–73.

Marra, G. and Wood, S. N. (2011). Practical variable selection for generalized additive
models. Computational Statistics & Data Analysis, 55(7):2372–2387.

Mayr, A., Binder, H., Gefeller, O., and Schmid, M. (2014). The evolution of boosting
algorithms. Methods of information in medicine, 53(06):419–427.

McClelland, J. L., Rumelhart, D. E., Group, P. R., et al. (1987). Parallel distributed pro-
cessing, volume 2: Explorations in the microstructure of cognition: Psychological
and biological models, volume 2. MIT press.

Molnar, C. (2020). Interpretable machine learning. Lulu. com.

Nelder, J. A. and Wedderburn, R. W. M. (1972). Generalized linear models. Journal of
the Royal Statistical Society, A 135:370–384.

Noble, W. S. (2006). What is a support vector machine? Nature biotechnology,
24(12):1565–1567.

Oshiro, T. M., Perez, P. S., and Baranauskas, J. A. (2012). How many trees in a random
forest? In Machine Learning and Data Mining in Pattern Recognition: 8th Interna-
tional Conference, MLDM 2012, Berlin, Germany, July 13-20, 2012. Proceedings 8,
pages 154–168. Springer.

Park, M. Y. and Hastie, T. (2007). L1-regularization path algorithm for generalized
linear models. Journal of the Royal Statistical Society Series B, 19:659–677.

R Core Team (2022). R: A Language and Environment for Statistical Computing. R
Foundation for Statistical Computing, Vienna, Austria.

26


R Core Team (2023). R: A Language and Environment for Statistical Computing. R
Foundation for Statistical Computing, Vienna, Austria.

Rosenblatt, F. (1958). The perceptron: a probabilistic model for information storage
and organization in the brain. Psychological review, 65(6):386.

Rumelhart, D. E., Hinton, G. E., and Williams, R. J. (1986a). Learning representations
by back-propagating errors. nature, 323(6088):533–536.

Rumelhart, D. E., McClelland, J. L., Group, P. R., et al. (1986b). Parallel distributed
processing, volume 1: Explorations in the microstructure of cognition: Foundations.
The MIT press.

Schauberger, G. and Groll, A. (2018). Predicting matches in international football
tournaments with random forests. Statistical Modelling, 18(5-6):460–482.

Sipko, M. and Knottenbelt, W. (2015). Machine learning for the prediction of profes-
sional tennis matches. MEng computing-final year project, Imperial College London,
2.

Somboonphokkaphan, A., Phimoltares, S., and Lursinsap, C. (2009). Tennis winner
prediction based on time-series history with neural modeling. In Proceedings of
the International MultiConference of Engineers and Computer Scientists, volume 1,
pages 18–20. Citeseer.

Steinwart, I. and Christmann, A. (2008). Support vector machines. Springer Science &
Business Media.

Therneau, T., Atkinson, B., Ripley, B., and Ripley, M. B. (2015). Package rpart. Avail-
able online: cran. ma. ic. ac. uk/web/packages/rpart/rpart. pdf (accessed on 20 April
2016).

Tibshirani, R. (1996). Regression shrinkage and selection via the Lasso. Journal of the
Royal Statistical Society, B 58:267–288.

Valero, C. S. (2016). Predicting win-loss outcomes in mlb regular season games–a
comparative study using data mining methods. International Journal of Computer
Science in Sport, 15(2):91–112.

Wang, Y.-C. F. and Casasent, D. (2008). New support vector-based design method for
binary hierarchical classifiers for multi-class classification problems. Neural Net-
works, 21(2-3):502–510.

Weston, D. (2014). Using age statistics to gain a tennis betting
edge. http://www.pinnacle.com/en/betting-articles/Tennis/

atp-players-tipping-point/LMPJF7BY7BKR2EY.

Whiteside, D., Cant, O., Connolly, M., and Reid, M. (2017). Monitoring hitting load
in tennis using inertial sensors and machine learning. International journal of sports
physiology and performance, 12(9):1212–1217.

Wilkens, S. (2021). Sports prediction and betting models in the machine learning age:
The case of tennis. Journal of Sports Analytics, 7(2):99–117.

27


Wood, S. N. (2004). Stable and efficient multiple smoothing parameter estimation
for generalized additive models. Journal of the American Statistical Association,
99(467):673–686.

Wood, S. N. (2017). Generalized Additive Models: An Introduction with R. Chapman
& Hall/CRC, London, 2nd edition.

Wright, M. N., Wager, S., Probst, P., and Wright, M. M. N. (2019). Package ranger.
Version 0.11, 2.

Wright, M. N. and Ziegler, A. (2017). ranger: A fast implementation of random forests
for high dimensional data in C++ and R. Journal of Statistical Software, 77(1):1–17.

28


Statistical enhance learning for modeling and
prediction tennis matches at Grand Slam

tournaments

N. Buhamra1 A. Groll2

1Department of Statistics, TU Dortmund University, Vogelpothsweg 87, 44221 Dortmund, email:
nourah.buhamra@tu-dortmund.de

2Corresponding author, Department of Statistics, TU Dortmund University, Vogelpothsweg 87,
44221 Dortmund, email: groll@statistik.tu-dortmund.de

Abstract

In this manuscript, we concentrate on a specific type of covariates, which we call sta-
tistically enhanced, for modeling tennis matches for men at Grand slam tournaments.
Our goal is to assess whether these enhanced covariates have the potential to improve
statistical learning approaches, in particular, with regard to the predictive performance.
For this purpose, various proposed regression and machine learning model classes are
compared with and without such features. To achieve this, we considered three slightly
enhanced variables, namely elo rating along with two different player age variables.
This concept has already been successfully applied in football, where additional team
ability parameters, which were obtained from separate statistical models, were able to
improve the predictive performance.

In addition, different interpretable machine learning (IML) tools are employed to
gain insights into the factors influencing the outcomes of tennis matches predicted by
complex machine learning models, such as the random forest. Specifically, partial de-
pendence plots (PDP) and individual conditional expectation (ICE) plots are employed
to provide better interpretability for the most promising ML model from this work.
Furthermore, we conduct a comparison of different regression and machine learning
approaches in terms of various predictive performance measures such as classification
rate, predictive Bernoulli likelihood, and Brier score. This comparison is carried out
on external test data using cross-validation, rolling window, and expanding window
strategies.

Keywords: Grand Slam tournaments, tennis matches, machine learning, prediction,
statistical enhance covariates, interpretable machine learning, expanding window.

1 Introduction
In recent years, various methodologies for statistical and machine learning based mod-
eling of tennis matches and tournaments have emerged, and the existing methods for
predicting the probability of winning matches in tennis have been expanded. Moreover,
there is potential to calculate winning probabilities for an entire tournament when all
individual matches can be predicted.

1

ar
X

iv
:2

50
2.

01
61

3v
1 

 [
st

at
.A

P]
  3

 F
eb

 2
02

5


Recently, machine learning (ML) models have been employed to predict the out-
comes of tennis matches. Somboonphokkaphan et al. (2009) introduced a method uti-
lizing match statistics and environmental data to predict winners using a Multi-Layer
Perceptron (MLP) with a back-propagation learning algorithm. MLP, a type of Artifi-
cial Neural Network (ANN), is effective for solving real-world classification problems
and predicting outcomes, especially when handling large databases with incomplete or
noisy data. Whiteside et al. (2017) proposed an automated stroke classification system
to quantify hitting load in tennis, using machine learning techniques like a cubic ker-
nel support vector machine. Wilkens (2021) expanded previous research by applying
various ML techniques, including neural networks and random forests, with extensive
datasets from professional men’s and women’s tennis singles matches. Despite these
efforts, he found that the average prediction accuracy does not exceed 70%.

Sipko and Knottenbelt (2015) predicted match winners based on the probability
of winning serve points, which subsequently indicates the overall probability of win-
ning the match. They extracted 22 features from historical data, including abstract
features like player fatigue and injury, and optimized models that outperformed Knot-
tenbelt’s Common-Opponent model using ML approaches such as artificial neural net-
works (ANNs). They suggest that ML-based techniques can innovate tennis betting,
noting that ANNs generated a 4.35% return on investment, a 75% improvement in the
betting market. Moreover, Gao and Kowalczyk (2021) developed a model that predicts
tennis match outcomes with over 80% accuracy, surpassing predictions based on bet-
ting odds alone, and identifying serve strength as a crucial predictor. Their model used
a random forest classifier, highlighting the importance of simple models even in the age
of deep learning. Their comprehensive data set, compiled from ATP data from 2000
to 2016, includes a variety of features capturing physical, psychological, court-related,
and match-related variables. Finally, a comprehensive overview of modeling and pre-
dicting tennis matches at Grand Slam tournaments by different regression approaches
has been presented in Buhamra et al. (2024).

The main focus of this manuscript, however, is to analyze, whether so-called en-
hanced covariates have the potential to improve statistical and machine learning ap-
proaches, in particular, with regard to predictive performance. Generally, in recent
years, there has been a growing interest in feature engineering. Effective feature engi-
neering plays a crucial role in enhancing the performance of machine learning models
by identifying and capturing relevant patterns and relationships within the data. This
enables models to improve their predictive accuracy and extract meaningful insights
from the data. For instance, Felice et al. (2023) introduce the concept of Statistically
Enhanced Learning (SEL), a formalization framework for existing feature engineering
and extraction tasks in ML. This approach has the potential to address challenges in
ML tasks by optimizing feature selection and representation.

For example, in the context of modeling soccer, Ley et al. (2019) proposed a model
to estimate flexible, time-varying team-specific ability parameters. The resulting esti-
mates were then added to the set of (conventional) features in a random forest model,
which turned out to be quite successful for predicting the FIFA World Cup 2018 in
Groll et al. (2019a).

When using modern and complex ML models, another important aspect is inter-
pretability of the fitted model. Hence, several studies have been conducted in the field
of understanding and interpreting complex (black box-type) ML models. For example,
Auret and Aldrich (2012) used variable importance measures, directly generated by
the random forest models, and partial dependence plots, indicating that random forest
models can reliably identify the influence of individual variables, even in the presence
of high levels of additive noise.

2


Moreover, Goldstein et al. (2015) present both individual conditional expectation
(ICE) plots and classical partial dependence plots (PDPs) on three different real data
sets, namely depression clinical trial, white wine and diabetes classification in Pima
Indians. They demonstrate how ICE plots can shed light on estimated models in ways
PDPs cannot. Accordingly, ICE plots refine the PDP by graphing the functional rela-
tionship between the predicted response and the feature for individual observations. In
particular, ICE plots highlight the variation in the fitted values across the range of a
certain selected covariate, suggesting where and to what extent heterogeneities might
exist.

More generally, Molnar et al. (2023) investigated the relationship between Partial
Dependence Plots (PDPs) and Permutation Feature Importance (PFI) methods in un-
derstanding the data generating process in machine learning models. They explored
how these two techniques can provide complementary insights into the importance and
effects of features on model predictions. Consequently, they formalize PDP and PFI as
statistical estimators representing the ground truth estimands rooted from the data gen-
erating process. Their analysis reveals that PDP and PFI estimates can deviate from this
ground truth not only due to statistical biases but also due to variations in learner behav-
ior and errors in Monte Carlo approximations. To address these uncertainties in PDP
and PFI estimation, they introduce the learner-PD and learner-PFI approaches, which
involve model refits, and propose corrected variance and confidence interval estimators.

Unlike traditional black-box models, interpretable machine learning (IML) models
aim to provide insights into the decision-making process, enabling users to make in-
formed decisions and understand the implications of model outputs. Therefore, in this
work, we focused on IML models, such as partial dependency plots (PDP; Friedman,
2001) and Individual Conditional Expectation (ICE; Apley and Zhu, 2020), to make our
employed random forest model more interpretable. Additionally, we demonstrate how
feature engineering techniques can be applied in the context of sports analytics to en-
hance predictive modeling and gain insights from sports data. Specifically, we examine
the use of covariates such as Elo, Age.30 and Age.int, which all are not directly avail-
able, but are obtained from either a separate modeling approach (Elo) or via meaningful
mathematical transformations (Age.30, Age.int), in order to enhance statistical models
for tennis. For this application, various regression and machine learning approaches
were considered, including linear regression, spline models, and random forest. These
approaches were then compared based on various performance measures within an ex-
panding window strategy for analyzing tennis data from Grand Slam tournaments. To
conduct this analysis, a dataset was compiled using the R package deuce (Kovalchik,
2019). This data set included information on 5,013 matches from men’s Grand Slam
tournaments spanning the years 2011 to 2022. Several potential covariates were con-
sidered, including the players’ age, ATP ranking and points, odds, elo rating, as well
as two additional age variables. These additional age variables were designed to reflect
the “optimal” age range of a tennis player, which is typically between 28 and 32 years
(Weston, 2014).

The remainder of the article is structured as follows. Section 2 briefly introduces the
data set and defines the objectives of this work. Then, in Section 3, different modeling
approaches are introduced, including classical regression approaches and ML meth-
ods such as random forest. Besides, some interpretable machine learning techniques
like partial dependence plots (PDP) and individual conditional expectation (ICE) plots
are discussed, and various performance measures are defined. In Section 4, the mod-
eling approaches are compared in terms of various performance measures, using an
expanding window strategy. Additionally, interpretations for different model classes
considering enhanced covariates and IML techniques are provided. Finally, Section 5

3


summarizes the main results and gives a final overview.

2 Data
Next, we shortly introduce our data set, which was compiled using the R package deuce
(Kovalchik, 2019), and contains information on 5,013 matches from men’s Grand Slam
tournaments from 2011 to 2022. It was already used in Buhamra et al. (2024) and
Buhamra et al. (2025), and contains the following variables.

Victory: A dummy variable indicating whether the first-named player won the match
(1: yes, 0: no), used as the response variable in all models.

Conventional covariates

The following three covariates are conventional covariates, which could be extracted
more or less directly from public sources. They are all incorporated in our final data set
in the form of differences, i.e. for each feature the value of the 2nd player is subtracted
from the one of the 1st player.

Age: A metric predictor collecting the age of the players in years. Note that players’
ages were not given directly and had to be deduced from the player’s date of birth
as well as the date of the respective match.

Rank: For each player, the rank at the start of the respective tournament was collected.
The position in the ranking is based on the ATP ranking points.

Points: The ATP world ranking points are awarded for each match won per tournament.
Wins in later rounds of a tournament are valued higher than wins in the first
rounds of a tournament. Points earned in a tournament expire after 52 weeks.

Enhanced covariates

Next, we introduce three covariates which we call enhanced, as they are not directly
available, but are obtained from either a separate modeling approach (Elo) or via mean-
ingful mathematical transformations (Age.30, Age.int), in order to enhance statistical
models for tennis. Again, also these features are all incorporated as differences (value
of 2nd player minus value of 1st player).

Elo: For each player, the overall Elo rating before a certain match is collected. The
idea of the Elo rating system is that one can more accurately track and predict
player performances over time, taking into account the relative strength of op-
ponents and the outcomes of matches. Each player starts with an initial rating
(often set around 1500 points, though this can vary depending on the specific im-
plementation). After each match, the players’ ratings are updated based on the
match outcome. If a higher-rated player wins, they gain fewer points than if a
lower-rated player wins. The amount of points exchanged in a match depends on
the difference in ratings between the two players. Hence, this covariate is consid-
ered “enhanced” as it involves complex calculations and provides a more delicate
measure of player performance. Further details on the Elo rating in tennis can be
found in Angelini et al. (2022) and Vaughan Williams et al. (2021)

Age.30: This variable is given by the absolute distance between the age of the players
and reference age 30. This is based on the assumption that the standard Age
variable introduced above does not contain enough information. For example,

4


while a 25-year-old player typically has an age-advantage over a 20-year-old one,
a 40-year-old player rather has a disadvantage over a 35-year-old one; and, in
both cases, the age difference between the two players equals 5 years. Following
Weston (2014), who argued that the optimal age of tennis players is between 28
and 32 years, we chose the mid-point as the reference age.

Age.int: This feature is based on the assumption that the optimal age of a tennis player
lies within the interval [28;32]. Then, the smaller distance to the limits of this
interval was derived, i.e. for players younger than 28 the distance to 28 was calcu-
lated and for players older than 32 the distance to 32 was calculated. For players
between 28 and 32 the distance was set to 0.

A detailed description of the variables included in the data set can be found in Section 2
of Buhamra et al. (2024) and Buhamra et al. (2025) .

Note at this point that the data set does not include matches in which one of the two
players gave up or was unable to compete, e.g. due to injury, such that the other player
won without actually playing the match. These matches do not contain any relevant
information for the present analysis and, hence, in order not to distort the results, are
excluded. Moreover, the data set does not contain any missing values.

Based on this data set, the best possible statistical enhance learning model for pre-
dicting tennis matches at Grand Slam tournaments is sought. Also, it will be examined
whether a machine learning approach, namely a random forest, incorporating enhanced
statistical covariates, is more powerful in predicting tennis matches compared to classi-
cal regression approaches that also incorporate the corresponding enhanced covariates.
Then, for all proposed modeling approaches, including machine learning and classical
regression methods, optimal models are determined based on certain performance mea-
sures in terms of an expanding window prediction approach. Here, our focus will be
only on the expanding window strategy, which reveals a clear winner model, but also
other technique such as leave-one-tournament-out cross-validation and a rolling win-
dow approach have been considered. The results for those approaches can be found in
the appendix. Our main objectives are (i) to quantify how the predictive performance
can be improved by incorporating enhanced variables, and (ii) to provide better inter-
pretations for the random forest model using IML tools such as PDP plots and ICE
plots.

3 Methods
In the following, first the statistical and machine learning methods used in this work
are briefly introduced in Section 3.1. We focus on both standard logistic regression,
and generalized additive models based on P-splines. Moreover, the random forest as
a representative from the field of machine learning is shortly presented. Then, in Sec-
tion 3.2 several interpretable machine learning methods are explained, including partial
dependence plots (PDP) and individual conditional expectation (ICE) plots. Finally, in
Section 3.3 various performance measures are defined.

3.1 Statistical and machine learning methods
In this section, we introduce the classic logistic regression model for binary outcomes,
followed by its extension to non-linear effects via spline-based approaches.

5


Logistic regression

Given observations (yi,xi1, . . . ,xip) for i = 1, . . . ,n tennis matches,

πi = P(yi = 1|xi1, . . . ,xip) = E[yi|xi1, . . . ,xip]

is the (conditional) probability for yi = 1, i.e. Player 1 winning the match, given covari-
ate values xi1, . . . ,xip.

We further specify a strictly monotonically increasing response function h : R→ [0,1],

πi = h(ηi) = h(β0 +β1xi1 + · · ·+βpxip) , (3.1)

which relates the linear predictor ηi to πi.
The logistic regression model, which uses the sigmoid function as response func-

tion, is the most famous candidate within the framework of Generalized Linear Models
(GLMs).

The corresponding estimates β̂0, . . . , β̂p are obtained by numerical maximization
of the underlying log-likelihood, e.g. by using Fisher scoring or the Newton-Raphson
algorithms (see, e.g., Nelder and Wedderburn, 1972). This approach is implemented
in the glm function from the stats package in R. For more details on GLMs, see
Fahrmeir and Tutz (2001).

Spline-based approaches

In the classic logistic regression model introduced above, the covariates effects are
strictly linear, see equation (3.1). However, in practice also non-linear influences might
be relevant. In order to model these appropriately and flexibly, the GLM from above
be extended to a so-called Generalized Additive Model (GAM; Wood, 2017a). For this
purpose, so-called splines can be used. In this work, we focus on B-splines (see, e.g.,
Eilers and Marx, 2021).

So instead of linear effects like the β jxi j in equation (3.1), with B-splines a non-
linear effect f (x) of a metric predictor can be represented as

f (x) =
d

∑
j=1

γ jB j(x) ,

where B j(x) are different B-spline basis functions, d denotes the number of basis func-
tions used, and γ j the corresponding spline coefficients. As an unpenalized estimation
of a non-linear B-spline effect often overfits, typically the non-linear effect is smoothed
by using penalized B-splines, i.e. P-splines.

Beside the problem of potential overfitting, the goodness-of-fit of the B-spline ap-
proach depends on the number of selected nodes. To avoid this problem, various penal-
ization methods exist in the form of P-splines. Here, a penalized estimation criterion,
which is extended by a penalty term, is used instead of the usual estimation criterion.
For P-splines based on B-splines see, e.g., Eilers and Marx (1996). For this approach,
we rely on the gam function from the mgcv package (Wood, 2017b) in R. Note that such
P-spline based approaches have also been used in Buhamra et al. (2024) and Buhamra
et al. (2025).

Random forest

In the following, a random forest will be used for comparison with the linear and non-
linear regression approaches introduced above. This method is based on a (typically

6


large) ensemble of trees, which were introduced by Breiman et al. (1984). However,
as individual trees suffer from instability (Breiman, 1996b) an ensemble method called
bootstrapping and aggregating (bagging; Breiman, 1996a) was introduced, which in
general improves the predictive performance compared to a single regression tree and
is rather easy to implement.

A random forest aggregates multiple decision trees to enhance prediction accuracy
(Breiman, 2001). The key idea is that combining uncorrelated models (individual trees)
reduces variance and improves predictions compared to using a single model. In this
manuscript, classifications trees are considered in the random forest (where the most
frequent class among the trees determines the outcome), as our target variable is binary.
For this purpose, the ranger package in R by Wright and Ziegler (2017) is used for
fitting the random forest models. Also, the two hyperparameters mtry andntree are
quite important. For optimizing mtry, 10-fold cross-validation is used, and ntree is
set to 400. Further details about these parameters can be found in Buhamra et al. (2025)

3.2 Interpretable machine learning
In the following, we discuss interpretable machine learning (IML) methods such as
partial dependence plots (PDP; see Friedman, 2001) and individual conditional expec-
tation (ICE) plots (Goldstein et al., 2015) in more detail. These methods aim to enhance
the interpretability of complex, black box-type machine learning models. Particularly,
they can be applied to such black box models to provide explanations for individual
predictions or overall model behavior. For this purpose, the implementations from the
R packages pdp by Greenwell et al. (2017) and ggplot2 by Wickham et al. (2016) are
used (the latter one being generally applied for plotting). A nice overview of standard
approaches for IML can be found in Molnar (2020).

Partial Dependence Plot (PDP)

Following Molnar (2020), the partial dependence plot (or PDP) illustrates the marginal
effect of one or two features on the predicted outcome of a machine learning model
(Friedman, 2001). It can reveal whether the relationship between a feature and the
target is linear, monotonic, or more complex. The partial dependence function for
regression is defined as follows

f̂S(xS) = ExC [ f̂ (xS,xC)] =
∫

f̂ (xS,xC)dP(xC) ,

where P(·) is the distribution of the features in set C, xS are the features for which the
partial dependence function is plotted, while xC represents the other features used in the
machine learning model f̂ (·), which are considered as random variables in this context.
Typically, the set S contains only one or two features, namely those whose effect on
the prediction one aims to understand. The feature vectors xS and xC together form the
complete feature space x. Partial dependence operates by marginalizing the model’s
output over the distribution of the features P(·) in set C, allowing the function to reveal
the relationship between the features in S and the predicted outcome. By doing so,
we obtain a function depending solely on the features in S, while still accounting for
interactions with other features. The partial function f̂S is given by

f̂S(xS) =
1
n

n

∑
i=1

f̂ (xS,xi,C) ,

i.e. it is estimated by calculating averages on the training data. The partial function
shows the average marginal effect on the prediction for given values of the features in S.

7


Here, xi,C denotes the actual values from the dataset for the features not of interest, and
n represents the number of instances in the dataset. A key assumption of the PDP is that
the features in C are uncorrelated with those in S. If this assumption does not hold, the
calculated averages may include data points that are highly unlikely or even impossible.
Further details can be found in Molnar (2020).

Individual Conditional Expectation (ICE)

The counterpart to a PDP (Friedman, 2001), which illustrates the average effect of a
feature, is referred to as individual conditional expectation (ICE) plot and is applied for
individual data instances. The approach was first introduced by Goldstein et al. (2015).
ICE plots are used in machine learning to analyze the relationship between a feature and
the predicted outcome for individual instances within a dataset. Unlike PDPs, which
focus on the average effect of a feature, ICE plots offer insight into how changes in a
specific feature or feature set impact the model’s predictions for individual instances.
Each line in an ICE plot represents the predicted outcome for a single instance as the
feature value varies, revealing the variability and heterogeneity in feature effects across
different instances. This helps identifying interactions between features, understand-
ing complex model behaviors, and detecting outliers, thereby improving model inter-
pretability and transparency.

3.3 Performance measures
In the following, performance measures are defined which we use to select the best
model based on the predictive performance with regard to those measures (see also
Buhamra et al., 2024, and Buhamra et al., 2025). Let ỹ1, . . . , ỹn denote the true binary
outcomes of a set of n matches, i.e., ỹi ∈ {0,1}, i = 1, . . . ,n. Moreover, let π̂i1 =: π̂i

denote the probability, predicted by a certain model, that player 1 wins match i. Then,
the probability that player 2 wins the match is directly given by π̂i2 = 1− π̂i1.

Classification rate

The (mean) classification rate is given by the proportion of matches correctly predicted
by a certain model, i.e.

1
n

n

∑
i=1

1(ỹi = ŷi), where ŷi =

{
1, π̂i > 0.5
0, π̂i ≤ 0.5

,

see, e.g., Schauberger and Groll (2018). Hence, large values indicate a good predictive
performance.

Predictive Bernoulli likelihood

Following again Schauberger and Groll (2018), the (mean) predictive Bernoulli likeli-
hood is based on the predicted probability π̂i for the true outcome ỹi. For n observations
it is defined as

1
n

n

∑
i=1

π̂ ỹi
i (1− π̂i)

1−ỹi ,

Once again, large values indicate good predictive performance.

8


Brier score

The Brier score (Brier, 1950) is based on the squared distances between the predicted
probability π̂i and the actual (binary) output ỹi from match i, and is defined as

1
n

n

∑
i=1

(π̂i − ỹi)
2 .

It is an error measure and, hence, low values indicate a good predictive performance.

4 Results
In the next section, the predictive power of the enhanced variables is presented for both
regression and machine learning approaches. For this purpose, we use an expanding
window (EW), a rolling window (RW) as well as a leave-one-tournament-out cross-
validation (CV) approach. The best models are identified with respect to the previously
defined performance measures. Additionally, we provide better interpretability of the
machine learning approach, i.e. the random forest model, by using IML tools such as
partial dependence plot (PDP) and individual conditional expectation (ICE). All calcu-
lations and evaluations were performed using the statistical programming software R
(R Core Team, 2024).

4.1 Enhanced variables predictive power
To investigate the predictive potential of so-called ‘statistically enhanced covariates’ in
more detail, certain promising variables are considered, such as Elo, Age.30 and Age.int,
along with two conventional covariates (Rank or Points). All possible combinations of
these covariates result in a total of 21 models for each of our proposed approaches,
including linear effects, non-linear effects (splines), and the random forest. The results
are given in Tables 1 and 2, with the best performers highlighted in bold. For the
expanding window approach, results are presented here in detail, as this method clearly
identifies the top-performing models among the 21 proposed. The results for the leave-
one-tournament-out cross-validation and rolling window approaches are included in the
appendix.

Expanding window validation

We validated all proposed 21 models with respect to their predictive performance on
new, unseen test data. The validation is performed using an expanding window fore-
casting approach, i.e., each time one of the remaining tournaments is used as the test
data set in chronological order, and the training data set is constantly updated and en-
larged. This scheme has already been previously used in Buhamra et al. (2025) and can
be explained as follows:

1. First, all tournaments prior to 2022 are used as the training data set. Then, all
models are fitted. Based on those, predictions are derived for the 2022 Australian
Open matches, as this was the 1st Grand Slam tournament in 2022.

2. Then, the training data is updated, by adding the matches of the Australian Open
2022. Then, the models are fitted again o the extended training data, and predic-
tions are made for the French Open 2022, which is the 2nd Grand Slam tourna-
ment in 2022.

9


3. Next, the matches of the French Open 2022 are added to the training data set and
the models are fitted again. Based on those fits, Wimbledon 2022 is predicted.

4. Then, the Wimbledon 2022 matches will be added to the training data set, and
again, the models are fitted and predictions are made for the final Grand Slam
tournament in 2022, the US Open 2022.

Finally, the prediction results for all four tournaments are compared with the actual
match outcomes, and the corresponding performance measures are calculated . Table 1
presents the predictive performance for the regression models with linear and non-linear
effects, resulting from all 21 possible combinations of the statistically enhanced vari-
ables. Moreover, in Table 2, the same covariate combinations are presented, but with
respect to the random forest method.

Linear model: The classification rates range from 0.737 to 0.795. The best per-
formance was achieved by the linear regression model including Points, Rank, and Elo,
with a value of 0.795. For the predictive likelihood, the best value is 0.701, achieved
by the model including the covariates Points, Elo and Age.int. The lowest Brier score,
hence the best, is 0.153 and is delivered by eight different models.

Splines model: Unlike the models with linear effects, for the non-linear models are
clear winning model can be identified based on the enhanced variables Elo and Age.30.
Specifically, the splines model including covariates Elo and Age.30 demonstrates the
best performance with respect to all proposed performance measures. This model
achieved a classification rate of 0.792, and for the predictive likelihood and Brier score,
the corresponding values are 0.703 and 0.149, respectively. Additionally, it should be
noted that with respect to the classification rate, the model incorporating Points, Elo
and Age.30 also yielded identical results to the winning model, namely the value 0.792.

Random forest: Similar to the case of the spline-based approaches, the winning
model among the random forests can be clearly identified. The results show that the
random forest based on Points, Rank, Age.30, and Elo covariates outperforms all other
models. It yielded a classification rate of 0.820, a predictive likelihood of 0.667, and a
Brier score of 0.151.

Overall, if we considered both the proposed model types (i.e., linear regression,
spline-based regression and random forests) and different forecasting strategies (such
as EW, CV, and RW), certain models consistently prevail when these enhanced vari-
ables are present. The results of the leave-one-tournament-out CV approach for the
regression approaches (with linear and non-linear effects), as well as for the random
forest are presented in Tables 4 and 5, respectively, in the appendix. Also, the corre-
sponding results of the rolling window approach are available in appendix (see Tables 6
and 7). In general, across all the tables presented in the appendix, the results sug-
gest a nearly identical trend: prediction accuracy improves when at least one of these
enhanced variables (i.e., Elo, Age.30, Age.int) is included.

10


Table 1: Results of the expanding window approach incorporating the statistically enhanced covari-
ates for the regression model class with linear and non-linear effects; best results are highlighted in
bold font.

Specific Class. Likeli- Brier
models rate hood score

Linear

Points 0.754 0.639 0.178
Elo 0.782 0.695 0.153
Rank 0.754 0.642 0.177
Points, Rank 0.754 0.661 0.171
Points, Elo 0.782 0.695 0.153
Rank, Elo 0.775 0.696 0.153
Elo, Age.30 0.778 0.697 0.153
Rank, Age.30 0.747 0.649 0.174
Points, Age.30 0.758 0.643 0.176
Elo, Age.int 0.775 0.696 0.154
Rank, Age.int 0.737 0.647 0.175
Points, Age.int 0.737 0.642 0.177
Points, Rank, Elo 0.795 0.696 0.153
Points, Rank, Age.30 0.758 0.666 0.168
Points, Rank, Age.int 0.754 0.665 0.169
Points, Elo, Age.30 0.785 0.696 0.153
Elo, Rank, Age.30 0.782 0.698 0.153
Points, Elo, Age.int 0.764 0.701 0.1570
Elo, Rank, Age.int 0.761 0.696 0.162
Points, Rank, Age.30, Elo 0.785 0.698 0.153
Points, Rank, Age.int, Elo 0.788 0.697 0.154

Splines

Points 0.741 0.651 0.175
Elo 0.775 0.697 0.153
Rank 0.747 0.643 0.179
Points, Rank 0.744 0.655 0.174
Points, Elo 0.775 0.698 0.153
Rank, Elo 0.778 0.696 0.156
Elo, Age.30 0.792 0.703 0.149
Rank, Age.30 0.751 0.654 0.175
Points, Age.30 0.761 0.659 0.171
Elo, Age.int 0.768 0.699 0.157
Rank, Age.int 0.741 0.649 0.180
Points, Age.int 0.734 0.655 0.178
Points, Rank, Elo 0.778 0.697 0.158
Points, Rank, Age.30 0.768 0.665 0.169
Points, Rank, Age.int 0.741 0.661 0.175
Points, Elo, Age.30 0.792 0.701 0.150
Elo, Rank, Age.30 0.785 0.698 0.156
Points, Elo, Age.int 0.764 0.701 0.157
Elo, Rank, Age.int 0.761 0.696 0.162
Points, Rank, Age.30, Elo 0.778 0.697 0.156
Points, Rank, Age.int, Elo 0.761 0.697 0.163

11


Table 2: Results of the expanding window approach for random forest approach; best results are
highlighted in bold font.

Specific Class. Likeli- Brier
models rate hood score

Random forest

Points 0.773 0.596 0.189
Elo 0.800 0.615 0.174
Rank 0.784 0.596 0.187
Points, Rank 0.779 0.616 0.179
Points, Elo 0.799 0.632 0.168
Rank, Elo 0.804 0.634 0.164
Elo, Age.30 0.793 0.616 0.173
Rank, Age.30 0.797 0.596 0.186
Points, Age.30 0.778 0.598 0.187
Elo, Age.int 0.804 0.618 0.172
Rank, Age.int 0.793 0.595 0.187
Points, Age.int 0.778 0.595 0.188
Points, Rank, Elo 0.807 0.638 0.164
Points, Rank, Age.30 0.791 0.616 0.178
Points, Rank, Age.int 0.788 0.616 0.178
Points, Elo, Age.30 0.779 0.631 0.172
Elo, Rank, Age.30 0.796 0.632 0.166
Points, Elo, Age.int 0.808 0.632 0.162
Elo, Rank, Age.int 0.794 0.634 0.165
Points, Rank, Age.30, Elo 0.820 0.667 0.151
Points, Rank, Age.int, Elo 0.800 0.637 0.166

4.2 Model interpretation
In this section, our interpretation is developed based on the refitting of the top-performing
linear, spline, and random forest models introduced in Section 4.1. Consequently, we
provide deeper insights for these models. Additionally, interpretable machine learn-
ing tools such as partial dependence plot (PDP) and individual conditional expectation
(ICE) are used to enhance our understanding of the random forest model.

4.2.1 Linear model

The linear model can be directly interpreted based on the p estimated regression coef-
ficients β̂ j from Equation 3.1. Their values indicate how much the outcome variable
is expected to change when the corresponding predictor variable changes by one unit,
assuming all other covariates in the model are held constant. Hence, they allow direct
interpretation of both strength and direction of the relationship between the predictors
and the outcome variable. Table 3 shows the coefficient estimates for the best candidate
linear model based on the results from the previous Section 4.1. It incorporates Points,
Rank as conventional covariates, and Elo as enhanced covariate.

Table 3: Estimated coefficients for the candidate linear model based on the results from Table 1

Variables β̂ j

Rank -0.0012
Elo 0.0042
Points 0.0001

12


For the Rank difference variable, the negative sign indicates that as this variable in-
creases (resulting in a larger numerical rank, which corresponds to a lower-ranked first-
named player), the linear predictor decreases by approximately 0.0012 units. Hence,
due to the the used (logistic) response function, also the winning probability for the
respective player decreases.

Analogously, for every one-unit increase in the Elo rating difference, the linear
predictor exhibits an increase of approximately 0.0042 units. Hence, also the winning
probability for the respective first-named player increase. In practical terms, this means
that larger Elo ratings are associated with larger probabilities of winning a tennis match,
as the positive sign of the coefficient estimate indicates a positive relationship.

The same positive relationship also holds for the predictor variable Points. For
every one-unit increase in the difference in points between players (again, from the
perspective of the first-named player), the model predicts an increase of approximately
0.0001 units in the linear predictor and, hence, also an increase of the respective player’s
winning probability.

4.2.2 Splines

Graphical illustrations of spline effects are a useful way to understand and visualize
potential non-linear effects of predictors on the response variable. The spline-based
approaches from Section 3.1 are able to capture non-linear relationships and allow for
a flexible, semi-parametric form of regression using smooth functions (splines) for pre-
dictors. Corresponding ploting functions provide interpretable visualizations. In this
context, the gam function from the mgcv package (Wood, 2017a) in R is used.

Figure 1 displays the effects for the covariates Elo and Age.30, which actually ap-
pear to be linear. The graphs show positive effects for both features, i.e. larger Elo rat-
ings associated with larger winning probabilities. The Age.30 difference has a slightly
positive effect, which however is not significant taking the point-wise confidence band
into account.

4.2.3 Random forest with interpretable machine learning

Next, the results for the random forest model with the best predictive performance
from Section 4.1 are discussed. This model, which included Points, Rank, Age.30, and
Elo as covariates, is refitted to our entire data set. Then, both partial dependence and
Individual Conditional Expectation (ICE) plots are presented for better interpretability
and visualization of the respective complex model.

Since ICE plots—the individual equivalent of Partial Dependence Plots (PDPs)—are
considered crucial in our case, the combined ICE and PDP plots were used to provide
a comprehensive visualization of the relationship between predictor variables and the
predicted outcome.

In Figure 2 we generated a combined ICE and PDP. Each ICE line represents how
the predicted winning probability changes with the predictor value for a single obser-
vation. The spread of the lines indicates variability in the effect of the predictor across
different observations. The thick black line represents the PDP, showing the average
effect of the respective predictor on the predicted outcome. This PDP line provides
context to the ICE lines by highlighting the overall trend across all observations.

For instance, in the Rank panel, we can observe how the predicted outcome is first
very strongly decreasing and then slowly increasing. Moreover, in the Rank panel,
the spread of grey lines illustrates variability in how an increasing Rank difference
affects different observations. After a steep decrease, the general trend of the black line
indicates a almost constant, very slightly increasing relationship between Rank and the

13


−500 0 500

−4

−2

0

2

4

Elo

f(
E

lo
)

−10 −5 0 5 10

−4

−2

0

2

4

Age.30

f(
A

ge
.3

0)

Figure 1: Spline effects for the enhanced covariates Elo and Age.30

predicted probability of winning. This means that as the Rank difference increases (i.e.,
the first-named player is ranked higher, meaning worse performance, when the rank
of the second player is kept constant), the winning probability first strongly decreases,
then increases a bit and finally remains almost constant, before it exhibits a small hill
at the end.

In the Elo panel, the PDP line shows a slightly negative trend, suggesting that larger
Elo differences first decrease the predicted probability of winning, before increasing
again. For the Points, upward slope in the PDP suggests that accumulating more points
correlates positively with increased chances of winning matches. Variability among
ICE lines indicate that while most players benefit from accumulating points, some may
experience diminishing returns at higher point levels or face challenges against specific
opponents. The Age.30 panel shows significant variability in individual responses, as
indicated by the spread of the ICE lines. The ICE lines highlight differences among
individual responses: some (first-named) players with an age closer to the optimal one
of 30 years outperform the (second-named) player substantially. In addition, we can see
a slightly decreasing trend of the Age.30 difference, which then seems to then remain
rather constant for small or positive differences.

Overall, this combined plot effectively conveys both the average influence of each
predictor (PDP) and its varying impact on individual observations (ICE). It allows for
precise interpretation of how predictors influence the outcome across different levels
and individuals.

Finally, in Figure 3, a heatmap plot for the partial dependency of Elo and Age.30 is

14


Points Rank

Age 30 Elo

−1000 −500 0 500 1000 −500 0 500 1000

−10 −5 0 5 10 −1000 −500 0 500 1000
0.0

0.2

0.4

0.6

0.8

0.0

0.2

0.4

0.6

0.8P
re

di
ct

ed
 P

ro
ba

bi
lit

y

Figure 2: Combined PDP (thick black line) and ICE plots (grey lines) for the random forest model
including covariates Age.30, Elo, Points and Rank

age.30

el
o

−500

0

500

−10 −5 0 5 10

−1.0

−0.5

0.0

0.5

Figure 3: Heat map for the random forest model including Elo and Age.30 as covariates

presented. The plot examines how the linear predictor (and, hence, the wining proba-
bility) changes jointly with Age.30 and Elo. If the color changes are not uniform across
the heat map, this suggests a complex interaction between Elo and Age.30. For exam-

15


ple, larger Elo have a stronger impact on the predictor for certain age ranges than for
others (e.g., players with an age closer to 30 benefit more from a larger Elo ratings than
younger or older players).

For example, a region that transitions from blue to yellow as Elo increases indicates
that higher Elo scores are associated with higher winning probabilities, especially when
Age.30 is within a certain range.

5 Summary and overview
In this work, we compared different model approaches for modeling tennis matches
in Grand Slam tournaments focusing on two main aspects: statistically enhanced co-
variates and interpretable machine learning tools. First, we demonstrated how these
enhanced covariates can be applied in the context of sports analytics to improve pre-
dictive modeling performance and gain insights from sports data. Then, to better un-
derstand the interpretation of complex ML models, exemplarily for a random forest,
we presented partial dependence plots (PDPs) to visualize the average partial relation-
ship between the predicted response and one or more features, along with individual
conditional expectation (ICE) plots, a tool for visualizing the model estimated by any
supervised learning algorithm. For this purpose, we used a data set provided and an-
alyzed in Buhamra et al. (2024, 2025), which was compiled based on the R package
deuce (Kovalchik, 2019). It contains information on 5,013 matches in 47 men’s Grand
Slam tournaments from the years 2011-2022. It also includes covariate information on
the age difference of both players (Age), the difference in their ranking positions (Rank)
and ranking points (Points), in Elo numbers (Elo), as well as the two additional age-
based variables, Age.30 and Age.int, which were constructed such that they take into
account that the optimal age of a tennis player is between 28 and 32 years.

Different regression models, which were already considered in Buhamra et al. (2024,
2025), were compared with machine learning approaches, in particular a random forest
model, for modeling and predicting tennis matches. Since there are only two possible
outcomes in tennis (win or loss), all models were based on a binary outcome and thus
focused on modeling the probability of the first-named player winning.

The different modeling approaches were compared with respect to their prediction
performance on unseen matches via an expanding window strategy. The following
regression and ML approaches were included in this comparison:

− Logistic regression with linear effects: all possible combinations of the three en-
hanced covariates along with the conventional variables Point and Rank were
considered. This resulted in 21 models.

− Logistic regression with non-linear effects (splines): Again, the same 21 combi-
nations of enhanced and conventional covariates were considered.

− A random forest model as a machine learning approach: Also here, the same
combinations of enhanced and conventional covariates were considered, resulting
in 21 models.

Via the expanding window approach, the models were compared in terms of classifica-
tion rate, predictive Bernoulli likelihood and Brier score. Since each approach resulted
in 21 different models, the model with the best predictive performance measures was
selected in each case. Overall, the values vary between the different approaches and
over proposed performance measures. The random forest model, including the covari-
ates Points, Rank, Age.30, and Elo achieved the best classification rate among all other
models with a value of 0.8202. In contrast, the spline-based regression model based on

16


Elo and Age.30 only yielded the best predictive performance in terms of predictive like-
lihood and Brier score compared to all other model approaches. Generally, one could
say that models consistently perform better when at least one of the enhanced variables
is included.

Additionally, we investigated a CV-type approach and a rolling window approach.
The rolling window approached was based on a (varying) training dataset always con-
taining 12 tournaments that were used to predict the outcome of the next tournament.
The results for these two approaches are provided in the appendix. Principally, results
are varying among the three different training-test-subdivision approaches, but gener-
ally led to similar results as the expanding window strategy. For example, the winning
random forest model of the rolling window approach is similar to the one of the expand-
ing window strategy, as both include Points, Age.30, Elo, and Rank as the best model
predictors.

To gain a comprehensive understanding and interpretation of each approach, we
analyze the coefficients of the linear regression model. Additionally, we employ spline
graphs to visually represent and interpret the relationship between predictor variables
and the response variable within the context of Generalized Additive Models (GAMs).
These graphs can capture non-linear relationships, which is particularly advantageous
for complex datasets where linear models may be insufficient. Furthermore, we intro-
duce interpretable machine learning (IML) tools such as partial dependence plots (PDP)
and individual conditional expectation (ICE) plots. These tools help in comprehending
and interpreting the predictions made by complex (black box-type) machine learning
models, in the present case a random forest model.

By examining PDP plots, we obtained insights into how each predictor variable
affects the model’s predictions. For instance, the PDP for Points exhibits a general
upward trend, indicating that earning more Points has a positive effect on the predicted
outcome. Similarly, the PDP for Age.30 suggests that being closer to the optimal age
of 30 years is only slightly associated with larger winning probabilities.

In addition, a heat map visualization illustrating the joint effect of the two covariates
Elo and Age.30 on the predicted outcome is presented. By examining the color gradient
and regions in the heat map, we can infer how changes in these two covariates influence
the outcome, highlighting any non-linear interactions and dependencies between those.

In future research, additional IML methods, such as Accumulated Local Effects
(ALE) plots (Apley and Zhu, 2020) and Local Interpretable Model-agnostic Explana-
tions (LIME; Ribeiro et al., 2016) can also be used. Furthermore, one could investigate
more complex machine learning models, such as deep learning approaches (Bishop,
1995; LeCun et al., 2015). Additionally, similar to soccer (Groll et al., 2019b), one
could also focus on tournament outcomes. For example, the probability of a certain
player winning the tournament could be determined. This approach takes advantage
of the fact that the tournament bracket is fully drawn before the start, allowing us to
predict the earliest round in which two players could meet. Unlike in soccer, where
group stage outcomes influence subsequent matches, this setup simplifies predictions.
However, using only the match-specific betting odds for the first round presents a chal-
lenge. Models that exclude odds as covariates might be preferable, despite potentially
lower prediction performance due to the significant influence of odds. Alternatively,
models could be developed that use pre-tournament odds for each player to win the
entire tournament, rather than odds for individual matches. Finally, additional statisti-
cally enhanced covariates could be produced. For example, similar to the historic match
abilities for soccer teams developed by Ley et al. (2019), such abilities could also be
developed for tennis players. We are currently working on such an approach and plan
to include those ability parameters into our models in the future.

17


References
Angelini, G., Candila, V., and De Angelis, L. (2022). Weighted Elo rating for tennis

match predictions. European Journal of Operational Research, 297(1):120–132.

Apley, D. W. and Zhu, J. (2020). Visualizing the effects of predictor variables in black
box supervised learning models. Journal of the Royal Statistical Society Series B:
Statistical Methodology, 82(4):1059–1086.

Auret, L. and Aldrich, C. (2012). Interpretation of nonlinear relationships between
process variables by use of random forests. Minerals Engineering, 35:27–42.

Bishop, C. M. (1995). Neural networks for pattern recognition. Oxford university
press.

Breiman, L. (1996a). Bagging predictors. Machine learning, 24:123–140.

Breiman, L. (1996b). Heuristics of instability and stabilization in model selection. The
annals of statistics, 24(6):2350–2383.

Breiman, L. (2001). Random forests. Machine learning, 45:5–32.

Breiman, L., Friedman, J., Olshen, R., and Stone, C. (1984). Classification and regres-
sion trees. wadsworth int. Group, 37(15):237–251.

Brier, G. W. (1950). Verification of forecasts expressed in terms of probability. Monthly
Weather Review, 78:1–3.

Buhamra, N., Groll, A., and Brunner, S. (2024). Modeling and prediction of tennis
matches at grand slam tournaments. Journal of Sports Analytics, 10(1):17–33.

Buhamra, N., Groll, A., and Gerharz, A. (2025). Comparing modern machine learn-
ing approaches for modeling tennis matches at grand slam tournaments. Journal of
Sports Analytics. under review.

Eilers, P. H. and Marx, B. D. (2021). Practical smoothing: The joys of P-splines.
Cambridge University Press.

Eilers, P. H. C. and Marx, B. D. (1996). Flexible smoothing with B-splines and penal-
ties. Statistical Science, 11:89–121.

Fahrmeir, L. and Tutz, G. (2001). Multivariate Statistical Modelling Based on Gener-
alized Linear Models. Springer-Verlag, New York, 2nd edition.

Felice, F., Ley, C., Groll, A., and Bordas, S. (2023). Statistically enhanced learning:
a feature engineering framework to boost (any) learning algorithms. arXiv preprint
arXiv:2306.17006.

Friedman, J. H. (2001). Greedy function approximation: a gradient boosting machine.
Annals of statistics, pages 1189–1232.

Gao, Z. and Kowalczyk, A. (2021). Random forest model identifies serve strength as a
key predictor of tennis match outcome. Journal of Sports Analytics, 7(4):255–262.

Goldstein, A., Kapelner, A., Bleich, J., and Pitkin, E. (2015). Peeking inside the black
box: Visualizing statistical learning with plots of individual conditional expectation.
journal of Computational and Graphical Statistics, 24(1):44–65.

18


Greenwell, B. M. et al. (2017). pdp: An r package for constructing partial dependence
plots. R J., 9(1):421.

Groll, A., Ley, C., Schauberger, G., and Van Eetvelde, H. (2019a). A hybrid random
forest to predict soccer matches in international tournaments. Journal of quantitative
analysis in sports, 15(4):271–287.

Groll, A., Ley, C., Schauberger, G., and Van Eetvelde, H. (2019b). A hybrid random
forest to predict soccer matches in international tournaments. Journal of Quantitative
Analysis in Sports, 15:271–287.

Kovalchik, S. (2019). deuce: resources for analysis of professional tennis data. R
package version 1.4.

LeCun, Y., Bengio, Y., and Hinton, G. (2015). Deep learning. nature, 521(7553):436–
444.

Ley, C., Wiele, T. V. d., and Eetvelde, H. V. (2019). Ranking soccer teams on the
basis of their current strength: A comparison of maximum likelihood approaches.
Statistical Modelling, 19(1):55–73.

Molnar, C. (2020). Interpretable machine learning. Lulu. com.

Molnar, C., Freiesleben, T., König, G., Herbinger, J., Reisinger, T., Casalicchio, G.,
Wright, M. N., and Bischl, B. (2023). Relating the partial dependence plot and
permutation feature importance to the data generating process. In World Conference
on Explainable Artificial Intelligence, pages 456–479. Springer.

Nelder, J. A. and Wedderburn, R. W. M. (1972). Generalized linear models. Journal of
the Royal Statistical Society, A 135:370–384.

R Core Team (2024). R: A Language and Environment for Statistical Computing. R
Foundation for Statistical Computing, Vienna, Austria.

Ribeiro, M. T., Singh, S., and Guestrin, C. (2016). ” why should i trust you?” explaining
the predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD interna-
tional conference on knowledge discovery and data mining, pages 1135–1144.

Schauberger, G. and Groll, A. (2018). Predicting matches in international football
tournaments with random forests. Statistical Modelling, 18(5-6):460–482.

Sipko, M. and Knottenbelt, W. (2015). Machine learning for the prediction of profes-
sional tennis matches. MEng computing-final year project, Imperial College London,
2.

Somboonphokkaphan, A., Phimoltares, S., and Lursinsap, C. (2009). Tennis winner
prediction based on time-series history with neural modeling. In Proceedings of
the International MultiConference of Engineers and Computer Scientists, volume 1,
pages 18–20. Citeseer.

Vaughan Williams, L., Liu, C., Dixon, L., and Gerrard, H. (2021). How well do Elo-
based ratings predict professional tennis matches? Journal of Quantitative Analysis
in Sports, 17(2):91–105.

19


Weston, D. (2014). Using age statistics to gain a tennis betting
edge. http://www.pinnacle.com/en/betting-articles/Tennis/

atp-players-tipping-point/LMPJF7BY7BKR2EY.

Whiteside, D., Cant, O., Connolly, M., and Reid, M. (2017). Monitoring hitting load
in tennis using inertial sensors and machine learning. International journal of sports
physiology and performance, 12(9):1212–1217.

Wickham, H., Chang, W., and Wickham, M. H. (2016). Package ‘ggplot2’. Create
elegant data visualisations using the grammar of graphics. Version, 2(1):1–189.

Wilkens, S. (2021). Sports prediction and betting models in the machine learning age:
The case of tennis. Journal of Sports Analytics, 7(2):99–117.

Wood, S. N. (2017a). Generalized additive models: an introduction with R. chapman
and hall/CRC.

Wood, S. N. (2017b). Generalized Additive Models: An Introduction with R. Chapman
& Hall/CRC, London, 2nd edition.

Wright, M. N. and Ziegler, A. (2017). ranger: A fast implementation of random forests
for high dimensional data in C++ and R. Journal of Statistical Software, 77(1):1–17.

20


A Appendix

Table 4: Results of the leave-one-tournament-out CV approach for regression model class; best
results in bold font.

Specific Class. Likeli- Brier
models rate hood score

Linear

Points 0.732 0.623 0.185
Elo 0.747 0.656 0.172
Rank 0.732 0.592 0.199
Points, Rank 0.732 0.634 0.181
Points, Elo 0.748 0.658 0.171
Rank, Elo 0.749 0.657 0.172
Elo, Age.30 0.747 0.656 0.172
Rank, Age.30 0.731 0.592 0.199
Points, Age.30 0.729 0.623 0.185
Elo, Age.int 0.747 0.656 0.172
Rank, Age.int 0.731 0.592 0.199
Points, Age.int 0.731 0.623 0.185
Points, Rank, Elo 0.747 0.659 0.170
Points, Rank, Age.30 0.732 0.634 0.181
Points, Rank, Age.int 0.733 0.634 0.181
Points, Elo, Age.30 0.747 0.658 0.171
Elo, Rank, Age.30 0.747 0.657 0.172
Points, Elo, Age.int 0.745 0.657 0.171
Elo, Rank, Age.int 0.746 0.656 0.172
Points, Rank, Age.30, Elo 0.747 0.659 0.170
Points, Rank, Age.int, Elo 0.748 0.659 0.170

Splines

Points 0.729 0.637 0.181
Elo 0.746 0.656 0.172
Rank 0.729 0.611 0.193
Points, Rank 0.732 0.641 0.179
Points, Elo 0.748 0.657 0.171
Rank, Elo 0.748 0.656 0.172
Elo, Age.30 0.746 0.656 0.172
Rank, Age.30 0.729 0.611 0.193
Points, Age.30 0.729 0.637 0.181
Elo, Age.int 0.745 0.656 0.172
Rank, Age.int 0.728 0.611 0.194
Points, Age.int 0.729 0.637 0.181
Points, Rank, Elo 0.747 0.657 0.171
Points, Rank, Age.30 0.731 0.641 0.179
Points, Rank, Age.int 0.730 0.641 0.179
Points, Elo, Age.30 0.746 0.657 0.171
Elo, Rank, Age.30 0.746 0.656 0.172
Points, Elo, Age.int 0.745 0.657 0.171
Elo, Rank, Age.int 0.746 0.656 0.172
Points, Rank, Age.30, Elo 0.745 0.657 0.171
Points, Rank, Age.int, Elo 0.744 0.657 0.171

21


Table 5: Results of the leave-one-tournament-out CV approach for random forest models; best re-
sults in bold font.

Specific Class. Likeli- Brier
models rate hood score

Random forest

Points 0.757 0.611 0.186
Elo 0.769 0.623 0.179
Rank 0.757 0.604 0.190
Points, Rank 0.759 0.629 0.181
Points, Elo 0.772 0.641 0.174
Rank, Elo 0.773 0.638 0.175
Elo, Age.30 0.771 0.624 0.179
Rank, Age.30 0.751 0.615 0.190
Points, Age.30 0.755 0.608 0.187
Elo, Age.int 0.769 0.620 0.179
Rank, Age.int 0.753 0.606 0.191
Points, Age.int 0.754 0.608 0.187
Points, Rank, Elo 0.772 0.645 0.174
Points, Rank, Age.30 0.758 0.626 0.182
Points, Rank, Age.int 0.760 0.626 0.182
Points, Elo, Age.30 0.769 0.638 0.175
Elo, Rank, Age.30 0.772 0.635 0.176
Points, Elo, Age.int 0.770 0.638 0.176
Elo, Rank, Age.int 0.770 0.636 0.176
Points, Rank, Age.30, Elo 0.770 0.643 0.174
Points, Rank, Age.int, Elo 0.770 0.643 0.175

22


Table 6: Results of the rolling window approach for regression model class; best results in bold font.

Specific Class. Likeli- Brier
models rate hood score

Linear

Points 0.645 0.498 0.309
Elo 0.630 0.496 0.330
Rank 0.645 0.501 0.293
Points, Rank 0.645 0.499 0.317
Points, Elo 0.633 0.496 0.332
Rank, Elo 0.633 0.497 0.331
Elo, Age.30 0.627 0.495 0.331
Rank, Age.30 0.641 0.501 0.294
Points, Age.30 0.642 0.498 0.309
Elo, Age.int 0.631 0.495 0.331
Rank, Age.int 0.644 0.501 0.294
Points, Age.int 0.646 0.498 0.309
Points, Rank, Elo 0.636 0.497 0.332
Points, Rank, Age.30 0.644 0.499 0.317
Points, Rank, Age.int 0.647 0.499 0.317
Points, Elo, Age.30 0.628 0.495 0.332
Elo, Rank, Age.30 0.632 0.496 0.331
Points, Elo, Age.int 0.631 0.495 0.332
Elo, Rank, Age.int 0.635 0.496 0.331
Points, Rank, Age.30, Elo 0.633 0.496 0.332
Points, Rank, Age.int, Elo 0.634 0.496 0.333

Splines

Points 0.648 0.500 0.319
Elo 0.634 0.496 0.331
Rank 0.643 0.501 0.306
Points, Rank 0.646 0.499 0.323
Points, Elo 0.633 0.496 0.332
Rank, Elo 0.634 0.497 0.330
Elo, Age.30 0.633 0.496 0.331
Rank, Age.30 0.645 0.501 0.306
Points, Age.30 0.644 0.499 0.320
Elo, Age.int 0.632 0.496 0.331
Rank, Age.int 0.644 0.501 0.306
Points, Age.int 0.646 0.499 0.319
Points, Rank, Elo 0.633 0.497 0.332
Points, Rank, Age.30 0.643 0.499 0.323
Points, Rank, Age.int 0.644 0.499 0.323
Points, Elo, Age.30 0.630 0.496 0.332
Elo, Rank, Age.30 0.634 0.496 0.331
Points, Elo, Age.int 0.633 0.496 0.333
Elo, Rank, Age.int 0.632 0.496 0.331
Points, Rank, Age.30, Elo 0.632 0.496 0.332
Points, Rank, Age.int, Elo 0.632 0.497 0.332

23


Table 7: Results of the rolling window approach for random forest models; best results in bold font.

Specific Class. Likeli- Brier
models rate hood score

Random forest

Points 0.753 0.600 0.191
Elo 0.771 0.611 0.183
Rank 0.755 0.601 0.194
Points, Rank 0.756 0.620 0.185
Points, Elo 0.771 0.628 0.177
Rank, Elo 0.773 0.627 0.179
Elo, Age.30 0.771 0.610 0.184
Rank, Age.30 0.748 0.600 0.195
Points, Age.30 0.754 0.596 0.192
Elo, Age.int 0.771 0.609 0.184
Rank, Age.int 0.752 0.600 0.195
Points, Age.int 0.755 0.599 0.192
Points, Rank, Elo 0.773 0.636 0.177
Points, Rank, Age.30 0.757 0.618 0.186
Points, Rank, Age.int 0.755 0.616 0.186
Points, Elo, Age.30 0.768 0.626 0.179
Elo, Rank, Age.30 0.774 0.626 0.179
Points, Elo, Age.int 0.765 0.625 0.179
Elo, Rank, Age.int 0.769 0.625 0.179
Points, Rank, Age.30, Elo 0.789 0.659 0.165
Points, Rank, Age.int, Elo 0.771 0.634 0.177

24


Name: Nourah Buhamra Matrikelnummer: 227227

Erklärung

Hiermit erkläre ich, dass ich die vorliegende Dissertation mit dem Titel

"Classic statistical and modern machine learning methods for modeling and prediction of
major tennis tournament"

selbständig verfasst und keine anderen als die angegebenen Quellen und Hilfsmittel verwen-
det sowie die wörtlich oder inhaltlich übernommenen Stellen als solche kenntlich gemacht
habe und die Satzung der Technischen Universität Dortmund zur Sicherung guter wis-
senschaftlicher Praxis in der jeweils gültigen Fassung beachtet habe. Ich versichere außer-
dem, dass ich die beigefügte Dissertation nur in diesem und keinem anderen Promotionsver-
fahren eingereicht habe und dass diesem Promotionsverfahren keine endgültig gescheiterten
Promotionsverfahren vorausgegangen sind. Ferner erkläre ich, dass keine Aberkennung
eines bereits erworbenen Doktorgrades vorliegt. Ich versichere an Eides statt, dass ich nach
bestem Wissen die reine Wahrheit erklärt und nichts verschwiegen habe.

Dortmund, den . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Nourah Buhamra