1
British Journal of Mathematical and Statistical Psychology (2022), 75, 1–22
© 2021 The Authors. British Journal of Mathematical and Statistical Psychology
published by John Wiley & Sons Ltd
on behalf of British Psychological Society
www.wileyonlinelibrary.com
Fisher transformation based confidence intervals
of correlations in fixed- and random-effects
meta-analysis
Thilo Welz* , Philipp Doebler and Markus Pauly
Department of Statistics, Mathematical Statistics and Applications in Industry, TU
Dortmund University, Germany
Meta-analyses of correlation coefficients are an important technique to integrate results
from many cross-sectional and longitudinal research designs. Uncertainty in pooled
estimates is typically assessed with the help of confidence intervals, which can double as
hypothesis tests for two-sided hypotheses about the underlying correlation. A standard
approach to construct confidence intervals for themain effect is theHedges-Olkin-Vevea
Fisher-z (HOVz) approach, which is based on the Fisher-z transformation. Results from
previous studies (Field, 2005, Psychol. Meth., 10, 444; Hafdahl andWilliams, 2009, Psychol.
Meth., 14, 24), however, indicate that in random-effects models the performance of the
HOVz confidence interval can be unsatisfactory. To this end, we propose improvements
of the HOVz approach, which are based on enhanced variance estimators for the main
effect estimate. In order to study the coverage of the new confidence intervals in both
fixed- and random-effects meta-analysis models, we perform an extensive simulation
study, comparing them to established approaches. Data were generated via a truncated
normal and beta distribution model. The results show that our newly proposed
confidence intervals based on a Knapp-Hartung-type variance estimator or robust
heteroscedasticity consistent sandwich estimators in combinationwith the integral z-to-r
transformation (Hafdahl, 2009, Br. J. Math. Stat. Psychol., 62, 233) provide more accurate
coverage than existing approaches in most scenarios, especially in the more appropriate
beta distribution simulation model.
1. Introduction
Quantifying the association of metric variables with the help of the Pearson correlation
coefficient is a routine statistical technique for understanding patterns of association. It is
a basic ingredient of the data analysis ofmany cross-sectional and longitudinal designs, and
is also indispensable for various psychometric and factor-analytic techniques. When
several reports are available for comparable underlying populations, meta-analytic
methods allow the available evidence to be pooled (Hedges & Olkin, 1985; Hunter &
Schmidt, 2004), resulting in more stable and precise estimates.
This is an open access article under the terms of the Creative Commons Attribution License, which permits use,
distribution and reproduction in any medium, provided the original work is properly cited.
*Correspondence should be addressed to Thilo Welz, Department of Statistics, Mathematical Statistics and
Applications in Industry, TU Dortmund University, Logistik Campus, Joseph- von-Fraunhofer-Straße 2-4, 44227
Dortmund, Germany (email: thilo.welz@tu-dortmund.de).
DOI:10.1111/bmsp.12242
2 Thilo Welz et al.
Systematic reviews based onmeta-analyses of correlations are among themost cited in
industrial and organizational psychology, clinical psychology and educational psychology
(e.g. Aldao,Nolen-Hoeksema, & Schweizer, 2010; Barrick&Mount, 1991; Sirin, 2005 each
with several thousand citations), and the methodological monograph on pooling
correlations of Hunter and Schmidt (2004) is approaching 10,000 citations on Google
Scholar at the time of writing. In addition, pooled correlations are the basis for meta-
analytic structural equation modelling (e.g., Cheung, 2015; Jak, 2015, and registered
replication efforts pool correlations to reassess findings of others (e.g., Open Science
Collaboration, 2015).).
1.1. The importance of confidence intervals for pooled correlations
Schulze (2004) provides a comprehensive summary of fixed- and random-effects meta-
analysis of correlations. The best-known approaches are based on Fisher’s z transformation
(Field, 2001, 2005; Hafdahl &Williams, 2009; Hedges &Olkin, 1985) or on direct synthesis
of correlations via the Hunter--Schmidt (HS) method (Hunter & Schmidt, 1994; Schulze,
2004). Regardless of themethod and the purpose of themeta-analysis, the point estimate of
the correlation is accompanied by an estimate of its uncertainty, in the form of a standard
error (SE) or a confidence interval (CI). Since the absolute value of a correlation is bounded
by1, aCImight be asymmetric in this context, that is, not centred around thepoint estimate.
Also, CIs are often more useful than SEs, because a null hypothesis of the form H0 : ρ¼ ρ0
can be rejected at level α if a 100ð1αÞ% CI does not include ρ0 (duality of hypothesis
testing and CIs). A CI’s coverage is ideally close to the nominal 1α level; for example, a
multi-centre registered replication report does want to rely either on an anti-conservative
(too narrow) CI that is overly prone to erroneously rejecting previous research, or on a
conservative (too wide) CI lacking statistical power to refute overly optimistic point
estimates. Despitemethodological developments since the late 1970s, the choice of a CI for
a pooled correlation should be a careful one: simulation experiments reported in this paper
reinforce the finding that CIs are too liberal when heterogeneity is present. The main
objective of this paper is a systematic investigation of competingmethods, especially when
moderateor even substantial amounts of heterogeneity arepresent, promising refinedmeta-
analyticmethods for correlations, especially thosebasedon theFisherz transformation.The
remainder of this introduction reviews results for (z-transformation-based) pooling, and
briefly introduces relevant methods for variance estimation.
1.2. Pooling (transformed) correlation coefficients
A line of research summarized inHunter and Schmidt (1994) pools correlation coefficients
on the original scale from 1 to 1. One of the merits of the HS methodology is a clear
rationale for artefact corrections, that is, correlations are disattenuated for differences at
the primary report level in reliability or variable range. While this part of the HS
methodology is beyond the scope of the current paper, CIs originating from Osburn and
Callender (1992) are studied here as an HS-based referencemethod (see also Field, 2005).
Fisher’s z-transformation (= areatangens hyperbolicus) maps the open interval
ð1,1Þ to the real number line. Working with z values of correlations avoids problems
arising at the bounds and makes normality assumptions of some meta-analytic models
more plausible (Hedges & Olkin, 1985). Field (2001) presents a systematic simulation
study, and describes scenarios with too liberal behaviour of the HSmethodology, but also
reports problems with z-transformed pooled values. A simulation strategy is also at the
Confidence Intervals of Correlations 3
core of Field (2005), who places a special emphasis on heterogeneous settings. He finds
similar point estimates for z-transformation-based and HS pooling, with the CIs from the
HS method too narrow in the small-sample case. The simulation study of Hafdahl and
Williams (2009) includes a comprehensive account of random-effects modelling and
related sources of bias in point estimates. Focusing on point estimation, Hafdahl and
Williams (2009) defend z-transformed pooling, but Hafdahl (2009) recommends the
integral z-to-r transformation as a further improvement. In the spirit of Hafdahl and
Williams (2009), the current paper focuses on variance estimators and resulting CIs,
especially in the case of heterogeneity.
1.3. Estimating between-study variance 
All CIs studied here are of the form g θ̂ σ̂θ̂ , for an appropriate back-transformation g
(which is not needed in the HS approach), a point estimator θ̂ and its SE estimator σ̂θ̂,
which depends on the between-study variance estimation. The quality of the CI will
depend on an appropriate choice. In other words, especially when primary reports are
heterogeneous and the underlying study-specific true correlations vary, good estimators
of the between study variance are needed to obtain neither too wide nor too narrow CIs.
The comprehensive study of Veroniki et al., (2016) supports restricted maximum
likelihood estimation (REML) as a default estimator of the between-study variance. Since
large values of the mean correlation cause REML convergence problems, the robust two-
step Sidik and Jonkman (2006) estimator is adoptedhere. Recently,Welz and Pauly (2020)
showed that in the context of meta-regression, the Knapp–Hartung (KH) adjustment
(Hartung, 1999; Hartung & Knapp, 2001) aided (co)variance estimation, motivating the
inclusion of KH-type CIs in the subsequent comparison.
Less well known in the meta-analysis literature are bootstrap methods for variance
estimation, which are not necessarily based on a parametric assumption for the random-
effects distribution. TheWu (1986)wild bootstrap intended for heteroscedastic situations
is evaluated here. Bootstrapping is complemented by sandwich estimators (heteroscedas-
ticity consistent, HC; White, 1980) which Viechtbauer, López-López, Sánchez-Meca, and
Marn-Martnez (2015) introduced in the field ofmeta-analysis. Recently, awide range ofHC
estimatorswere calculatedbyWelz and Pauly (2020),whose comparison also includes the
more recent HC4 andHC5 estimators (Cribari-Neto, Souza, & Vasconcellos, 2007; Cribari-
Neto & Zarkos, 2004). In sum, the following comparison includes a comprehensive
collection of established and current variance estimators and resulting CIs.
In Section 2 we introduce the relevant models and procedures for meta-analyses of
correlations withmore technical detail, as well as our proposed refinements. In Section 3
weperform an extensive simulation study andpresent the results. In Section 4wepresent
an illustrative data example on the association of conscientiousness (in the sense of the
NEO-PI-R; Costa Jr and McCrae, 1985, 2008) and medication adherence (Molloy,
O’Carroll, & Ferguson, 2013). Section 5 concludes the paper with a discussion of our
findings and give an outlook for future research.
2. Meta-analyses of Pearson correlation coefficients
For a bivariate metric randompvffiffiffieffifficffiffitffiffioffiffirffiffiffi(ffiffiXffiffiffi,ffiYffiffiffi)ffiffiffiwffiffiffi ith existing secondmoments the correlation
coefficient ρ¼CovðX,Y Þ= VarðXÞVarðY Þ is usually estimated with the (Pearson)
correlation coefficient
4 Thilo Welz et al.
r¼qffiffiffiffiffiffiffiffi∑ffiffiffiffiniffi¼ffiffiffi1ffiffiðffiffixffiffiiffiffiffiffiffiffiqxÞðffiffiyffiffiffiiffiffiffiffiffiffiyffiffiÞffiffiffiffiffiffiffiffiffiffiffiffiffiffi , (1)
∑n 2 ni¼1ðxixÞ ∑i¼1ðyi 2yÞ
where ðxi,yiÞ, i¼ 1,⋯,n, are independent observations of (X,Y).
The Pearson correlation coefficient is asymptotically consistent, that is, for large
sample sizes, its value converges to the true ρ. It is also invariant under linear
transformations of the data. However, its distribution is difficult to describe analytically
and it is not an unbiased estimator of ρ, with an approximate bias of
ðrρÞ≈ 1ρð1ρ2Þ=ðn1Þ (Hotelling, 1953).
2
As correlation-based meta-analyses with r as effect measure occur frequently in
psychology and the social sciences we briefly recall the two standard models (see
Schwarzer, Carpenter, & Rücker, 2015): the fixed- and random-effects models. The fixed-
effect meta-analysis model is defined as
yi ¼ μþ ɛi, i¼ 1, . . .,K , (2)
where μ denotes the common (true) effect, that is, the (transformed) correlation in our
case,K the number of available primary reports, and yi the observed effect in the i thstudy.
Themodel errors ϵi are typically assumed to be normally distributedwith ɛ indi ∼N 0, σ2i .
In this model the only source of sampling error comes from within the studies. The
estimate of the main effect μ is then computed as a weighted mean via
K
¼ wμ̂ ∑ i yi, (3)
i¼1 w
wherew :¼∑K 2i¼1wi and the study weightswi ¼ σ̂i are the reciprocals of the (estimated)
sampling variances σ̂2i . This is known as the inverse variance method. The fixed-effect
model typically underestimates the observed total variability because it does not account
for between-study variability (Schwarzer et al., 2015). However, it has the advantage of
being able to pool observations, if individual patient data (IPD) are in fact available,
allowing for greater flexibility in methodology in this scenario.
The random-effectsmodel extends the fixed-effect model by incorporating a random
effect that accounts for between-study variability, such as differences in study population
or execution. It is given by
μi ¼ μþuiþ ɛi, i¼ 1,⋯,K , (4)
where the random effects ui are typically assumed tobe independent and Nð0,τ2Þdis-
tributed with between-study variance τ2and ɛ ind∼N 0, σ2i i . Furthermore, the random
effects (ui)i and the error terms (ϵi)i are jointly independent. Thus, for τ2 ¼ 0, the fixed-
effect model is a special case of the random-effects model. The main effect is again
estimated via theweighted mean μ̂ given in equation (3) with study weights now defined¼ðσ̂2þ τ̂2Þ 1as wi i .
A plethora of approaches exist for estimating the heterogeneity variance τ2. Which
estimator should be used has been discussed for a long time, without reaching a definitive
conclusion. However, a consensus has been reached that the popular and easy-to-
calculate DerSimonian--Laird estimator is not the best option. Authors such as Veroniki
et al., (2016) and Langan et al., (2019) have recommendedusing iterative estimators for τ2.
Confidence Intervals of Correlations 5
We therefore (initially) followed their suggestion and used the REML estimator. However,
in some settings, such as large ρvalues, the REML estimator had trouble converging, even
after the usual remedies of utilizing step halving and/or increasing the maximum number
of permitted iterations. We therefore opted to use the two-step estimator suggested by
Sidik and Jonkman (SJ), which is defined by starting with a rough initial estimate of
τ̂2 ¼ 1∑K ð  Þ20 i¼1 yi y and is then updated via the expressionK
2 ¼ 1
K
τ̂SJ  ∑wið  μ̂Þ
2
y , (5)
K 1 ¼ ii 1
K K
where wi ¼
 
τ̂2 20= σ̂i þ

τ̂2
1
0 and μ̂¼ ∑wiyi=∑wi (Sidik & Jonkman, 2005). A
i¼1 i¼1
comprehensive comparison of heterogeneity estimators for τ2in the context of random-
effects meta-analyses for correlations would be interesting but is beyond the scope of this
paper.
Before discussing different CIs for the common correlation μwithinmodel (4),we take
a short excursion on asymptotics for r in the one-group case.
2.1. Background: Asymptotic confidence intervals
2
Assuming bivariate normality of (X,Y), r is approximately distributed asNðρ,ð1ρ2Þ =nÞ
for large sample sizes n (Lehmann, 2004). Here, bivariate normality is a necessary
2
assumption to obtain ð1ρ2Þ in the asymptotic variance (Omelka & Pauly, 2012).
Plugging in r, pwffiffieffi obtain an approximate 100ð1αÞ% CI of the form
ru  21 α=2ð1 r Þ= n, where u1α=2 denotes the ð1α=2Þ quantile of the standard
normal distribution.
In fixed-effect meta-analyses, when IPD are available, this result can be used to
construct a CI based on pooled data: calculating ρ̂pool, the pooled sample correlation
coefficient, we obtain an approximate CI for ρ as
 
1 ρ̂2
ρ̂pool
pool
u1α=2 pffiffiffiffi , (6)
N
K
where N :¼∑i¼1ni is the pooled sample size. As this pooling of observations only makes
sense if we assume that each study has the same underlying effect, this approach is not
feasible for a random-effectsmodel, even if IPDwere available. In any case, evenunder IPD
and a fixed-effects model, this CI is sensitive to the normality assumption and the
underlying sample size, as we demonstrate in Table 1 for the case K ¼ 1. We simulated
bivariate data from standard normal and standardized lognormal distributions1 with
correlation ρ∈f:3, :7gand study size n∈f20, 50, 100g. In each setting we performed
N = 10,000 simulation runs. For the lognormal data coverage is extremely poor in all
cases, ranging from 53–80%. For the normally distributed case coverage was somewhat
low at 90% for n = 20 but improved for larger sample sizes. This case study clearly
illustrates that alternatives are needed when the data cannot be assumed to stem from a
normal distribution or sample sizes are small.
1 Further details regarding the data generation can be found in the online supplementary materials.
6 Thilo Welz et al.
Table 1. Empirical coverage of the asymptotic confidence interval for K ¼ 1, study size
n∈f20, 50, 100g and correlation ρ∈f0:3,0:7g
N
Distribution ρ 20 50 100
Normal .3 .90 .93 .94
.7 .90 .92 .94
Lognormal .3 .79 .80 .79
.7 .63 .57 .53
After this short excursion we return to model (4) and CIs for ρ.
2.2. The Hunter--Schmidt approach
The aggregation of correlations in the Hunter–Schmidt approach is done by sample size
weighting:
K
rHS ¼∑i¼1niri : (7)
∑Ki¼1ni
Several formulae have been recommended for estimating the sampling variance of this
mean effect size estimate. We opted for a suggestion by Osburn and Callender (1992),
 !
K 2
σ̂2 ¼ 1 ∑i¼1niðri rHSÞHS , (8)K ∑Ki¼1ni
which is supposed to perform reasonably well in both heterogeneous and homogeneous
settings (Schulze, 2004). In the simulation study wewill investigate whether this is in fact
the case for the resulting CI, rHSu1α=2σ̂HS.
2.3. Confidence intervals based on the Fisher z transformation
A disadvantage of the asymptotic confidence interval (6) is that the variance of the limit
distribution depends on the unknown correlation ρ. This motivates a variance-stabilizing
transformation. A popular choice for correlation coefficients is the Fisherz transforma-
tion (Fisher, 1915),
 
¼ 1 1þρρ↦z ln
2 1 ¼ atanhðρÞ: (9)ρ
The corresponding inverse Fisher transformation is z↦tanhðzÞ¼ ðexpð2zÞ1Þ=
ðexpð2zÞþ1Þ.
The variance-stabilizing property of the Fishepr ffitffiffiransformation follows from the
δp-mffiffiffiethod (pLeffiffihffi mann, 2004); that is, if nðrρÞ!d ð ð ρ2Þ2N 0, 1 Þ then
pffinffiffiðẑpzffiÞffiffiffi¼ffiffiffiffiffiffiffi nðatanhðrÞ atanhðρÞÞ!dNð0,1Þ: Following, it is reasonable to substitute
nby, n3that is, to approximate the distribution of ẑby,NðatanhðrÞ,1=ðn3ÞÞ still
assuming bivariate normality. Thpusffiffi,ffiffiaffiffiffiffisffiffiiffingle-group approximate 100ð1αÞ% CI can be
constructed via tanhðẑu1α=2= N3Þ:
Confidence Intervals of Correlations 7
In the random-effects model (4), the z transformation may also be used to construct a
CI for the common correlation ρ. Here, the idea is again to use inverse variance weights to
define
 1
∑K 1 þ τ̂2 z
¼ i¼1 ni3 iz 1 , (10)
∑K 1 2i¼1 þ τ̂ni3
K 1
where zi ¼ atanhðriÞ. A rough estimate of the variance of z is given by ð∑i¼1wiÞ . In thefixed-effect case with τ2 ¼ 0 thK ð  Þ 1 ¼ð  Þ1 pffiffiffiiffisffiffiffiffiffiffiffiffiffiffiffiyields the variance estimate∑i¼1 ni 3 N 3K . Then z N3K approximately follows a standard
normal distribu
ð  p
tiffioffiffiffinffi and an approximate 100ð1αÞ% CI is given by
tanh z u1α=2= N
ffiffiffiffiffiffiffiffiffiffi
3KÞ. Proceeding similarly in the random-effects model (4), one
obtains the Hedges–Olkin–Vevea Fisher-z(HOV z) CI
K 1=2
tanhðzu1α=2=ð∑wiÞ Þ, (11)
i¼1
with wi ¼ð1=ðni3Þþ τ̂2Þ1 (Hafdahl & Williams, 2009; Hedges & Olkin, 1985; Hedges
& Vevea, 1998).
2.3.1. Knapp--Hartung-type CI  
K 1
The above approximation of the variance of z via ∑i¼1wi can be rather inaccurate,
especially in random-effects models. Although this is the exact variance of z when the
1
weights are chosen perfectly as wi ¼ðσ2i þτ2Þ , this variance estimate does not protect
against (potentially substantial) errors in estimating σ̂2i and τ̂
2 (Sidik & Jonkman, 2006).
Therefore, we propose an improved CI based on the KH method (Hartung & Knapp,
2001). Knapp andHartung proposed the following variance estimator for the estimate μ̂of
the main effect μin a random-effects meta-analysis (REMA):
K
σ̂2 ¼ ðμ̂Þ¼ 1 w∑ i 2KH V̂arKH  ðμ̂i μ̂Þ , (12)K 1 i¼1 w
K
where again w¼∑i¼1wi. showed that if μ̂ is normally distributed, then ðμ̂μÞ=σ̂KH
follows a t distribution with K1 degrees of freedom. Therefore an approximate
100ð1αÞ% CI for μ is given by
 
tanh z tK1,1α=2  σ̂KH , (13)
where tK1,1α=2 is the 1α=2 quantile of the t distribution with K1 degrees of
freedom. Because of the approximately normal distribution of z-transformed correlations,
the CI ((13)) seems justified. Various authors have highlighted the favourable perfor-
mance of the KH approach compared to alternative meta-analytic methods (IntHout,
Ioannidis, & Borm, 2014; Viechtbauer et al., 2015; Welz & Pauly, 2020). Analogously to
(13), we can construct further CIs by using other variance estimation procedures for
Varðμ̂Þ.
8 Thilo Welz et al.
2.3.2. Wild bootstrap approach
Another possibility for estimating the variance of z is through bootstrapping. Bootstrap-
ping belongs to the class of resampling methods. It allows the estimation of the sampling
distribution of most statistics using random sampling methods. The wild bootstrap is a
subtype of bootstrapping that is applicable in models which exhibit heteroscedasticity.
Roughly speaking, the idea of the wild bootstrap approach is to resample the response
variables based on the residuals. The idea was originally proposed by Wu (1986) for
regression analysis.
We now propose a confidence interval for ρ based on a (data-dependent) wild
bootstrap (WBS) approach combined with the ztransformation. The idea works as
follows. We assume an REMA model with Pearson’s correlation coefficient as the effect
estimate (and K>3 studies). Given the estimated study-level correlation coefficients
ri, i¼ 1,⋯,K , we transform these using ztransformation to ẑi, i¼ 1,⋯,K , and estimate
z¼ 1atanhðρÞvia ẑ¼∑iðwi=wÞẑi, where again wi ¼ðσ̂ 2iþ τ̂ Þ with σ̂2 1i ¼  and¼ ni 3w ∑ 2iwi. Here, τ̂ may be any consistent estimator of the between-study heterogeneity
τ2, where we have chosen the SJ estimator. We then calculate the estimated residuals
ɛ̂i ¼ ẑ ẑi and use these to generate B new sets of study-level effects
ẑ∗ ∗1b, . . ., ẑKb,b¼ 1, . . .,B. Typical choices for Bare 1,000 or 5,000. The new study-level
effects are generated via
ẑ∗ib :¼ ẑiþ ɛ̂i vi, (14)
where vi ∼Nð0,γÞ. The usual choice of variance in a WBS is γ¼ 1. However, we
propose a data-dependent choice of either γK ¼ðK1Þ=ðK3Þ or γK ¼ðK2Þ=ðK3Þ.
These choices are based on simulation results, which will be discussed in detail in
Section 3.Wewill later refer to these approaches asWBS1,WBS2 andWBS3, respectively.
The corresponding values for γ are 1, ðK1Þ=ðK3Þ and ðK2Þ=ðK3Þ. This allows us
to generate B new estimates of the main effect z by calculating
∑K∗ ¼ i¼1w
∗ ∗
ẑ ib
ẑib
b , (15)∑K ∗i¼1wib
with w∗ib≡wi. We then estimate the variance of ẑ via the empirical variance of ẑ
∗
1,⋯, ẑ
∗
B,
B B
σ∗2 :¼ 1 ∑ ðẑ∗z  i 
2
z∗Þ , with z∗ ¼ 1 ∑ ẑ∗
B 1 ¼ B ¼ ii 1 i 1
It is now possible to construct a CI for z as in equation (13) but with this new variance
estimate ofz. The CI is back-transformed via the inverse Fisher transformation to obtain
a CI for the common correlation ρ, given by

tanh ẑ σ̂∗z 

tK1,1α=2 : (16)
Confidence Intervals of Correlations 9
Transform 
correlations r to z, fit 
REMA model, 
calculate residuals
= −
Draw 
~ (0, )
randomly
Repeat B 
times
Generate pseudo-
data:
Fit new REMA 
& save effect
estimate
Figure 1. Visual illustration of the wild bootstrap procedure for generating B bootstrap samples of
the main effect estimate on the z scale. REMA, random-effects meta-analysis.
Figure 1 provides a visual illustration of the WBS procedure discussed above.
2.3.3. HC-type variance estimators
Last but not least,we employheteroscedasticity consistent variance estimators [sandwich
estimators; White, 1980). Different forms (HC0,...,HC5) are in use for linear models
(Rosopa, Schaffer, & Schroeder, 2013). The motivation for the robust HC variance
estimators is that in a linear regression setting the usual variance estimate is unbiased
when unit-level errors are independent and identically distributed. However, when the
unit-level variances are unequal, this approach can be biased. If we apply this to the meta-
analysis context, the study-level variances are almost always unequal due to varying
sample sizes. Therefore, it makes sense to consider variance estimators that are unbiased
even when the variances of the unit (study) level variances are different.
The extension of HC estimators to the meta-analysis context can be found in
Viechtbauer et al., (2015) for HC0 andHC1 and inWelz and Pauly (2020) for the remaining
HC2,⋯,HC5. Statistical tests based on these robust estimators have been shown to
perform well, especially those of types HC3 and HC4. In the special case of an REMA they
are defined as
K
σ̂2 ¼ 1 ∑ 2HC 2 w2ɛ̂2j j ð1x Þ3 ð jj∑Ki¼ wiÞ j¼11
1 K n o
σ̂2 ¼ ∑ 2ɛ̂2ð1 Þδ xw x j jjHC 2 j j jj , δ j ¼min 4, ,4 ð∑Ki¼ j¼1 x1wiÞ
10 Thilo Welz et al.
K K
with ɛ̂ j ¼ ẑ j ẑ,xij ¼wj=∑wi and x¼K1 ∑ xij [see theAppendix S1 ofWelz and Pauly,
i¼1 i¼1
2020 for details). Plugging them into equation (13) leads to the confidence intervals

tanh ẑ σ̂HC  tK1,1α=2 , j¼ 3,4: (17)j
2.3.4. Integral z-to-r transformation
There is a fundamental problem with back-transforming CIs on the z scale using the
inverse Fisher transformation tanh. Consider a random variable ξ :NðartanhðρÞ,σ2Þwith
some variance σ2>0 and ρ≠0. Then ρ¼ tanhððξÞÞ≠ðtanhðξÞÞ by Jensen’s inequality.
This means the back-transformation introduces an additional bias. A remedy was
proposed by Hafdahl (2009), who suggested back-transforming from the z scale using an
integral z-to-r transformation. This transformation is the expected value of tanhðzÞwhere
z :Nðμz,τ2zÞ that is, Z ∞
ψðμzjτ2zÞ¼ tanhðtÞf ðtjμz,τ2zÞdt, (18)
∞
where f is the density of z. In practicewe apply this transformation to the lower and upper
confidence limits on the z scale, plugging in the estimates ẑ and τ̂2z . For example, for the
KH-based CI (13) with z scale confidence bounds ‘¼ z tK1,1α=2  σ̂KH and
u¼ zþ tK1,1α=2  σ̂KH, with an estimated heterogeneity τ̂2z (on the z scale), the CI is
given by
 ð j ψ ‘ τ̂2Þ,ψðujτ̂2z zÞ :
If the true distribution of ẑ is well approximated by a normal distribution and τ̂2z is a
good estimate of the heterogeneity variance (on the z scale), ψ should improve the CIs as
compared to simply back-transformation with tanh (Hafdahl, 2009). Following this
argument, we also suggest using ψ instead of tanh. We calculate the integral with
Simpson’s rule (Süli &Mayers, 2003), which is amethod for the numerical approximation
of definite integrals. Following Hafdahl (2009), 150 subintervals over ẑ5  τ̂SJ were used.
Note that the HOVz CI is implemented in its original formulation, using tanh.
3. Simulation study
We have suggested several new CIs for the mean correlation ρ, all based on the z
transformation, applicable in both, fixed- and random-effects models. In order to
investigate their properties (especially coverage of ρ), we perform extensive Monte Carlo
simulations. We focus on comparing the coverage of our newly suggested CIs with
existing methods.
3.1. Simulation study design
The Pearson correlation coefficient is constrained to lie in the interval ½1,1. The typical
random-effects model μi ¼ μþuiþ ɛi, assuming a normal distribution for the random
ð Þ  
Confidence Intervals of Correlations 11
effect ui ∼N 0,τ2 and error term ɛi ∼N 0,σ2i , needs to be adjusted, since values
outside of ½1,1 could result when sampling without any modification.
3.1.1. Model 1
As a first option for generating the (true) study-level correlations, we consider a truncated
normal distribution ρi ∼Nðρ,τ2Þ: Sampling of ρi is repeated until a sample lies within the
interval ½0:999, 0:999. This type of truncated normal distribution model was also used
in Hafdahl andWilliams (2009) and Field (2005). A problemwith this modelling approach
is that the expected value of the resulting truncated normal distribution is in general not
equal to ρ. For a random variable X stemming from a truncated normal distribution with
mean μ, variance σ2, lower bound a and upper bound b,
ð Þ¼ þ ϕðΔ1ÞϕðΔ2Þ X μ σ ,
δ
where Δ1 ¼ðaμÞ=σ, Δ2 ¼ðbμÞ=σ and δ¼ΦðΔ2ÞΦðΔ1Þ (Johnson, Kotz, & Balakr-
ishnan, 1994). Here ϕðÞ is the probability density function of the standard normal
distribution andΦðÞ its cumulative distribution function. Figure S15 shows the bias in our
setting with a¼0:999 and b¼ 0:999. The bias is equal to σðϕðΔ1ÞϕðΔ2ÞÞ=δ. In
addition to generating a biased effect, the truncation also leads to a reduction of the overall
variance, which is smaller than τ2.
3.1.2. Model 2
We therefore studied a second model, in which we generate the (true) study-level effects
ρi from transformed beta distributions: Y i ¼ 2ðXi0:5Þ with Xi ∼Betaðα,βÞ for studies
i¼ 1,⋯,K . The idea is to choose the respective shape parameters α,β such that
 
EðY iÞ¼ 2  αþ 0:5 ¼ ρ,α β
ð 4αβVar Y iÞ¼ ð þ Þ2ð þ ¼ τ
2:
α β α βþ1Þ
The solution to the system of equations above is
¼ð1
 
ρÞð1þρÞτ2  1þρα ,
τ2 2
  ¼ 1 ρβ α:
1þρ
In this second simulation scenario we also truncate the sampling distribution of the
correlation coefficients to ½0:999, 0:999, but values outside of this interval are
considerably rarer. The second model has the advantages that the expected value and
variance are approximately correct, unlike in the first (truncated)model. A disadvantage is
12 Thilo Welz et al.
that for extreme τ2 values, the above solution for α (and thus β) may become negative,
which is undefined for parameters of a beta distribution. However, this was not a concern
for the parameters considered in our simulation study and only occurs in more extreme
scenarios.
3.1.3. Parameter choices
In order to get a broad overview of the performance of all methods, we simulated various
configurations of population correlation coefficient, heterogeneity, sample size and
number of studies. Here we chose the correlations ρ∈f0, :1, :3, :5, :6, :7, :8, :9gand
heterogeneity τ∈f0,0:16,0:4g. We used the same values for τ as Hafdahl and Williams
(2009), to enable comparability of our simulation studies. Moreover, we considered small
to large numbers K∈f5, 10, 20, 40g of studies with different study sizes. For K = 5, we
! !
considered n¼ð15, 16, 19, 23, 27Þ as vector of ‘small’ study sizes and 4 n for larger study
sizes, corresponding to an average study size ðnÞof 20 and 80 subjects, respectively. For all
~
other choices of K we proceeded similarly, stacking copies n behind each other, for! ! ! !
example, the sample size vectors n, n and 4  n, n forK ¼ 10. Byway of comparison,
Hafdahl and Williams (2009) considered 5 ≤K ≤ 30. As we wanted to capture the
methods’ behaviour when many studies are present, we also included the setting K ¼ 40
in our simulation study. Additionally, we accounted for variability in study sizes, which
will be present in virtually any meta-analysis in practice. Additionally, we considered two
special scenarios: the case of few and heterogeneous studies, with study size vector
ð !∗ !∗23,19,250,330,29Þ and the case of many large studies, with study size vector n ,n
!∗
with n ¼ð210,240,350,220,290,280,340,400,380,290Þ. The latter case corresponds to
K ¼ 20 studies with an average of 300 study subjects.
Thus, in total we simulated 8ðρÞ3ðτ2Þ10ðK , studysizevectorÞðmodelsÞ¼ 480
different scenarios for each type of confidence interval discussed in this paper. For each
scenario we performed N ¼ 10,000 simulation runs, where for the WBS CI each run was
based upon B¼ 1,000 bootstrap replications. The primary focus was on comparing
empirical coverage, with nominal coverage being 1α¼ :95. For 10,000 iterations, the
MpoffiffinffiffiffitffieffiffiffiffiffiffiCffiffiffiaffiffiffirffilffiffioffiffiffiffiffiffisffiffitffiaffiffinffi dard error of the simulated coverage will be approximately
:95 :05=10000≈0:218% , using the formula provided in the recentwork on simulation
studies by Morris, White, and Crowther (2019).
All simulations were performed using the open-source software R. The R scripts
written by the first author especially make use of the metafor package for meta-analysis
(Viechtbauer, 2010).
3.2. Results
For ease of presentation, we aggregated the multiple simulation settings with regard to
number and size of studies. The graphics therefore display the mean observed coverage
for each confidence interval type and true main effect ρ. Results are separated by
heterogeneity τ2 and simulation design. The latter refers to the truncated normal
distribution approach and the transformed beta distribution approach, respectively.More
detailed simulation results for all settings considered are given in the Appendix S1.
Confidence Intervals of Correlations 13
Figure 2. Mean Coverage for truncated normal distributionmodel with τ¼ 0, aggregated across all
number of studies and study size settings. HC, heteroscedasticity-consistent; HOVz, Hedges–-
Olkin–Vevea Fisher z; HS, Hunter–Schmidt; KH, Knapp–Hartung; WBS, wild bootstrap
3.2.1. Coverage
We first discuss the results based on the truncated normal distribution (model 1). In the
case of no heterogeneity (fixed-effect model), Figure 2 shows that the new methods
control the nominal coverage of 95% well. Only the first wild bootstrap (WBS1) CI
exhibits liberal behaviour, yielding empirical coverage of approximately 93:5% . The HS
approach only provides 90% coverage, and HOVzwas slightly conservative with (mean)
coverage of around 97–98% . Moreover, in the fixed-effect model the value of ρ did not
affect any of the methods.
In the truncated normal set-up with moderate heterogeneity of τ¼ 0:16 in Figure 3,
several things change. First, there is a strong drop-off in coverage for higher correlations
ρ≥ :8. For HS this drop-off occurs earlier for ρ ≥ :7. Second, for ρ ≤ :7, HS is even more
liberal than for τ¼ 0, with coverage around 87.5%. Additionally, HOVz is no longer
conservative but becomes more liberal than WBS1 with estimated coverage probabilities
around 90–94% for ρ≤ :7. For all new methods a slight decrease in coverage can be
observed for increasing values of ρ from 0 to .7. Moreover, there is a slight uptick at ρ¼ :8
for HOVz, followed by a substantial drop-off. Overall the WBS3, HC3, HC4 and KH CIs
show the best control of nominal coverage in this setting.
We now consider model 2 with a transformed beta distribution model. In the fixed-
effects case (τ2 ¼ 0) the two models are equivalent so we obtain the same coverage as in
Figure 2. For moderate heterogeneity (τ¼ 0:16; see Figure 4), our newly proposed
methods clearly outperformHOVz andHS,with a good control of nominal coverage.Only
for ρ¼ :9 is their coverage slightly liberal. WBS1 performs just slightly worse than the
other newCIs. The observed coverage for HS is around 86–88% for ρ ≤ :7 and drops to just
below 80% for ρ¼ :9. For ρ>:6 the HOVz CI is even worse, with values dropping
(substantially) below 75%.
14 Thilo Welz et al.
Figure 3. Mean coverage for truncated normal distributionmodel with τ¼ 0:16, aggregated across
all number of studies and study size settings. HC, heteroscedasticity-consistent; HOVz, Hedges–-
Olkin–Vevea Fisher z; HS, Hunter–Schmidt; KH, Knapp–Hartung; WBS, wild bootstrap
Figure 4. Mean coverage for transformed beta distribution model with τ¼ 0:16, aggregated across
all number of studies and study size settings. HC, heteroscedasticity-consistent; HOVz, Hedges–-
Olkin–Vevea Fisher z; HS, Hunter–Schmidt; KH, Knapp–Hartung; WBS, wild bootstrap
For ease of presentation, the results for the case of extreme heterogeneity with τ¼ 0:4
are given in the Appendix S1. Here, we only summarize important points from Figures
S13–S14. In the truncated normal distribution model we observe that HS again has
Confidence Intervals of Correlations 15
unsatisfactory coverage, compared with the other approaches. For our new CIs based on
the Fisher transformation, for small K , coverage is approximately correct for ρ≤ :6 and
then drops off considerably. HOVz is slightly liberal with coverage around 90% for ρ≤ :6
and then drops off strongly. This holds for both smaller and larger studies with
n∈f20,80g, respectively. For an increasing number of studies K , HOVz remains largely
unchanged, whereas coverage of the new methods gets progressively worse (i.e., the
drop-off in coverage occurs earlier for an increasing number of studies). For K ¼ 40 the
new CIs only have correct coverage for ρ ≤ :3. In the case of the beta distribution model
with τ¼ 0:4 the new CIs provide correct coverage for ρ≤ :7 in all scenarios, dropping off
after this threshold. HOVz is highly inadequate, with coverage growing progressively
worse for increasing K . HOVz only has correct coverage for simultaneously ρ≤ :1 and
large K . For K ¼ 5, HS has coverage up to 82%, decreasing for increasing values of ρ.
However, for increasing number of studies (whether large or small), HS appears to
converge towards nominal coverage. In particular, for K ¼ 40 and ρ>:7, HS provides the
most accurate coverage under the beta distribution model.
3.2.2. Interval lengths
We simulated the expected confidence interval lengths for all methods discussed in this
paper. The detailed results are provided in Figures S7–S12. The results again depend on
both the assumed model and the amount of heterogeneity τ.
Generally we observe that the confidence intervals become increasingly narrow for
increasing values of ρ and increasinglywide for larger values of τ. For the truncated normal
distribution model and τ¼ 0, HS (on average) yields the shortest confidence intervals and
HOVz the widest, with the other CIs lying in between with quite similar lengths. Only for
K ¼ 5 are the CIs based on the wild bootstrap quite wide, indicating that potentially more
studies are required to reliably use WBS-based approaches. For τ¼ 0:16, HS again yields
the shortest CIs in all scenarios. For smallK , theWBS approaches yield thewidest CIs, and
formore studies,HOVz is thewidest,when ρ is small, but becomingnearly as narrow asHS
when ρ is close to 1. The lengths of the other CIs are nearly identical for K ¼ 40, whereas
for fewer studies there are considerable differences. This relative evaluation also holds for
τ¼ 0:4.
When the underlying model is the beta distribution model and τ¼ 0, the results are
equivalent to the truncated normal distribution model. For τ¼ 0:16 and K ¼ 5 the widths
of the newCIs decreasewith increasing ρuntil ρ¼ :7. Interestingly, thewidths of theseCIs
then increase again for ρ>:7,whichwasnot observed in the truncated normalmodel. This
effect becomes much less pronounced for increasing number of studies K. HS is always
narrower than the newCIs, and, forK ≥ 20,HOVz is thewidest at ρ¼ 0but evennarrower
than HS for ρ ≥ :8. For τ¼ 0:4 the results are similar, except that thewidths of the CIs now
decrease monotonously for increasing ρ and HOVz is narrowest for ρ>:5.
3.2.3. Recommendations
We summarize our findings by providing recommendations to practitioners wishing to
choose between the methods considered. The recommendations will depend on the
assumed model and how much heterogeneity is present in the data. We believe the beta
distributionmodel is better suited for random-effects meta-analyses of correlations. Recall
that HOVz employs the inverse Fisher transformation, whereas our newly proposed
confidence intervals employ the integral z-to-r transformation suggested by
16 Thilo Welz et al.
 τ¼ 0 (fixed-effect model). HS and HOVz are not recommended. We recommend using
KH, HC3 or HC4.
 τ¼ 0:16. For the truncatednormalmodel, HS andHOVz are not recommended andwe
recommend using KH, HC3 or HC4. For jρj>:7, all methods are unsatisfactory and only
in the case ofK ¼ 40may HOVz be preferable. For the beta distributionmodel, HS and
HOVz are not recommended. All new confidence intervals exhibit satisfactory
coverage. For small K, WBS approaches yield wider confidence intervals, therefore
preferably use KH, HC3 or HC4.
 τ¼ 0:4. For the truncated normal model, HS is not recommended. For K ¼ 5 and
jρj≤ :7 we again recommend KH, HC3 or HC4. For K ≥ 10 and jρj≤ :7 we recommend
HOVz. For jρj>0:7 none of the methods is satisfactory. For the beta distribution
model, HOVz is not recommended. For jρj≤ :7 we recommend KH, HC3 or HC4. For
K ≥ 40 and jρj>:7 we recommend using HS. For K ≤ 20 and jρj>:7 none of the
methods is satisfactory.
4. Illustrative data analyses
Between 25% and 50% of patients fail to take their medication as prescribed by their
caregiver (Molloy et al., 2013). Some studies have shown thatmedication adherence tends
to be better in patientswho score higher on conscientiousness (from the five-factormodel
of personality). Table 2 contains data on 16 studies, which investigated the correlation
between conscientiousness and medication adherence. These studies were first analysed
in the form of a meta-analysis in Molloy et al. (2013). The columns of Table 2 contain
information on the authors of the respective study, the year of publication, the sample size
of study i (ni), the observed correlation in study i, the number of variables controlled for
(controls), study design, the type of adherence measure (a_measure), the type of
conscientiousness measure (c_measure), the mean age of study participants (mean_age)
and themethodological quality (as scoredby the authors on a scale from1 to 4,with higher
scores indicating higher quality).
Regarding the measurement of conscientiousness, where NEO (Neuroticism-
Extraversion-Openness) is indicated as c_measure, the personality trait of conscientious-
ness was measured by one of the various types of NEO personality inventories (PIs; Costa
Jr and McCrae, 1985, 2008).
We performed both a fixed- and random-effects meta-analysis, using all methods
considered. For the random-effects model we used the SJ estimator to estimate the
between-study heterogeneity variance τ2. Combining all available studies yielded
rFE ¼ :130, r 2RE ¼ :154 and τ̂SJ ¼ 0:012. In addition to a complete-case study, we also
examined the cross-sectional and prospective studies separately. In total there were five
cross-sectional and 11 prospective studies in the data set. For the cross-sectional studies
rFE ¼ :168 and rRE ¼ :170 resulted and slightly lower values for the prospective studies
(rFE ¼ :108, rRE ¼ :147). Heterogeneity estimates were τ̂2SJ ¼ 0:007 (cross-sectional) and
τ̂2SJ ¼ 0:016 (prospective), respectively. In Table 3weprovide values of all CIs discussed in
this paper.
In the case of all studies (K ¼ 16), all methods yield quite similar CIs except for HS.
Additional simulations for this situation (K ¼ 16, τ2 ¼ 0:012, ni as in Table 3) are given in
the Appendix S1 and show a coverage of around 80% for HS, while all other methods
exhibit a fairly accurate coverage of around 95% and HOVz with around 94%. Thus, the
Confidence Intervals of Correlations 17
Table 2. Data from 16 studies investigating the correlation between conscientiousness and medication adherence
Study i Authors Year ni ri Controls Design a_measure c_measure mean_age Quality
1 Axelsson et al. 2009 109 .19 None cross-sectional Self-report other 22.00 1
2 Axelsson et al. 2011 749 .16 None Cross-sectional Self-report NEO 53.59 1
3 Bruce et al. 2010 55 .34 None Prospective Other NEO 43.36 2
4 Christensen et al. 1999 107 .32 None Cross-sectional Self-report other 41.70 1
5 Christensen and Smith 1995 72 .27 None Prospective Other NEO 46.39 2
6 Cohen et al. 2004 65 .00 None Prospective Other NEO 41.20 2
7 Dobbels et al. 2005 174 .17 None Cross-sectional Self-report NEO 52.30 1
8 Ediger et al. 2007 326 .05 Multiple Prospective Self-report NEO 41.00 3
9 Insel et al. 2006 58 .26 None Prospective Other other 77.00 2
10 Jerant et al. 2011 771 .01 Multiple prospective Other NEO 78.60 3
11 Moran et al. 1997 56 −.09 Multiple Prospective Other NEO 57.20 2
12 O’Cleirigh et al. 2007 91 .37 None Prospective Self-report NEO 37.90 2
13 Penedo et al. 2003 116 .00 None cross-Sectional Self-report NEO 39.20 1
14 Quine et al. 2012 537 .15 None Prospective Self-report other 69.00 2
15 Stilley et al. 2004 158 .24 None Prospective Other NEO 46.20 3
16 Wiebe and Christensen 1997 65 .04 None Prospective Other NEO 56.00 1
ni, study size; ri, empirical correlation; controls, number of variables controlled for;design, studydesign; a_measure, typeof adherencemeasure (self-report or
other); c_measure, type of conscientiousness measure (NEO or other); mean_age, mean age of study participants; quality, methodological quality.
18 Thilo Welz et al.
Table 3. Random-effects model confidence intervals for all studies and subgroups separated by
study design, original data from Molloy et al. (2013)
Study design
Approach All designs Cross-sectional Prospective
HOVz [.081, .221] [.067, .266] [.050, .240]
HS [.073, .174] [.100, .220] [.035, .166]
KH [.080, .218] [.037, .291] [.043, .239]
WBS1 [.086, .213] [.063, .267] [.051, .232]
WBS2 [.079, .219] [.053, .276] [.043, .239]
WBS3 [.084, .215] [.058, .272] [.048, .234]
HC3 [.081, .218] [.041, .288] [.041, .241]
HC4 [.083, .216] [.054, .276] [.045, .237]
HC, heteroscedasticity-consistent; HOVz, Hedges--Olkin--Vevea Fisher z; HS, Hunter–Schmidt; KH,
Knapp–Hartung; WBS, wild bootstrap.
price paid for the narrowHSCIs is poor coverage. Additional analyses of other data sets are
given in the Appendix S1.
5. Discussion
We introduced several newmethods to construct confidence intervals for themain effect
in random-effects meta-analyses of correlations, based on the Fisher z transformation. We
compared these to the standard HOVz and Hunter--Schmidt confidence intervals and,
following the suggestion by Hafdahl (2009), utilized an integral z-to-r transformation
instead of the inverse Fisher transformation. We performed an extensive Monte Carlo
simulation study in order to assess the coverage and mean interval length of all CIs. In
addition to the truncated normal distribution model considered by Hafdahl and Williams
(2009) and Field (2005), we investigated a transformed beta distribution model which
exhibits less bias in the generation of the study-level effects.
The results of our simulations show that for low and moderate heterogeneity and
correlations of jρj ≤ :7, our newly proposed confidence intervals improved coverage
considerably over the classical HOVz and Hunter-Schmidt approaches. However, for
extreme heterogeneity and jρj>:7 all confidence intervals performed poorly. Therefore,
further methodological research is necessary in order to fill this gap. Also, the choice of
data-generatingmodel (truncated normal or transformedbeta distribution) has substantial
influence on results. For various reasons, which we discussed when introducing the two
models, the beta distribution model is arguably more appropriate. Based on our findings,
we provide recommendations to practitioners looking for guidance in choosing amethod
for data analysis. These are listed in Section 3.2.3.
5.1. Limitations and further research
In the present paper we focused on the Pearson correlation coefficient, as it is the most
commonly used dependence measure. However, a limitation of the Pearson correlation
coefficient is that it only considers the linear relationship between variables. If variables
are related via some nonlinear function or significant outliers are present, other
Confidence Intervals of Correlations 19
correlation coefficients such as Spearman’s rank correlation may be more appropriate.
The Spearman correlation coefficient is the Pearson correlation coefficient of the rank
values of the variables considered. Moreover, it shares similar properties with Pearson’s
correlation such as taking values in [−1,1] and even being asymptotically normal under
relatively weak assumptions (Schmid & Schmidt, 2007). The confidence intervals we
discussed in this paper can be calculated analogously for Spearman correlation
coefficients, for example when dealing with ordinal data. Evaluating their performance,
as we did in our simulation study, in conjunctionwith Spearman correlations is a topic for
future research. A detailed analysis of Spearman’s and more general correlations as in
Schober, Boer, and Schwarte (2018),, however, is outside the scope of this paper.
When dealing with different underlying data than we considered in our paper, it
should be kept inmind that although the underlying normal-normalmodel (4) is often very
useful, it has some limitations. For example, when dealing with binomial variables with
extreme observations, normal approximations may perform poorly (Agresti & Coull,
1998).. A context where this might occur are ceiling or floor effects on questionnaires or
ability tests; that is, when many participants obtain a near maximal (or minimal) score on
some questionnaire, a normal approximation may be invalid. Count data may also be
problematic, due to their ordinal nature and especially when zeros frequently occur.
Therefore researchers should carefully consider the data being analysedwhen choosing a
fitting model in practical applications.
In real-life data sets model (4) may be improved by including meaningful moderator
variables, leading to meta-regression as considered in Viechtbauer et al., (2015) andWelz
and Pauly (2020).. This can considerably reduce the heterogeneity present in the model.
We attempted to further improve the proposed confidence intervals with the help of a
bias correction for the Pearson correlation coefficient r, given by
r∗ ¼ rð1 r2Þ=ð2ðn1ÞÞ, as the (negative) bias of r is usually approximated by
B ¼ρð1ρ2r Þ=ð2ðn1ÞÞ (Hotelling, 1953; Schulze, 2004). However, this bias correc-
tion actually made coverage worse in the settings studied.
Acknowledgments
The authors gratefully acknowledge the computing time provided on the Linux HPC cluster at
Technical University Dortmund (LiDO3), partially funded in the course of the Large-Scale
Equipment Initiative by the German Research Foundation (DFG) as project 271512359.
Furthermore, we thank Marléne Baumeister and Lena Schmid for many helpful discussions,
Wolfgang Trutschnig and Sebastian Fuchs for valuable comments on Spearman’s Rho and
Philip Buczak for finding interesting data sets. This work was supported by the German
Research Foundation project (Grant no. PA-2409 7-1).
Conflict of interest
All authors declare no conflict of interest.
Author contributions
Thilo Welz, M.Sc. Mathematical Biometry (Conceptualization; Formal analysis; Method-
ology; Software; Visualization; Writing – original draft) Philipp Doebler (Methodology;
20 Thilo Welz et al.
Writing – review & editing) Markus Pauly (Funding acquisition; Methodology; Project
administration; Supervision; Writing – review & editing).
Data availability statement
The R-scripts used for our simulations and data analyses will be made publicly available on
osf.io under https://osf.io/t83b7/. The dataset from Molloy et al., (2013) can be found in the
metafor package in R and the datasets considered for re-analysis are from Chalkidou et al.
(2012) and Santos et al. (2016) respectively.
References
Agresti, A., & Coull, B. A. (1998). Approximate is better than “Exact” for interval estimation of
binomial proportions. The American Statistician, 52, 119–126. https://doi.org/10.1080/
00031305.1998.10480550
Aldao, A., Nolen-Hoeksema, S., & Schweizer, S. (2010). Emotion-regulation strategies across
psychopathology: A meta-analytic review. Clinical Psychology Review, 30, 217–237. https://
doi.org/10.1016/j.cpr.2009.11.004
Barrick, M. R., & Mount, M. K. (1991). The big five personality dimensions and job performance: a
meta-analysis. Personnel Psychology, 44, 1–26. https://doi.org/10.1111/j.1744-6570.1991.tb
00688.x
Chalkidou, A., Landau, D. B., Odell, W., Cornelius, V. R., O’Doherty, M. J., & Marsden, P. K. (2012).
Correlation between Ki-67 immunohistochemistry and 18F-fluorothymidine uptake in patients
with cancer: a systematic review and meta-analysis. European journal of cancer, 48(18),
3499–3513. https://doi.org/10.1016/j.ejca.2012.05.001
Cheung, M.W.L. (2015).Meta-analysis: A structural equation modeling approach. Hoboken, NJ:
John Wiley & Sons.
Costa, Jr, P. T., & McCrae, R. R. (1985). The NEO personality inventory. Lutz, FL: Psychological
Assessment Resources Odessa. https://www.parinc.com/PAR_Support
Costa, Jr, P. T., & McCrae, R. R. (2008). The revised NEO personality inventory (NEO-PI-R).
Thousand Oaks, CA: Sage Publications. https://us.sagepub.com/en-us/nam/contact-us
Cribari-Neto, F., Souza, T. C., & Vasconcellos, K. L. (2007). Inference under heteroskedasticity and
leveraged data. Communication in Statistics – Theory and Methods, 36, 1877–1888. https://
doi.org/10.1080/03610920601126589
Cribari-Neto, F., & Zarkos, S. G. (2004). Leverage-adjusted heteroskedastic bootstrap methods.
Journal of Statistical Computation and Simulation, 74, 215–232. https://doi.org/10.1080/
0094965031000115411
Field, A. P. (2001). Meta-analysis of correlation coefficients: A Monte Carlo comparison of fixed-and
random-effects methods. Psychological Methods, 6, 161–180. https://doi.org/10.1037/1082-
989X.6.2.161
Field, A. P. (2005). Is the meta-analysis of correlation coefficients accurate when population
correlations vary? Psychological Methods, 10, 444–467. https://doi.org/10.1037/1082-989X.
10.4.444
Fisher, R. A. (1915). Frequency distribution of the values of the correlation coefficient in samples
from an indefinitely large population. Biometrika, 10, 507–521. https://doi.org/10.2307/
2331838
Hafdahl, A. R. (2009). Improved Fisher z estimators for univariate random-effects meta-analysis of
correlations.British Journal ofMathematical and Statistical Psychology,62, 233–261. https://
doi.org/10.1348/000711008X281633
Confidence Intervals of Correlations 21
Hafdahl, A. R., & Williams, M. A. (2009). Meta-analysis of correlations revisited: Attempted
replication and extension of Field’s (2001) simulation studies. Psychological Methods, 14,
24–42. https://doi.org/10.1037/a0014697
Hartung, J. (1999). An alternative method for meta-analysis. Biometrical Journal: Journal of
Mathematical Methods in Biosciences, 41, 901–916. https://doi.org/10.1002/(SICI)1521-4036
(199912)41:8<901::AID-BIMJ901>3.0.CO;2-W
Hartung, J., & Knapp, G. (2001). A refined method for the meta-analysis of controlled clinical trials
with binary outcome. Statistics inMedicine, 20, 3875–3889. https://doi.org/10.1002/sim.1009
Hedges, L. V., & Olkin, I. (1985). Statistical methods for meta-analysis. San Diego, CA: Academic
Press.
Hedges, L., & Vevea, J. (1998). Fixed-and random-effects models in meta-analysis. Psychological
Methods, 3, 486–504.
Hotelling, H. (1953). New light on the correlation coefficient and its transforms. Journal of the
Royal Statistical Society, Series B, 15, 193–232. https://doi.org/10.1111/j.2517-6161.1953.tb
00135.x
Hunter, J. E., & Schmidt, F. L. (1994). Estimation of sampling error variance in the meta-analysis of
correlations: Use of average correlation in the homogeneous case. Journal of Applied
Psychology, 79, 171.
Hunter, J. E., & Schmidt, F. L. (2004). Methods of meta-analysis: Correcting error and bias in
research findings. Thousand Oaks, CA: Sage Publishing.
IntHout, J., Ioannidis, J. P., & Borm, G. F. (2014). The Hartung-Knapp-Sidik-Jonkman method for
random effects meta-analysis is straightforward and considerably outperforms the standard
DerSimonian-Laird method. BMCMedical ResearchMethodology, 14, 1–12. https://doi.org/10.
1186/1471-2288-14-25
Jak, S. (2015). Meta-analytic structural equation modelling. New York: Springer.
Johnson, N. L., Kotz, S., & Balakrishnan, N. (1994). Continuous univariate distributions. New
York: John Wiley & Sons.
Langan, D., Higgins, J. P., Jackson, D., Bowden, J., Veroniki, A. A., Kontopantelis, E., . . . Simmonds,
M. (2019). A comparison of heterogeneity variance estimators in simulated random-effectsmeta-
analyses. Research Synthesis Methods, 10, 83–98. https://doi.org/10.1002/jrsm.1316
Lehmann, E. L. (2004). Elements of large-sample theory. Springer Science and Business media,
Luxemburg.
Molloy, G. J., O’Carroll, R. E., & Ferguson, E. (2013). Conscientiousness and medication adherence:
A meta-analysis. Annals of Behavioral Medicine, 47, 92–101. https://doi.org/10.1007/s12160-
013-9524-4
Morris, T. P., White, I. R., & Crowther, M. J. (2019). Using simulation studies to evaluate statistical
methods. Statistics in Medicine, 38, 2074–2102. https://doi.org/10.1002/sim.8086
Omelka, M., & Pauly, M. (2012). Testing equality of correlation coefficients in two populations via
permutationmethods. Journal of Statistical Planning and Inference, 142, 1396–1406. https://
doi.org/10.1016/j.jspi.2011.12.018
Open Science Collaboration (2015). Estimating the reproducibility of psychological science.
Science, 349. https://doi.org/10.1126/science.aac4716
Osburn, H., & Callender, J. (1992). A note on the sampling variance of the mean uncorrected
correlation in meta-analysis and validity generalization. Journal of Applied Psychology, 77,
115–122. https://doi.org/10.1037/0021-9010.77.2.115
Rosopa, P. J., Schaffer, M. M., & Schroeder, A. N. (2013). Managing heteroscedasticity in general
linear models. Psychological Methods, 18, 335–351. https://doi.org/10.1037/a0032553
Santos, S., Almeida, I., Oliveiros, B.,& Castelo‐Branco, M. (2016). The role of the amygdala in facial
trustworthiness processing: A systematic review andmeta‐analyses of fMRI studies.PloS one, 11
(11).e0167276. https://doi.org/10.1371/journal.pone.0167276
Schmid, F., & Schmidt, R. (2007). Multivariate extensions of Spearman’s rho and related statistics.
Statistics & Probability Letters, 77, 407–416. https://doi.org/10.1016/j.spl.2006.08.007
22 Thilo Welz et al.
Schober, P., Boer, C., & Schwarte, L. A. (2018). Correlation coefficients: Appropriate use and
interpretation. Anesthesia & Analgesia, 126, 1763–1768. https://doi.org/10.1213/ANE.
0000000000002864
Schulze, R. (2004).Meta-analysis: A comparison of approaches. Boston, MA: Hogrefe Publishing.
https://www.hogrefe.com/us/contact
Schwarzer, G., Carpenter, J. R., & Rücker, G. (2015).Meta-analysis with R. Cham: Springer.
Sidik, K., & Jonkman, J. N. (2005). Simple heterogeneity variance estimation for meta-analysis.
Applied Statistics, 54, 367–384. https://doi.org/10.1111/j.1467-9876.2005.00489.x
Sidik, K., & Jonkman, J. N. (2006). Robust variance estimation for random effects meta-analysis.
Computational Statistics & Data Analysis, 50, 3681–3701. https://doi.org/10.1016/j.csda.
2005.07.019
Sirin, S. R. (2005). Socioeconomic status and academic achievement: A meta-analytic review of
research. Review of Educational Research, 75, 417–453. https://doi.org/10.3102/
00346543075003417
Süli, E., &Mayers, D. F. (2003).An introduction to numerical analysis. Cambridge, UK: Cambridge
University Press.
Veroniki, A. A., Jackson, D., Viechtbauer, W., Bender, R., Bowden, J., Knapp, G., . . . Salanti, G.
(2016). Methods to estimate the between-study variance and its uncertainty in meta-analysis.
Research Synthesis Methods, 7, 55–79. https://doi.org/10.1002/jrsm.1164
Viechtbauer, W. (2010). Conducting meta-analyses in R with the metafor package. Journal of
Statistical Software, 36, 1–48. https://doi.org/10.18637/jss.v036.i03
Viechtbauer, W., López-López, J. A., Sánchez-Meca, J., & Marn-Martnez, F. (2015). A comparison of
procedures to test for moderators in mixed-effects meta-regression models. Psychological
Methods, 20, 360–374. https://doi.org/10.1037/met0000023
Welz, T., & Pauly, M. (2020). A simulation study to compare robust tests for linear mixed-effects
meta-regression.Research Synthesis Methods, 11, 331–342. https://doi.org/10.1002/jrsm.1388
White, H. (1980). A heteroskedasticity-consistent covariance matrix estimator and a direct test for
heteroskedasticity. Econometrica, 48, 817–838. https://doi.org/10.2307/1912934
Wu,C.F. J. (1986). Jackknife, bootstrap andother resamplingmethods in regression analysis.Annals
of Statistics, 14, 1261–1295. https://doi.org/10.1214/aos/1176350142
Received 10 September 2020; revised version received 15 February 2021
Supporting Information
The following supporting informationmay be found in the online edition of the article:
Appendix S1 Complete results of simulation study.