A pluralistic framework for measuring, interpreting and stratifying heterogeneity in meta-analysis

A step-by-step illustration

Update

Last update June 2025

Preface

This step-by-step tutorial is a supplement to our paper targeted at quantifying and interpreting heterogeneity in meta-analyses.

Set-up

Our illustrations use R statistical software and existing R packages, which you will first need to download and install.

If you do not have it on your machine, first install R (download). We recommend also downloading RStudio, a popular integrated development environment for coding with R, created by a company named posit (download).

After installing R, you must install several packages that contain necessary functions for performing the analyses in this tutorial. If the packages are archived in CRAN, use install.packages() to install them. For example, to install the metafor package (a common meta-analysis package), you can execute install.packages("metafor") in the console (bottom left pane of R Studio). To install packages that are not on CRAN and archived in Github repositories, execute devtools::install_github(). For example, to install orchaRd (a meta-analysis visualization package) from Github repository, execute devtools::install_github("daniel1noble/orchaRd", force = TRUE).

The package list is as follows:

# install and load necessary library
pacman::p_load(tidyverse, 
               here,
               DT,
               janitor,
               ggpubr,
               readxl, 
               metafor,
               ggplot2,
               ggsignif,
               visdat,
               cowplot,
               patchwork,
               pander,
               rotl,
               ape,
               ggstance,
               ggtree,
               RColorBrewer,
               wesanderson
               )

Custom function

We provide helper functions necessary for our illustrations. To use them, you need to source them. You also can paste the source code into the console, and hit “Enter” to let R “learn” these custom functions.

source(here("function","custom_function.R"))

Case study 1

Load data

Risely et al. (2018) quantitatively reviewed 85 observations extracted from 41 studies examining the relationship between infection status and intensity of infection and changes in body stores, refuelling rates, movement capacity, phenology and survival of migratory hosts in different taxa1.

#load data
dat_Risely <- read.csv(here("data","Risely_2018_data.csv"))

#subset observations on infection status
status <- subset(dat_Risely, infection.measure=="Infection status")
head(status)
   id rowid
1  99   153
6  39    49
7  39    50
8  39    51
13  2     6
14 81   135
                                                                           authors
1                  Altizer, SM; Hobson, KA; Davis, AK; De Roode, JC, Wassenaar, LI
6                                              Arizaga, J; Barba, E; Hernandez, MA
7                                              Arizaga, J; Barba, E; Hernandez, MA
8                                              Arizaga, J; Barba, E; Hernandez, MA
13 Asghar, M; Hasselquist, D; Hansson, B; Zehtindjiev, P; Westerdahl, H; Bensch, S
14                                                                 Bengtsson et al
                                                                                                                                   title
1  Do Healthy Monarchs Migrate Farther? Tracking Natal Origins of Parasitized vs. Uninfected Monarch Butterflies Overwintering in Mexico
6                                DO HAEMOSPORIDIANS AFFECT FUEL DEPOSITION RATE AND FUEL LOAD IN MIGRATORY BLACKCAPS SYLVIA ATRICAPILLA?
7                                DO HAEMOSPORIDIANS AFFECT FUEL DEPOSITION RATE AND FUEL LOAD IN MIGRATORY BLACKCAPS SYLVIA ATRICAPILLA?
8                                DO HAEMOSPORIDIANS AFFECT FUEL DEPOSITION RATE AND FUEL LOAD IN MIGRATORY BLACKCAPS SYLVIA ATRICAPILLA?
13                              Hidden costs of infection: Chronic malaria accelerates telomere degradation and senescence in wild birds
14                                Does influenza A virus infection affect movement behaviour during stopover in its wild reservoir host?
                      journal year
1                    PLOS ONE 2015
6                     ARDEOLA 2009
7                     ARDEOLA 2009
8                     ARDEOLA 2009
13                    SCIENCE 2015
14 ROYAL SOCIETY OPEN SCIENCE 2016
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    abstract
1                                                                                                                                                                                                                                              Long-distance migration can lower parasite prevalence if strenuous journeys remove infected animals from wild populations. We examined wild monarch butterflies (Danaus plexippus) to investigate the potential costs of the protozoan燨phryocystis elektroscirrha爋n migratory success. We collected monarchs from two wintering sites in central Mexico to compare infection status with hydrogen isotope (?2H) measurements as an indicator of latitude of origin at the start of fall migration. On average, uninfected monarchs had lower\xa0?2H values than parasitized butterflies, indicating that uninfected butterflies originated from more northerly latitudes and travelled farther distances to reach Mexico. Within the infected class, monarchs with higher quantitative spore loads originated from more southerly latitudes, indicating that heavily infected monarchs originating from farther north are less likely to reach Mexico. We ruled out the alternative explanation that lower latitudes give rise to more infected monarchs prior to the onset of migration using citizen science data to examine regional differences in parasite prevalence during the summer breeding season. We also found a positive association between monarch wing area and estimated distance flown. Collectively, these results emphasize that seasonal migrations can help lower infection levels in wild animal populations. Our findings, combined with recent declines in the numbers of migratory monarchs wintering in Mexico and observations of sedentary (winter breeding) monarch populations in the southern U.S., suggest that shifts from migratory to sedentary behavior will likely lead to greater infection prevalence for North American monarchs.
6  Do haemosporidians affect fuel deposition rate and fuel load in migratory blackcaps, Sylvia atricapilla? Aims: Fuel deposition rate is one of the main parameters determining bird migration strategies. Accordingly, factors compromising fuel deposition rate, Such as parasite infections, might have relevant effects not only on migration, but also on other life history events that depend on migration Success, Such as breeding. We analysed the effect of haemosporidians on fuel load and fuel deposition rate in a population of migratory blackcaps (Sylvia atricapilla) during stopover in northern Iberia. Locality: Loza lagoon, northern Iberia (42 degrees 50' N, 01 degrees 43' W, 400 m a.s.l.). Methods: From blood samples of recaptured blackcaps we determined haemospiridian content by amplification of 479 bp of the parasite's cytochrome b gene. Independent on body size, of the sampled birds, 35% were infected by Haemoproteus-Plasmodium. Results: Mass deposition rate, fat score and body mass showed similar values in non-infected as in infected blackcaps. No differences in age or body size proportions were detected between infected and non-infected birds. Conclusions: These results may be explained by infected birds with high virulence (compromising fuel accumulation) being unable to migrate south from their breeding areas in central and northern Europe. In contrast, more resistant birds may be able to tolerate the parasitaemia and gain fuel normally. Another possibility is that only birds with a low intensity of parasite infection were captured (presumably birds able to overcome the parasitaemia successfully), since these birds may be able to accumulate fuel at a similar rate to non-infected birds. Furthermore, highly infected birds may have a lower likelihood of capture than non-infected birds, because birds lose mobility with a high intensity of infection and under these circumstances they are less likely to be captured.
7  Do haemosporidians affect fuel deposition rate and fuel load in migratory blackcaps, Sylvia atricapilla? Aims: Fuel deposition rate is one of the main parameters determining bird migration strategies. Accordingly, factors compromising fuel deposition rate, Such as parasite infections, might have relevant effects not only on migration, but also on other life history events that depend on migration Success, Such as breeding. We analysed the effect of haemosporidians on fuel load and fuel deposition rate in a population of migratory blackcaps (Sylvia atricapilla) during stopover in northern Iberia. Locality: Loza lagoon, northern Iberia (42 degrees 50' N, 01 degrees 43' W, 400 m a.s.l.). Methods: From blood samples of recaptured blackcaps we determined haemospiridian content by amplification of 479 bp of the parasite's cytochrome b gene. Independent on body size, of the sampled birds, 35% were infected by Haemoproteus-Plasmodium. Results: Mass deposition rate, fat score and body mass showed similar values in non-infected as in infected blackcaps. No differences in age or body size proportions were detected between infected and non-infected birds. Conclusions: These results may be explained by infected birds with high virulence (compromising fuel accumulation) being unable to migrate south from their breeding areas in central and northern Europe. In contrast, more resistant birds may be able to tolerate the parasitaemia and gain fuel normally. Another possibility is that only birds with a low intensity of parasite infection were captured (presumably birds able to overcome the parasitaemia successfully), since these birds may be able to accumulate fuel at a similar rate to non-infected birds. Furthermore, highly infected birds may have a lower likelihood of capture than non-infected birds, because birds lose mobility with a high intensity of infection and under these circumstances they are less likely to be captured.
8  Do haemosporidians affect fuel deposition rate and fuel load in migratory blackcaps, Sylvia atricapilla? Aims: Fuel deposition rate is one of the main parameters determining bird migration strategies. Accordingly, factors compromising fuel deposition rate, Such as parasite infections, might have relevant effects not only on migration, but also on other life history events that depend on migration Success, Such as breeding. We analysed the effect of haemosporidians on fuel load and fuel deposition rate in a population of migratory blackcaps (Sylvia atricapilla) during stopover in northern Iberia. Locality: Loza lagoon, northern Iberia (42 degrees 50' N, 01 degrees 43' W, 400 m a.s.l.). Methods: From blood samples of recaptured blackcaps we determined haemospiridian content by amplification of 479 bp of the parasite's cytochrome b gene. Independent on body size, of the sampled birds, 35% were infected by Haemoproteus-Plasmodium. Results: Mass deposition rate, fat score and body mass showed similar values in non-infected as in infected blackcaps. No differences in age or body size proportions were detected between infected and non-infected birds. Conclusions: These results may be explained by infected birds with high virulence (compromising fuel accumulation) being unable to migrate south from their breeding areas in central and northern Europe. In contrast, more resistant birds may be able to tolerate the parasitaemia and gain fuel normally. Another possibility is that only birds with a low intensity of parasite infection were captured (presumably birds able to overcome the parasitaemia successfully), since these birds may be able to accumulate fuel at a similar rate to non-infected birds. Furthermore, highly infected birds may have a lower likelihood of capture than non-infected birds, because birds lose mobility with a high intensity of infection and under these circumstances they are less likely to be captured.
13                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     Recovery from infection is not always complete, and mild chronic infection may persist. Although the direct costs of such infections are apparently small, the potential for any long-term effects on Darwinian fitness is poorly understood. In a wild population of great reed warblers, we found that low-level chronic malaria infection reduced life span as well as the lifetime number and quality of offspring. These delayed fitness effects of malaria appear to be mediated by telomere degradation, a result supported by controlled infection experiments on birds in captivity. The results of this study imply that chronic infection may be causing a series of small adverse effects that accumulate and eventually impair phenotypic quality and Darwinian fitness.
14                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 The last decade has seen a surge in research on avian influenza A viruses (IAVs), in part fuelled by the emergence, spread and potential zoonotic importance of highly pathogenic virus subtypes. The mallard (Anas platyrhynchos) is the most numerous and widespread dabbling duck in the world, and one of the most important natural hosts for studying IAV transmission dynamics. In order to predict the likelihood of IAV transmission between individual ducks and to other hosts, as well as between geographical regions, it is important to understand how IAV infection affects the host. In this study, we analysed the movements of 40 mallards equipped with GPS transmitters and three-dimensional accelerometers, of which 20 were naturally infected with low pathogenic avian influenza virus (LPAIV), at a major stopover site in the Northwest European flyway. Movements differed substantially between day and night, as well as between mallards returning to the capture site and those feeding in natural habitats. However, movement patterns did not differ between LPAIV infected and uninfected birds. Hence, LPAIV infection probably does not affect mallard movements during stopover, with high possibility of virus spread along the migration route as a consequence.
   type       trait            measure exp.methods            species
1     O    Movement Distance travelled          na   Monarch butterly
6     O  Refuelling        Mass change          na           Blackcap
7     O Body stores          Fat score          na           Blackcap
8     O Body stores          Body mass          na           Blackcap
13    O    Survival           Lifespan          na Great reed warbler
14    O    Movement    Local movements          na            Mallard
               species_latin   taxa         Order         Family
1           Danaus plexippus Insect   Lepidoptera    Nymphalidae
6         Sylvia atricapilla   Bird Passeriformes      Sylviidae
7         Sylvia atricapilla   Bird Passeriformes      Sylviidae
8         Sylvia atricapilla   Bird Passeriformes      Sylviidae
13 Acrocephalus arundinaceus   Bird Passeriformes Acrocephalidae
14        Anas platyrhynchos   Bird  Anseriformes       Anatidae
                        strain parasite.taxa life.history.measured setting
1  Ophryocystis elektroscirrha      Protozoa          Non-breeding   Field
6               Haemoparasites      Protozoa             Migration   Field
7               Haemoparasites      Protozoa             Migration   Field
8               Haemoparasites      Protozoa             Migration   Field
13              Haemoparasites      Protozoa              Breeding   Field
14                       LPAIV         Virus             Migration   Field
   subset Infection.type migratory.leg2  ss ss_infected ss_healthy
1     all         Single              Y 175         100         75
6     all         Single              Y  53          18         35
7     all         Single              Y  53          18         35
8     all         Single              Y  53          18         35
13    all         Single              Y  75          32         43
14    all         Single              N  87          37         50
           slope effect    stat presented.p  adj.p sig0.05 infection.measure
1  Not presented  12.78 Wald X2     <0.0001 0.0001       1  Infection status
6  Not presented    0.1       t         0.9 0.9000       0  Infection status
7  Not presented   0.99       t        0.33 0.3300       0  Infection status
8  Not presented   0.01       F        0.93 0.9300       0  Infection status
13          0.84   2.82       t       0.006 0.0060       1  Infection status
14 Not presented  0.538       t       >0.05 0.5000       0  Infection status
       z var.z     g var.g var.g1                              function.
1  -0.28  0.01 -0.56  0.02   0.02 pes(0.0001,100,75) or chies(12.78,175)
6   0.02  0.02  0.04  0.08   0.08                         pes(0.9,18,35)
7  -0.13  0.02 -0.28  0.08   0.08                        pes(0.33,18,35)
8  -0.01  0.02 -0.03  0.08   0.08                        pes(0.93,18,35)
13 -0.32  0.01 -0.65  0.06   0.06                        tes(2.82,32,43)
14 -0.06  0.01 -0.12  0.05   0.05                       tes(0.538,37,50)
   Effect.direction
1          Negative
6          Positive
7          Negative
8          Negative
13         Negative
14         Negative

We also construct the phylogenetic tree to allow for decomposing heterogeneity at both species and phylogeny levels.

#check species we have
#length(unique(status$species_latin)) #25 unique species names, if no misspelling

#find Open Tree Taxonomy (OTT) IDs for each species
taxa <- tnrs_match_names(names = unique(status$species_latin))

#rough check
#length(taxa$unique_name) # 25 unique species names, which is aligned with the species names in the dataset
#tabyl(taxa,approximate_match) # 0 approximate match

#check whether all otts occur in the synthetic tree
ott_in_tree <- ott_id(taxa)[is_in_tree(ott_id(taxa))]  # all good
#length(ott_in_tree) # 25

#make phylo tree
tree <- tol_induced_subtree(ott_ids = ott_id(taxa))
tree$tip.label <- strip_ott_ids(tree$tip.label, remove_underscores = TRUE)

# test whether a tree is binary
#is.binary(tree)  

#decapitalise species names to match with the search string names in taxa
status <- status %>% mutate(search_string = tolower(species_latin))  

#align data
status <- left_join(status, dplyr::select(taxa, search_string, unique_name, ott_id), by = "search_string")  

#create the variables of spp and phylo
status <- status %>% mutate(spp = search_string, phylo = unique_name)


#estimate branch lengths using APE
tree2 <- compute.brlen(tree, method = "Grafen", power = 1)
##create correlation matrix
phylo_corr <- vcv(tree2, corr=TRUE)
plot(tree2, cex = .8, label.offset = .1, no.margin = TRUE)

Fit data

We fit a phylogenetic multilevel model as illustrated in the main text to the data. Doing so, we can stratify the heterogeneity at different strata: within-study level (obs_id), between-study level (study_id), species level (spp), and phylogeny level (phylo).

\[ ES_{[i]} = \mu + u_{species[k]} + u_{phylogeny[k]} + \mu_{between[j]} + \mu_{within[j]} + e_{[i]} \]

#unify the name of key variables
dat <- status
dat$obs_id <- 1:nrow(dat)
dat$study_id <- dat$id
# run the model
mod_status <- rma.mv(g, var.g, random = list(~ 1 | obs_id, ~1 | study_id, ~ 1 | spp, ~ 1 | phylo), R = list(phylo = phylo_corr), data = dat)

summary(mod_status)

Multivariate Meta-Analysis Model (k = 66; method: REML)

  logLik  Deviance       AIC       BIC      AICc   
-30.9191   61.8382   71.8382   82.7102   72.8552   

Variance Components:

            estim    sqrt  nlvls  fixed    factor    R 
sigma^2.1  0.0072  0.0846     66     no    obs_id   no 
sigma^2.2  0.0000  0.0000     35     no  study_id   no 
sigma^2.3  0.0006  0.0243     23     no       spp   no 
sigma^2.4  0.0296  0.1720     23     no     phylo  yes 

Test for Heterogeneity:
Q(df = 65) = 356.1658, p-val < .0001

Model Results:

estimate      se     zval    pval    ci.lb    ci.ub    
 -0.2134  0.0917  -2.3274  0.0199  -0.3932  -0.0337  * 

---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Heterogeneity quantification and stratification

We quantify and stratify the heterogeneity using the formulas provided in the main text. All functions used are integrated into orchard package2.

Typical sampling variance (which captures the statistical noise of the data, but is rarely reported in the current meta-analytic practice):

sigma2_v(mod_status)
[1] 0.00131211

Unstandardised heterogeneity metrics \(\sigma^2\):

Variance-standardised heterogeneity metrics \(I^2\):

t3 <- data.frame(stratum = c("total", mod_status$s.names),
           I2 = i2_ml(mod_status) / 100) %>% dfround(1)

t3$stratum <- c("total", "within-study", "between-study", "species", "phylogeny")
rownames(t3) <- NULL

t3 %>% DT::datatable()

Mean-standardised heterogeneity metrics \(CVH2\):

t4 <- data.frame(stratum = c("total", mod_status$s.names),
           CVH2 = cvh2_ml(mod_status)) %>% dfround(2)

t4$stratum <- c("total", "within-study", "between-study", "species", "phylogeny")
rownames(t4) <- NULL

t4 %>% DT::datatable()

Mean- and variance-standardised heterogeneity metrics \(M2\):

t5 <- data.frame(stratum = c("total", mod_status$s.names),
           M2 = m2_ml(mod_status)) %>% dfround(2)

t5$stratum <- c("total", "within-study", "between-study", "species", "phylogeny")
rownames(t5) <- NULL

t5 %>% DT::datatable()

Let’s visualize all heterogeneity measures:

# Visualize heterogeneity
## make dataframe
h_status <- data.frame(sigma2 = t2$sigma2,
                       I2 = i2_ml(mod_status),
                       CVH = cvh2_ml(mod_status),
                       M2 = m2_ml(mod_status))
rownames(h_status) <- c("Total", "obs_id", "study_id", "spp", "phylo")
h_status$levels <- rownames(h_status)
h_status$levels <- dplyr::recode(h_status$levels, "Total" = "Total",  "obs_id" = "Within",  "study_id" = "Between", "spp" = "Spp", "phylo" = "Phylo")
h_status$levels <- as.factor(h_status$levels)
h_status$levels <- factor(h_status$levels, levels = c("Total", "Phylo", "Spp", "Between", "Within"))

## plot
## sigma
p.sigma <- ggplot(h_status, aes(levels, sigma2)) +
  geom_col(alpha = 1, color = wes_palette('GrandBudapest1', 4, type = 'discrete')[1], fill = wes_palette('GrandBudapest1', 4, type = 'discrete')[1]) +
  labs(y = expression("Variance"), x = "Strata" , title = "Unstandardised heterogeneity metrics") + 
  theme_bw() +
  theme(legend.background = element_blank(),
        axis.text = element_text(size = 12, color = "black"),
        axis.title = element_text(size = 12, color = "black")
        ) 

# I2
p.I2 <- ggplot(h_status, aes(levels, I2/100)) +
  geom_col(alpha = 1, color = wes_palette('GrandBudapest1', 4, type = 'discrete')[2], fill = wes_palette('GrandBudapest1', 4, type = 'discrete')[2]) +
  labs(y = expression(paste(italic(I)[]^2)), x = "Strata" , title = "Source of heterogeneity") + 
  theme_bw() +
  theme(legend.background = element_blank(),
        axis.text = element_text(size = 12, color = "black"),
        axis.title = element_text(size = 12, color = "black")
        ) 

# CVH
p.CV <- ggplot(h_status, aes(levels, CVH)) +
  geom_col(alpha = 1, color = wes_palette('GrandBudapest1', 4, type = 'discrete')[3], fill = wes_palette('GrandBudapest1', 4, type = 'discrete')[3]) +
  labs(y = expression(paste(italic(CVH2)[])), x = "Strata" , title = "Magnitude of heterogeneity") + 
  scale_y_continuous(labels = scales::number_format(accuracy = 0.01)) + 
  theme_bw() +
  theme(legend.background = element_blank(),
        axis.text = element_text(size = 12, color = "black"),
        axis.title = element_text(size = 12, color = "black")
        )   
  

# M
p.M <- ggplot(h_status, aes(x = levels, y = M2)) +
  geom_col(alpha = 1, color = wes_palette('GrandBudapest1', 4, type = 'discrete')[4], fill = wes_palette('GrandBudapest1', 4, type = 'discrete')[4]) +
  labs(y = expression(paste(italic(M2)[])), x = "Strata" , title = "Magnitude of heterogeneity") + 
  scale_y_continuous(labels = scales::number_format(accuracy = 0.01)) + 
  theme_bw() +
  theme(legend.background = element_blank(),
        axis.text = element_text(size = 12, color = "black"),
        axis.title = element_text(size = 12, color = "black")
        )  

#png(filename = "Fig. 6.png", width = 9, height = 8, units = "in", type = "windows", res = 400)
p.sigma + p.I2 + p.CV + p.M + plot_layout(ncol =2, nrow = 2) + plot_annotation(tag_levels = "A") & theme(plot.tag = element_text(face = "bold"))

Heterogeneity interpretation

The best way to interpret heterogeneity is to place the interpretation of heterogeneity in the specific context of the research topic. When meta-analysts lack sufficient domain knowledge to implement this context-specific interpretation strategy, empirically derived benchmarks, which compare the heterogeneity of interest with published meta-analyses can provide a starting point for interpreting heterogeneity.

We provide an R helper function (het_interpret()) that can help determine the percentile range in which the heterogeneity estimates for a particular meta-analysis fall, based on the heterogeneity distribution of the published meta-analysis.

The main parameters of het_interpret() are as follows:

observed_value Specify the observed value of heterogeneity.

het_type = "V_bar" Specify the heterogeneity measures: "I2", "CVH", "M", raw heterogeneity ("sigma2"), and typical sampling variance ("V_bar").

es_type Specify the effect size measures: standardised mean difference like Cohen’s d ("SMD"), log response ratio ("lnRR"), Fisher’s r-to-z transformed correlation coefficient ("Zr"), 2-by-2 table denotes often dichotomous (“binary”) effect size measures like log odds ratio, and uncommon measures ("uncommon") represent less frequently used effect size measures like mean difference.

data Specify the dataset containing empirical benchmarks (i.e., the dataset het.benchmark),

First, let’s upload the benchmark data (het.benchmark), which are publicly available in the GitHub repository:

Then, we calculate the percentile range for each type of heterogeneity measure.

het_interpret(observed_value = 0.9, het_type = “CVH”, es_type = “SMD”, data = het.benchmark)

Percentile range of typical sampling variance:

$declaration
[1] "We encourage interpreting heterogeneity in context. These empirical benchmarks are intended to contextualize heterogeneity magnitude relative to published meta-analyses, not to replace domain-specific interpretation."

$interpretation
# A tibble: 1 x 4
  observed_value het_type es_type percentile_range
           <dbl> <chr>    <chr>   <chr>           
1        0.00131 V_bar    Zr      0%-5% percentile

Unstandardised heterogeneity metrics \(\sigma^2\):

$declaration
[1] "We encourage interpreting heterogeneity in context. These empirical benchmarks are intended to contextualize heterogeneity magnitude relative to published meta-analyses, not to replace domain-specific interpretation."

$interpretation
# A tibble: 1 x 4
  observed_value het_type es_type percentile_range  
           <dbl> <chr>    <chr>   <chr>             
1         0.0373 sigma2   Zr      10%-15% percentile

Variance-standardised heterogeneity metrics \(I^2\):

$declaration
[1] "We encourage interpreting heterogeneity in context. These empirical benchmarks are intended to contextualize heterogeneity magnitude relative to published meta-analyses, not to replace domain-specific interpretation."

$interpretation
# A tibble: 1 x 4
  observed_value het_type es_type percentile_range  
           <dbl> <chr>    <chr>   <chr>             
1          0.966 I2       Zr      80%-85% percentile

Mean-standardised heterogeneity metrics \(CVH2\):

$declaration
[1] "We encourage interpreting heterogeneity in context. These empirical benchmarks are intended to contextualize heterogeneity magnitude relative to published meta-analyses, not to replace domain-specific interpretation."

$interpretation
# A tibble: 1 x 4
  observed_value het_type es_type percentile_range  
           <dbl> <chr>    <chr>   <chr>             
1          0.820 CVH      Zr      25%-30% percentile

Mean- and variance-standardised heterogeneity metrics \(M2\):

$declaration
[1] "We encourage interpreting heterogeneity in context. These empirical benchmarks are intended to contextualize heterogeneity magnitude relative to published meta-analyses, not to replace domain-specific interpretation."

$interpretation
# A tibble: 1 x 4
  observed_value het_type es_type percentile_range  
           <dbl> <chr>    <chr>   <chr>             
1          0.450 M        Zr      25%-30% percentile

Case study 2

Load data

Yang et al. (2024) conducted a meta-analysis on an evidence base including 437 effect sizes extracted from 127 experiments investigating the impact of artificial light at night on the melatonin suppression of wildlife3.

#load data
dat_Yang <- read.csv(here("data","Yang_2024_data.csv"))

#effect size computation - use log response ratio as the effect size measure
dat_Yang <- escalc(measure = "ROM", m1i = Exp_Mean, m2i = Con_Mean, sd1i = Exp_SD, sd2i = Con_SD, n1i = Exp_N, n2i = Con_N, data = dat_Yang)

Fit data

We fit a three-level multilevel model to the data. Doing so, we can stratify the heterogeneity at within-study level (obs_id), and between-study level (study_id).

#unify the name of key variables
dat <- dat_Yang
# add row no.
dat$obs_id <- 1:nrow(dat)
dat$study_id <- dat$StudyID
# run the model
mod_alan <- rma.mv(yi, vi, random = list(~ 1 | obs_id, ~1 | study_id), data = dat)

summary(mod_alan)

Multivariate Meta-Analysis Model (k = 372; method: REML)

   logLik   Deviance        AIC        BIC       AICc   
-489.9875   979.9750   985.9750   997.7236   986.0404   

Variance Components:

            estim    sqrt  nlvls  fixed    factor 
sigma^2.1  0.5677  0.7534    372     no    obs_id 
sigma^2.2  0.1268  0.3562     35     no  study_id 

Test for Heterogeneity:
Q(df = 371) = 4916.5818, p-val < .0001

Model Results:

estimate      se     zval    pval    ci.lb    ci.ub      
 -0.4815  0.0865  -5.5686  <.0001  -0.6510  -0.3120  *** 

---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Heterogeneity quantification and stratification

We quantify and stratify the heterogeneity using the formulas provided in the main text. All functions used are integrated into orchard package2.

Typical sampling variance (which captures the statistical noise of the data, but is rarely reported in the current meta-analytic practice):

sigma2_v(mod_alan)
[1] 0.03674805

Unstandardised heterogeneity metrics \(\sigma^2\):

Variance-standardised heterogeneity metrics \(I^2\):

t3 <- data.frame(stratum = c("total", mod_alan$s.names),
           I2 = i2_ml(mod_alan) / 100) %>% dfround(1)

t3$stratum <- c("total", "within-study", "between-study")
rownames(t3) <- NULL

t3 %>% DT::datatable()

Mean-standardised heterogeneity metrics \(CVH2\):

t4 <- data.frame(stratum = c("total", mod_alan$s.names),
           CVH2 = cvh2_ml(mod_alan)) %>% dfround(2)

t4$stratum <- c("total", "within-study", "between-study")
rownames(t4) <- NULL

t4 %>% DT::datatable()

Mean- and variance-standardised heterogeneity metrics \(M2\):

t5 <- data.frame(stratum = c("total", mod_alan$s.names),
           M2 = m2_ml(mod_alan)) %>% dfround(2)

t5$stratum <- c("total", "within-study", "between-study")
rownames(t5) <- NULL

t5 %>% DT::datatable()

Let’s visualize all heterogeneity measures:

# Visualize heterogeneity
## make dataframe
h_status <- data.frame(sigma2 = t2$sigma2,
                       I2 = i2_ml(mod_alan),
                       CVH = cvh2_ml(mod_alan),
                       M2 = m2_ml(mod_alan))
rownames(h_status) <- c("Total", "obs_id", "study_id")
h_status$levels <- rownames(h_status)
h_status$levels <- dplyr::recode(h_status$levels, "Total" = "Total",  "obs_id" = "Within",  "study_id" = "Between")
h_status$levels <- as.factor(h_status$levels)
h_status$levels <- factor(h_status$levels, levels = c("Total", "Between", "Within"))

## plot
## sigma
p.sigma <- ggplot(h_status, aes(levels, sigma2)) +
  geom_col(alpha = 1, color = wes_palette('GrandBudapest1', 4, type = 'discrete')[1], fill = wes_palette('GrandBudapest1', 4, type = 'discrete')[1]) +
  labs(y = expression("Variance"), x = "Strata" , title = "Unstandardised heterogeneity metrics") + 
  theme_bw() +
  theme(legend.background = element_blank(),
        axis.text = element_text(size = 12, color = "black"),
        axis.title = element_text(size = 12, color = "black")
        ) 

# I2
p.I2 <- ggplot(h_status, aes(levels, I2/100)) +
  geom_col(alpha = 1, color = wes_palette('GrandBudapest1', 4, type = 'discrete')[2], fill = wes_palette('GrandBudapest1', 4, type = 'discrete')[2]) +
  labs(y = expression(paste(italic(I)[]^2)), x = "Strata" , title = "Source of heterogeneity") + 
  theme_bw() +
  theme(legend.background = element_blank(),
        axis.text = element_text(size = 12, color = "black"),
        axis.title = element_text(size = 12, color = "black")
        ) 

# CVH
p.CV <- ggplot(h_status, aes(levels, CVH)) +
  geom_col(alpha = 1, color = wes_palette('GrandBudapest1', 4, type = 'discrete')[3], fill = wes_palette('GrandBudapest1', 4, type = 'discrete')[3]) +
  labs(y = expression(paste(italic(CVH2)[])), x = "Strata" , title = "Magnitude of heterogeneity") + 
  scale_y_continuous(labels = scales::number_format(accuracy = 0.01)) + 
  theme_bw() +
  theme(legend.background = element_blank(),
        axis.text = element_text(size = 12, color = "black"),
        axis.title = element_text(size = 12, color = "black")
        )   
  

# M
p.M <- ggplot(h_status, aes(x = levels, y = M2)) +
  geom_col(alpha = 1, color = wes_palette('GrandBudapest1', 4, type = 'discrete')[4], fill = wes_palette('GrandBudapest1', 4, type = 'discrete')[4]) +
  labs(y = expression(paste(italic(M2)[])), x = "Strata" , title = "Magnitude of heterogeneity") + 
  scale_y_continuous(labels = scales::number_format(accuracy = 0.01)) + 
  theme_bw() +
  theme(legend.background = element_blank(),
        axis.text = element_text(size = 12, color = "black"),
        axis.title = element_text(size = 12, color = "black")
        )  



p.sigma + p.I2 + p.CV + p.M + plot_layout(ncol =2, nrow = 2) + plot_annotation(tag_levels = "A") & theme(plot.tag = element_text(face = "bold"))

Heterogeneity interpretation

We also illustrate the use of empirical benchmarks using the data from Yang et al. (2024)3.

Upload the benchmark data (het.benchmark):

Calculate the percentile range for each type of heterogeneity measure.

Percentile range of typical sampling variance:

$declaration
[1] "We encourage interpreting heterogeneity in context. These empirical benchmarks are intended to contextualize heterogeneity magnitude relative to published meta-analyses, not to replace domain-specific interpretation."

$interpretation
# A tibble: 1 x 4
  observed_value het_type es_type percentile_range  
           <dbl> <chr>    <chr>   <chr>             
1         0.0367 V_bar    lnRR    75%-80% percentile

Unstandardised heterogeneity metrics \(\sigma^2\):

$declaration
[1] "We encourage interpreting heterogeneity in context. These empirical benchmarks are intended to contextualize heterogeneity magnitude relative to published meta-analyses, not to replace domain-specific interpretation."

$interpretation
# A tibble: 1 x 4
  observed_value het_type es_type percentile_range  
           <dbl> <chr>    <chr>   <chr>             
1          0.695 sigma2   lnRR    80%-85% percentile

Variance-standardised heterogeneity metrics \(I^2\):

$declaration
[1] "We encourage interpreting heterogeneity in context. These empirical benchmarks are intended to contextualize heterogeneity magnitude relative to published meta-analyses, not to replace domain-specific interpretation."

$interpretation
# A tibble: 1 x 4
  observed_value het_type es_type percentile_range  
           <dbl> <chr>    <chr>   <chr>             
1          0.950 I2       lnRR    45%-50% percentile

Mean-standardised heterogeneity metrics \(CVH2\):

$declaration
[1] "We encourage interpreting heterogeneity in context. These empirical benchmarks are intended to contextualize heterogeneity magnitude relative to published meta-analyses, not to replace domain-specific interpretation."

$interpretation
# A tibble: 1 x 4
  observed_value het_type es_type percentile_range  
           <dbl> <chr>    <chr>   <chr>             
1           3.00 CVH      lnRR    45%-50% percentile

Mean- and variance-standardised heterogeneity metrics \(M2\):

$declaration
[1] "We encourage interpreting heterogeneity in context. These empirical benchmarks are intended to contextualize heterogeneity magnitude relative to published meta-analyses, not to replace domain-specific interpretation."

$interpretation
# A tibble: 1 x 4
  observed_value het_type es_type percentile_range  
           <dbl> <chr>    <chr>   <chr>             
1          0.750 M        lnRR    45%-50% percentile

Additional illustration

In addition, we show how the traditional heterogeneity measure \(I^2\) can distort heterogeneity estimates for some specific effect size measures.

Some meta-analysts may be interested in the difference in variability between two groups. The (log-transformed) ratio of the coefficients of variation of the two groups (also known as the coefficient of variation ratio) can be a useful metric for comparing the variability of the two groups. Due to the properties of the coefficient of variation ratio, it usually has a large sampling variance. This inherently large sampling variance leads to smaller \(I^2\) values, regardless of whether the true heterogeneity is large or not.

Let’s use the example of Yang et al. (2024)3 to illustrate it.

Load data

#load data
dat_Yang <- read.csv(here("data","Yang_2024_data.csv"))

# calculate the coefficient of variation ratio for each observation
dat_Yang <- escalc(measure = "CVR", m1i = Exp_Mean, m2i = Con_Mean, sd1i = Exp_SD, sd2i = Con_SD, n1i = Exp_N, n2i = Con_N, data = dat_Yang)

Fit data

Fit a three-level multilevel model and stratify the heterogeneity at within-study level (obs_id), and between-study level (study_id).

#unify the name of key variables
dat <- dat_Yang
# add row no.
dat$obs_id <- 1:nrow(dat)
dat$study_id <- dat$StudyID
# run the model
mod_alan <- rma.mv(yi, vi, random = list(~ 1 | obs_id, ~1 | study_id), data = dat)

summary(mod_alan)

Multivariate Meta-Analysis Model (k = 372; method: REML)

   logLik   Deviance        AIC        BIC       AICc   
-374.6991   749.3982   755.3982   767.1468   755.4636   

Variance Components:

            estim    sqrt  nlvls  fixed    factor 
sigma^2.1  0.1035  0.3217    372     no    obs_id 
sigma^2.2  0.0575  0.2397     35     no  study_id 

Test for Heterogeneity:
Q(df = 371) = 544.9818, p-val < .0001

Model Results:

estimate      se    zval    pval   ci.lb   ci.ub      
  0.2804  0.0605  4.6340  <.0001  0.1618  0.3990  *** 

---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Heterogeneity quantification and stratification

Calculate heterogeneity for the meta-analysis using the coefficient of variation ratio. As you can see, the \(I^2\) values become much smaller, while \(\sigma^2\), \(CVH2\), and \(M2\) are still large.

# Visualize heterogeneity
## make dataframe
h_status <- data.frame(sigma2 = t2$sigma2,
                       I2 = i2_ml(mod_alan) / 100,
                       CVH = cvh2_ml(mod_alan),
                       M2 = m2_ml(mod_alan))
rownames(h_status) <- c("Total", "obs_id", "study_id")
h_status$levels <- rownames(h_status)
h_status$levels <- dplyr::recode(h_status$levels, "Total" = "Total",  "obs_id" = "Within",  "study_id" = "Between")
h_status$levels <- as.factor(h_status$levels)
h_status$levels <- factor(h_status$levels, levels = c("Total", "Between", "Within"))

## plot
## sigma
p.sigma <- ggplot(h_status, aes(levels, sigma2)) +
  geom_col(alpha = 1, color = wes_palette('GrandBudapest1', 4, type = 'discrete')[1], fill = wes_palette('GrandBudapest1', 4, type = 'discrete')[1]) +
  labs(y = expression("Variance"), x = "Strata" , title = "Unstandardised heterogeneity metrics") + 
  theme_bw() +
  theme(legend.background = element_blank(),
        axis.text = element_text(size = 12, color = "black"),
        axis.title = element_text(size = 12, color = "black")
        ) 

# I2
p.I2 <- ggplot(h_status, aes(levels, I2/100)) +
  geom_col(alpha = 1, color = wes_palette('GrandBudapest1', 4, type = 'discrete')[2], fill = wes_palette('GrandBudapest1', 4, type = 'discrete')[2]) +
  labs(y = expression(paste(italic(I)[]^2)), x = "Strata" , title = "Source of heterogeneity") + 
  theme_bw() +
  theme(legend.background = element_blank(),
        axis.text = element_text(size = 12, color = "black"),
        axis.title = element_text(size = 12, color = "black")
        ) 

# CVH
p.CV <- ggplot(h_status, aes(levels, CVH)) +
  geom_col(alpha = 1, color = wes_palette('GrandBudapest1', 4, type = 'discrete')[3], fill = wes_palette('GrandBudapest1', 4, type = 'discrete')[3]) +
  labs(y = expression(paste(italic(CVH2)[])), x = "Strata" , title = "Magnitude of heterogeneity") + 
  scale_y_continuous(labels = scales::number_format(accuracy = 0.01)) + 
  theme_bw() +
  theme(legend.background = element_blank(),
        axis.text = element_text(size = 12, color = "black"),
        axis.title = element_text(size = 12, color = "black")
        )   
  

# M
p.M <- ggplot(h_status, aes(x = levels, y = M2)) +
  geom_col(alpha = 1, color = wes_palette('GrandBudapest1', 4, type = 'discrete')[4], fill = wes_palette('GrandBudapest1', 4, type = 'discrete')[4]) +
  labs(y = expression(paste(italic(M2)[])), x = "Strata" , title = "Magnitude of heterogeneity") + 
  scale_y_continuous(labels = scales::number_format(accuracy = 0.01)) + 
  theme_bw() +
  theme(legend.background = element_blank(),
        axis.text = element_text(size = 12, color = "black"),
        axis.title = element_text(size = 12, color = "black")
        )  



p.sigma + p.I2 + p.CV + p.M + plot_layout(ncol =2, nrow = 2) + plot_annotation(tag_levels = "A") & theme(plot.tag = element_text(face = "bold"))

License

This documented is licensed under the following license: CC Attribution-Noncommercial-Share Alike 4.0 International.

Software and package versions

                  package loadedversion
ape                   ape         5.8-1
cowplot           cowplot         1.1.1
dplyr               dplyr        1.0.10
DT                     DT          0.19
forcats           forcats         0.5.2
ggplot2           ggplot2         3.4.4
ggpubr             ggpubr         0.4.0
ggsignif         ggsignif         0.6.3
ggstance         ggstance         0.3.5
ggtree             ggtree     3.7.1.001
here                 here         1.0.1
janitor           janitor         2.1.0
knitr               knitr          1.37
Matrix             Matrix         1.5-3
metadat           metadat         1.2-0
metafor           metafor        4.7-53
numDeriv         numDeriv    2016.8-1.1
pander             pander         0.6.4
patchwork       patchwork         1.1.1
purrr               purrr         0.3.4
RColorBrewer RColorBrewer         1.1-3
readr               readr         2.1.2
readxl             readxl         1.3.1
rmdformats     rmdformats         1.0.3
rotl                 rotl        3.0.11
stringr           stringr         1.5.0
tibble             tibble         3.1.8
tidyr               tidyr         1.2.1
tidyverse       tidyverse         1.3.1
visdat             visdat         0.5.3
wesanderson   wesanderson         0.3.6

References

1.
Risely, A., Klaassen, M. & Hoye, B. J. Migratory animals feel the cost of getting sick: A meta-analysis across species. Journal of Animal Ecology 87, 301–314 (2018).
2.
Nakagawa, S. et al. The orchard plot: Cultivating a forest plot for use in ecology, evolution, and beyond. Research Synthesis Methods 12, 4–12 (2021).
3.
Yang, Y. et al. Species sensitivities to artificial light at night: A phylogenetically controlled multilevel meta-analysis on melatonin suppression. Ecology Letters 27, e14387 (2024).