A turorial of advanced methods for the meta-analyses of animal models

Illustration of multilevel model, meta-regression, multivariate model and robust variance estimation

Citation

If this tutorial is useful to your meta-analysis, please cite the following paper:

Yefeng Yang, Malcom Macleaod, Jinming Pan, Malgorzata Lagisz, Shinichi Nakagawa, 2022. The current practices of meta-analyses using animal models, and underappreciated opportunities using advanced methods: multilevel models and robust variance estimation. Neuroscience & Biobehavioral Reviews.

Code written by:

Dr. Yefeng Yang PhD

Institutions:

School of Biological, Earth and Environmental Sciences, University of New South Wales, Sydney, Australia;

Department of Biosystems Engineering, Zhejiang University, Hangzhou 310058, China;

Jockey Club College of Veterinary Medicine and Life Sciences, City University of Hong Kong, Hong Kong, China

Email: yefeng.yang1@unsw.edu.au

Code cross-checked by:

Dr. Malgorzata Lagisz PhD

Institutions:

Evolution & Ecology Research Centre, UNSW Data Science Hub;

School of Biological, Earth and Environmental Sciences, University of New South Wales, Sydney, Australia

Email: m.lagisz@unsw.edu.au

Professor Shinichi Nakagawa PhD

Elected Member of Society for Research Synthesis Methodology

Fellow of the Royal Society of New South Wales (FRSN)

Institutions:

Evolution & Ecology Research Centre, UNSW Data Science Hub;

School of Biological, Earth and Environmental Sciences, University of New South Wales, Sydney, Australia

Email: s.nakagawa@unsw.edu.au

Credit

We should acknowledge that this supplementary tutorial is heavily influenced by the open code from published papers in Prof. Shinichi Nakagawa’s lab (see full publication list) and from (Associate) Prof. Wolfgang Viechtbauer’s versatile R package metafor (see the documentation.

Updates

This is an easy-to-implement tutorial, where you just need to have a slight modification to fit your own meta-analytic data. But we note that some arguments or script may be out of date because some packages are updating or changing in future. We will periodically update this tutorial to keep it working. The latest version of the tutorial can be found the GitHub repository (https://github.com/Yefeng0920/advanced_animal_MA_tutorial) and the Zenodo repository (https://zenodo.org/record/6622330#.YqAyH3ZBw2w)

Loading packages

Load the necessary R packages for data manipulation, visualizations and model implementation. Note that if your R does not have the following packages, you need to install them by using install.packages() for CRAN packages or devtools::install_github() for those archived on github repositories: tidyverse, knitr, DT, readxl, metafor, clubSandwich, orchaRd, MuMIn, patchwork, GoodmanKruskal, networkD3, ggplot2, ggsignif, visdat, ggalluvial, ggthemr, cowplot, grDevices, png, grid, gridGraphics, pander, formatR, rmdformats.

Why animal meta-analyses need advanced models

We profiled the current practices of meta-analyses of animal data by mapping the reporting practices, statistical issues and statistical approaches from papers published over the last 10 years (2011 – 2021; see survey results in a Github repository: mlagisz/survey_neurobiology_MA). Animal meta-analyses often combine studies on different species or strains, experimental designs (e.g., different dosages or sex of the animals), multiple outcomes, multiple trials, each with multiple arms. These complex data structures often result in two statistical issues: statistical dependence and multiple sources of heterogeneity. Researchers in animal meta-analyses mainly use traditional meta-analytic techniques (i.e., fixed- and random-effects models), which are very limited in addressing the two issues. resulting unreliable meta-analytic evidence (e.g., inflation of p-value) can undermine our ability to provide robust meta-analytic insights.

Formulating meta-analysis within the multilevel model framework allows directly modeling dependence and heterogeneity. In the main text, we elaborate on the concepts, rationale, and examples of the multilevel model in the context of meta-analysis. To complement the theory of the multilevel model outlined in the main text, here, we provide a easy-to-implement tutorial (with R code) to showcase how to conduct animal meta-analyses within the framework of multilevel model. We also illustrate how to implement other underappreciated methods, such as robust variance estimation. We encourage meta-analysts to modify the sample R code for their own animal data meta-analyses to draw more robust biological (neurological) inferences, reveal new biological insights, and foster better animal-to-human translation (Bahadoran et al. (2020)).

All source files for this tutorial, including Rmarkdown file with the original code can be found at https://github.com/Yefeng0920/advanced_animal_MA_tutorial.

How to implement advanced meta-analytic techniques

This online tutorial has two parts:

Part I: Animal meta-analyses within the multilevel model framework

In Part I, we reproduce a typical animal meta-analyses conducted by Ramsteijn et al. (2020), who employed the traditional meta-analytic techniques (i.e., random-effects model; see 6.1 in the main text). We use this dataset as the worked example to: - show how failure to account for non-independence using traditional meta-analytic technique might lead to spurious conclusions; - showcase the implementation of the multilevel meta-analytic framework (sections 6 to 8 in the main text).

Part I consists of 5 sections. In Section 1, we show how to use R code to fit a traditional meta-analytic model, by which we expect you to get familiar with coding/syntax-based implementation of meta-analytic models and corresponding model outputs. By building upon the random-effects meta-analytic model (Section 1), we illustrate the procedures related to the use of multilevel models (Sections 2 to 7). We recommend researchers to employ these procedures as default analytic pipelines when conducting animal meta-analysis.

Section 1 - Fit a random-effects meta-analytic model
Section 2 – Fit a multilevel meta-analytic model to estimate overall pooled effect
Section 3 – Partition heterogeneity among effect sizes using the multilevel model
Section 4 – Fit multilevel meta-regressions to explain heterogeneity and estimate moderator effects
Section 5 – Expand animal data meta-analysis using emerging effect sizes
Section 6 – Test publication bias

Part II: Complementary analyses with other advanced methods to handle more complex animal data structures

In Part II, we use a more complex animal dataset to show the implementation of the extended methods outlined in section 9 in the main text. This dataset comes from one of our published neuroscience meta-analyses (Lagisz et al. (2020)). Methods implemented in Part II are not mandatory procedures for performing an animal meta-analysis, but following them can make an animal meta-analysis statistically more sound (with more reliable model coefficients and statistical inferences). These extended methods include：

Section 7 – Select an appropriate random-effect structure
Section 8 – Fit multi-moderator multilevel meta-regression models
Section 9 – Construct variance-covariance matrix to account for correlated/non-independent error
Section 10 – Make cluster-robust model inferences

The above four sections broadly align with the order of subheadings of our main text. For each procedure within Part I and Part II, we first briefly explain the necessary statistical concepts and rationale (detailed theoretical explanations can be found in the main text). Then we use existing R packages and custom functions to show the implementation of each procedure.

Section 1 – Fit a random-effect meta-analytic model

We use the animal dataset provided by Ramsteijn et al. (2020) for our first worked example. This dataset comes from one of the meta-analyses included in our survey (Ramsteijn et al. (2020)), where the authors examined the effect of perinatal selective serotonin re-uptake inhibitor (SSRI) exposure on behavioural phenotypes of animals (e.g., exploration, learning, stress copying, social behaviour, sensory processing). We choose one of the outcome data subsets (sensory processing) to show the implementation of recommended procedures.

1.1 Load the dataset of Ramsteijn et al. (2020)

Table S1
The corresponding coded variables in the worked example (Ramsteijn et al. (2020)).

As shown in Table S1, this dataset includes 12 primary studies with 17 effect sizes. The ratio of the number of observations / effect sizes (k) to the number of studies (N) implies that this dataset suffers from statistical dependence: k = 17 effect sizes, N = 12 unique primary studies, k/N = 1.4166667. This indicates that several studies in this meta-analysis contributed more than one effect size. Moreover, the authors declared in their published paper:

If a study reported separate comparisons for males and females, or animals exposed to different SSRIs, we analyzed these comparisons as if they were separate studies. (page 55)

This sentence indicates that the authors ignored the fact that multiple effect sizes are nested in one study (multiple effect sizes per study). In reality, the effect sizes are non-independent in this meta-analysis. Besides the issue of non-independence among effect sizes, there also exist multiple sources of heterogeneous biological and methodological characteristics in Ramsteijn et al. (2020)’s dataset (Figure S1).

Figure S1 Heterogeneous experimental designs of primary studies included in the worked example: relationship/nestedness between study species, sex, behavioural assay, and types of exposure (SSRIs).

1.2 Fit a random-effects model

Ramsteijn et al. (2020) used standardized mean differences (SMD) as their effect size metric to quantify the effect of SSRI exposure on animals’ sensory processing. They conducted a random-effects meta-analysis to estimate the overall effect of SSRI and a series of subgroup analyses to examine how different moderators mediate the magnitude of the effect of SSRI. Below, we reproduce Ramsteijn et al. (2020)’s analyses using their analytic pipelines (i.e., the traditional meta-analytic model).

Effect size calculations

SMD and corresponding sampling variance can be computed using existing R packages, such as metafor package (escal() function) or meta package. Positive values of SMD represent that the perinatal SSRI exposure has a positive effect on offspring’s sensory processing function. Here, we use escal() to calculate SMD and corresponding sampling variance. Note that other commonly used effect sizes (see Section 5) also can be easily calculated using the mentioned packages. The effect size and corresponding sampling variance can be computed with (using SMD as an example):

SMD <- escalc(measure = "SMD", # standardised mean difference should be calculated (alternative effect sizes: "ROM" – lnRR, "CVR" – lnCVR, "VR" – lnVR; see below);
              m1i = SSRI_Mean, # mean of treatment group (SSRI);
              m2i = Vehicle_Mean, # mean of control group (Vehicle);
              sd1i = SSRI_SD, # standard deviation of treatment group;
              sd2i = Vehicle_SD, # standard deviation of control group; 
              n1i = SSRI_Nadj, # sample size of treatment group; 
              n2i = Vehicle_Nadj, # sample size of control group; 
              data = dat_sensory, # dataset of our work example;
              append = FALSE)

Table S2
The estimates of effect sizes (SMD) and their sampling variance (SMDV) for each included study and comparison.

Now let’s fit a random-effects model to these data (Equation 1; all notations can be found in the main text):

\[ ES_{j} = \beta_{0} + \mu_{j} + m_{j}, (1)\\ \mu_{j} \sim N(0,\tau^2)\\ m_{j} \sim N(0,\nu_{j}) \] Ramsteijn et al. (2020) used Review Manager (RevMan v.5.3) to perform the random-effects analysis. Here, we reproduce their random-effects meta-analysis using rma() function in metafor package with using this syntax:

mod_random_SMD <- rma(yi = SMD, # the variable in your dataset containing calculated effect sizes / estimates of SMD, which can be obtained from R functions like escalc() function; 
                      vi = SMDV, # the variable in your dataset containing the estimates of sampling variance of SMD corresponding to each yi
                      test = "t", # the t-distribution is specified to calculate test statsitic and performs significance test (confidence intervals, and p-value for model coefficient intercept in Equation 1); alternative method: "z", which uses a standard normal distribution;
                      data = dat_sensory2 # your dataset
                      )

The model outputs looks like this:


Random-Effects Model (k = 17; tau^2 estimator: REML)

  logLik  deviance       AIC       BIC      AICc  <U+200B> 
-18.8005   37.6010   41.6010   43.1462   42.5241   

tau^2 (estimated amount of total heterogeneity): 0.3953 (SE = 0.1964)
tau (square root of estimated tau^2 value):      0.6287
I^2 (total heterogeneity / total variability):   75.75%
H^2 (total variability / sampling variability):  4.12

Test for Heterogeneity:
Q(df = 16) = 52.2611, p-val < .0001

Model Results:

estimate      se     tval  df    pval    ci.lb    ci.ub   <U+200B> 
 -0.3873  0.1816  -2.1326  16  0.0488  -0.7723  -0.0023  * 

---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Under ‘model results’, we can see these results are not exactly the same, but very close to what Ramsteijn et al. (2020) report in their paper (page 62): The overall pooled SMD is estimated to be \(\beta_{0}\) = -0.387 (original \(\beta_{0}\) = -0.37) with a standard error of SE[\(\beta_{0}\)] = 0.182. The amount of heterogeneity (between-study variance) is \(\tau^2\) = 0.395 and corresponding \(I^2\) = 75.8 (original\(I^2\) = 68%). The slight difference is caused by different analysis estimators used in our re-analysis and Ramsteijn et al. (2020)’s analysis. Ramsteijn et al. (2020) performed the analysis using Review Manager which uses DerSimonian-Laird method as a default estimator. rma.mv() uses restricted maximum-likelihood (REML) method as the estimator, which is suggested by simulation studies (Langan et al. (2019); Viechtbauer (2007)). If we specify DerSimonian-Laird method via the argument method (method = “DL”), the results are much closer to Ramsteijn et al. (2020)’s results:

mod_random_SMD2 <- rma(yi = SMD,  
                       vi = SMDV, 
                       test = "t",
                       method = "DL", # we followed the method used to estimate between-study variance in  Ramsteijn et al. (2020)'s analyses - DerSimonian-Laird method, which is the default estimator of Review Manager;
                       data = dat_sensory2 
                      )

Now, the overall pooled SMD becomes \(\beta_{0}\) = -0.378 (original \(\beta_{0}\) = 0.37) and the degree of heterogeneity becomes \(I^2\) = 75.8 (original\(I^2\) = 68%).

Section 2 – Fit a multilevel meta-analytic model to estimate overall pooled effect

In this section, with Ramsteijn et al. (2020)’s data, we illustrate how to conduct a meta-analysis in the framework of multilevel model to deal with non-independence among effect sizes and subsequently obtain a robust overall pooled effect.

2.1 Concepts and rationale

The random-effects model assumes statistical independence between the effect sizes obtained from a set of studies. According to our survey, 89% of animal meta-analyses violated this assumption in practice (see our main text and survey results in a Github repository: mlagisz/survey_neurobiology_MA). As mentioned early, many primary studies included in Ramsteijn et al. (2020) contribute more than one effect size per study (i.e., non-independent effect sizes). The effect sizes are correlated with each other within a ‘clustering’ variable - effect size derived from the same study, species, or other clustering variables (e.g., research group) may be more similar to each other than effect sizes from different study, species, or other clustering variables (e.g., research group). We have a nice visual summary of different forms of potential non-independence in animal data meta-analytic datasets in the main text.

Using a random-effects model to fit dependent effect sizes will ignore the dependency among effect sizes and treat them as if they were statistically independent. Such ignorance of non-independence could inflate Type I error and underestimate the associated standard error of model coefficient (SE[\(\beta_{0}\)]; see the main text). As a result, the respective significance tests of model coefficients are inflated (i.e., distorting p-value and confidence intervals) - our worked example clearly shows this point (see below). The use of multilevel model can directly model the statistical dependence among effect sizes. The multilevel model also can accommodate various sources of heterogeneity (e.g., originating from different clusters: studies, species, treatment types). The simplest multilevel model is a 3-level multilevel meta-analytic model, which can be written as (Equation 2; all notations can be found in the main text):
\[ ES_{[i]} = \beta_{0} + \mu_{between[j]} + \mu_{within[i]} + e_{[i]}, (2)\\ \mu_{between[j]} \sim N(0,\sigma_{between}^2)\\ \mu_{within[i]} \sim N(0,\sigma_{within}^2)\\ e_{[i]} \sim N(0,\nu_{[i]}) \] \(\beta_{0}\) in Equation 2 denotes the overall effect (also known as the overall mean or pooled effect size). Equation 2 is a so-called intercept-only multilevel meta-analytic model because the main model coefficient is the intercept (i.e., \(\beta_{0}\)). The principle that the multilevel model can deal with dependent effect sizes is that it can use a flexible random-effects structure to capture the non-independence structure due to clustering/nesting variables (analogous to the nested random-effects terms in a linear mixed effects model).

To properly handle non-independence, you need to deal with it from the very beginning, namely when preparing your data file (e.g., Excel or CSV files) structure your data file in a way that permits the incorporation of non-independence among effect sizes. In this respect, you need to code a unique identifier for each observation/effect size (e.g., Obs_ID: Obs1, Obs2, Obs3), primary study (e.g., Study_ID: s1, s2, s3) and strain/species if applicable (e.g., Strain_ID or Species_ID), respectively. These unique identifiers will allow true effect sizes to vary among different clustering variables (for example, to allow true effect sizes vary across different primary studies) and within a cluster variable (i.e., multiple effect sizes nested within an study), such that the corresponding non-independence and variation can be modeled.

2.2 Construct a multilevel meta-analytic model

To fit non-independent effect sizes in Ramsteijn et al. (2020)’s dataset, we need to use rma.mv() function rather than rma() in metafor package. rma.mv() function uses the random argument to deal with non-independence due to clustering. The random argument can be specified with a formula (which defines a nested random-effects structure) to account for non-independence due to clustering. Within the formula, each random effect is defined using the following form: starts with ~ 1, followed by a vertical bar |; Behind |, a clustering variable (e.g., Study_ID, Species_ID) is assigned to account for the random effect. In our 3-level multilevel model, we have two random-effects terms: \(\mu_{between[j]}\) and \(\mu_{within[i]}\) (Equation 2). You may wonder why there are only two random-effects terms, why Equation 2 is called a “3-level multilevel model”. This is because Equation 2 also contains the sampling variance effect (\(e_{[i]}\)) on the “bottom” level (see below).

Level 1: sampling variance effect

The sampling variance effect \(e_{[j]}\) is on level 1, which is used to account for sampling/measurement error effect in effect size.

Level 2: within-study effect

The random effect \(\mu_{within[i]}\) is on level 2, which can be used to account for within-study (observational level/effect size level) random effect and uses corresponding variance component \(\sigma_{within}^2\) to capture within study-specific heterogeneity. Level 2 can be specified as: random = ~ 1 | Obs_ID.

level 3: between-study effect

The random effect \(\mu_{between[j]}\) is on level 3, which can be used to account for between-study (study-specific) random effect and uses corresponding variance component \(\sigma_{between}^2\) to capture study-specific heterogeneity. Level 3 can be specified as: random = ~ 1 | Study_ID.

Because 3-level model has two random effects, we need to use list() to bind them together: random = list(~ 1 | Study_ID, ~ 1 | Obs_ID). Alternatively, we can use another form of formula to tell rma.mv() that the effect sizes are non-independent due to clustering (nesting random effects): random = ~ 1 | Study_ID / Obs_ID, which adds a random effect corresponding to the clustering variable Study_ID and a random effect corresponding to Obs_ID within Study_ID to the multilevel model.

We can use rma.mv() to fit a 3-level meta-analytic model to the calculated SMD with the syntax (implementation of Equation 2):

mod_multilevel_SMD <- rma.mv(yi = SMD, 
                             V = SMDV, 
                             random = list(~1 | Study_ID, # a random effect (clustering variable) that allows the true effect sizes vary across studies (variation between studies);
                                           ~1 | Obs_ID), # a random effect that allows the true effect sizes vary within studies (variation within studies);
                             test = "t",
                             method = "REML", 
                             data = dat_sensory2
                            )

2.3 Interpretations of a multilevel meta-analytic model

We can use forest plot to visualise the model results. The forest() function in metafor package can be used to make a typical forest plot. We see that in this forest plot (Figure S2) is hard to tell whether there is statistically significant overall effect (pooled SMD) or not (the summary polygon shown at the bottom of the figure almost touches the vertical line which indicates the zero effect). Alternatively, you can use an alternative meta-analysis visualisation package orchaRd, we created (Nakagawa et al. (2021)). The orchard_plot() function can be used to create an orchard plot (forest-like plot) to visualise the results of a meta-analytic model (Figure S3). An orchard plot is more informative than a forest plot. For example, it can display the prediction interval of the overall effect (bold whiskers), the number of effect sizes (k) and the number of studies (the number in the bracket). An orchard plot is very useful when you have a big dataset (large k; see Figure S9).

Figure S2
Forest plot showing the results of analysing 17 effect sizes with a multilevel model quantifying the effect of SSRI exposure on animal sensory processing (data from Ramsteijn et al. (2020)).

Figure S3
Orchard plot (forest-like plot) showing the results of analysing 17 effect sizes with a multilevel model quantifying the effect of SSRI exposure on animal sensory processing (data from Ramsteijn et al. (2020)). You can use help(orchard_plot) to look at the corresponding arguments and add more code to make it more elegant (Nakagawa et al. (2021)).

The outputs of the fitted multilevel model look:


Multivariate Meta-Analysis Model (k = 17; method: REML)

  logLik  Deviance       AIC       BIC      AICc  <U+200B> 
-18.7458   37.4915   43.4915   45.8093   45.4915   

Variance Components:

            estim    sqrt  nlvls  fixed    factor 
sigma^2.1  0.1455  0.3815     12     no  Study_ID 
sigma^2.2  0.2636  0.5134     17     no    Obs_ID 

Test for Heterogeneity:
Q(df = 16) = 52.2611, p-val < .0001

Model Results:

estimate      se     tval  df    pval    ci.lb   ci.ub   <U+200B> 
 -0.3887  0.1953  -1.9904  16  0.0639  -0.8026  0.0253  . 

---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Next, let’s go through the printed model outputs one by one.

Multivariate Meta-Analysis Model

In the top output section, k corresponds to the number of effect sizes fed to the multilevel model. method: REML means the REML method was specified as estimation procedure for model fitting to obtain model estimates (e.g., variance components, model coefficients). All other elements are fit statistics and information-criteria based statistics, including logLik (restricted log-likelihood of the fitted model), Deviance, AIC (Akaike information criterion score of the fitted model), BIC (Bayesian information criterion) and AICc (AIC corrected for small sample sizes). These statistics are used for model selection, that is, to select ‘better’ models (see Section 6).

Variance Components

This section of the printed output shows the variance estimated for each level of the fitted 3-level model. The first one, sigma^2.1, represents the level 3 between-study variance, \(\sigma_{between}^2\). Conceptually, this is equivalent to between-study heterogeneity variance \(\tau^2\) in a random-effects model - Equation 1 (but the values of \(\sigma_{between}^2\) and \(\tau^2\) are not the same; see next section: Comparing the multilevel and random-effects models). The second variance component, sigma^2.2, represents the level 2 within-study variance, \(\sigma_{within}^2\). The heading estim shows the estimates of variance components in levels 3 and 2 (\(\sigma_{between}^2\) and \(\sigma_{within}^2\)). The heading sqrt shows the standard deviation of variance components - square root of \(\sigma^2\). The column of nlvls shows how many levels each random effect has. factor is the name of the clustering variables we used in the random argument to specify corresponding random effects.

Test for Heterogeneity

This printed output section shows results of Cochran’s Q-test, which is used to test the null hypothesis that all animal studies have the same/equal effect. p-val < 0.05 means that effect sizes derived from the animal studies are heterogeneous. In other words, substantial heterogeneity exists in this animal dataset.

Model Results

estimate is the estimate of the overall/pooled effect (i.e., grand mean or meta-analytic mean; \(\beta_{0}\) in Equation 2). se is the standard error of the estimate: as in our example, it is (SE[\(\beta_{0}\)]. tval is the value of test statistic (in our case: t-value). ci.lb and ci.Ub are lower and upper boundary of confidence intervals (CI).

2.4 Handle non-independence to avoid the distortion of significance test and spurious conclusions

As mentioned in the main text, the traditional statistical models used in animal data meta-analyses often fail to handle statistical non-independence, inflating type I error, distorting significance test results and leading to spurious conclusions. This worked example exactly shows this point. As in a random-effects meta-analysis, the first aim of a multilevel meta-analysis is often to estimate the overall effect across all animal studies (overall mean or pooled effect size; \(\beta_{0}\) in Equation 2). Under Model Results, we can find the estimates we need for this aim: the magnitude of the overall effect (\(\beta_{0}\)), its standard error (SE[\(\beta_{0}\)]), corresponding two-tailed p-value, and 95% confidence intervals (CIs). Our multilevel model revealed that animals exposed to SSRIs did not have a significantly less efficient sensory processing than those exposed to vehicle (\(\beta_{0}\) = -0.389, 95% CIs = [-0.803 to 0.025], p-value = 0.064).

[1] 78.99342

          intrcpt
intrcpt 0.7807764

Conflicting results between the multilevel model and random-effect model
When comparing the results of our 3-level model with Ramsteijn et al. (2020)’s original results, we found that the conclusions of our 3-level model conflict with those reached by Ramsteijn et al. (2020) (who used a simple random-effects model). Ramsteijn et al. (2020)’s analysis (page 62) indicates that SSRI exposure has a statistically significant negative effect on sensory processing in animals (our re-analysis when using a random-effects model also suggested this point). However, the multilevel model suggests that the SSRI-effect is actually statistically non-significant when accounting for data non-independence (Table S3).

Table S3
Results of the random-effects and 3-level meta-analytic models.

We note that the magnitude of the overall effect (pooled effect size: \(\beta_{0}\) = -0.389) in the multilevel model is very close to that in the random-effect model (\(\beta_{0}\) = -0.387). But the multilevel model indicates that the overall effect (pooled effect size: \(\beta_{0}\)) is not statistically significant, while the random-effects model indicates the overall effect (pooled effect size: \(\beta_{0}\)) is statistically significant. This clearly shows that using the random-effects model to fit non-independent effect sizes underestimates the standard error of model coefficient (SE[\(\beta_{0}\)]) and distorts corresponding statistical inference (e.g., inflated p-value): SE[\(\beta_{0}\)] = 0.195 in the multilevel model vs. SE[\(\beta_{0}\)] = 0.182 in the random-effects model; p-value = 0.064 in the multilevel model vs. p-value = 0.049 in the random-effect model. Accordingly, the width of 95% CIs in the multilevel model is wider than that in the random-effects model: = [-0.803 to 0.025] in the multilevel model vs. [-0.772 to 0.025] in the random-effects model.

Section 3 – Partition heterogeneity among effect sizes using the multilevel model

3.1 Calculate multilevel version of I2 statistic and variance components

As in the traditional meta-analytic models (i.e., fixed- and random-effects models), the multilevel model also can measure the degree of inconsistency among effect sizes (i.e., the amount of heterogeneity). The hierarchical nature of the multilevel model means that animal meta-analysis can benefit from decomposing heterogeneity across levels, e.g., within- and between-study heterogeneity (more complex heterogeneity source: species-specific heterogeneity; see our second worked example in Section 6). However, the random-effects model can only quantify between-study heterogeneity, which makes within-study heterogeneity mistakenly perceived as between-study heterogeneity (i.e., confounding source of heterogeneity). Below, we use Ramsteijn et al. (2020)’s data to show how to partition multiple sources of heterogeneity using the multilevel version of \(I^2\) statistic and variance components (\(\sigma^2\)), such that we can avoid confounded heterogeneity estimates.

The following formulas can be used to calculate the multilevel version of \(I^2\) statistic (Equations 4 to 6 in the main text):

\[ I^2_{between}=\frac{\sigma_{between}^2} {Var[ES_{i}]}=\frac{\sigma_{between}^2} {\sigma_{total}^2+\sigma_{sampling}^2}, (4) \]

\[ I^2_{within}=\frac{\sigma_{total}^2} {Var[ES_{i}]}=\frac{\sigma_{total}^2} {\sigma_{total}^2+\sigma_{sampling}^2}, (5) \]

\[ I^2_{total}=\frac{\sigma_{within}^2} {Var[ES_{i}]}=\frac{\sigma_{within}^2} {\sigma_{total}^2+\sigma_{sampling}^2}, (6) \]

\(\sigma_{between}^2\) and \(\sigma_{within}^2\) are the variance components corresponding to between- and within-study level random-effects in the multilevel model (specified by the syntax random = list(~ 1 | Study_ID, ~ 1 | Obs_ID) in rma.mv()). The value of each level of \(\sigma^2\) can be found at the Variance Components. \(\sigma_{total}^2\) is the total variance, whose value equals to the sum of \(\sigma_{between}^2\) and \(\sigma_{within}^2\). \(\sigma_{sampling}^2\) is a “typical” sampling-error variance, which can be calculated as:
\[ \sigma_{sampling}^2=\frac{(k-1)\sum_{i=1}^{k} 1/\nu_{i}} {(\sum_{i=1}^{k} 1/\nu_{i})^2-\sum_{i=1}^{k} 1/\nu_{i}^2}, (7) \]

In reality, we do not need to calculate multilevel versions of \(I^2\) statistic manually. i2_ml() function in orchaRd package (Nakagawa et al. (2021)) is very convenient and user-friendly for decomposing \(I^2\) statistic across levels. We can calculate multilevel version of \(I^2\) statistic using i2_ml() function in one line code:

i2_ml(mod_multilevel_SMD)

Then, \(I^2\) statistics corresponding to each level (including \(I^2_{total}\)) are provided as:

   I2_Total I2_Study_ID   I2_Obs_ID 
   76.37792    27.17122    49.20670

3.2 Handle multiple sources of heterogeneity to avoid confounded heterogeneity

As shown above, the i2_ml() will calculate \(I^2\) statistic for each random-effects corresponding to each variance component. Table S3 above shows the distinctions between the random-effects model and multilevel model in terms of heterogeneity. We can see that between-study variance (\(\tau^2\)) and heterogeneity (\(I^2_{between}\)) in the random-effects model are overestimated. The two values in the 3-level model are almost half of those in the random-effects model. This is because the random-effects model incorrectly allocates within-study variance (\(\sigma^2_{within}\)) and heterogeneity (\(I^2_{within}\)) to between-study variance (\(\sigma^2_{betwween}\)) and heterogeneity (\(I^2_{between}\)). You can easily corroborate this point by comparing the between-study variance (\(\tau^2\)) and heterogeneity (\(I^2_{between}\)) in the random-effects model with the total variance (\(\sigma^2_{total}\)) and heterogeneity (\(I^2_{total}\)) in the 3-level model (i.e. the sum of these values in within- and between-study level). This means that using the random-effects model to fit this dataset leads to a wrong conclusion that the study level has a high amount of heterogeneity (\(I^2_{between}\) = 69.38%). However, the study level only explains 27.17% of the total heterogeneity. The remaining 49.21% of the total heterogeneity is due to the within-study level (effect-size/observational level).

3.3 Compute prediction intervals

We recommend researchers to use a complementary statistic index to quantify heterogeneity - prediction interval (PI), which is defined as the estimate of an interval (a plausible value range) wherein the future measurements (i.e., new effect sizes) would fall when no sampling errors exist. This can be calculated as (see Equation 8 in the main text for notation explanation):
\[ \text{95%PI} = \beta_{0} \pm t_{0.975} \sqrt{\sigma^2_{total}+ \text{SE}[\beta_{0}^2]}, (8) \] All the required elements for the calculation of 95% PIs can be found in the outputs of the rma.mv() (see 2.3 Interpretations of a multilevel meta-analytic model). You can also use an existing function - predict() to obtain 95% PIs for the overall effect (pooled effect size):

predict(mod_multilevel_SMD)

The corresponding output looks like:


    pred     se   ci.lb  ci.ub   pi.lb  pi.ub 
 -0.3887 0.1953 -0.8026 0.0253 -1.8064 1.0290

We can see that 95% PIs of the overall effect of SSRI are [-1.81,1.03], which means 95% of new trials regarding the perinatal SSRI exposure will have an effect size (i.e., SMD) in the range of [-1.81,1.03] over different experimental contexts (across different types of SSRI, animal species, sex and ages).

Section 4 – Fit multilevel meta-regressions to explain heterogeneity and estimate moderator effects

4.1 Concepts and rationale

A large \(I^2_{total}\) = 76.38% in Ramsteijn et al. (2020)’s dataset indicates that it is desirable to explain (at least part of) this heterogeneity using variables extracted from primary studies as moderator variables (also known as predictors). In other words, examining how different study-level characteristics (variables/factors) modulate the magnitude of the effect of SSRI exposure. Ramsteijn et al. (2020) used the subgroup analysis to explain the heterogeneity in the effect sizes. Here, we recommend using a multilevel meta-regression model to explain the heterogeneity and examine the moderator effects:
\[ ES_{[i]} = \beta_{0}' + \beta_{1}x_{between[j]} + \mu_{between[j]} + \mu_{within[i]} + e_{[i]}, (8)\\ \mu_{between[j]} \sim N(0,\sigma_{between}^2)\\ \mu_{within[i]} \sim N(0,\sigma_{within}^2)\\ e_{[i]} \sim N(0,\nu_{[i]}) \] As we demonstrated in the main text, a subgroup analysis is equivalent to a meta-regression model if using a dummy-coding strategy to deal with categorical predictors (note that in the context of a meta-analysis, we usually call a predictor as a moderator). In the main text, we used an example to show how to dummy-code categorical moderators for a multilevel meta-regression. The rationale of dummy-coding strategy is to (1) create dummy-coded variables for a categorical moderator to represent different levels, (2) leave one of the dummy variables out of the regression model and set its level as the “reference” level (see details in the main text). Other than creating the dummy-coded variables by hand, we prefer to let R dummy-code categorical moderator variables automatically.

4.2 Construct a multilevel meta-regression model

A multilevel regression model (e.g., Equation 8) can be fitted using rma.mv() function. The mods argument in rma.mv() is used to specify the categorical moderators (also the continuous moderators). The syntax for mods is as follows: starting with a tilde ~ (a general requirement of a formula in R), followed by the name of the moderator (e.g., mods = ~ moderator1).

In this worked example, Ramsteijn et al. (2020) examined whether the effects of SSRI exposure differ depending on sex (male, female, or both). This categorical moderator is labelled as the column Sex column in the dataset. A 3-level meta-regression model with Sex as a moderator variable can be fitted using the follow code:

mod_multilevel_reg_sex_SMD <- rma.mv(yi = SMD, 
                                     V = SMDV, 
                                     random = list(~1 | Study_ID, 
                                                   ~1 | Obs_ID), 
                                     mods = ~ Sex, # the name of the tested moderator; this can be replaced by other moderators of interests;
                                     test = "t",
                                     method = "REML", 
                                     data = dat_sensory2)

See Interpretations of a multilevel meta-regression model below for a comprehensive interpretation of the outputs for the above fitted meta-regression model.

4.3 Interpretations of a multilevel meta-regression model

We can use orchard_plot() to make an orchard plot to visualise the results of a multilevel meta-regression model. Figure S4 shows that neither Male, Female, nor Both have a statistically significant moderating effect (their CI overlap).

Figure S4
Orchard plot (forest-like plot) showing the mediated effect of sex in the analysis of SSRI exposure on animal sensory processing (data from Ramsteijn et al. (2020)). You can use help(orchard_plot) to look at the corresponding arguments and add more code to make it more elegant (Nakagawa et al. (2021)).

Using rma.mv() to fit a multilevel meta-regression model will yield the following output:


Multivariate Meta-Analysis Model (k = 17; method: REML)

  logLik  Deviance       AIC       BIC      AICc  <U+200B> 
-15.9626   31.9253   41.9253   45.1206   49.4253   

Variance Components:

            estim    sqrt  nlvls  fixed    factor 
sigma^2.1  0.5488  0.7408     12     no  Study_ID 
sigma^2.2  0.0275  0.1657     17     no    Obs_ID 

Test for Residual Heterogeneity:
QE(df = 14) = 49.0716, p-val < .0001

Test of Moderators (coefficients 2:3):
F(df1 = 2, df2 = 14) = 2.2580, p-val = 0.1413

Model Results:

           estimate      se     tval  df    pval    ci.lb   ci.ub   <U+200B> 
intrcpt     -0.7107  0.4069  -1.7465  14  0.1026  -1.5835  0.1621    
Sexfemale    0.8267  0.5356   1.5435  14  0.1450  -0.3220  1.9753    
Sexmale      0.1044  0.5424   0.1924  14  0.8502  -1.0591  1.2678    

---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Next, let’s go through the parts of this printed output one by one.

Multivariate Meta-Analysis Model

The interpretations of the results under Multivariate Meta-Analysis Model are same as those in the section of Interpretations of a multilevel meta-analytic model.

Variance Components

The interpretations of the results under Variance Components are same as those in the section of Interpretations of a multilevel meta-analytic model.

Test for Residual Heterogeneity

Results under Test for Residual Heterogeneity are similar to Test for Heterogeneity in the section Interpretations of a multilevel meta-analytic model. But the two are not exactly the same. QE is the test statistic used to test whether the amount of “residual heterogeneity” among the effect sizes is substantial. “Residual heterogeneity” means the amount of heterogeneity that is not explained by the moderator of sex added in the meta-regression model. From the results, we can see that the inclusion of sex as a moderator only can account for very little of heterogeneity (Q is reduced from 52.2611 to 49.0716). p-val < 0.0001 indicates that residual heterogeneity still remains statistically significant.

Test of Moderators (coefficients 2:3)

Results under Test of Moderators present the omnibus test of all model coefficients. coefficients 2:3() means that an omnibus test of coefficients 2 to 3 is conducted to test the null hypothesis of \(H0:\beta_{1}=\beta_{2}=0\) (note that the moderator Sex has three levels [male, female or both], so this fitted meta-regression has three model coefficients). By default, the first coefficient (the intercept, which is denoted as \(\beta_{0}\) in Equation 8) is excluded when fitting the meta-regression model. The first coefficient (the intercept) can be included in the meta-regression model intentionally (see below for details), then the omnibus test will include three coefficients (including the first coefficient - intercept); the corresponding null hypothesis will be \(H0:\beta_{0}=\beta_{1}=\beta_{2}=0\). F(df1 = 2, df2 = 14) = 2.2580 and p-val = 0.1413 indicate that the null hypothesis is rejected (the test of \(H0:\beta_{1}=\beta_{2}=0\) is not significant). In other words, there is no significant difference between different subgroups of Sex (i.e., male, female or both) or the Sex as a whole does not impact the average effect of SSRI exposure (i.e., sex of animals can not explain any heterogeneity in effect sizes).

Model Results

Results under Model Results report the estimates of all model coefficients and their significance tests. The moderator Sex has three levels: male, female or both. By default, R will alphabetize the dummy-coded variable (in this case, Sex). The subgroup of both is set as the “reference” level and left out the model (because the letter “b” [both] comes before “f” [female] and “m” [male]). So the intercept (\(\beta_{0}\); intrcpt) is the pooled \(SMD\) for the subgroup of both (\(\text{SMD}_{both}\) = -0.7107, \(\text{95%CI}\) = [-1.5835, 0.1621], \(p-value\) = 0.1026). The other two coefficients represent how much higher the pooled \(SMD\) is for the subgroups of female (\(\beta_{1}\)) and “male” (\(\beta_{2}\)), respectively, compared to the “reference” level (subgroup of both - intercept, \(\beta_{0}\)). We can obtain the pooled \(SMD\) for the subgroups of female (\(\beta_{1}\)) and male (\(\beta_{2}\)) by adding their estimate to the estimate of the “reference” level (i.e., intrcpt). Therefore, the pooled \(SMD\) for female and male are 0.116 (\(\beta_{0} + \beta_{1}\) = -0.7107 + 0.8267) and -0.6063 (\(\beta_{0} + \beta_{2}\) = -0.7107 + 0.1044). The corresponding \(p-value\) (pval) and \(\text{95%CI}\) (ci.lb, ci.ub) indicate that neither subgroup has a significant influence on the SSRI effect (the results of Test of Moderators also confirm this result: F(df1 = 2, df2 = 14) = 2.2580 and p-val = 0.1413 indicates that we can reject the null hypothesis \(H0:\beta_{1}=\beta_{2}=0\)).

By using a different syntax strategy, we can directly obtain the pooled \(SMD\) for each subgroup. To achieve this, we need to add -1 at the end of mods = ~ Sex (i.e., mods = ~ Sex -1). -1 means that the intercept (\(\beta_{0}\)) will be removed from the meta-regression model. The whole code is now:

rma.mv(yi = SMD, 
       V = SMDV, 
       random = list(~1 | Study_ID, ~1 | Obs_ID), 
       mods = ~ Sex -1, # remove the intercept from the meta-regression;
       test = "t",
       method = "REML", 
       data = dat_sensory2)

The results of meta-regression after removing the intercept (using -1 with the moderator parameter):


Multivariate Meta-Analysis Model (k = 17; method: REML)

  logLik  Deviance       AIC       BIC      AICc  <U+200B> 
-15.9626   31.9253   41.9253   45.1206   49.4253   

Variance Components:

            estim    sqrt  nlvls  fixed    factor 
sigma^2.1  0.5488  0.7408     12     no  Study_ID 
sigma^2.2  0.0275  0.1657     17     no    Obs_ID 

Test for Residual Heterogeneity:
QE(df = 14) = 49.0716, p-val < .0001

Test of Moderators (coefficients 1:3):
F(df1 = 3, df2 = 14) = 2.4084, p-val = 0.1106

Model Results:

           estimate      se     tval  df    pval    ci.lb   ci.ub   <U+200B> 
Sexboth     -0.7107  0.4069  -1.7465  14  0.1026  -1.5835  0.1621    
Sexfemale    0.1160  0.3482   0.3330  14  0.7441  -0.6309  0.8628    
Sexmale     -0.6063  0.3587  -1.6904  14  0.1131  -1.3757  0.1630    

---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Now, under Test of Moderators (coefficients 1:3):, coefficients 1:3 indicates that null hypothesis \(H0:\beta_{0}=\beta_{1}=\beta_{2}=0\) is tested. The results still show that the null hypothesis \(H0:\beta_{0}=\beta_{1}=\beta_{2}=0\) is rejected: F(df1 = 3, df2 = 14) = 2.4084 and p-val = 0.1106. Results under Model Results: show the pooled pooled \(SMD\) for each subgroup, whose values (estimate) are exactly the same with those calculated by “adding two estimates” (see above).

4.4 Calculate the goodness-of-fit index

For a multilevel meta-regression model, the goodness-of-fit index \(R^2\) is also applicable for quantifying the percentage of variance explained by the included moderator variables (Aloe et al. (2010)). Nakagawa and Schielzeth (2013) propose to use a general form of \(R^2\) - marginal \(R^2\), which can be calculated as:

\[ R^2_{marginal}=\frac{\sigma_{fixed}^2} {\sigma_{total}^2}=\frac{\sigma_{fixed}^2} {\sigma_{fixed}^2+\sigma_{b}^2+\sigma_{w}^2}, (10) \]

You can easily calculate \(R^2_{marginal}\) via the function r2_ml() in our R package orchaRd:

r2_ml(mod_multilevel_reg_sex_SMD)

The first column of the output (R2_marginal) shows that animal sex can explain 21.3% of the variation.

   R2_marginal R2_conditional 
     0.2130917      0.9624983

Section 5 – Test publication bias

As mentioned in the main text, the common methods for testing publication bias will be invalid if effect sizes are statistically dependent. Therefore, funnel plots, conventional Egger’s regression and trim-and-fill tests are often not suitable to test publication bias for animal data meta-analyses. In this section, we showcase how to properly test two forms of publication bias in the framework of multilevel meta-regression: small-study effect and decline effect.

5.1 Construct an extended Egger’s regression to test the small-study effect

The first form publication bias is the small-study effect, which occurs when small studies (with small sample sizes) tend to report large effect sizes. An extended Egger’s regression is equivalent to a multilevel meta-regression with sampling error (\(se_{[i]}=\sqrt{\nu_{[i]}}\)) as a continuous moderator variable. Accordingly, it can be fitted via specifying mods = ~ SMDSE:

rma.mv(yi = SMD, 
       V = SMDV, 
       random = list(~1 | Study_ID, ~1 | Obs_ID), 
       mods = ~ SMDSE, # sampling error (squart root of sampling variance SMDV);
       test = "t",
       method = "REML", 
       data = dat_sensory2)

See Interpretations of the small-study effect test below for how to interpret extended Egger’s regression’s results.

5.2 Interpretations of the small-study effect test

The printed output of a typical extended Egger’s regression model looks like this:


Multivariate Meta-Analysis Model (k = 17; method: REML)

  logLik  Deviance       AIC       BIC      AICc  <U+200B> 
-16.8403   33.6805   41.6805   44.5127   45.6805   

Variance Components:

            estim    sqrt  nlvls  fixed    factor 
sigma^2.1  0.2730  0.5225     12     no  Study_ID 
sigma^2.2  0.1505  0.3879     17     no    Obs_ID 

Test for Residual Heterogeneity:
QE(df = 15) = 51.2431, p-val < .0001

Test of Moderators (coefficient 2):
F(df1 = 1, df2 = 15) = 2.5477, p-val = 0.1313

Model Results:

         estimate      se     tval  df    pval    ci.lb   ci.ub   <U+200B> 
intrcpt    0.7127  0.7201   0.9897  15  0.3380  -0.8222  2.2476    
SMDSE     -2.8426  1.7809  -1.5962  15  0.1313  -6.6385  0.9533    

---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Under Model Results, we can see that the regression slope of the extended Egger’s regression is sqrt(SMDSE) = -2.8426, which is not statistically different from zero (t_value = -1.5962 and p-val = 0.1313). This means smaller studies (with larger sampling error [\(se_{[i]}=\sqrt{\nu_{[i]}}\)]) do not have larger effect sizes (Figure S5), i.e. no small-study effect exists in this dataset. The non-significant slope sqrt(SMDSE) = -2.8426 also indicates that data is symmetrically distributed in the funnel plot (Figure S6).

Figure S5
A bubble plot showing the relationship between effect size estimates (SMD) and their sampling error (SE) can be used to detect the small-study effect (funnel plot asymmetry). This bubble plot can be made using bubble_plot() function in orchaRd package (Nakagawa et al. (2021)).

Figure S6
Visual inspection of the funnel plot to identify the small-study effect.

The interpretations of other outputs of the extended Egger’s regression model are the same to those in a multilevel meta-regression, including:

Multivariate Meta-Analysis Model
Variance Components
Test for Residual Heterogeneity
Test of Moderators
Model Results

You can refer to Interpretations of a multilevel meta-analytic model in Step 3 for thorough descriptions of interpretations.

Of note, when using SMD as an effect size metric, using Egger’s test to identify small-study effect may produce false-positive results. If you have a look at the formula used to compute SMD’s SE (which can be found elsewhere), you may realise that SMD is artifactually correlated with its SE , meaning that SMD’s SE is dependent on SMD. Nakagawa et al. (2022) propose to use an adapted SE when using Egger’s regression to test the small-study effect on SMD. The adapted SE is based on the effective sample size:

\[ \sqrt{\frac {1} { \tilde{N} }} = \sqrt{\frac {1} { N_\text{T}} + \frac{1}{N_\text{C}}}, \]

Therefore, it is necessary to conduct a sensitivity analysis using this adapted SE to check the robustness of small-study test. This can be easily done by replacing SE with the adapted SE in the extended Egger’s regression model. Under Model Results (see below). We can see that the slope of the adapted SE sqrt(SMDSE_C) = -1.2521 still shows non-significant (t_value = -0.6042 and p-val = 0.5548), which indicates the robustness of the small-study test.


Multivariate Meta-Analysis Model (k = 17; method: REML)

  logLik  Deviance       AIC       BIC      AICc  <U+200B> 
-17.8446   35.6892   43.6892   46.5214   47.6892   

Variance Components:

            estim    sqrt  nlvls  fixed    factor 
sigma^2.1  0.2387  0.4885     12     no  Study_ID 
sigma^2.2  0.2293  0.4789     17     no    Obs_ID 

Test for Residual Heterogeneity:
QE(df = 15) = 52.2585, p-val < .0001

Test of Moderators (coefficient 2):
F(df1 = 1, df2 = 15) = 0.3650, p-val = 0.5548

Model Results:

         estimate      se     tval  df    pval    ci.lb   ci.ub   <U+200B> 
intrcpt    0.0728  0.7986   0.0912  15  0.9285  -1.6293  1.7750    
SMDSE_c   -1.2521  2.0724  -0.6042  15  0.5548  -5.6693  3.1651    

---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

5.3 Construct a multilevel regression model to test the time-lag bias

Time-lag bias occurs when statistically significant (“positive” results) tend to be published earlier than those with statistically non-significant findings (“negative” results), leading to a decline in reported effect sizes over time (i.e., a decline effect).

Time-lag bias has a very important implications, for example, the instability of the cumulative evidence of a given field poses a threat to policy-making and (pre)clinical decision-making. However, this form of publication bias has been rarely tested in the practice of animal data meta-analyses. The test for time-lag bias is very straightforward. You only need to add the publication year of the effect size as a continuous moderator variable in a multilevel meta-regression model (equivalent to replacing sampling error (\(se_{[i]}=\sqrt{\nu_{[i]}}\)) by publication year):

rma.mv(yi = SMD, 
       V = SMDV, 
       random = list(~1 | Study_ID, ~1 | Obs_ID), 
       mods = ~ pub_year, # publication year of the effect sizes (in this case, SMD);
       test = "t",
       method = "REML", 
       data = dat_sensory2)

See Interpretations of the time-lag bias test below for how to interpret time-lag bias test’s results.

5.4 Interpretations of the time-lag bias test

The printed output of a typical time-lag bias test looks like this:


Multivariate Meta-Analysis Model (k = 17; method: REML)

  logLik  Deviance       AIC       BIC      AICc  <U+200B> 
-17.9355   35.8711   43.8711   46.7033   47.8711   

Variance Components:

            estim    sqrt  nlvls  fixed    factor 
sigma^2.1  0.2666  0.5163     12     no  Study_ID 
sigma^2.2  0.2168  0.4656     17     no    Obs_ID 

Test for Residual Heterogeneity:
QE(df = 15) = 51.9927, p-val < .0001

Test of Moderators (coefficient 2):
F(df1 = 1, df2 = 15) = 0.0002, p-val = 0.9898

Model Results:

          estimate        se     tval  df    pval      ci.lb     ci.ub   <U+200B> 
intrcpt    -1.9368  119.1649  -0.0163  15  0.9872  -255.9309  252.0573    
pub_year    0.0008    0.0592   0.0130  15  0.9898    -0.1254    0.1269    

---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Similarly to the decline effect test, under Model Results, we can see that the regression slope is pub_year = 0.0008, which is very small and not statistically different from zero (t_value = 0.0130 and p-val = 0.9898). This means studies with statistically significant findings do not tend to be published earlier than these with “negative” results, i.e. no time-lag bias exists in this dataset. Figure S7 clearly shows that the estimates of SMD remain roughly consistent across different publication years.

Figure S7
A bubble plot showing the relationship between effect size estimates (SMD) and their publication year can be used to detect the time-lag bias (aka decline effect). This bubble plot can be made using bubble_plot() function in orchaRd package (Nakagawa et al. (2021)).

The interpretations of other outputs of the extended Egger’s regression model are the same as in a multilevel meta-regression, including (1) Multivariate Meta-Analysis Model; (2) Variance Components; (3) Test for Residual Heterogeneity; (4) Test of Moderators; (5) Model Results. You can refer to Interpretations of a multilevel meta-analytic model section in Step 3 for thorough descriptions of these interpretations.

Section 6 – Animal meta-analysis using emerging effect sizes

There are three underappreciated standardized effect sizes in the practice of animal data meta-analyses:

lnRR

The log-transformed response ratio, which uses the natural logarithm of the ratio of means between two groups to quantify the mean difference or average treatment effect.

lnVR

The log-transformed variability ratio, which can quantify the difference in variance (standard deviation) around the mean between two groups and estimate inter-individual variability between two groups (heterogeneity of a treatment effect).

lnCVR

The log-transformed coefficient of variation ratio, which is a mean-adjusted version of lnVR. In contrast to lnVR, lnCVR controls for the indirect impact of mean on its variability (i.e., accounting for the mean-variance relationship).

In our main text, we elaborate on the formulas, statistical merits and (neuro)biological meaning of these underutilised effect sizes in the context of animal data meta-analysis. Here we show how to calculate these effect sizes and their sampling variances and how to conduct a meta-analysis on them.

6.1 Meta-analysis of mean (lnRR)

lnRR and corresponding sampling variance can be computed via escal() function. Positive values of lnRR indicate that the perinatal SSRI exposure has a positive effect on offspring’s sensory processing function. The R syntax is similar to the calculation of SMD shown in Section 1:

lnRR <- escalc(measure = "ROM", # "ROM" means the ratio of mean differences (log scale) - lnRR;
               m1i = SSRI_Mean, # mean of treatment group (SSRI);
               m2i = Vehicle_Mean, # mean of control group (Vehicle);
               sd1i = SSRI_SD, # standard deviation of treatment group;
               sd2i = Vehicle_SD, # standard deviation of control group; 
               n1i = SSRI_Nadj, # sample size of treatment group; 
               n2i = Vehicle_Nadj, # sample size of control group; 
               data = dat_sensory, # dataset of our work example;
               append = FALSE)

We can use rma.mv() to fit a 3-level meta-analytic model with lnRR as an effect size:

mod_multilevel_lnRR <- rma.mv(yi = lnRR, # lnRR is specified as the effect size measure;
                              V = lnRRV, # sampling variance of lnRR;
                              random = list(~1 | Study_ID, # a random effect (clustering variable) that makes the true effect sizes vary across studies (variation between studies);
                                           ~1 | Obs_ID), # a random effect that makes the true effect sizes vary within studies (variation within studies);
                              test = "t",
                              method = "REML", 
                              data = dat_sensory2)

Let’s have a look at the model results:


Multivariate Meta-Analysis Model (k = 16; method: REML)

  logLik  Deviance       AIC       BIC      AICc  <U+200B> 
 -0.4500    0.9001    6.9001    9.0242    9.0819   

Variance Components:

            estim    sqrt  nlvls  fixed    factor 
sigma^2.1  0.0271  0.1646     11     no  Study_ID 
sigma^2.2  0.0098  0.0989     16     no    Obs_ID 

Test for Heterogeneity:
Q(df = 15) = 74.9774, p-val < .0001

Model Results:

estimate      se     tval  df    pval    ci.lb   ci.ub   <U+200B> 
 -0.1060  0.0667  -1.5902  15  0.1326  -0.2481  0.0361    

---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Under Model Results we can see that the pooled lnRR is not statistically significant (\(\beta_{0}\) = -0.106, 95% CIs = [0.248 to 0.036], p-value = 0.133), which aligns with the model estimate for SMD effect sizes. By definition of lnRR, the \(\beta_{0}\) = -0.106 shows the multiplicative effects (log scale mean ratio), which is different from the additive effects of SMD. Therefore, it is easy to ease the interpretation of lnRR by back-transforming it to the original scale of mean ratio: exp(-0.106) - 1 = -0.10. This number means that the SSRI exposure can reduce the sensory function by 10%, albeit it is not a statistically significant effect.

We contend that it is a good practice to report both additive effects using SMD and multiplicative effects using lnRR when conducting animal meta-analysis of mean. Moreover, the dual use of these two types of effect size can serve as a sensitivity analysis, which can be used to examine the robustness of the meta-analytic evidence.

6.2 Meta-analysis of variation (lnVR and lnCVR)

The calculation of the variance-based effect sizes (lnVR and lnCVR) is also already implemented in the escal() function. Positive values of lnVR or lnCVR indicate that the perinatal SSRI exposure increases the inter-individual variability differences in offspring’s sensory function. You may wonder which one to use when quantifying the variance effect. Basically, it depends on your biological questions (Nakagawa et al. (2015)). But there is a general rule you can follow. When there is a mean-variance relationship, namely a larger mean has a larger variance, lnCVR is preferred over lnVR. This is because lnCVR represents mean-adjusted variance effect (relative variance), while lnVR is a solely dependent on variance effect (absolute variance). In Ramsteijn et al. (2020)’s dataset, there is a clear mean-variance effect both in SSRI exposure group and vehicle group. Therefore, we would choose to calculate lnCVR and its sampling variance:

lnVR <- escalc(measure = "CVR", # "CVR" means the variation coefficient ratio (log scale); lnVR can be specified as measure = "VR";
               m1i = SSRI_Mean, # mean of treatment group (SSRI);
               m2i = Vehicle_Mean, # mean of control group (Vehicle);
               sd1i = SSRI_SD, # standard deviation of treatment group;
               sd2i = Vehicle_SD, # standard deviation of control group; 
               n1i = SSRI_Nadj, # sample size of treatment group; 
               n2i = Vehicle_Nadj, # sample size of control group; 
               data = dat_sensory, # dataset of our work example;
               append = FALSE)

Let’s use rma.mv() to fit a 3-level meta-analytic model to lnCVR:

mod_multilevel_lnCVR <- rma.mv(yi = lnRR, # lnCVR is specified as the effect size measure;
                               V = lnRRV, # sampling variance of lnCVR;
                               random = list(~1 | Study_ID, # a random effect (clustering variable) that makes the true effect sizes vary across studies (variation between studies);
                                             ~1 | Obs_ID), # a random effect that makes the true effect sizes vary within studies (variation within studies);
                               test = "t",
                               method = "REML", 
                               data = dat_sensory2)

Meta-analysis of variation shows that SSRI exposure can increases the inter-individual variability in sensory function (pooled lnCVR: \(\beta_{0}\) = 0.16, 95% CIs = [-0.159 to 0.478], p-value = 0.298). This means SSRI exposure only affects some animals (individual-specific effect) and further research on the sources of variability is warranted.


Multivariate Meta-Analysis Model (k = 14; method: REML)

  logLik  Deviance       AIC       BIC      AICc  <U+200B> 
-11.2920   22.5841   28.5841   30.2789   31.2507   

Variance Components:

            estim    sqrt  nlvls  fixed    factor 
sigma^2.1  0.0000  0.0000     10     no  Study_ID 
sigma^2.2  0.1900  0.4359     14     no    Obs_ID 

Test for Heterogeneity:
Q(df = 13) = 38.4475, p-val = 0.0002

Model Results:

estimate      se    tval  df    pval    ci.lb   ci.ub   <U+200B> 
  0.1599  0.1475  1.0845  13  0.2979  -0.1587  0.4786    

---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Section 7– Select an appropriate random-effect structure

In this section, we use a more complex animal dataset to show how to select an appropriate random-effect structure, and therefore, to account for various types of non-independence and heterogeneity at different levels. This dataset comes from Lagisz et al. (2020), which examined cognition bias across 22 animal species using 71 studies with 459 effect sizes.

7.1 Concepts and rationale

When specifying a multilevel meta-analytic model, a practical question to consider is which study-level variables to use as random effects. You may wonder what is a ‘random effect’? There are many formal definitions (Andrew Gelman has a nice blog on it). Here, we give a non-statistical definition in the context of a meta-analysis: a study-level variable being a random-effect means that it varies across different intervention types, doses, and species. When a study-level variable in a meta-analytic model is modeled as being a random-effect, we believe that it has a random effect on the overall mean and contributes noise (variation) to the overall mean. For example, including animal strains/species as a random-effect will allow us to estimate how much variance exists among strains/species. In contrast, when treating strains/species as a fixed-effect, we believe that strains/species levels are identical across different studies and have a systematic effect on the mean (e.g., we ask question like: do one species responds more to an intervention than others?). A general rule of thumb for choosing random-effect factor is that it should contain at least five levels, so that we can properly estimate variance (Bolker et al. (2009)).

7.2 Choose a random-effect structure based on AIC

Theoretically, any cluster/group variable having more than five levels can be a random-effect candidate. Should we then include every cluster/group variable with more than five levels as a random-effect in a multilevel model? The answer is NO! For a good random-effect candidate, you first need to think about whether it is a potentially true sources of heterogeneity based on your biological expertise. Then you need to investigate whether this random-effect candidate improves the quality of the multilevel model. In this respect, you need to resort to information-theoretic approaches alongside likelihood methods (likelihood ratio tests). Here, we use Lagisz et al. (2020)’s dataset to show how to decide the best random-effects structure from the view of Akaike Information Criterion (AIC) criteria.

This can be easily done by:

1. specifying the argument random in rma.mv function with different random-effects structures;
1. using anova.rma to provide a full versus reduced model comparison in terms of model fit statistics and a likelihood ratio test (log-likelihood, deviance, AIC, BIC, and AICc).

Table S4
Load data of Lagisz et al. (2020).

Lagisz et al. (2020)’s dataset has three random-effects candidates:

Effect size identity (EffectID) - unique ID for each pairwise comparison used to calculate effect sizes; modelling it as a random-effect means to allow true effect sizes to vary within studies, such that the model can estimate with-study (effect size) level variance (\(\sigma_{within}^2\)) and partition with-study (effect size) level heterogeneity (\(I^2_{within}\)).

Study identity (ArticleID) - unique ID for each extracted original experimental paper; modelling it as a random-effect means to allow true effect sizes to vary across studies, such that the model can estimate between-study level variance (\(\sigma_{between}^2\)) and partition between-study level heterogeneity (\(I^2_{between}\)).

Species identity (Species_Latin) - Latin name of an animal species used in the experiment; modelling it as a random-effect means to allow true effect sizes to vary across species, such that the model can estimate species level variance (\(\sigma_{species}^2\)) and partition species level heterogeneity (\(I^2_{species}\)).

Let’s fit a null model without any random-effects candidates as the default reduced model:

meta.null <- rma.mv(yi = d, 
                    V = Vd, 
                    data = dat_cognition, 
                    method = 'ML') # note that when using AIC criteria, maximum likelihood (ML) rather than restricted maximum likelihood (REML) is preferred (for reasons see @anderson2008model)

Use the argument ramdom to specify EffectID as a random-effects term to account for within-study variation (\(\sigma_{within}^2\)):

meta.effectID <- rma.mv(yi = d, 
                        V = Vd, 
                        random = ~ 1 | EffectID, # the random effect EffectID allows effect sizes vary within studies;
                        data = dat_cognition, 
                        method = 'ML')

Examine whether EffectID improves model quality via anova.rma function:

anova.rma(meta.effectID, meta.null)

This will provide full (model with EffectID as a random-effects term) versus reduced model (null model without any random-effects term) comparison in terms of model fit statistics and a likelihood ratio test (log-likelihood, deviance, AIC, BIC, and AICc values):


        df       AIC       BIC      AICc    logLik      LRT   pval        QE 
Full     2 1182.1894 1190.4475 1182.2157 -589.0947                 1465.2543 
Reduced  1 1566.3076 1570.4367 1566.3164 -782.1538 386.1182 <.0001 1465.2543 
        tau^2 
Full       NA 
Reduced    NA

We can see that adding EffectID as a random-effects term (meta.effectID; Full) results in a much lower AIC value (1182.1894) compared with the null model (meta.null; Reduced). The log-likelihood ratio test shows that adding EffectID as a random-effects term can significantly improve the model fit (LRT = 386.1182, pval = < 0.0001).

Let’s examine the importance of ArticleID using the same procedure. Specify ArticleID as a random-effects term to account between-study variation (\(\sigma_{between}^2\)):

meta.studyID <- rma.mv(yi = d, 
                       V = Vd, 
                       random = ~ 1 | ArticleID, # the random effect ArticleID allows effect sizes vary between studies;
                       data = dat_cognition, 
                       method = 'ML')

Examine whether ArticleID improves model quality via anova.rma function:

anova.rma(meta.meta.studyID, meta.null)


        df       AIC       BIC      AICc    logLik      LRT   pval        QE 
Full     2 1353.8067 1362.0648 1353.8330 -674.9034                 1465.2543 
Reduced  1 1566.3076 1570.4367 1566.3164 -782.1538 214.5009 <.0001 1465.2543 
        tau^2 
Full       NA 
Reduced    NA

The value of AIC and log-likelihood ratio test show that adding ArticleID as a random-effects term can significantly improve the model fit (LRT = 214.5009, pval = < 0.0001).

Let’s incorporate both ArticleID and EffectID as the random-effects terms:

meta.study.effectID <- rma.mv(yi = d, 
                              V = Vd, 
                              random = list(~ 1 | ArticleID, ~ 1 | EffectID), # a nested random-effects structure (multiple effect sizes nested within studies) is defined to non-independence due to clustering; An alternative syntax is: random = ~ 1 | ArticleID/EffectID;
                              data = dat_cognition, 
                              method = 'ML')

By comparing meta.study.effectID and meta.studyID, we can investigate whether model with both ArticleID and EffectID as the random-effect terms (which defines the nested random-effects structure) is “better” than that with only ArticleID as the random-effects term:


        df       AIC       BIC      AICc    logLik      LRT   pval        QE 
Full     3 1149.5653 1161.9525 1149.6181 -571.7827                 1465.2543 
Reduced  2 1353.8067 1362.0648 1353.8330 -674.9034 206.2414 <.0001 1465.2543 
        tau^2 
Full       NA 
Reduced    NA

The above fit statistics and information criteria corroborate our claim in the main text: multilevel model should incorporate effect size identity and study identity as the default random-effects structure when performing an animal meta-analysis.

Next, lets’ explore whether animal species identity is an important random-effects term. First add Species_Latin as a random-effects term to meta.study.effectID via argument random:

meta.species.study.effectID <- rma.mv(yi = d, 
                                      V = Vd, random = list(~ 1 | Species_Latin, ~ 1 | ArticleID, ~ 1 | EffectID), # the random effect *Species_Latin* allows effect sizes vary between species;
                                      data = dat_cognition, 
                                      method = 'ML', sparse = TRUE)

Then compare it to meta.study.effectID using anova.rma:


        df       AIC       BIC      AICc    logLik    LRT   pval        QE 
Full     4 1151.5653 1168.0815 1151.6535 -571.7827               1465.2543 
Reduced  3 1149.5653 1161.9525 1149.6181 -571.7827 0.0000 1.0000 1465.2543 
        tau^2 
Full       NA 
Reduced    NA

We can see that adding animal species identity as a random-effects term does not contribute more information to the multilevel model: AIC in meta.species.study.effectID (1151.5653; full) is larger than that in meta.study.effectID (1149.5653; full). This indicates that generally cognition bias is consistent across animal species (i.e., only a small amount of heterogeneity exists between species). We can confirm this point by computing the species level heterogeneity \(I^2_{species}\):

        I2_Total I2_Species_Latin     I2_ArticleID      I2_EffectID 
    7.377069e+01     2.343703e-07     2.592653e+01     4.784416e+01

Additionally, we will use this dataset to show the advantage of using orchard plot over forest plot when the number of effect sizes (k) is very large. Let’s use forest() function and orchard_plot() function to visualise the results of the multilevel model with the above selected random-effects structure, separately. From Figure S8, we can see that the forest plot does not work well when visualising a large dataset. In contrast, the orchard plot can clearly show each individual data point and the overall estimate (Figure S9).

Figure S8
Forest plot showing 459 effect sizes used in a meta-analytic model quantifying animal cognitive bias (data from Lagisz et al. (2020)).

Figure S9
Orchard plot (forest-like plot) showing 459 effect sizes used in a meta-analytic model quantifying animal cognitive bias (data from Lagisz et al. (2020)). You can use help(orchard_plot) to look at the corresponding arguments and add more code to make it more elegant (Nakagawa et al. (2021)).

Section 8 – Fit multivariate multilevel meta-regression models

Adding multiple moderators variables as fixed-effects leads to a multi-moderator multilevel meta-regression. In this section, we fit a multi-moderator multilevel meta-regression using the complex animal dataset from Lagisz et al. (2020).

8.1 Concepts and rationale

In contrast to the single-moderator meta-regression (as illustrated earlier), a multi-moderator multilevel meta-regression can provide richer meta-scientific insights, for example, by: (1) investigating the interactive effect between two moderator variables, and (2) correcting for publication bias and estimating the bias-adjusted overall effect size (Kvarven et al. (2020)).

A more general form of multilevel meta-regression model can be written as:
\[ ES_{[i]} = \beta_{0}' + \sum \beta_{mod}x_{[m]} + \mu_{between[j]} + \mu_{within[i]} + e_{[i]}, (12)\\ \mu_{between[j]} \sim N(0,\sigma_{between}^2)\\ \mu_{within[i]} \sim N(0,\sigma_{within}^2)\\ e_{[i]} \sim N(0,\nu_{[i]}) \]

But for an easy start, let’s fit the simplest form of a multi-moderator multilevel meta-regression model (two moderator variables without any interactive term; see next section):
\[ ES_{[i]} = \beta_{0}' + \beta_{1}x_{within[i]} + \beta_{2}x_{between[j]} + \mu_{between[j]} + \mu_{within[i]} + e_{[i]}, (14)\\ \mu_{between[j]} \sim N(0,\sigma_{between}^2)\\ \mu_{within[i]} \sim N(0,\sigma_{within}^2)\\ e_{[i]} \sim N(0,\nu_{[i]}) \]

8.2 Fit a multilevel meta-regression to examine the interactive effects

Lagisz et al. (2020) has tested five moderator variables. We choose two of them for illustrative purposes:

animal sex (Sex) - sex of tested animals in the compared groups with three levels: female = only female animals were used; male = only male animals were used; both = both female and male animals were used.

test task type (TaskType) - type of the task used during behavioural trials with two levels: active choice = go/go tasks in which an animal is required to make an active response to cues perceived as positive and to cues perceived as negative; go/no-go = tasks in which an animal is required to suppress a response to cues perceived as negative and actively respond only to cues perceived as positive.

Let’s create a cross tabulation (contingency table) to display the combination of factor levels of Sex and TaskType (Table S5).

Table S5
A cross tabulation displaying the combinations of factor levels of Sex and TaskType.

	active choice	go/no-go
female	21	204
male	30	88
mixed-sex	19	97

	active choice	go/no-go	Total	n
female	9.3	90.7	100	225
male	25.4	74.6	100	118
mixed-sex	16.4	83.6	100	116
All	15.3	84.7	100	459

	active choice	go/no-go	All
female	30.0	52.4	49.0
male	42.9	22.6	25.7
mixed-sex	27.1	24.9	25.3
Total	100.0	100.0	100.0
n	70.0	389.0	459.0

	active choice	go/no-go
female	-2.27	0.96
male	2.83	-1.20
mixed-sex	0.31	-0.13

X-squared = 15.6584, df = 2, p = 0.0003979

By specifying argument mods with a formula of ~ Sex + TaskType -1 (remember the trick of -1?), we can fit a multilevel multi-moderator meta-regression model with these two moderators with:

maregression_sex.task <- rma.mv(yi = d, 
                                V = Vd, 
                                random = list(~ 1 | ArticleID, ~ 1 | EffectID), 
                                mod = ~ Sex + TaskType -1, # model animal sex and test task type simultaneously;
                                data = dat_cognition, 
                                method = 'REML', # remember to change the method used to estimate variance back to restricted maximum likelihood (REML); 
                                sparse = TRUE)


Multivariate Meta-Analysis Model (k = 459; method: REML)

   logLik   Deviance        AIC        BIC       AICc  <U+200B> 
-563.4496  1126.8992  1138.8992  1163.6210  1139.0867   

Variance Components:

            estim    sqrt  nlvls  fixed     factor 
sigma^2.1  0.1408  0.3752     71     no  ArticleID 
sigma^2.2  0.2614  0.5112    459     no   EffectID 

Test for Residual Heterogeneity:
QE(df = 455) = 1438.1072, p-val < .0001

Test of Moderators (coefficients 1:4):
QM(df = 4) = 23.2763, p-val = 0.0001

Model Results:

                     estimate      se     zval    pval    ci.lb   ci.ub     <U+200B> 
I(Sex)female           0.2303  0.1679   1.3718  0.1701  -0.0987  0.5593      
I(Sex)male             0.5488  0.1543   3.5559  0.0004   0.2463  0.8512  *** 
I(Sex)mixed-sex        0.3679  0.1726   2.1315  0.0330   0.0296  0.7062    * 
I(TaskType)go/no-go   -0.1609  0.1588  -1.0128  0.3112  -0.4722  0.1505      

---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Under Model Results, we can see that for the moderator Sex, only “male” and “mixed-sex” show statistically significant effect on cognition bias (\(\beta_{0}\) = 0.549, 95% CIs = [0.246 to 0.851] and \(\beta_{0}\) = 0.368, 95% CIs = [0.030 to 0.706], respectively). Below we show that “female” also has statistically significant effect on cognition bias when controlling for the confounding effect of of TaskType.

Suppose that the relationship between animal sex and effect size estimates differs for different types of behavioral assay. We can test this hypothesis by modelling the interaction between the two moderator variables:

\[ ES_{[i]} = \beta_{0}' + \beta_{1}x_{within[i]} + \beta_{2}x_{between[j]} + \beta_{3}x_{within[i]}x_{between[j]} + \mu_{between[j]} + \mu_{within[i]} + e_{[i]}, (12)\\ \mu_{between[j]} \sim N(0,\sigma_{between}^2)\\ \mu_{within[i]} \sim N(0,\sigma_{within}^2)\\ e_{[i]} \sim N(0,\nu_{[i]}) \]

You can use the argument mods in rma.mv() to define the interaction between animal sex (Sex) and types of behavioral assay (TaskType). The core syntax is to use * to connect the two moderator variables (mod = ~ Sex*TaskType -1):

maregression_interaction <- rma.mv(yi = d, 
                                V = Vd, 
                                random = list(~ 1 | ArticleID, ~ 1 | EffectID), 
                                mod = ~ Sex*TaskType -1, # model the interaction;
                                data = dat_cognition, 
                                method = 'REML', 
                                sparse = TRUE)

The printed output of the multilevel model with interaction term are similar to these from a “normal” meta-regression shown earlier:


Multivariate Meta-Analysis Model (k = 459; method: REML)

   logLik   Deviance        AIC        BIC       AICc  <U+200B> 
-558.1527  1116.3055  1132.3055  1165.2326  1132.6298   

Variance Components:

            estim    sqrt  nlvls  fixed     factor 
sigma^2.1  0.1294  0.3597     71     no  ArticleID 
sigma^2.2  0.2609  0.5108    459     no   EffectID 

Test for Residual Heterogeneity:
QE(df = 453) = 1416.6432, p-val < .0001

Test of Moderators (coefficients 1:6):
QM(df = 6) = 30.5065, p-val < .0001

Model Results:

                                     estimate      se     zval    pval    ci.lb<U+200B> 
I(Sex)female                           0.7931  0.2853   2.7798  0.0054   0.2339 
I(Sex)male                             0.4613  0.1915   2.4091  0.0160   0.0860 
I(Sex)mixed-sex                        0.0331  0.2720   0.1218  0.9031  -0.5000 
I(TaskType)go/no-go                   -0.7752  0.2985  -2.5973  0.0094  -1.3601 
I(Sex)male:I(TaskType)go/no-go         0.7325  0.3724   1.9672  0.0492   0.0027 
I(Sex)mixed-sex:I(TaskType)go/no-go    1.0063  0.4194   2.3993  0.0164   0.1843 
                                       ci.ub 
I(Sex)female                          1.3522  ** 
I(Sex)male                            0.8366   * 
I(Sex)mixed-sex                       0.5662     
I(TaskType)go/no-go                  -0.1902  ** 
I(Sex)male:I(TaskType)go/no-go        1.4623   * 
I(Sex)mixed-sex:I(TaskType)go/no-go   1.8283   * 

---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Under Model Results, now we see that “female” has a statistically significant effect on cognition bias (\(\beta_{0}\) = 0.793, 95% CIs = [0.234 to 1.352]). Moreover, the last two lines provide the model estimates for the interactive effects. We can see that the interaction between animal sex and type of behavioral assay is statistically significant. The two moderator variables including their interaction can explain 10.6% variation among effect sizes (via r2_ml function).

   R2_marginal R2_conditional 
     0.1063452      0.4026581

8.3 Correct for publication bias to estimate adjusted effect size

We have illustrated how to use univariate multilevel meta-regression to identify two forms of publication bias: small-study effect and decline effect (aka time-lag bias). A multi-moderator multilevel meta-regression with effect size’s sampling error (\(se_{[i]}=\sqrt{\nu_{[i]}}\)) and centred publication year (\(c(year_{[i]})\); see below for explanations) can be used to correct for the impacts of the two forms of publication bias:

\[ ES_{[i]} = \beta_{0}' + \beta_{1}se_{[i]} + \beta_{2}c(year_{[i]}) + \mu_{between[j]} + \mu_{within[i]} + e_{[i]}, (16)\\ \mu_{between[j]} \sim N(0,\sigma_{between}^2)\\ \mu_{within[i]} \sim N(0,\sigma_{within}^2)\\ e_{[i]} \sim N(0,\nu_{[i]}) \]

The intercept \(\beta_{0}'\) shows the expected effect size estimate when the sampling error (\(se_{[i]}=\sqrt{\nu_{[i]}}\)) and centred publication year (\(c(year_{[i]})\)) of an effect size are equal to zero. \(se_{[i]}\) = 0 means the precision of an effect size is infinitely large (precision = \(\frac {1} {se_{[i]}}\)), which indicates there is no small-study effect. \(c(year_{[i]})\) = 0 means there is no decline effect (i.e., time-lag bias). Therefore, intercept \(\beta_{0}'\) can be interpreted as the publication-bias corrected effect size (Nakagawa et al. (2022); Stanley et al. (2017)).

We can fit model 16 to Lagisz et al. (2020)’s dataset using this syntax:

maregression_pb <- rma.mv(yi = d, 
                          V = Vd, 
                          random = list(~ 1 | ArticleID, ~ 1 | EffectID), 
                          mod = ~ sed + Year.c, # sed = the sampling error (square root of sampling variance Vd); Year.c = the centered year (set mean year as 0), such that the model intercept is meaningful to be interpreted as a bias-corrected overall effect;
                          data = dat_cognition, 
                          method = 'REML', # remember to change the method used to estimate variance back to restricted maximum likelihood (REML); 
                          sparse = TRUE)

Under Model Results, sed and Year.c are the regression slopes of sampling error (\(se_{[i]}=\sqrt{\nu_{[i]}}\)) and publication year \(c(year_{[i]})\): \(\beta_{1}\) and \(\beta_{2}\)in model 16, respectively. We can see a statistically significant sed = 1.1561 (p-value = 0.0003), indicating there is a small-study effect (Figure S10). Year.c = -0.0002 is not statistically significant (p-value = 0.9931), suggesting that there is no decline effect (Figure S11).


Multivariate Meta-Analysis Model (k = 459; method: REML)

   logLik   Deviance        AIC        BIC       AICc  <U+200B> 
-562.4264  1124.8529  1134.8529  1155.4653  1134.9862   

Variance Components:

            estim    sqrt  nlvls  fixed     factor 
sigma^2.1  0.2012  0.4486     71     no  ArticleID 
sigma^2.2  0.2415  0.4914    459     no   EffectID 

Test for Residual Heterogeneity:
QE(df = 456) = 1457.0561, p-val < .0001

Test of Moderators (coefficients 2:3):
QM(df = 2) = 12.8205, p-val = 0.0016

Model Results:

         estimate      se     zval    pval    ci.lb   ci.ub     <U+200B> 
intrcpt   -0.2813  0.1566  -1.7957  0.0725  -0.5882  0.0257    . 
sed        1.1561  0.3232   3.5768  0.0003   0.5226  1.7897  *** 
Year.c    -0.0002  0.0217  -0.0087  0.9931  -0.0427  0.0423      

---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Figure S10
The relationship between effect size estimates (SMD) and their sampling errors indicates existence of a small-study effect.

Figure S11
The relationship between effect size estimates (SMD) and their publication year indicates no decline effect (time-lag bias).

One point of note here: simulation study indicates that \(\beta_{0}'\) tends to underestimate the “true” effect (bias-corrected effect) if there is a nonzero treatment effect (i.e., \(\beta_{0}'\) is statistically significant at the 10% significance level; Stanley et al., 2017). In such as a case, replacing the sampling error (\(se_{[i]}=\sqrt{\nu_{[i]}}\)) by its sampling variance (\(se_{[i]}^2=\nu_{[i]}\)) can reduce the bias of the estimated “true” effect (bias-corrected effect):

\[ ES_{[i]} = \beta_{0}' + \beta_{1}se_{[i]}^2 + \beta_{2}c(year_{[i]}) + \mu_{between[j]} + \mu_{within[i]} + e_{[i]}, (17)\\ \mu_{between[j]} \sim N(0,\sigma_{between}^2)\\ \mu_{within[i]} \sim N(0,\sigma_{within}^2)\\ e_{[i]} \sim N(0,\nu_{[i]}) \]

Given that \(\beta_{0}'\) is statistically significant at the 10% significance level (p-value = 0.0725 < 0.1), we now fit model 17 to correct for the publication bias:

maregression_pb2 <- rma.mv(yi = d, 
                          V = Vd, 
                          random = list(~ 1 | ArticleID, ~ 1 | EffectID), 
                          mod = ~ Vd + Year.c, # repalcing sampling error (sed) by sampling variance (Vd) avoid downwardly biased estimate of the bias-corrected overall effect (i.e., model intercept);
                           data = dat_cognition, 
                           method = 'REML', 
                           sparse = TRUE)

Let’s have a look at the model outputs:


Multivariate Meta-Analysis Model (k = 459; method: REML)

   logLik   Deviance        AIC        BIC       AICc  <U+200B> 
-560.1260  1120.2520  1130.2520  1150.8645  1130.3853   

Variance Components:

            estim    sqrt  nlvls  fixed     factor 
sigma^2.1  0.1835  0.4284     71     no  ArticleID 
sigma^2.2  0.2456  0.4956    459     no   EffectID 

Test for Residual Heterogeneity:
QE(df = 456) = 1455.6896, p-val < .0001

Test of Moderators (coefficients 2:3):
QM(df = 2) = 16.6698, p-val = 0.0002

Model Results:

         estimate      se     zval    pval    ci.lb   ci.ub     <U+200B> 
intrcpt   -0.0060  0.0848  -0.0712  0.9432  -0.1723  0.1602      
Vd         1.0668  0.2616   4.0783  <.0001   0.5541  1.5795  *** 
Year.c     0.0003  0.0210   0.0120  0.9904  -0.0410  0.0415      

---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

intrcpt under Model Results is the estimate of model intercept (\(\beta_{0}'\)), indicating that the estimated bias-corrected overall effect is negligible (\(\beta_{0}'\) = -0.006, 95% CIs = [-0.172 to 0.160], p-value = 0.943).

In our main text, we also mention that it is best to account for the potential heterogeneity when testing publication bias (because high heterogeneity may invalidate publication bias test):

\[ ES_{[i]} = \beta_{0}' + \beta_{1}se_{[i]}^2 + \beta_{2}c(year_{[i]}) + \sum \beta_{mod}x_{[m]} + \mu_{between[j]} + \mu_{within[i]} + e_{[i]}, (18)\\ \mu_{between[j]} \sim N(0,\sigma_{between}^2)\\ \mu_{within[i]} \sim N(0,\sigma_{within}^2)\\ e_{[i]} \sim N(0,\nu_{[i]}) \]

Such a complex model can be fitted as:

maregression_pb3 <- rma.mv(yi = d, 
                           V = Vd, 
                           random = list(~ 1 | ArticleID, ~ 1 | EffectID), 
                           mod = ~ sed + Year.c + Sex + TaskType + CueTypeCat + ReinforcementCat -1, # add other important moderator variables to accommodate the potential heterogeneity in the dataset;
                           data = dat_cognition, 
                           method = 'REML', 
                           sparse = TRUE))

The regression slopes of sampling error (\(se_{[i]}=\sqrt{\nu_{[i]}}\)) and publication year \(c(year_{[i]})\) are similar to these without accounting for heterogeneity, although the exact values are different.


Multivariate Meta-Analysis Model (k = 459; method: REML)

   logLik   Deviance        AIC        BIC       AICc  <U+200B> 
-545.5197  1091.0393  1119.0393  1176.4752  1120.0116   

Variance Components:

            estim    sqrt  nlvls  fixed     factor 
sigma^2.1  0.2205  0.4696     71     no  ArticleID 
sigma^2.2  0.2391  0.4889    459     no   EffectID 

Test for Residual Heterogeneity:
QE(df = 447) = 1411.0252, p-val < .0001

Test of Moderators (coefficients 1:12):
QM(df = 12) = 37.2702, p-val = 0.0002

Model Results:

                     estimate      se     zval    pval    ci.lb    ci.ub     <U+200B> 
sed                    1.2319  0.3334   3.6946  0.0002   0.5784   1.8854  *** 
Year.c                 0.0123  0.0238   0.5165  0.6055  -0.0344   0.0590      
Sexfemale             -0.8188  0.3612  -2.2667  0.0234  -1.5268  -0.1108    * 
Sexmale               -0.4800  0.3480  -1.3791  0.1679  -1.1621   0.2022      
Sexmixed-sex          -0.6282  0.3666  -1.7135  0.0866  -1.3468   0.0904    . 
TaskTypego/no-go       0.2531  0.3004   0.8424  0.3995  -0.3357   0.8419      
CueTypeCatolfactory    0.1495  0.3845   0.3887  0.6975  -0.6041   0.9030      
CueTypeCatspatial     -0.0991  0.2345  -0.4225  0.6726  -0.5587   0.3605      
CueTypeCattactile      0.4913  0.3816   1.2875  0.1979  -0.2567   1.2393      
CueTypeCatvisual      -0.1417  0.2359  -0.6006  0.5481  -0.6041   0.3207      
ReinforcementCatR-P    0.2512  0.1731   1.4511  0.1468  -0.0881   0.5905      
ReinforcementCatR-R    0.3700  0.3182   1.1629  0.2449  -0.2536   0.9937      

---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Section 9 – Construct variance-covariance matrix to account for correlated/non-independent errors

As we shown in the main text, multiple effect sizes from the same study can result in two types of data non-independence: non-independence between effect size estimates (i.e., correlated effect size) and non-independence between sampling errors (i.e., correlated errors). The use of multilevel model with an appropriate random-effects structure can only capture the non-independence between effect size estimates. In this section, we show how to use a variance-covariance (VCV) matrix to capture the non-independence between sampling errors.

9.1 Concepts and rationale

Non-independence among sampling errors means sampling errors within the same study (e.g., \(se_{1}\) and \(se_{2}\)) are correlated with each other (\(\rho_{12}\neq0\)), resulting non-zero covariances (e.g., \(Cov[\nu_{1},\nu_{2}]=\rho_{12}se_{1}se_{2}\)):

\[ \boldsymbol{VCV} = \begin{bmatrix} se_{1}^2 & \rho_{12}se_{1}se_{2} & 0 \\ \rho_{12}se_{1}se_{2} & se_{2}^2 & 0 \\ 0 & 0 & se_{3}^2 \end{bmatrix}, (19) \]

A rough rule of thumb can be used to check whether the sampling errors of your dataset are non-independent: when the calculation of effect sizes repeatedly uses the same animal data (e.g., from a shared control group), the effect sizes’ sampling errors will be correlated with each other (see Figure 4 in the main text for a nice visual summary). If your dataset has this type of dependency, a proper way is to construct a VCV matrix to account for it. In reality, constructing a VCV matrix is often challenging because the within-study sampling correlations (e.g., \(\rho_{12}\)) are not reported in the primary animal studies. We next provide a simple solution to construct a VCV matrix (see section 9.2 below for implementation).

9.2 Impute a VCV matrix

Although we are not able to construct an exact VCV matrix (due to missing \(\rho\)), we can impute a VCV matrix by assuming a constant sampling correlation \(\rho\) across different studies (\(\rho_{ik}=\cdots=\rho_{jh}\equiv\rho\)). In our previous published meta-analyses, we often assume \(\rho\) to be 0.5, which seems to be a plausible and safe assumption across many situations (see the main text for explanations). Importantly, you should conduct a sensitivity analysis to explore the extent to which the model coefficients (e.g., \(\beta_{0}\)) are sensitive to the choice of \(\rho\) values. The imputation of a VCV matrix can be implemented by impute_covariance_matrix() function in clubSandwich. The argument cluster is used to specify the cluster or grouping variable (in our case, study identity ArticleID) within which effect sizes’ sampling errors (\(se_{[i]}=\sqrt{\nu_{[i]}}\)) will be treated as correlated. The argument r is used to define a constant sampling correlation between \(se_{[i]}\) (\(\rho\)).

We assume that the sampling errors (\(se_{[i]}=\sqrt{\nu_{[i]}}\)) within studies in Lagisz et al. (2020)’s dataset are correlated with \(\rho\) = 0.5. Let’s use vcalc() to impute a VCV matrix:

VCV <- impute_covariance_matrix(vi = dat_cognition$Vd, # sampling variance;
                                cluster = dat_cognition$ArticleID, # define grouping variables within which the sampling variance are correlated with each with a correlation of a assumed value;
                                r = 0.5) # the assumed correlation.

Alternatively, using the vcalc function in the latest version of metafor 3.4-0 (which has already be released on CRAN), we also can easily approximate a VCV matrix:

VCV <- vcalc(vi = Vd, 
             cluster = ArticleID, 
             obs = EffectID, 
             data = dat_cognition, 
             rho = 0.5)

Now, let’s have a look at the constructed VCV matrix of, for examples, tudies Bateson and Matheson (2007) 2007 and Walker et al. (2014):

           [,1]       [,2]       [,3]       [,4]      [,5]       [,6]
 [1,] 0.3142764 0.12419993 0.11445591 0.11453637 0.4753658 0.00000000
 [2,] 0.1241999 0.19633190 0.09046441 0.09052800 0.3757228 0.00000000
 [3,] 0.1144559 0.09046441 0.16673417 0.08342569 0.3462457 0.00000000
 [4,] 0.1145364 0.09052800 0.08342569 0.16696867 0.3464891 0.00000000
 [5,] 0.4753658 0.37572275 0.34624569 0.34648909 2.8761009 0.00000000
 [6,] 0.0000000 0.00000000 0.00000000 0.00000000 0.0000000 0.09573561
 [7,] 0.0000000 0.00000000 0.00000000 0.00000000 0.0000000 0.04954690
 [8,] 0.0000000 0.00000000 0.00000000 0.00000000 0.0000000 0.04481705
 [9,] 0.0000000 0.00000000 0.00000000 0.00000000 0.0000000 0.04488780
[10,] 0.0000000 0.00000000 0.00000000 0.00000000 0.0000000 0.04468489
            [,7]       [,8]       [,9]      [,10]
 [1,] 0.00000000 0.00000000 0.00000000 0.00000000
 [2,] 0.00000000 0.00000000 0.00000000 0.00000000
 [3,] 0.00000000 0.00000000 0.00000000 0.00000000
 [4,] 0.00000000 0.00000000 0.00000000 0.00000000
 [5,] 0.00000000 0.00000000 0.00000000 0.00000000
 [6,] 0.04954690 0.04481705 0.04488780 0.04468489
 [7,] 0.10256980 0.04638914 0.04646237 0.04625233
 [8,] 0.04638914 0.08392147 0.04202698 0.04183699
 [9,] 0.04646237 0.04202698 0.08418664 0.04190304
[10,] 0.04625233 0.04183699 0.04190304 0.08342723

We can see that the VCV matrix is a block-diagonal (symmetric) matrix, with the sampling variance (\(se_{[i]}^2=\nu_{[i]}\)) along the diagonal, and the covariance (\(Cov[\nu_{i},\nu_{k}]=\rho_{ik}se_{i}se_{k}\)) along the off-diagonal.

Next, re-run the multilevel intercept-only meta-analytic model using the constructed VCV matrix (VCV) with:

meta.study.effectID_VCV <- rma.mv(yi = d, 
                                  V = VCV, # replace sampling variance with a vcv matrix
                                  random = list(~ 1 | ArticleID, ~ 1 | EffectID), 
                                  data = dat_cognition, 
                                  method = 'REML')

Examine the model outputs of rma.mv object:


Multivariate Meta-Analysis Model (k = 459; method: REML)

   logLik   Deviance        AIC        BIC       AICc  <U+200B> 
-584.4066  1168.8133  1174.8133  1187.1939  1174.8661   

Variance Components:

            estim    sqrt  nlvls  fixed     factor 
sigma^2.1  0.0000  0.0001     71     no  ArticleID 
sigma^2.2  0.4469  0.6685    459     no   EffectID 

Test for Heterogeneity:
Q(df = 458) = 2127.6189, p-val < .0001

Model Results:

estimate      se    zval    pval   ci.lb   ci.ub     <U+200B> 
  0.1961  0.0516  3.8020  0.0001  0.0950  0.2972  *** 

---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The model estimates indicate that there is a statistically signiciant overall effect (\(\beta_0\) = 0.20, 95%CI = [0.10 to 0.30]). But keep in mind that it is important to conduct a sensitivity analysis to explore the extent to which this overall effect is sensitive to the choice of \(\rho\) values assumed in constructing a VCV matrix. See 9.3 Sensitivity analysis

9.3 Sensitivity analysis

Let’s first assume a range values of \(\rho\) (i.e., 0.3, 0.5, 0.7, 0.9):

rho_range <- c(0.3, 0.5, 0.7, 0.9)

Next, we write a function to fit models with a range values of \(\rho\) (sensitivity analysis):

meta.study.effectID_VCV_range <- list()
for (i in 1:length(rho_range)) {
VCV_range <- vcalc(vi = Vd, 
                   cluster = ArticleID, 
                   obs = EffectID, 
                   rho = rho_range[i], 
                   data = dat_cognition)
meta.study.effectID_VCV_range[[i]] <- rma.mv(yi = d, 
                                             V = VCV_range, # VCV matrix with varying values of rho
                                             random = list(~ 1 | ArticleID,
                                                           ~ 1 | EffectID), 
                                             data = dat_cognition, 
                                             method = 'REML')}

Nice! The overall effect (pooled effect size: \(\beta_{0}\) = 0.20, 95% CIs = [0.10 to 0.30], p-value = 0.0001) remains statistically significant after accounting for the covariance between effect sizes (i.e., correlated errors). Moreover, sensitivity analysis shows that the model estimates are robust to different values of \(\rho\) (Table S6).

Table S6 The use of sensitivity analysis to examine the sensitivity of model estimates to the choice of \(\rho\) values.

Sampling correlation (ρ)	Overall effect (pooled SMD)	Standard error	p-value	Lower CI	Upper CI
0.3	0.21	0.1	0.0001	0.10	0.31
0.5	0.20	0.1	0.0001	0.10	0.30
0.7	0.19	0.1	0.0011	0.08	0.30
0.9	0.18	0.1	0.0049	0.05	0.30

Section 10 – Fit a multivariate meta-analytic model

10.1 Concepts and rationale

Multivariate models can model two or more different types of response variables/outcomes with studies. So multivariate models cal account for the non-independence due to the multivariate effect sizes (i.e., correlations of true effects and sampling variances). It is worth noting that when dealing non-independence, multilevel models and multivariate models are identical if assuming correlations of the underlying true effects follow a compound-symmetric structure (heterogeneous variances and covariances/correlations of random effects). But multivariate models allow more flexibility in certain circumstances. For example, if we assume the random effects have an unstructured variance-covariance matrix, we can make the amount of variability to differ for different outcomes/response variables (e.g., for depression and anxiety) within the studies. We acknowledge that multilevel model is not capable of settling this. But in such a case, we would like to call it as a ‘multi-response’ or ‘multi-outcome’ model as ‘multivariate’ is really confusing here.

10.2 Construct a multivariate meta-analytic model

Under the dependency condition where the same outcome (e.g., depression) is measured at different measurement occasions, namely, multiple effect sizes are different observations of the effect corresponding the same outcome (e.g., depression), a multilevel model we fitted in Section 2 – Fit a multilevel meta-analytic model to estimate overall pooled effect can be reformulated via multivariate parameterization. Below, we illustrate this case. We are not going to illustrate how to use a multivariate model to account for the dependency of effect sizes due to the different outcomes measured within the same study. Because such a multivariate data structure is not a typical one in animal meta-analysis.

A multivariate model with random effects taking on a compound-symmetric structure can be fitted with:

mod_multivariate <- rma.mv(yi = d, 
                           V = Vd, 
                           random = ~ EffectID | ArticleID, 
                           method = "REML", 
                           data = dat_cognition)


Multivariate Meta-Analysis Model (k = 459; method: REML)

   logLik   Deviance        AIC        BIC       AICc  <U+200B> 
-570.6489  1141.2977  1147.2977  1159.6783  1147.3506   

Variance Components:

outer factor: ArticleID (nlvls = 71)
inner factor: EffectID  (nlvls = 459)

            estim    sqrt  fixed 
tau^2      0.4090  0.6395     no 
rho        0.3584             no 

Test for Heterogeneity:
Q(df = 458) = 1465.2543, p-val < .0001

Model Results:

estimate      se    zval    pval   ci.lb   ci.ub     <U+200B> 
  0.2257  0.0581  3.8819  0.0001  0.1118  0.3397  *** 

---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

As we said, all the parameter estimates and model fitting statistics are identical to a multilevel model:


Multivariate Meta-Analysis Model (k = 459; method: REML)

   logLik   Deviance        AIC        BIC       AICc  <U+200B> 
-570.6489  1141.2977  1147.2977  1159.6783  1147.3506   

Variance Components:

            estim    sqrt  nlvls  fixed              factor 
sigma^2.1  0.1466  0.3829     71     no           ArticleID 
sigma^2.2  0.2624  0.5123    459     no  ArticleID/EffectID 

Test for Heterogeneity:
Q(df = 458) = 1465.2543, p-val < .0001

Model Results:

estimate      se    zval    pval   ci.lb   ci.ub     <U+200B> 
  0.2257  0.0581  3.8819  0.0001  0.1118  0.3397  *** 

---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Section 11 – Apply cluster-robust inference methods

11.1 Concepts and rationale

There is an alternative method to account for statistical non-independence: variance estimation method (REV). As shown above, the multilevel model uses a multilevel random-effects structure to account for non-independence among effect sizes and a VCV matrix to account for non-independence among sampling errors. In contrast, REV can estimate the sampling covariances from the meta-analytic data itself and subsequently adjust the associated standard errors (a so-called robust standard errors) to avoid inflated Type I error and p-value of model coefficients (e.g., overall effect or pooled effect size: \(\beta_{0}\)).

11.2 Meta-analysis with RVE with the multilevel model framework

Some researchers recommend using the multilevel model to account for data non-independence, while others endorse the use of RVE. Rather than choosing between the two, we prefer to take advantages of both (a so-called hybrid strategy): implementing a multilevel model in the framework of RVE. By doing so, RVE can provide us with the robust significance tests and confidence intervals for the model coefficients. Meanwhile, the multilevel model can provide us with extra model estimates, for example, the partitioning of variance components across levels (e.g., \(\sigma_{between}^2\) and \(\sigma_{within}^2\)). This hybrid strategy seems very difficult to implement. Conveniently, the combination of metafor and clubSandwich packages provides an elegant solution.
The implementation is very straightforward. First, construct a multilevel meta-analytic model via rma.mv() function in metafor package:

multilevl.ma.model <- rma.mv(yi = d, 
                             V = Vd, 
                             random = list(~ 1 | ArticleID, ~ 1 | EffectID),
                             data = dat_cognition, 
                             method = 'REML')

Then, use coef_test() function in clubSandwich package to compute the robust error and use it for the subsequent model inferences (i.e., significance tests):

mod_RVE <- coef_test(multilevl.ma.model, # fitted multilevel model for which to make robust model inference (an object of class "rma.mv");
                     vcov = "CR2", # "bias-reduced linearization" is specified to approximate variance-covariance;
                     test = "Satterthwaite", # method for which small-sample correction to approximate;
                     cluster = dat_cognition$ArticleID
                     )

As shown below, the outputs of the coef_test() function mainly focus on the model inferences - significance tests of the model coefficient (e.g., \(\beta_{0}\)):

    Coef. Estimate     SE t-stat d.f. p-val (Satt) Sig.
1 intrcpt    0.226 0.0581   3.88 65.8       <0.001  ***

Section 12 - Addition after peer review

During peer review, our reviewers gave thoughtful comments and suggestion. As inspired by one of the useful comment, we decided to add multivariate model to our paper. As a complement, we added the corresponding illustration in this section. However, we did not find any available dataset to show multivariate model (this is also a part of reason why we did not include multivariate model in our original paper). Here, for illustrative purpose, we fabricated a fictitious dataset.

This fictitious dataset looks like:

dat_mv <- data.frame(assay = c(1,1,2,2,3,3,4,4,5,5),
                     studyID = c("s1","s1","s2","s2","s3","s3","s4","s4","s5","s5"),
                     outcome = rep(c("sensory", "anxiety"),5),
                     yi = c(0.37,-0.22,0.1,-0.5,0.3,-0.02,0.16,-0.21,0.46,-0.29),
                     vi = c(0.0065,0.0067,0.0047,0.0007,0.0011,0.0004,0.0019,0.0005,0.0048,0.0204),
                     v1i = c(0.0065,0.0020,0.0047,0.0008,0.0011,0.0006,0.0019,0.0008,0.00048,0.00062),
                     v2i = c(0.002,0.0067,0.0008,0.0007,0.0006,0.0004,0.0008,0.0005,0.0062,0.00204))

Before we can proceed with the model fitting, we need to construct the full (block-diagonal) variance-covariance for all studies from these two variables. We can do this using the vcalc() function in one line of code:

VCV <- vcalc(vi = 1, cluster = studyID, rvars = c(v1i, v2i), data = dat_mv)

Then, fit the multivariate random-effects model with:

model_MV <- rma.mv(yi = yi, 
                   V = VCV, 
                   random = ~ outcome | assay, 
                   struct="UN",
                   mods = ~ outcome - 1,  
                   method = "REML",
                   test = "t",
                   data = dat_mv)

The output looks like:

summary(model_MV)


Multivariate Meta-Analysis Model (k = 10; method: REML)

  logLik  Deviance       AIC       BIC      AICc  <U+200B> 
  3.4643   -6.9285    3.0715    3.4687   33.0715   

Variance Components:

outer factor: assay   (nlvls = 5)
inner factor: outcome (nlvls = 2)

            estim    sqrt  k.lvl  fixed    level 
tau^2.1    0.0304  0.1744      5     no  anxiety 
tau^2.2    0.0204  0.1429      5     no  sensory 

         rho.anxt  rho.snsr    anxt  snsr 
anxiety         1                 -     5 
sensory    0.3468         1      no     - 

Test for Residual Heterogeneity:
QE(df = 8) = 535.3431, p-val < .0001

Test of Moderators (coefficients 1:2):
F(df1 = 2, df2 = 8) = 20.0793, p-val = 0.0008

Model Results:

                estimate      se     tval  df    pval    ci.lb    ci.ub    <U+200B> 
outcomeanxiety   -0.2464  0.0804  -3.0655   8  0.0155  -0.4318  -0.0611   * 
outcomesensory    0.2802  0.0680   4.1214   8  0.0033   0.1234   0.4371  ** 

---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

We see that a multivariate random-effects model can meta-analyze the two outcomes simultaneously. Moreover, the correlated random effects structure allows that variance of each outcome to vary.

Guide against the misspecified VCV matrix:

robust(model_MV, cluster = studyID, clubSandwich = TRUE)


Multivariate Meta-Analysis Model (k = 10; method: REML)

Variance Components:

outer factor: assay   (nlvls = 5)
inner factor: outcome (nlvls = 2)

            estim    sqrt  k.lvl  fixed    level 
tau^2.1    0.0304  0.1744      5     no  anxiety 
tau^2.2    0.0204  0.1429      5     no  sensory 

         rho.anxt  rho.snsr    anxt  snsr 
anxiety         1                 -     5 
sensory    0.3468         1      no     - 

Test for Residual Heterogeneity:
QE(df = 8) = 535.3431, p-val < .0001

Number of estimates:   10
Number of clusters:    5
Estimates per cluster: 2

Test of Moderators (coefficients 1:2):1
F(df1 = 2, df2 = 2.97) = 14.7715, p-val = 0.0286

Model Results:

                estimate     se1    tval1   df1   pval1   ci.lb1   ci.ub1   <U+200B> 
outcomeanxiety   -0.2464  0.0808  -3.0492  3.98  0.0383  -0.4713  -0.0216  * 
outcomesensory    0.2802  0.0682   4.1108  3.96  0.0150   0.0902   0.4703  * 

---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

1) results based on cluster-robust inference (var-cov estimator: CR2,
   approx. t/F-tests and confidence intervals, dfs = Satterthwaite method)

License

This documented is licensed under the following license: CC Attribution-Noncommercial-Share Alike 4.0 International.

Software and package versions

R version 4.0.3 (2020-10-10)

Platform: x86_64-w64-mingw32/x64 (64-bit)

locale: _LC_COLLATE=Chinese (Simplified)China.936, _LC_CTYPE=Chinese (Simplified)China.936, _LC_MONETARY=Chinese (Simplified)China.936, LC_NUMERIC=C and _LC_TIME=Chinese (Simplified)China.936

attached base packages: grid, stats, graphics, grDevices, utils, datasets, methods and base

other attached packages: formatR(v.1.11), pander(v.0.6.4), gridGraphics(v.0.5-1), png(v.0.1-7), cowplot(v.1.1.1), ggthemr(v.1.1.0), ggalluvial(v.0.12.3), visdat(v.0.5.3), ggsignif(v.0.6.3), plotly(v.4.9.4.1), networkD3(v.0.4), GoodmanKruskal(v.0.0.3), patchwork(v.1.1.1), MuMIn(v.1.43.17), orchaRd(v.2.0), clubSandwich(v.0.5.3), metafor(v.3.4-0), metadat(v.1.2-0), Matrix(v.1.2-18), readxl(v.1.3.1), DT(v.0.19), here(v.1.0.1), forcats(v.0.5.1), stringr(v.1.4.0), dplyr(v.1.0.7), purrr(v.0.3.4), readr(v.2.0.1), tidyr(v.1.1.3), tibble(v.3.1.4), ggplot2(v.3.3.5), tidyverse(v.1.3.1), rmdformats(v.1.0.3) and knitr(v.1.37)

loaded via a namespace (and not attached): ggbeeswarm(v.0.6.0), TH.data(v.1.1-0), colorspace(v.2.0-0), ellipsis(v.0.3.2), rprojroot(v.2.0.2), estimability(v.1.3), fs(v.1.5.0), rstudioapi(v.0.13), farver(v.2.1.0), fansi(v.0.5.0), mvtnorm(v.1.1-2), lubridate(v.1.7.10), mathjaxr(v.1.2-0), xml2(v.1.3.2), codetools(v.0.2-18), splines(v.4.0.3), jsonlite(v.1.7.2), broom(v.1.0.1), dbplyr(v.2.1.1), latex2exp(v.0.9.4), shiny(v.1.6.0), compiler(v.4.0.3), httr(v.1.4.2), emmeans(v.1.6.3), backports(v.1.2.1), assertthat(v.0.2.1), fastmap(v.1.1.0), lazyeval(v.0.2.2), cli(v.3.0.1), later(v.1.3.0), htmltools(v.0.5.2), tools(v.4.0.3), igraph(v.1.2.6), coda(v.0.19-4), gtable(v.0.3.0), glue(v.1.4.2), Rcpp(v.1.0.8.3), cellranger(v.1.1.0), jquerylib(v.0.1.4), vctrs(v.0.3.8), nlme(v.3.1-151), crosstalk(v.1.1.1), xfun(v.0.29), rvest(v.1.0.1), mime(v.0.11), miniUI(v.0.1.1.1), lifecycle(v.1.0.0), pacman(v.0.5.1), MASS(v.7.3-54), zoo(v.1.8-9), scales(v.1.1.1), promises(v.1.2.0.1), hms(v.1.1.0), sandwich(v.3.0-1), yaml(v.2.2.1), sass(v.0.4.0), labelled(v.2.9.0), stringi(v.1.7.4), highr(v.0.9), rlang(v.0.4.10), pkgconfig(v.2.0.3), evaluate(v.0.14), lattice(v.0.20-41), htmlwidgets(v.1.5.3), labeling(v.0.4.2), tidyselect(v.1.1.1), plyr(v.1.8.6), magrittr(v.2.0.1), bookdown(v.0.24), R6(v.2.5.1), generics(v.0.1.0), multcomp(v.1.4-17), DBI(v.1.1.1), mgcv(v.1.8-33), pillar(v.1.6.2), haven(v.2.4.3), withr(v.2.4.2), survival(v.3.2-7), questionr(v.0.7.7), modelr(v.0.1.8), crayon(v.1.4.1), utf8(v.1.2.2), tzdb(v.0.1.2), rmarkdown(v.2.11), data.table(v.1.14.0), reprex(v.2.0.1), digest(v.0.6.27), xtable(v.1.8-4), httpuv(v.1.6.2), stats4(v.4.0.3), munsell(v.0.5.0), beeswarm(v.0.4.0), viridisLite(v.0.4.0), vipor(v.0.4.5) and bslib(v.0.3.0)

References

Aloe, A.M., Becker, B.J., Pigott, T.D., 2010. An alternative to R2 for assessing linear models of effect size. Research Synthesis Methods 1, 272–283.

Bahadoran, Z., Mirmiran, P., Kashfi, K., Ghasemi, A., 2020. Importance of systematic reviews and meta-analyses of animal studies: Challenges for animal-to-human translation. Journal of the American Association for Laboratory Animal Science 59, 469–477.

Bateson, M., Matheson, S., 2007. Performance on a categorisation task suggests that removal of environmental enrichment inducespessimism’in captive european starlings (sturnus vulgaris). Animal Welfare 16, 33.

Bolker, B.M., Brooks, M.E., Clark, C.J., Geange, S.W., Poulsen, J.R., Stevens, M.H.H., White, J.-S.S., 2009. Generalized linear mixed models: A practical guide for ecology and evolution. Trends in ecology & evolution 24, 127–135.

Kvarven, A., Strømland, E., Johannesson, M., 2020. Comparing meta-analyses and preregistered multiple-laboratory replication projects. Nature Human Behaviour 4, 423–434.

Lagisz, M., Zidar, J., Nakagawa, S., Neville, V., Sorato, E., Paul, E.S., Bateson, M., Mendl, M., Løvlie, H., 2020. Optimism, pessimism and judgement bias in animals: A systematic review and meta-analysis. Neuroscience & Biobehavioral Reviews 118, 3–17.

Langan, D., Higgins, J.P., Jackson, D., Bowden, J., Veroniki, A.A., Kontopantelis, E., Viechtbauer, W., Simmonds, M., 2019. A comparison of heterogeneity variance estimators in simulated random-effects meta-analyses. Research synthesis methods 10, 83–98.

Nakagawa, S., Lagisz, M., Jennions, M.D., Koricheva, J., Noble, D.W., Parker, T.H., Sánchez-Tójar, A., Yang, Y., O’Dea, R.E., 2022. Methods for testing publication bias in ecological and evolutionary meta-analyses. Methods in Ecology and Evolution 13, 4–21.

Nakagawa, S., Lagisz, M., O’Dea, R.E., Rutkowska, J., Yang, Y., Noble, D.W., Senior, A.M., 2021. The orchard plot: Cultivating a forest plot for use in ecology, evolution, and beyond. Research Synthesis Methods 12, 4–12.

Nakagawa, S., Poulin, R., Mengersen, K., Reinhold, K., Engqvist, L., Lagisz, M., Senior, A.M., 2015. Meta-analysis of variation: Ecological and evolutionary applications and beyond. Methods in Ecology and Evolution 6, 143–152.

Nakagawa, S., Schielzeth, H., 2013. A general and simple method for obtaining R2 from generalized linear mixed-effects models. Methods in ecology and evolution 4, 133–142.

Ramsteijn, A.S., Van de Wijer, L., Rando, J., Luijk, J. van, Homberg, J.R., Olivier, J.D., 2020. Perinatal selective serotonin reuptake inhibitor exposure and behavioral outcomes: A systematic review and meta-analyses of animal studies. Neuroscience & Biobehavioral Reviews 114, 53–69.

Stanley, T.D., Doucouliagos, H., Ioannidis, J.P., 2017. Finding the power to reduce publication bias. Statistics in medicine 36, 1580–1598.

Viechtbauer, W., 2007. Confidence intervals for the amount of heterogeneity in meta-analysis. Statistics in medicine 26, 37–52.

Walker, J.K., Waran, N.K., Phillips, C.J., 2014. The effect of conspecific removal on the behaviour and physiology of pair-housed shelter dogs. Applied Animal Behaviour Science 158, 46–56.