A High-Throughput Pipeline Identifies Robust Connectomes But Troublesome Variability

Modern scientific discovery depends on collecting large heterogeneous datasets with many sources of variability, and applying domain-specific pipelines from which one can draw insight or clinical utility. For example, macroscale connectomics studies require complex pipelines to process raw functional or diffusion data and estimate connectomes. Individual studies tend to customize pipelines to their needs, raising concerns about their reproducibility, which add to a longer list of factors that may differ across studies and result in failures to replicate (including sampling, experimental design, and data acquisition protocols). Mitigating these issues requires multi-study datasets and the development of pipelines that can be applied across them. We developed NeuroData’s MRI to Graphs (NDMG) pipeline using several functional and diffusion studies, including the Consortium for Reliability and Reproducability, to estimate connectomes. Without any manual intervention or parameter tuning, NDMG ran on 25 different studies (≈6,000 scans) from 19 sites, with each scan resulting in a biologically plausible connectome (as assessed by multiple quality assurance metrics at each processing stage). For each study, the connectomes from NDMG are more similar within than across individuals, indicating that NDMG is preserving biological variability. Moreover, the connectomes exhibit near perfect consistency for certain connectional properties across every scan, individual, study, site and modality; these include stronger ipsilateral than contralateral connections and stronger homotopic than heterotopic connections. Yet, the magnitude of the differences varied across individuals and studies—much more so when pooling data across sites, even after controlling for study, site, and basic demographic variables (i.e., age, sex, and ethnicity). This indicates that other experimental variables (possibly those not measured or reported) are contributing to this variability, which if not accounted for can limit the value of aggregate datasets, as well as expectations regarding the accuracy of findings and likelihood of replication. We therefore provide a set of principles to guide the development of pipelines capable of pooling data across studies while maintaining biological variability and minimizing measurement error. This open science approach provides us with an opportunity to understand and eventually mitigate spurious results for both past and future studies.


Introduction
Recent developments in technology enable experimentalists to collect increasingly large, complex, and heterogeneous data. Any single study can include both raw multimodal data and extensive metadata, including sample design, experimental protocols, data acquisition, and subject specific demographics/phenotypics. Each of these variables adds di erent sources of variability, which can hamper our ability to interpret and/or generalize results [1; 2]. Moreover, often only a subset of the potential sources of variability are documented or reported [3]. Interpreting these data therefore requires deep data processing pipelines. Such pipelines are particularly important when attempting enlarge sample size and increase power by aggregating data across multiple studies. These pipelines, however, can introduce additional sources of variability, if di erent pipelines are used on di erent datasets, or the same pipeline is applied across datasets but requires substantial tuning or manual intervention, or is run using di erent operating systems [4]. These sources of variability can collectively swamp the signal of interest, yielding studies with questionable reproducibility, scientific validity, and/or clinical utility.
Studies in neuroimaging exemplify these properties. The data from a single study consist of structural, functional and/or di usion magnetic resonance imaging (sMRI, fMRI, and dMRI) scans, from multiple individuals. The metadata associated with a study includes the time, date, location of the scans, the make and model of the scanner hardware and software, scanner acquisition protocols, as well as the demographic information from each individual, only some of which may be recorded and reported. A number of investigators have developed processing pipelines for one or more of these modalities [5][6][7][8][9][10][11][12][13][14][15][16], but none of the pipelines are designed to address the variability alluded to above, or work across both fMRI and dMRI.
For example, a number of studies have identified frequently uncontrolled variables that can radically alter the results of downstream inferences, such as menstrual cycle status [3]. Others have demonstrated that some statistics and normalization procedures result in stable parameter estimates across fMRI measurements within an individual and study, but those analyses lacked a coherent statistical model, did not compare across studies, and did not consider dMRI data [3]. A few investigations have pooled data across studies, with mixed results. With enough samples, certain properties are apparently preserved [17; 18]. Alternately, the use of sophisticated machine learning techniques can mitigate some of these issues [19]. Nonetheless, many studies continue to fail to replicate [20].
To rigorously identify and quantify the sources of variability both within and across multimodal neuroimaging requires (1) data and (2) a pipeline. The requisite data includes numerous datasets with multiple measurements per individual-including data that both conserve and vary a number of di erent factors. The requisite pipeline must be able to fully process each sample and study, and analyze the results both within and across studies using a coherent statistical model.
The Consortium of Reliability and Reproducability (CoRR) consists of about 30 di erent studies from nearly 20 di erent institutions around the world, which provides the necessary data [21]. But no existing pipeline can successfully estimate meaningful process every scan in this datasetincluding both functional and di usion data-while also quantifying the magnitude and source of variability amongst them. We therefore established several principles and metrics to guide the development of pipelines. We developed an approach, "Neuro Data MRI to Graphs" (NDMG), that meets or exceeds standards along each of the above men-tioned principles. We validated our pipeline by running NDMG on 11 dMRI studies comprising 3,227 individuals with 4,347 scans, and 17 fMRI studies comprising 714 individuals with 1,646 scans. For each scan NDMG estimates a "connectome" (a functional or structural map of connectivity) at 24 di erent spatial resolutions-yielding a total of > 100, 000 estimated connectomes-all of which are publicly available from http://m2g.io. This is the largest open database of connectomes [22], and one of the largest mega-analyses (inference across studies) of multi-modal connectomics data to date [19; 23].
These connectomes provided the data that led us to develop statistical connectomics methods to quantify various connectome properties, such as the relative probability of ipsilateral vs. contralateral connections and homotopic vs. heterotopic connections. While these properties have been previously noted in single studies [24][25][26], this work demonstrates that aspects of these properties are preserved both across individuals and studies upon harmonizing analysis. However, even after optimizing and harmonizing the pipeline, and controlling for site, study, sex, and other basic demographic variables, we observed a large degree of variability across individuals and studies. While some of this variability reflects legitimate biological heterogeneity, not all of it can be explained away. This excessive variability perhaps underlies recent failures of replicability in neuroimaging [20]. Moreover, the degree of variability calls into question many single-study neuroimaging results, and partially explains the lack of clinically useful neuroimaging based biomarkers [27], suggesting that further e orts are required for statistical connectomics to produce more accurate and reliable p-values and clinical biomarkers. Table 1 depicts NDMG's performance with regard to each of the below described principles for reproducible pipelines. For each principle, we outline a procedure for evaluating the degree to which a pipeline adheres to that principle, enabling the principle-based evaluation of disparate approaches.

Statistical Principles
The statistical principles are designed to evaluate the empirical quality of the Table 1: Comparing M3R Processing Pipelines. NDMG is designed with both algorithmic and implementation principles in mind. This table compares existing pipelines along these principles, demonstrating that for each, NDMG performs at least as well as the current state of the art. A is given for pipelines that satisfy the respective desiderata, a for pipelines that partially satisfy the respective desiderata, and a is given for pipelines that do not satisfy the respective desiderata. Appendix A provides more details.

Statistical Principles
Computational Principles Connectome Principles pipeline a c c u r a t e r e l i a b l e r o b u s t e x p e d i e n t s c a l a b l e p o r t a b l e t u r n - Accuracy quantifies the distance between "the truth' and an estimate-also known as "validity"and is closely related to the statistical concepts of "unbiased" and "consistent". Because the truth is unknown for these data, no pipeline can be meaningfully evaluated in terms of accuracy. Until the field develops the technological capability to obtain "ground truth" data, we rely on surrogate information. Specifically, NDMG incorporates the quality assessment (QA )from CPAC [5] as well as other pipelines, and also adds several novel QA figures, with at least one figure per processing stage, yielding a total of 11 QA figures per di usion scan and 32 per functional scan (see Appendix A for details). NDMG does not filter any data based on QA thresholds, as the choice of threshold remains both subjective and task dependent.
Reliability colloquially refers to methods that produce a similar result given a similar input, also called "stability" in the statistics literature [31]. To evaluate a method's reliability, Wang et al. [32] developed a metric called "discriminability" that quantifies the fraction of measurements from the same individual that are closer to one another than they are to the measurement of any other individual (details below). NDMG's discriminability over all scans was nearly 0.98 for dMRI data and over 0.88 for fMRI.
Robustness quantifies accuracy and reliability across a wide range of studies with di erent properties, including di erent experimental design, measurement devices, etc. We therefore ran NDMG on 11 dMRI studies and 17 fMRI studies using di erent hardware and acquisition parameters (see Table 2 for details). In some of these studies samples were not filtered to discard outliers or samples with poor signal-to-noise properties. Nonetheless, for each study, NDMG's QA suggested accuracy, and each study achieved a score of discriminability > 0.8.
Computational Principles Adhering to the computational principles lowers the barrier for use. In practice, this means that both domain experts and researchers from other disciplines (such as machine learning and statistics), can more easily use the tools.
Expediency refers to the time it takes to run an approach on a typical sample. In practice, we suspect that users are more likely to adopt pipelines that run in about an hour per scan as compared to those that run overnight per scan, for example. The NDMG runtime is about 20 minutes for a functional scan, and 1.5 hours for di usion scan.
Parallelized refers to the ability of the method to parallelize the code across multiple computers. NDMG enables parallelization across scansusing for example the commercial cloud or a highperformance cluster-enables NDMG to run many thousands of scans in only 1.5 hours.
Portability-meaning the ability to be run on different platforms, from laptop and cloud, with minimal installation and configuration energy-enables di erent analysts using di erent hardware resources to seamlessly use the code. We have tested NDMG on multiple hardware and operating system setups, including Windows, OSX, and Linux laptops, multi-core workstations, singularity clusters, and the Amazon cloud. Moreover, we have deployed NDMG on both openneuro [33] and CBRAIN [34], making it possible for anybody to run NDMG for free on their own or other's computational resources.
Turn-Key methods do not require the user to specify parameters and settings for each stage of processing or for each new study. This feature reduces the time for researchers to get a pipeline running, and enables pooling data across multiple pipelines because the analysis is harmonized (conducted identically across studies). NDMG parameters have been tuned to yield accurate and reliable connectome estimates across nearly 30 di erent studies. Moreover, NDMG is fully compliant with Brain Imaging Data Structure (BIDS)-a recently proposed specification for organizing multi-scan, multiindividual, multi-modality studies [35; 36].
Open, referring to both open source code and open access data derivatives processed using the code, enables anybody with Web-access to contribute to the scientific process. NDMG leverages open source packages with permissive licenses, and is released under the Apache 2.0 open source license. Our website, http://m2g.io contains links to download all of the data derivatives and quality assurance figures from each scan, the largest open database of connectomes and other data derivatives to our knowledge. By developing NDMG according to these statistical and computational design principles, and running it on many diverse studies, we can evaluate individuals, studies, and the collection of studies at an unprecedented scale. Below, we describe the nuts and bolts of the pipeline, followed by a set of NDMG -enabled scientific findings. Our hope is that NDMG and the data products derived from it will be useful for a wide variety of discoveries.

Connectome Principles
We also consider a pair of principles that are specific for reproducable pipelines in connectomics.
dMRI and fMRI pipelines operate on di usion or functional MRI data respectively. Raw-to-Graph refers to the pipeline performing all steps of analysis required to acquire connectomes (graphs) given raw, unprocessed M3R scans with no user input. The NDMG-d pipeline was built to take raw dMRI and T1w images and produce a di usion connectome, and the NDMG-f pipeline takes raw fMRI and T1w images and produces a functional connectome.

Individual-Level Analysis
In the individual-level analysis, each individual undergoes some number of sessions, during which multiple di erent modalities can be collected. The input to NDMG for a given individual is the collection of scans and some metadata for each scan, including a structural scan, and either or both of (1) a diffusion scan, including the di usion parameters files, and (2) a functional scan, including the slice acquisition sequence. The individual-level of NDMG analysis leverages existing open source tools, including the fMRI Software Library (FSL) [37][38][39], Dipy [40], the MNI152 atlas [41], and a variety of parcellations defined in the MNI152 space [42][43][44][45][46][47][48][49][50] (see Appendix G for details about built-in parcellations included. All algorithms requiring hyper-parameter selection were initially set to the suggested parameters for each tool, and tuned to improve the accuracy, reliability, expediency, and robustness of the results. The output of each processing stage includes data derivatives and QA figures to enable individualive accuracy assessments. The QA figures at many stages include cross-sectional images at di erent depths in the three canonical planes (sagittal, coronal, and axial) of images or overlays. Example QA figures are provided in Appendix B and Appendix C.

Individual-Level Diffusion Analysis
The NDMG-d pipeline consists of four key components: (1) registration, (2) tensor estimation, (3) tractography, and (4) graph generation (see Figure 1 for an illustration, and Appendix B for further details): Registration uses FSL's "standard" linear registration pipeline to register the structural and di usion images to the MNI152 atlas [37-39; 41]. Nonlinear registration, though possibly more accurate [51; 52], was not su ciently robust to include (it often failed on some studies without manual intervention). The individual-level NDMG-d pipeline transforms raw dMRI data into sparse structural connectomes, whereas the NDMG-f pipeline transforms raw fMRI data into dense functional connectomes. Each pipeline consists of four key steps, and each step generates both data derivatives and quality assurance figures to enable both qualitative assessments and quantitative comparisons (see Appendix B and Appendix C for details).
Tensor Estimation uses DiPy [40] to obtain an estimated tensor in each voxel.
Tractography uses DiPy's EuDX [53], a deterministic tractography algorithm closely related to FACT [54] to obtain a streamline from each voxel. Probabilistic tractography, while possibly more accurate, requires orders of magnitude more computational time. Visual QA of the generated fibers suggested that deterministic was su ciently accurate for our purposes.
Graph Generation uses our custom Python code that places an edge between a pair of regions of interest (ROIs) whenever a streamline passes through both, yielding a weighted voxel-wise graph. Downsampling and QA are performed as in the functional pipeline, described below.
Individual-level analysis in NDMG-d takes approximately 1.5 hours to complete using 1 CPU core and 12 GB of RAM at 1 mm 3 resolution. The individuallevel analysis was run on 11 studies, including 3,227 individuals and 4,347 scans. Each dataset generated connectomes across each of the parcellations in Appendix G, resulting in 104,328 functional braingraphs.

Individual-Level Functional Analysis
The NDMG-f pipeline can be broken up into four key components: (1) preprocessing, (2) registration, (3) nuisance correction, and (4) graph generation (see Figure 1for an illustration and Appendix C for further details). The NDMG-f pipeline was constructed starting with the optimal processing pipeline identified in Wang et. al [32] using CPAC [5]. Hyperparameters and further algorithm selection was optimized for reliability based on multiple measurement studies (including test-retest). Below, we provide a brief description of each step.
Registration uses FSL [37; 58; 59] to perform a nonlinear boundary based registration of the fMRI into the MNI152 space [41]. The registration pipeline implemented is "standard" when working with functional data and FSL's tools [5].
Nuisance Correction uses custom Python code to implement a general linear model incorporating regressors for quadratic detrending [60; 61], top five white-matter and cerebrospinal fluid Principal Components (CompCor) [62; 63], and the Friston 24parameter regressors [64]. Low-frequency drift is then removed below 0.01 Hz [65], and the first 15 seconds of the fMRI sequence are discarded [66].
Graph Generation uses custom Python code to compute the average timeseries for all voxels within an ROI, then compute correlations between all pairs of ROIs. The functional connectome is then ranktransformed by replacing the magnitude of the correlation with its relative rank, from smallest to largest [32].
Individual-level analysis in NDMG-f takes approximately 20 minutes to complete using 1 CPU core and 3 GB of RAM at 2 mm 3 resolution. The individuallevel analysis was run on 714 individuals with 1,646 scans from 17 studies; each generating connectomes across each of the 24 parcellations in NDMG-d, resulting in 39,504 total brain-graphs.
The QA for graph generation includes a heatmap of the adjacency matrix, the number of non-zero edges, and several multivariate graph statistics (one statistic per vertex in the graph) including: betweenness centrality, clustering coe cient, hemisphereseparated degree sequence, edge weight, eigenvalues of the graph laplacian, and locality statistic-1 [67]. We developed the hemisphere-separated degree sequence to indicate the ipsilateral degree and contralateral degree for each vertex, which we found quite useful for QA. Appendix C.4 includes definitions and implementation details for each of the statistics. Supplementary Figure S10 shows, for a single individual, the graph summary statistics for the multi-connectome (including both functional and di usion) across the set of atlases described above.

Group-Level Analysis
We ran NDMG on the 11 di usion and 17 functional studies, listed in Table 2. For each, NDMG group-level Table 2: NDMG pipeline robustness and reliability. We ran NDMG on over 20 di erent studies, including both fMRI and dMRI data, spanning multiple di erent scanners, acquisition parameters, and population demographics. Nonetheless, for both fMRI and dMRI data, across all datasets with multiple measurements, NDMG always achieved > 0.8 discriminability, and NDMG-d's discriminability was typically > 0.98 on the dMRI data. D=discriminability.
analysis computes and plots group-level graph summary and reliability statistics.

Group Level Implementation Strategy
Leveraging previous e orts developed in "Science in the Cloud" [71] manuscript, multiple sessions and datasets are evaluated in parallel for participantlevel analysis, and the derivatives are pooled for group-level analysis, much like typical map-reduce approaches (consistent with the BIDS app specifi-Figure 2: NDMG's group level analysis computes and plots eight connectome-specific summary statistics per modality for each scan, providing immediate quality assurance for the entire study. In theory, any connectome could be an outlier for any of these statistics, so plotting all of them together is particularly useful (see Appendix D for details). The top panels show results for the BNU1 dMRI connectomes, the bottom panels show the results for the BNU1 fMRI connectomes.
cation [36]). The parallel implementation uses the Amazon Web Services (AWS) cloud, in particular leveraging their storage (S3) and high performance computing (Batch) services. Datasets are stored in an S3 bucket, and the NDMG cloud-API enabled connectome estimation for all sessions of data on Amazon Batch in parallel. In practice, AWS limits the number of parallel threads users are allowed to launch (to prevent accidental spending). After connectome estimation is complete, the same cloud-API exposed by NDMG enables group-level analysis to be launched and parallelized across each parcellation for all sessions within each study. We were able to compute di usion connectomes at 24 scales for each of the publicly available 2,861 sessions of data (totaling 68,664 connectomes) in under one day and $1,000. Had we processed each session in parallel, cost would have remained the same but only taken 1.5 hours.

Group-Level Graph Summary Statistics
Each scan's estimated connectome can be summarized by a set of graph statistics, as described above.
For group-level analysis, we visualize each session's summary statistics overlaid on one another. For example, Figure 2 demonstrates that each di usion graph from the BNU3 dataset using the Desikan atlas has relatively similar values for the statistics (we use the Desikan atlas for the remainder of the analyses unless otherwise specified). Moreover, It is clear from both the degree plot and the mean connectome plot that the dMRI connectomes from this study tend to have more connections within a hemisphere than across a hemisphere, as expected. For the fMRI connectomes, however, the homotopic connectionsthat is, connections from one region on a hemisphere to the same region on the other hemisphere-seem particularly strong.

Group-Level Discriminability
Group-level results from NDMG that include repeated measurements are quantitatively assessed using a statistic called discriminability [32]. The group's sample discriminability estimates the probability that two observations within the same class are more similar to one another than to objects belonging to a di erent class: In the context of reliability in NDMG, each connectome, a ij , is compared to other connectomes belonging to the same individual, a ij ′ , and to all connectomes belonging to other individuals, a i ′ j ′ . A perfect discriminability score indicates that for all observations within the group, each connectome is more like connectomes from the same individual than to others. Table 2 lists the discriminability score of each study with repeated measurements (five dMRI studies and sixteen fMRI). NDMG-d achieves a discriminability score of nearly 0.99 or greater on most studies (the lowest score was nearly 0.9). NDMGf achieves a discriminability score of around 0.9 on all studies. These high discriminability scores were achieved by optimizing both NDMG-d and NDMG-f on a subset of the data. Specifically, NDMG-d was optimized using only the Kirby21 study, to achieve a perfect discriminability score. Nonetheless, the other studies achieved comparably high discriminability scores, despite the fact that the Kirby21 used a Philips scanner, unlike any of the other studies. Moreover, Kirby21's age distribution is substantially di erent from several of the other dMRI studies (see Table 2). Similarly, NDMG-f was optimized using only 13 of the 17 studies, and yet discriminability on the remaining studies remained equally high [32], even though they also exhibited large study demographic and acquisition protocol variability. That NDMG's discriminability score is robust to data acquisition details and study demographics across modalities suggests that scientific results may also be robust to such sources of variability.

Mega-Level Analysis
Many sources of variability contribute to the observed summary statistics, including individual-or population-level variation, and di erent types of measurement or analysis techniques. By virtue of harmonizing the analysis across individuals and studies, we are able assess the remaining degrees of variability due to measurement and demographicspecific e ects. Although population-level e ects are expected when comparing two di erent populations with di erent demographics, variability across measurements must be relatively small for inferences based on neuroimaging to be valid across these populations. Therefore, we conducted megaanalysis to address these remaining sources of variation. Site-specific mean connectomes (top two rows) and mega-mean (bottom row), using the Desikan parcellation, using all the data for which we have both functional and di usion studies (> 900 scans of each), with edges and vertices colored depending on hemisphere. In the top 2 rows, graphs are shown with vertex position determined by the coronally-projected center of mass for each region in the desikan parcellation. The bottom shows radial plots organized by hemisphere. Same-modality connectomes appear qualitatively similar to one another across sites, but some di erences across modalities are apparent. For example, homotopic and contra-lateral connections both seem stronger in functional than di usion mean connectomes, both within site and after pooling all sites. Figure 3 (top) shows the mean estimated connectome computed from each dataset for which we have both dMRI and fMRI data. We also calculate the megamean connectomes, that is, the mean derived by pooling all the studies (Figure 3, bottom). Several group-specific properties are readily apparent simply by visualizing the connectomes:

Consistency Within and Across Studies
1) The dMRI connectomes have significantly stronger ipsilateral connections than contralateral connections, meaning connections are more prevalent within hemispheres, on average.
2) The fMRI connectomes have significantly stronger homotopic connections than heterotopic connections, meaning region A on the left hemisphere tends to correlation with region A on the right hemisphere, more than other regions, on average.
To formally test these initial assessments we developed statistical connectomics models and methods. Specifically, we developed a structured independent edge random graph model that generalizes the more commonly used stochastic block model (SBM) [72] and the independence edge random graph model [73]. In this new model, each edge is sampled independently, but not identically. Rather, there are K possible probabilities of an edge between a pair of vertices, and we have a priori knowledge of which edges are in which groups. Unlike the SBM model, in which each vertex is in a group, here each edge is in a group. We then developed test statistics that are consistent and ecient for this model. Specifically, with su ciently large sample sizes, the power (the probability of correctly rejecting a false null hypothesis) of our test approaches unity for the model under consideration. Moreover, no other test can achieve higher power with fewer samples under this model (see Appendix E for details).
Using the above described approach, we first quantify the degree to which ipsilateral connections tend to be stronger than contralateral connections ( Figure 4A). 99.9% of dMRI connectomes and 6.4% of fMRI connectomes, exhibit stronger ipsilateral than contralateral connections. The dMRI connectomes typically have a larger di erence between ipsilateral and contralateral connections, as evidenced by the magnitudes of di erences ( Figure 4A.i). and the cumulative probability of a connectome having a di erence that is statistically significantly at any level (Figure 4A.ii). We subsequently investigated whether for a given individual, the di erence between ipsilateral and contralateral was stronger in the dMRI data than the fMRI data. Out of 907 individuals with both dMRI and fMRI, 99.5% exhibit stronger ipsilateral versus contralateral connections in the dMRI dataset as compared to their corresponding fMRI dataset (Figure 4A.iii), with 99.5% significant at the 0.05 level, for example ( Figure 4A.iv).
We applied the same strategy to test whether homotopic connections tend to be stronger than heterotopic connections ( Figure 4B). In this case, the results are essentially opposite. Nearly all of the fMRI scans (99.4%) show stronger homotopic than heterotopic connections, whereas nearly none of the dMRI scans exhibit this property (7.7%) ( Figure 4B.ii). Similarly, for nearly all individuals (99.0%) with both functional and di usion scans, the relative strength of homotopic versus heterotopic connections was stronger in his or her fMRI data than the dMRI data ( Figure 4B.iv).
In other words, there is a marked consistency across all individuals in all studies (16 fMRI studies and 9 dMRI studies) for these must basic statisti-cal properties of multimodal connectomes. Notably, this result therefore provides compelling evidence that certain connectomic discoveries that utilize only a single study, or even a single individual, without even addressing batch e ects, can reasonably be expected to be repeatable across studies. Note, however, that repeatable does not mean correct; the true relative probabilities of ipsilateral, contralateral, homotopic, and heterotopic in humans remains an open question.

Significant Variability Within and Across
Studies While the above analysis demonstrates preservation of certain connectome properties both within and across studies, it is insu cient to determine the extent of "batch e ects"-sources of variability including di erences in participant recruitment, screening, and demographics as well as scanner, acquisition sequence, and operator. Typically, investigators prefer that their results are robust to these additional sources of variance. If the batch effects are larger than the signal of interest (for example, whether a particular individual is su ering from a particular psychiatric disorder), then inferences based on individual studies are prone to be unreliable. Figure 5 evaluates the variability of individual scans both within and across studies for dMRI (left) and fMRI (right). For simplicity, we focus on a single parameter for each modality: dMRI data is evaluated in terms of its average within hemisphere connectivity and fMRI is evaluated in terms of its average homotopic connectivity. Prior to analysis each graph is "rank normalized", meaning that its edge weights are converted into the numbers 1, 2, ..., N , where N is the total number of potential edges, where the smallest value is assigned a 1, the next smallest is assigned a 2, etc. By virtue of this normalization, each network has the same mean and variance, therefore mitigating certain kinds of batch e ects. Moreover, by partitioning edges into only two groups, there is only a single degree of freedom: the average connection strength from one group determines that of the other, and vice versa (see Appendix F for details). The fact that the most salient features when visualizing these connectomes e ectively yields a onedimensional characterization of this property justifies the importance of studying the batch e ects of this parameter. Dramatic variability in this parame-(C) ter suggests that more "fine" parameters such as the connection strength between individual regions of interest, will typically necessarily have even greater variability.  ii) evaluate the statistical significance of batch e ects across studies, demonstrating that many studies do exhibit severe batch e ects, though not all. In this case, batch e ect was quantified by determining whether the distribution of connection strengths for one study was significantly different from those of another study.
To further understand the source of this variabil-ity, we conducted an extensive within and between study analysis of individuals, including the following six di erent cases ( Figure 5(A.iii) and (B.iii)): 1) Across sessions for a given individual within studies: If session 1 scans tended to be di erent from session 2 scans within a study, then the scan order itself can meaningfully impact inferences. While dMRI sessions never resulted in significant scan order e ects, 9% fMRI studies were impacted by this e ect.
2) Across individuals within studies: This analysis quantifies the fraction of individuals in a given study whose scans are less similar to one another than they are to any other scan in the dataset. 6% and 15% of individuals within studies have significantly di erent (C) ii) The pairwise significance of the di erences between studies demonstrates that many pairs of studies exhibit meaningful batch e ects. (A.iii) and (B.iii) Several common pooling strategies are investigated, including pooling across multiple sessions within a single study (Sessions), pooling across subjects within a single study (Subjects), pooling across sex within a single study (Sexes), pooling across studies taken from a single scanning site (Sites), pooling across studies with a similar demographic (Demographics), and pooling across studies with no consideration of demographic (All studies). In many cases, even when controlling for several factors, significant batch e ects remain. (C) The fraction of samples showing significant batch e ect at α = 0.05. Pooling across sessions, subjects, and sexes, only a small fraction of the samples show significant batch e ect in both the dMRI and fMRI connectomes. Pooling across demographics and across any study in general shows a large fraction of samples with significant batch e ects. This suggests that pooling strategy has a significant impact on the generalizability of neuroscience models, and underscores the prevalence of the batch e ect in MRI investigations. A careful discussion of the model and hypothesis testing employed can be found in Appendix F. connection strengths.

3) Across sexes within studies:
Connection strengths are the same at this coarse level across sexes for each dMRI study, whereas 30% of fMRI studies demonstrated significant di erences across sexes.

4)
Across studies within sites: Two dMRI and three fMRI sites provided data from multiple studies, thus while the location and scanner were the same across these studies, certain variability in demographic and acquisition details persisted. Both dMRI and fMRI demonstrated a mix: sometimes these di erences were significant, but not always.

5)
Across studies within demographic: X dMRI and Y fMRI studies were demographically essentially identical, including all Chinese people, about 50% female, and all college age. Even preserving demographics was insu cient typically to mitigate batch e ects, with 2/3 and 1/2 of comparisons yielding significant batch e ects for dMRI and fMRI, respectively.
6) Across all studies: 10 dMRI and 17 fMRI studies were compared, ignoring acquisition and demographic detail. Over 60% of pairwise comparisons for both dMRI and fMRI demonstrated significant di erences, often the maximal significance level given our permutation test.
Taken together, the above results indicate the existence of many di erence sources of variability, even upon harmonizing analyses. Such variability suggests that the reproducibility of both dMRI and fMRI can benefit by further understanding, quantifying, and mitigating these sources of variability.

Discussion
Variability of sample demographics, data acquisition details (for example, number of repetitions for fMRI, number of directions for dMRI), analysis methods, measurement error, questionable research practices, or statistical errors can each contribute to recent troubles in psychology [20] and neuroimaging [74]. Our principles for pipeline development enabled a rigorous high-throughput analysis of multistudy, multi-site, and multi-connectome data, to identify and quantify di erent sources of variability.
While perhaps seemingly at odds, the results from Figures 4 and 5 are complementary. Specifically, Figure 4 demonstrates that essentially all connectomes share a particular property: stronger connections in one set of edges than another. On the other hand, Figure 5 demonstrate that although some sets of edges tend to be larger than others, the magnitude of the di erences is highly variable both within and across studies. That the magnitude of certain parameters can di er while the sign of those parameters can be constant parallels recent suggestions from the statistics literature to move away from "Type I" and "Type II" errors, to Type M (magnitude) and Type S (sign) errors [75]. Moreover, it suggests a strategy to understand and address batch e ects: reporting the precision for which e ects are preserved or variable. For example, in this study, when using the coarsest possible precision (larger versus smaller as in Figure 4), no batch e ects arise, whereas when using an extremely fine precision (the di erence in magnitude as in Figure 5 ), batch e ects are pervasive. The natural question then becomes: "at which precision, for a given parameter and source of variability, do the studies still agree?" Such analyses could then be incorporated in downstream analyses to preserve results across studies.
The design criteria for NDMG required certain trade-o s, including robustness under variability versus optimality under idealized data. Nonetheless, NDMG, could be improved along several dimensions. First, recent advances in registration [52] and tractography [76] could be incorporated. When implementations for these algorithms achieve suitable expediency and robustness, it will be natural to assess them. Second, recently several more sophisticated batch e ect strategies have been successfully employed in dMRI data [77]. Such strategies could possibly help here as well, especially if they are modified appropriately to work on binary graphs [78]. Third, there is some evidence machine learning approaches to mitigating batch e ects can be e ective as well, but so far only in fMRI data and only using four, rather than 17 studies [79]. Fourth, pre-processing strategies have been employed to improve multi-site reliability [80], so implementing methods such as these within NDMG could possibly mitigate some batch effect, at the risk of reducing accuracy and/or reliability [81].
It may be that analysis methods on their own are insu cient to mitigate replicability issues, and that further improving data acquisition and/or data acquisition harmonization may be required. Indeed, a recent study by Noble et al. [82] found relatively few batch e ects in fMRI data, although it employed only two datasets with enhanced and harmonized data acquisition protocols.
Because the methods developed herein are open source and easy to use, and the data are open access, this work enables further studies to asses measurement errors as well as variability of sample de-mographics and experimental protocols. For example, data could be sub-sampled to only include scans that pass stringent quality assurance standards, or duration [? ]. Alternately, this analysis could be repeated on data that is "perfectly" harmonized. In general, further work developing and applying experimental and theoretical tools to parse the relative impact of various source of batch e ects, as well as batch e ect mitigation strategies, will be crucial for neuroimaging to achieve its greatest potential scientific and clinical value.

Appendix A Pipeline Comparison Technology Evaluation
The scoring criteria for the principles defined in the main text are defined as follows: Accurate : pipeline results compared with ground truth to assess accuracy, : quality control figures and metrics are produced.
Reliable : reliability has been assessed, using either discriminability or intra-class correlation, by running the pipeline on at least one study, : no published results demonstrating reliability Robust : pipeline has been run successfully on multiple disparate studies, : no published results on multiple studies.
Expedient : pipeline runs on a single individual ≤ 2 hour per scan, : pipeline requires ≥ 2 hour per scan.
Parallelized : can run in parallel locally, and can scale to cloud infrastructure (AWS EC2 or AWS Batch). : can run in parallel locally or scale to cloud infrastructure. : can neither run in parallel locally nor scale to cloud infrastructure.
Portable : can run on, and is deployed on, multiple di erent platforms, : can run on multiple platforms, but is not deployed on any publicly available resources, : is platform specific.
Turn-Key : runs without requiring any tuning parameters or configuration files, : given a configuration file, runs without requiring any further tuning, : requires tuning for each run.
Open : both source code and data derivatives are open, : only source code is available, : neither source code nor data derivatives are publicly available.
dMRI& fMRI operates on dMRI data (left), operates on fMRI (right), on either side indicates the opposite.
Raw-to-Graph : outputs estimated networks, : does not.

Appendix B Diffusion Pipeline
Here we take a deep-dive into each of the modules of the NDMG-d pipeline. We will explain algorithm and parameter choices that were implemented at each step and the justification for why they were used over alternatives. All QA/QAX figures shown are from participant 0025864 from the BNU1 [21] study.  Registration in NDMG leverages FSL and the Nilearn Python package. NDMG uses linear registrations because non-linear methods had higher variability across studies and increased the time requirements of the pipeline dramatically (not shown). The first step in the registration module is eddy-current correction and dMRI self-alignment to the volumestack's B0 volume ( Figure S1). NDMG uses FSL's eddy_correct, rather than the newer eddy function, because eddy requires substantially longer to run or relies on GPU acceleration, which would reduce the accessibility of NDMG. Once the dMRI data is self-aligned, it is aligned to the same-individual T1w image through FSL's epi_reg mini-pipeline. This tool performs a linear alignment between each image in the dMRI volume-stack and the T1w volume. The T1w volume is then aligned to the MNI152 template using linear registration computed by FSL's flirt. This alignment is computed using the 1 millimeter (mm) MNI152 atlas, to enable higher freedom in terms of the parcellations that may be used, such as near-voxelwise parcellations that have been generated at 1 mm. FSL's non-linear registration, fnirt, is not used in NDMG as the performance was found to vary significantly based on the collection protocol of the T1w images, often resulting in either slightly improved or significantly deteriorated performance.
The transform mapping the T1w volume to the template is then applied to the dMRI image stack, resulting in the dMRI image being aligned to the MNI152 template in stereotaxic-coordinate space. However, while flirt aligns the images in stereotaxic space, it does not guarantee an overlap of the data in voxelspace. Using Nilearn's Figure S1: NDMG-d detailed pipeline. The NDMG-d pipeline consists of 4 main steps: Registration (D1), Tensor Estimation (D2), Tractography (D3), and Graph Generation (D4). Each of these sections leverages pubicly available tools and data to robustly produce the desired derivative of each step. Alongside derivative production, NDMG produces QA figures at each stage, as can be seen in D1-4, that enable qualitative evaluation of the pipeline's performance.  resample, NDMG ensures that images are aligned in both voxel-and stereotaxic-coordinates so that all analyses can be performed equivalently either with or without considering the image a ne-transforms mapping the data matrix to the real-world coordinates.
Finally, NDMG produces a QA plot showing three slices of the first B0 volume of the aligned dMRI image overlaid on the MNI152 template in the three principle coordinate planes ( Figure S2).

Appendix B.2 Tensor Estimation
Once the dMRI volumes have been aligned to the template, NDMG begins di usion-specific processing on the data. All di usion processing in NDMG is performed using the Dipy Python package [40]. The di usion processing in NDMG is performed after alignment to ease cross-connectome comparisons.
While high-dimensional di usion models, such as orientation distribution functions (ODFs) or q-ball, enable reconstruction of crossing fibers and complex fiber trajectories, these methods are designed for images with a large number of di usion volumes/directions for a given image [83; 84]. Because NDMG is designed to be robust across a wide range of dMRI studies, including di usion tensor imaging, NDMG uses a lower-dimensional tensor model. The model, described in detail on Dipy's website (http://nipy.org/dipy/examples_bcouilt/ reconst_dti.html), computes a 6-component tensor for each voxel in the image. This reduces the dMRI image stack to a single 6-dimensional image that can be used for tractography. NDMG generates a QA plot showing slices of the FA map derived from the tensors in nine panels, as above ( Figure S3).

Appendix B.3 Tractography
NDMG uses DiPy's deterministic tractography algorithm, EuDX [53]. Integration of tensor estimation and tractography methods is minimally complex with this tractography method, as it has been designed to operate on the tensors produced by Dipy in the previous step. Probabilistic tractography would be significantly more computationally expensive, and it remains unclear how well it would perform on data with a small number of di usion directions. The QA figure for tractography shows a subset of the resolved streamlines in an axial projection of the brain mask with the fibers contained therein ( Figure S4). This QA figure allows the user to verify, for example, that streamlines are following expected patterns within the brain and do not leave the boundary of the mask.

Appendix B.4 Graph Estimation
NDMG uses the fiber streamlines to generate connectomes across multiple parcellations. The connectomes generated are graph objects, with nodes in the graph representing regions of interest (ROIs), and edges representing connectivity via fibers. An undirected edge is added to the graph for each pair of ROIs a given streamline passes through. Edges are undirected because dMRI data lacks direction information. Edge weight is the number of streamlines which pass through a given pair of regions. NDMG uses 24 parcellations, including all standard public dMRI parcellations known by the authors. Users may run NDMG using any additional parcellation defined in MNI152 space simply by providing access to it on the command-line. To package an additional parcellation with NDMG, please contact the maintainers. The QA for graph generation depicts a number of graph statistics for each of the parcellation schemes. We typically generate this figure at the population level, as depicted in Figure 2. The description of all the graph statistics we used is provided in Appendix D. Table 4: NDMG-f IO Breakdown. Below, we look at the inputs, outputs, and QA figures produced by NDMG-f. .
Step Inputs  Outputs  QA figures Preprocessing raw fMRI, raw T w preprocessed fMRI, preprocessed T w

Appendix C Functional Pipeline
Here we take a deep-dive into each of the modules of the NDMG-f pipeline. We will explain algorithm and parameter choices that were implemented at each step, and the justification for why they were used over alternatives. All QA/QAX figures shown are from participant 0025864 from the BNU1 [21] study.

Appendix C.1 Preprocessing
fMRI pre-processing consists of several steps described below ( Figure S6).

Slice Timing Correction
Slice-timing is accomplished using the slicetimer utility provided by FSL [57]. To collect an individual 4D EPI sequence, a 3D volume is constructed as a combination of individual 2D slices. The 2D slices are collected incrementally; that is, we collect each 2D slice for approximately 10 milliseconds, and the entire 3D volume is complete in about 30 2D slices. This gives a repetition time, or TR, for the volume of seconds, depending on the scanner. The data are essentially a "sliding snapshot" over the course of one TR; observations are not all at a fixed point in time. NDMG-f accounts for slice timing aliasing by accepting a user-provided acquisition sequence (the order in which slices are collected for a single time point). Given the acquisition sequence of the 2D slices, we can compute the TR shift of each 2D slice using the TR information from the header of the brain image. A slice that occurs first in a TR will have a shift of 0, while a slice that occurs at the end of a TR will have a shift of 1. A slice that occurs exactly in the middle of a TR has a shift of 0.5. For each voxel in an individual slice, interpolation is used to re-center our observations to all have a TR shift of 0.5. For slice-timing correction, NDMG-faccepts a slicetiming configuration file, or one of the canonical slice-timing orientations (interleaved, sequential ascending, or sequential descending).

Motion Correction
Motion correction is implemented using the mcflirt utility [56], which is a simplification of FSL's FLIRT registration tool. During an fMRI session, participants sit in a small, cramped scanner, often for 5 to 10 minutes. During the course of a study, it is fairly common for participants to move their heads, even if only small amounts. Small shifts will lead to a person's head being in di erent spatial positions at each timestep [64], which hampers our e orts to standardize the spatial properties of each individual's brain down the line through registration. This is because registrations are performed by estimating the registration on the first volume [37][38][39], after which the estimated transformation is applied across the temporal dimension. This means that if each 3D volume is not aligned spatially, inconsistencies in registration will generally decrease functional connectome quality [85].
Fortunately, given that the individual's brain structure for each 3D volume is relatively constant (the brain shape itself is not changing in time), a 6 degree of freedom rigid a ne transformation for each 3D volume (1 translational and 1 rotational parameter per x, y, z direction the individual could move his/her head) using the mean fMRI slice as the reference, adequately addresses this issue.

Anatomical Preprocessing
To preprocess the anatomical t1w image, AFNI's 3dSkullstrip [55] is used. 3dSkullstrip provides modifications to the BET algorithm [86] to make it more robust without hyperparameters. Note that 3dSkullstrip renormalizes intensities, so to regain the original intensities, the result is binarized and fed as a step function (essentially making it a mask) through 3dcalc and multiplied voxelwise with the original image, yielding the original image intensities of the brain and excluding the regions determined to be skull.

Appendix C.2 Registration
Self Registration To register our input fMRI to our reference T1w image, a three degree of freedom (DOF) a ne transformation is estimated with x, y, and z translational parameters with FSL's FLIRT [87] using the sch3Dtrans3dof schedule file provided as part of the FSL package. The schedule file centers the functional brain on the T1w brain and tends to improve the initialization of registrations in later steps. Next, a locallyoptimized transformation from the fMRI brain to the T1w brain is estimated. Again, this transformation is robust, and its hyperparameters are tuned to focus on local features of the input (fMRI) and reference (T1w) spaces in the simple3d.sch FLIRT schedule file. This schedule file is chosen due to the input fMRI potentially having a narrow field of view, resolution constraints, or tearing that will perform poorly using a more global alignment.
A third alignment is then estimated using the locally-aligned fMRI to the structural T1w image using the boundary-based registration (bbr) cost function provided by FSL [58]. In fMRI, the white-matter/gray-matter border is fairly apparent because the gray-matter generally shows higher intensities than the white-matter regions. Leveraging this observation, the white-matter boundary can be aligned between the fMRI and the T1w scan with high accuracy [58]. A six DOF transformation from the fMRI to the T1w image is estimated, and the T1w is then segmented to produce a white-matter mask using FSL's FAST algorithm [88]. FLIRT is used with the (e) (f) Figure S7: NDMG-f Registration QAX. For registration, NDMG-f produces summary figures showing the preprocessed epi image overlaid on the t1w, the registered epi image overlaid on the template (S7a), the registered t1w image overlaid on the template (S7b), the white-matter mask used in FLIRT-bbr (S7c), the voxelwise mean intensity (3), the voxelwise signalto-noise ratio (S7e), and the voxelwise contrast-to-noise ratio (S7f).
bbr cost-function to align the boundaries of the white-matter in the fMRI and T1w scans optimally. This provides a high-quality alignment for intra-modal registration from an EPI to a T1w image.
Template Registration A gentle linear transformation of the T1w brain to the MIN152 template [89] is estimated using the local-optimization schedule file from before. We use this local-optimization registration as the starting point for a more extensive 12-DOF global FLIRT alignment than the self-registration case. Given that the template brain will theoretically be less similar than simple translations, rotations and scalings can provide, a non-linear registration is estimated from the T1w to the template space. This is accomplished using FSL's FNIRT algorithm [59], with hyper-parameter tuning specific for the MNI152 template. This non-linear transformation is applied first to the T1w image. The non-linear transformation is then combined with the result of the self-alignment step and applied to the functional volume. Applying the transformation only once prevents unnecessary fixed-precision multiplications, which can induce numerical errors. QA figures for the registration step include the preprocessed epi image overlaid on the t1w, the registered epi image overlaid on the template, the registered t1w image overlaid on the template, the white-matter mask used in FLIRT-bbr, the voxelwise mean intensity, the voxelwise signal-to-noise ratio, and the voxelwise contrastto-noise ratio.

Appendix C.3 Nuisance Correction
General Linear Model Over the course of an fMRI scanning session, many sources of noise arise that must be corrected for in order to make quality data inferences. The scanner heats up during a scanning session (producing a high strength magnetic field for sessions lasting up to ten or more minutes, which in turn produces an enormous amount of heat). As the scanner heats, the signal recorded tends to drift (first demonstrated by [61] who showed that a heated scanner detected "brain activity" in cadavers). This drift has been shown to be approximately quadratic [60], so NDMG uses a second-degree polynomial regressor.
While spatial motion correction removes the visual impact of head motion, spurious signal artifacts remain present. These artifacts can be characterized by the position of the brain in the scanner over time [64]. This temporal relationship can be e ectively captured by the current volume and the preceding volume, as well as their squares, so 24 regressors are estimated where we have four regressors (current frame, shifted frame, squaredcurrent frame, square-shifted frame) for each of our six (x, y, z, translation and rotation) motion regressors. These regressors are known as the Friston 24 parameter regressors.
Finally, fMRI signal is often corrupted by physiological noise, such as blood flow or vessel dilation [62]. The top 5 principal components from the white-matter and lateral-ventricle signal capture these additional sources of variance [62; 63]. We estimate CSF and white-matter masks using the FAST algorithm, [88] with priors obtained from the MNI152 parcellation [41]. This estimated mask is eroded by 2 voxels on all sides to avoid any potential signal distortion from the gray-matter signal, since gray-matter signal is expected to correlate with our stimuli. Any signal bleeding into the white-matter voxels (since the gray-matter/white-matter boundary has a slight bleed-over region) that could get removed by our PCA could be detrimental to our downstream inferences.
The regressors are incorporated into the design matrix X of the general linear model (GLM) shown in (2). For our n voxels, the t timestep BOLD signal, we can decompose Y raw ∈ R t×n as: where ϵ ∈ R t×n is a noise term, X ∈ R t×r is the design matrix, and β ∈ R r×n are the regression coe cients.
Minimizing the squared-error loss of Y raw with respect to Xβ will provide an estimate of the coe cientsβ of the regressors in X:β Using the estimateβ for our regressors X, the GLM-corrected timeseries is [60]: Low Frequency Drift Removal Using the GLM-corrected timeseries, low-frequency drift that may still be present in the functional volume can then be removed . Any physiological response due to a stimulus will have a period of around 16 seconds (or a frequency of 0.063 Hz) and will not exceed the period of any stimuli present, as has been shown in [65]. Using this information, sinusoidal fourier modes with frequencies lower than most brain stimuli are removed. Conservatively, we set a threshold of 0.01 Hz for highpass-filtering out low-frequency noise (this should not remove task-dependent signal as long as our task has a period less than about 100 seconds). Although in this manuscript we only applied NDMG to resting state data, we have also applied it to task data, motivating these thresholds.

T1 Effect Removal
During the fMRI session, the first few volumes may appear to have brighter intensities as the T1 e ects are not fully saturated [66]. NDMG therefore discards the first 15 seconds of the fMRI sequence.

Appendix C.4 Graph Estimation
The ROI timeseries can be used to estimate a functional connectome. NDMG computes the pairwise correlation between each pair of regions. The delination of regions is provided by the 24 parcellations described above. For each parcellation, the correlations are converted to relative ranks-the lowest correlation gets a rank 1, the second lowest rank 2, etc.-and define the ranks as the edge-weights of the resulting functional connectome [32]. QA depicts Figure S9.

Appendix D Multi-scale Multi-Connectome Analysis
NDMG computes eight node-or edge-wise statistics of each connectome. Each illustrates a non-parametric graph property. The graph statistics are primarily computed with NetworkX and Numpy, and all implementations for NDMG live within the graph_qa module. Table 5 provides further information for each statistic. Table 5: Graph statistics. Each of the graph statistics computed by NDMG. The binarized graphs for NDMG-d were formed by thresholding the non-zero edges. The binarized graphs for NDMG-f were formed by thresholding edges with correlations greater than 0.1, which was identified in [32] as having the highest discriminability for functional connectomes. For each parcellation, vertex statistics are normalized by dividing them into number of vertices in the parcellation, and then smoothed via kernel-density estimation to enable comparison across scales. The kernel-width was computed using Scott's Rule, the default mode for Scipy (https://docs.scipy.org/doc/scipy-0.19.1/ reference/generated/scipy.stats.gaussian_kde.html). For most of the statistics, the "shape" of the distributions are relatively similar across scales, though their actual magnitudes can vary somewhat dramatically.

Appendix D.2 Multi-Study Analysis
Figure S11 top panel shows a variety of uni-and multi-variate statistics of the average di usion connectome from each of the studies enumerated in Table 2 with di usion data using the Desikan parcellation. The bottom panel shows the same statistics computed on the average functional connectome from each of the studies enumerated in Table 2 with functional data using the Desikan parcellation. In both the di usion and functional connectomes, each dataset largely appears to have similar trends across each of the statistics shown.
For the within-modality tests, we first define p 1 as the connection probability within a hemisphere, and p 2 as the connection probability between hemispheres, and τ therefore indicates whether a given edge corresponds to an ipsilateral or a contralateral connection. We then redefine p 1 as the connection probability for homotopic edges, and p 2 as the connection probability for heterotopic edges, and τ indicates whether an edge corresponds Figure S10: Multi-Scale Connectome Analysis. NDMG produces connectomes at a variety of scales, enabling investigation of graph properties between parcellation schemes. We can observe that the statistics are qualitatively similar in shape across scales, however, they are quantitatively significantly di erent, for both the di usion (top) and functional (bottom) connectomes. This suggests that claims made or analyses performed on a given scale may not hold when applied to another scale. This is impactful, as the choice of parcellation has significant bearing on the results of a scientific study. Figure S11: Multi-Study Connectome Analysis. Average connectomes from ten di usion studies processed with NDMGd are qualitatively compared by way of their summary statistics on the Desikan parcellation in the top figure. The Desikan atlas used in NDMG has been modified to include two additional regions, one per hemisphere, which fills in a hole in the parcellation near the corpus callosum. The nodes in this plot have been sorted such that the degree sequence of the left hemisphere (Desikan nodes 1-35) of the BNU1 dataset is monotonically non-decreasing, and that corresponding left-right nodes are next to one another. Each line shows the average for each statistic over all individuals within the study. On the bottom, we repeat the same analysis on the functional connectomes from seventeen di erent studies. Like the statistics computed in for the di usion connectomes, the statistics are again qualitatively similar but quantitatively disparate. This suggests that claims made or analyses performed on a given scale may not hold when applied to another scale. Again, we see that parcellation choice has an impact on the implications of a study. Information on the graph statistics computed can be found in Appendix D.