recount-brain: a curated repository of human brain RNA-seq datasets metadata

The usability of publicly-available gene expression data is often limited by the availability of high-quality, standardized biological phenotype and experimental condition information (“metadata”). We released the recount2 project, which involved re-processing ∼70,000 samples in the Sequencing Read Archive (SRA), Genotype-Tissue Expression (GTEx), and The Cancer Genome Atlas (TCGA) projects. While samples from the latter two projects are well-characterized with extensive metadata, the ∼50,000 RNA-seq samples from SRA in recount2 are inconsistently annotated with metadata. Tissue type, sex, and library type can be estimated from the RNA sequencing (RNA-seq) data itself. However, more detailed and harder to predict metadata, like age and diagnosis, must ideally be provided by labs that deposit the data. To facilitate more analyses within human brain tissue data, we have complemented phenotype predictions by manually constructing a uniformly-curated database of public RNA-seq samples present in SRA and recount2. We describe the reproducible curation process for constructing recount-brain that involves systematic review of the primary manuscript, which can serve as a guide to annotate other studies and tissues. We further expanded recount-brain by merging it with GTEx and TCGA brain samples as well as linking to controlled vocabulary terms for tissue, Brodmann area and disease. Furthermore, we illustrate how to integrate the sample metadata in recount-brain with the gene expression data in recount2 to perform differential expression analysis. We then provide three analysis examples involving modeling postmortem interval, glioblastoma, and meta-analyses across GTEx and TCGA. Overall, recount-brain facilitates expression analyses and improves their reproducibility as individual researchers do not have to manually curate the sample metadata. recount-brain is available via the add_metadata() function from the recount Bioconductor package at bioconductor.org/packages/recount.


Introduction
RNA sequencing (RNA-seq) is a valuable method for measuring gene expression across the transcriptome. The widespread availability and falling cost of high-throughput sequencing has lead to massive amounts of biological information being accumulated in public data repositories (Denk, 2017) . The Sequence Read Archive (SRA) -which was established in 2007 by the National Center for Biotechnology Information (NCBI) and functions as the National Institute of Health's (NIH) primary archive of high-throughput sequencing data -hosts raw sequencing data for >50,000 human RNA-seq samples (Leinonen et al., 2011) and is rapidly expanding (Kodama et al., 2012;Langmead and Nellore, 2018) . This is a tremendous resource that allows genomic researchers to answer biological questions using already-sequenced reads from other laboratories. Deposition of data in the SRA is mandated by most funding agencies and open access journals, resulting in an expansive range of biological samples. However missing information on biological phenotype (sex, age, disease status and type) and experimental condition (library selection, brain bank), in short sample metadata, significantly reduces its utility to researchers  . In fact, critical sample phenotype information is missing or incomplete for many samples (92.7%) within the SRA (Ellis et al., 2018) , limiting their ability to answer biological questions using gene expression data.
Because the SRA is made up of individual submissions, data are not provided in a consistent format and annotations (such as methodology and technical sequencing details) are often unclear or missing (Bernstein et al., 2017) . We previously developed recount2 (Collado-Torres et al., 2017c, 2017a , a public resource with over 70,000 uniformly processed human RNA-seq samples, enabling comparisons across studies for human expression data in the public repository. However, inconsistent phenotype annotation still remains a barrier to taking advantage of public uniformly processed reads. Several efforts have been made to improve the sample metadata for SRA samples by predicting metadata from abstracts (Kingsford, 2016) , automatically normalizing the available metadata with ontologies (Bernstein et al., 2017) , and predicting sample metadata from expression values (Ellis et al., 2018) . Here, we describe a reproducible curation process that complements phenotype predictions and automatic ontology inference. This curation process can be adapted and applied to other studies and tissues, thus expanding the use of the public RNA-seq data.
We applied our reproducible curation process to create recount-brain , a freely available human brain sample metadata database for SRA samples present in recount2 with unified metadata variables for brain Genotype-Tissue Expression (GTEx) (GTEx Consortium, 2015) and The Cancer Genome Atlas (TCGA) samples (Cancer Genome Atlas Research Network, 2008;Hutter and Zenklusen, 2018) . By accessing the sample metadata and expression data via the recount Bioconductor package at bioconductor.org/packages/recount (Collado-Torres et al., 2017b, 2017c researchers may study transcriptomic changes in neurological diseases.
Moreover, we offer a streamlined and efficient curation method for researchers and students to contribute critical sample metadata and enhance reproducibility in genomics. We outline our curation process here -the insights gained and lessons learned -so that others may reproduce similar results or analyze public human brain RNA-seq data in a fraction of the time.

Results
Ready to use human brain sample metadata recount-brain hosts sample metadata for 4,431 human brain tissue samples from 62 projects from the SRA, out of which 3,214 (72.5%) samples have expression data available from recount2 (Collado-Torres et al., 2017c) . recount-brain supports powerful search and filter capabilities by tissue phenotype, including spanning 3,600 neurological controls with 2,900 from the SRP025982 study (SEQC/MAQC- III Consortium, 2014) ; 15 neurological diseases and information on brain tumor subtype, grade, and stage; 3 levels of detailed anatomic tissue site information; 5 developmental stages (Fetus, Infant, Child, Adolescent, Adult); demographic data (age, sex, race); technical sequencing information (RIN, PMI, sequencing layout, library source, etc.); and Brodmann area, tissue and disease ontologies (Methods: Reproducible curation process, Ontology mapping). The SRA samples in recount-brain are complemented by 1,409 GTEx (GTEx Consortium, 2015) and 707 TCGA (Brennan et al., 2013;Cancer Genome Atlas Research Network et al., 2015) samples covering 13 healthy regions of the brain and 2 tumor types, respectively (Methods: Merging recount-brain with GTEx and TCGA). In total, there are 6,547 samples with metadata in recount-brain with 5,330 (81.4%) present in recount2 (Collado-Torres et al., 2017c) and the curation process is reproducible. Of the samples present in recount-brain , 58.7% are absent from MetaSRA (Bernstein et al., 2017) brain samples and conversely MetaSRA lists samples absent from recount-brain showcasing how these approaches complement each other (Methods: MetaSRA comparison). Figure 1 outlines the variables that were used and a list of sample attributes found in the database. The complete list of variables and descriptions is available in Table S1 .  Table S1 .
The add_metadata() function in recount (Collado-Torres et al., 2017b) makes it easy to access the complete recount-brain metadata in RNA-seq analyses. This function can be used to access the full recount-brain metadata to find samples and studies of interest as illustrated in Figure 2 (purple box). Alternatively, users can interactively explore recount-brain via https://jhubiostatistics.shinyapps.io/recount-brain/ to identify samples of interest (Methods: Interactive display). Once a study of interest has been identified, the user can download the gene expression data from recount2 (Collado-Torres et al., 2017c) and append the recount-brain metadata as shown in Figure 2 ; this process is equivalent to appending custom metadata from Figure 2 of the recount workflow (Collado-Torres et al., 2017a) . Once the expression data from recount2 and the sample metadata from recount-brain have been combined, the user can proceed to perform analyses such as identification of differentially expressed genes and enriched gene ontologies, examples of them are illustrated in Figure 2 with data from SRA study SRP027383 (Bao et al., 2014) .

Figure 2
. Uses of recount-brain and its relationship with recount2 . recount-brain facilitates identifying project(s) of interest (purple box) programmatically or interactively through https://jhubiostatistics.shinyapps.io/recount-brain/ . After downloading expression data from recount2 , recount-brain can enrich the sample metadata for brain studies. This information can be used to perform analyses to find differentially expressed genes and enriched gene sets such as those exemplified with SRA Study SRP027383 (Bao et al., 2014) , where the top differentially expressed gene among glioblastoma samples in recount2 is SCM4 . Black boxes represent R code with functions highlighted in blue, input arguments in green, and R objects in white.
Differential gene expression analysis using recount-brain To exemplify how recount-brain facilitates re-analysis of public human brain RNA-seq data, we selected an SRA study with almost no sample information available from the NCBI SRA with data for 272 gliomas (Bao et al., 2014) . Using the gene expression data from study SRP027383, we identified 6,116 and 6,438 genes with significant (FDR <1%) decreasing and increasing expression associations with linear tumor grade progression (II, III and IV), while adjusting for sex, age and pathology (IDH1 mutation status) using the 258 (94.9%) samples that had complete sample metadata (Methods: Differential expression by tumor grades with data from SRP027383). SMC4, the top differentially expressed gene ( Figure 2 ) plays a role in the structural maintenance of chromosomes. The genes with an increase expression as tumor grade progresses are enriched for DNA replication and chromosome segregation biological processes ( Figure 2 ). SMC4 is a core component of the condensin complexes which has recently been associated with aggressive glioblastoma phenotypes (Jiang et al., 2017) . Furthermore, SMC4 mRNA and protein expression levels are associated with poor prognosis and could potentially be a therapeutic target in gliomas (Jiang et al., 2017) .
The full code for reproducing this example analysis is available in Supplementary File 1 and at http://LieberInstitute.github.io/recount-brain/ . Subject-matter experts could further examine the results and guide analyses like the one we carried out. Without recount-brain and recount2 (Collado-Torres et al., 2017c) for this analysis one would have had to process the raw expression data and obtain the relevant sample metadata, which would likely have taken a considerable amount of time. Furthermore, the processed RNA-seq data from recount2 , the curated sample metadata from recount-brain and the analysis code provided in Supplementary File 1 are all public resources that enable the full reproducibility of the analysis we described.

Effects of post-mortem interval on transcription
Researchers with their own human brain datasets can use recount-brain to assess the replication of their results, regardless of the publication status of their projects. Recently  (Ferreira et al., 2018) . To exemplify how recount-brain can be used to replicate findings, we identified 10 SRA studies with variable PMI data for a total of 252 samples from 9 publications (Boudreau et al., 2014;Dumitriu et al., 2016;Khrameeva et al., 2014;Labadorf et al., 2015;Magistri et al., 2015;Pardo et al., 2013;Voineagu et al., 2011;Wu et al., 2012) . In their analysis of GTEx and PMI data, Ferreira et al. identified genes with significant temporal changes across tissues (Ferreira et al., 2018) . Among them, Ferreira et al.
(their Figure 2D) found no significant association between RNASE2 , EGR3 , HBA1 and CXCL2 expression and PMI interval in the brain cerebellum or cortex (Ferreira et al., 2018) . We replicated their findings for EGR3 , HBA1 and CXCL2 but found a significant decrease in RNASE2 expression with PMI interval progression across the four intervals we had data for ( (Ferreira et al., 2018) and observed that certain tissues, such as brain cerebellum and cortex, exhibited a positive relationship between PMI and mitochondrial transcription, while other tissues demonstrated a negative relationship (Ferreira et al., 2018) . We replicate this finding here by taking advantage of the fact that recount2 provides base-pair coverage data in the form of BigWig files (Collado-Torres et al., 2017a, 2017c . We used this BigWig data to compute the percent of mitochondrial transcription and observed a positive association with PMI (in hours,

Figure 4 A ).
We also computed the percent of transcription that overlaps genes from Gencode v25 (Harrow et al., 2012) excluding the mitochondrial genes. We expected a negative association between mitochondrial and global gene transcription; however, we found instead that the 10 studies clustered into two distinct sets ( Figure 4  We examined differences between the two sets of studies across disease status, demographic, biological, technical, quality covariates as well as SHARQ beta (Kingsford, 2016) and phenopredict sample metadata (Ellis et al., 2018) and found no clear cut differences. Most of the samples from the second study set are 100 base-pair paired-end reads and have a 0.63 higher mean RIN (0.35 to 0.92 95% CI, p-value 4.804e-05). However, these differences don't predict the two type of relationships observed in Figure 4 B . Overall, our replication analysis revealed that two types of associations between mitochondrial and genome transcription (excluding mitochondrial genes) that could change the interpretation of the implications of PMI  Consortium, 2015) and TCGA (Brennan et al., 2013;Hutter and Zenklusen, 2018) , have more detailed sample metadata than the rest of the SRA samples. We thus expect that users will be interested in combining the human brain SRA samples from recount-brain with either GTEx or TCGA. To exemplify this process, we selected the two glioblastoma studies with at least 20 samples: SRP027383 (Bao et al., 2014) and SRP044668 (Gill et al., 2014) . We then combined these SRA samples with those from TCGA listed as primary glioblastoma tumors (Brennan et al., 2013;Cancer Genome Atlas Research Network, 2008) . Using only the tumor samples we normalized the studies and removed variation across them to make them comparable This example analysis demonstrates how recount-brain can be used with TCGA data and facilitate meta-analyses. To further simplify the process of comparing SRA, GTEx and TCGA human brain data, we adapted the GTEx and TCGA human brain sample metadata and merged it into recount-brain (see Methods). The full code for reproducing this example analysis is available in Supplementary File 3 and at http://LieberInstitute.github.io/recount-brain/ . Pink: SRP027383 versus SRP044668, green: SRP027383 versus TCGA brain, purple: SRP044668 versus TCGA brain, blue: TCGA brain vs TCGA kidney.

Discussion
Massive amounts of sequencing data are accumulating in public repositories, but unlabeled or unannotated variables limit the ability of researchers to analyze these data  . We present recount-brain , a freely available human brain sample metadata database that pairs with the uniformly-processed RNA-seq data from recount2 enabling researchers to study transcriptomic changes in neurological disease. Our metadata database and reproducible curation approach shows that freely-available data can be cleaned and curated to encourage and facilitate data reuse and increase the value of small studies, such as SRP017933 (Pardo et al., 2013) , and large studies, such as GTEx and TCGA (Hutter and Zenklusen, 2018) , alike.
In our methods, we described a semi-automated reproducible process through the pseudo-algorithm provided ( Figure 6 ) that offers a streamlined, efficient method for researchers interested in genomics to contribute critical data and enhance reproducibility in the field. Prior experience and knowledge are useful but not necessary as the curation process has been distilled down to an easy to replicate step-by-step method. We envision future students, who may have a particular interest in a disease or organ system, collaborating with genomic data scientists to learn about the field in an interactive hands-on way while simultaneously contributing valuable and publishable work. Improving the metadata of public data will benefit everyone and facilitate the creation of major data search engines including the new public data Google website (Castelvecchi, 2018)  Enabling researchers to take advantage of deposited data will increase reproducibility in genomics research as "taking no notice of deposited data is similar to ignoring several independently published replication experiments" (Denk, 2017) . Our refined database and reproducible model reuses public data, enhances reproducibility among genomic researchers, and enables translational discovery. recount-brain further improves the usability of the RNA-seq data present in recount2 (Collado-Torres et al., 2017c) . We showed how one can combine the two resources to perform analyses, explore different biological questions, replicate findings and expand analyses from other studies. recount-brain also facilitates meta-analyses across SRA, GTEx, and TCGA samples such as our example analysis on the expression variability in glioblastoma tumor samples.
The sample metadata in recount-brain can be combined with other projects that enhance recount2 beyond gene expression. For example, Snaptron (Wilks et al., 2018) provides rapid querying of splice junctions and splicing patterns from samples in recount2 and could be used together with recount-brain for studying human brain splicing. Furthermore, recount-brain metadata could be utilized together with the recently released transcript abundance estimates (Fu et al., 2018) using the same add_metadata() functionality that we demonstrated with our example use cases. The add_metadata() function can be easily expanded to cover other tissues if others follow our annotation workflow ( Figure 6 ) to improve the sample metadata of other studies. Furthermore, curation efforts such as recount-brain will likely be useful in the creation of more refined sample metadata prediction algorithms (Ellis et al., 2018) .
Curation efforts complement prediction and automatic ontology inference approaches as there is always some uncertainty attached to predictions and inferences (Bernstein et al., 2017;Ellis et al., 2018) .
We documented our curation process with detailed notes and code at http://LieberInstitute.github.io/recount-brain/ . We believe that recount-brain will save time for other researchers since they can immediately bypass this time-consuming process: from extracting the information from the articles to merging the variables with GTEx and TCGA, as well as matching with ontology databases. By using the R package versioned framework and constructing a unified resource we are also promoting reproducibility of downstream analyses.
Furthermore, we invite researchers to contribute to recount-brain and the add_metadata() framework in recount by curating more human datasets and submitting them via https://github.com/LieberInstitute/recount-brain/issues/new . We envision that researchers will follow similar curation processes to ours or compute sample metadata predictions and contribute them to the community via sample metadata unification projects.

Reproducible curation process
A summary of the curation process we followed is shown by the pseudo-algorithm in Figure 6 . First, it was necessary to select the studies that would eventually compose recount-brain . For the 50,099 human RNA-seq SRA samples for which recount2 has expression data, we used the v0.0.03 tissue predictions from the phenopredict R package (Ellis et al., 2018) to list all studies predicted to have brain tissue samples ( Figure 6 A ). We then identified all SRA studies that were made up of >70% brain tissue samples and had at least 4 samples to increase our yield of total brain samples for our refined database. We downloaded study metadata from  ( Figure 6 B-C ).
Based on an exploratory analysis of the common annotated variables across included studies, we developed a common set of variables that we believed would be most useful to investigators for downstream analyses ( Figure 6 D ). We then carefully and systematically searched article text and supplementary materials for specific information on the biological samples. Novel tissue attributes found in publication text but not included in the original metadata were added directly to the project-specific metadata tables. These included demographic data, technical sequencing information, clinical and pathological characteristics, and anatomical details. If downloaded metadata already included data for one or more of our uniform set of variables, then we aligned it with our uniform set of variables.
The most important aspect of curation is the search, identification, and transfer of tissue attributes not included in the original metadata. It is also, by far, the most time-consuming, which is why we have described our recommendations for the process ( Figure 6 E ). The key was to develop a structured and systematic approach to each study, which allowed us to gather the most information possible while ensuring reproducibility. Our search process focused on We adhered strictly to a uniform method for annotation and documentation.  figure) where sample information was found. The full reproducibility document is available at http://LieberInstitute.github.io/recount-brain/ and is organized by SRA Study.

Materials and Methods and
Updated metadata tables were saved as CSV files and merged across projects into a single table (recount-brain-v1 ). We introduced the add_metadata() function to the recount (version >= 1.5.6) Bioconductor package (Collado-Torres et al., 2017b, 2017c to facilitate merging the expression data from recount2 with the sample metadata from recount-brain . present in recount2 . This annotation workflow can be applied to other tissues by selecting a candidate projects using predictions by phenopredict (Ellis et al., 2018) or other efforts (Bernstein et al., 2017;Kingsford, 2016) .
Merging recount-brain with GTEx and TCGA We combined the recount-brain-v1 with the brain metadata from TCGA (Brennan et al., 2013;Cancer Genome Atlas Research Network, 2008;Cancer Genome Atlas Research Network et al., 2015) and GTEx (GTEx Consortium, 2015) and created recount-brain-v2 . We found 707 and 1,409 brain samples in TCGA and GTEx, respectively. The metadata in TCGA and GTEx was relatively complete; however, the formats differed between them and from recount-brain-v1 .
We compiled the TCGA and GTEx metadata and converted them into the format used in recount-brain-v1 when creating recount-brain-v2 . Some variables, such as the "Brain tissue repository source" were directly combined between the three datasets; however, most involved some minor alterations or were not comparable. For example, the "Nature of Disease (Disease / Control)" variable informs if a sample is a "Solid Tissue Normal" or a tumour of any type (i.e primary or recurrent) in TCGA, and was adapted from the Death Classification: 4-point Hardy Scale in GTEx (GTEx Consortium, 2015) . All alterations generated from TCGA and GTEx metadata can be found in the Table S2 . Furthermore, study name, count-file identifier, and drug information pertaining to TCGA samples were added to recount-brain-v2 . Additional information about these metadata variables can be found after the "Columns that are not from recount_brain" row of Table S2 . If there are other metadata variables within TCGA or GTEx that are not part of recount-brain-v2 , all_metadata(subset = "gtex") or all_metadata(subset = "tcga") can be used to download this metadata and then merged using the count_file_identifier variable from recount-brain-v2 .

Ontology mapping
To expand recount-brain and take advantage of curated ontologies, we matched recount-brain to ontologies available via the BioPortal project (Whetzel et al., 2011) such as UBERON (Mungall et al., 2012) . For the Brodmann area, we used brodmann_area curated variable for matching to UBERON's preferred label. The diseases were matched to ontologies manually. For tissues, we constructed a hierarchical tissue variable from tissue_site_3 > tissue_site_2 > tissue_site_1 such that the more detailed information is used when available. We then searched UBERON's preferred labels to identify the ontology entries that best matched the tissue descriptions. For Brodmann area and tissue ontology terms, we extracted their synonyms, parent term IDs and parent term labels to facilitate identifying samples of interest either through text based searches. The code for the ontology mapping is available at http://LieberInstitute.github.io/recount-brain/ .

MetaSRA comparison
We downloaded the MetaSRA (Bernstein et al., 2017) data for UBERON term 0000955 (brain) on April 15, 2019. The table we downloaded included information for 17,890 brain samples from 342 studies. Of these 342 studies, 100 included at least one sample present in recount2 that are absent from recount-brain . Based on our selection criteria of at least 4 samples in recount2 and 70% or brain samples in a study, 28 studies would match the criteria based on MetaSRA data. From these 28 studies, 5 of them are supported by the latest phenopredict predictions (version 0.0.06) and 7 by the SHARQ prototype tissue predictions (Kingsford, 2016) . Conversely, 17 (26.6%) out of the 64 studies and 3,841 (58.6%) of the samples in recount-brain are absent from MetaSRA (including GTEx and TCGA). The code and full comparison results are available at http://LieberInstitute.github.io/recount-brain/ . Table   S3 contains the list of the 100 studies with at least one brain sample according to MetaSRA that are present in recount2 and absent from recount-brain .

Differential expression by tumor grades with data from SRP027383
We downloaded the gene expression data for SRA study SRP027383 (Bao et al., 2014) from recount2 (Collado-Torres et al., 2017c) using recount v1.5.9, added the recount-brain-v1 sample metadata and retained the 258 samples that have sex, age, pathology (IDH1 mutation either + or -) and tumor grade progression (II, III and IV) recorded. We then filtered the genes with a mean RPKM < 0.24 as suggested by the expression_cutoff() function from jaffelab v0.99.18 (Collado-Torres and Jaffe, 2018) to retain 25,649 genes. Next we computed library size adjustments with edgeR v3. 21.9 (McCarthy et al., 2012;Robinson et al., 2010) and performed the differential gene expression using limma -voom 3.35.12 (Law et al., 2014;Ritchie et al., 2015) . The model we used was ordered(clinical_stage_1) + sex + age + pathology such that we fitted a linear and a quadratic term for the tumor grade

Variation in gene expression across multiple glioblastoma studies
In order to illustrate how recount-brain metadata can be utilized with expression data from more than one study in an expression analysis, we used the two glioblastoma studies with at least 20 samples present in recount-brain-v1 : SRP027383 (N=270) (Bao et al., 2014) and SRP044668 (N=93) (Gill et al., 2014) . We then combined these SRA samples with those from TCGA listed as primary glioblastoma tumors (N=157) (Brennan et al., 2013;Cancer Genome Atlas Research Network, 2008) for a total of 520 samples. We used expression_cutoff() from jaffelab v0.99.21 (Collado-Torres and Jaffe, 2018) to filter the genes with mean RPKM < 0.21, retaining a total of 26,499 genes. As we are specifically interested in assessing variability across glioblastoma tumors in this particular analysis, control samples were dropped, retaining a total of 320 tumor samples: SRP027383 (N=99), SRP044668 (N=74), TCGA (N=157). We normalized for the data source effect by using a linear regression with an indicator variable differentiating the SRA and the TCGA samples and then regressing out this effect. We then removed the top 6 principal components (PCs) computed on the log2(RPKM+0.5) data to facilitate cross-study comparisons.
We then estimated the variance of each gene for each of the three datasets using colVars() from matrixStats v0.53.1 (Bengtsson, 2018) and compared the most variable gene ranking using concordance at the top plots. We also processed primary tumor kidney TCGA data (Davis et al., 2014) given its biological dissimilarity with human brain. We used a similar normalization procedure with the combined brain and kidney data (top 4 PCs removed) to produce a background pairwise comparison for the concordance at the top plots. Code and full results are provided in Supplementary File 3 .

Data Access
The recount-brain data (both version 1 and 2) is available via the There are four different ways to access the recount-brain dataset. Both an R version and a text version (csv) are available from the recount-brain GitHub repository at https://github.com/LieberInstitute/recount-brain . recount-brain can also be explored interactively from https://jhubiostatistics.shinyapps.io/recount-brain/ and subsets of interest can be downloaded to csv files from that website. However, we mainly recommend using the add_metadata(source = "recount_brain_v2") function from the recount R  The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
Conflict of Interest : none declared.

Supplementary Material
Supplementary Website : http://LieberInstitute.github.io/recount-brain/ contains all the R code as well as the supplementary files that can be used to reproduce the entire recount-brain project. Table S1 : List of variables present in recount-brain and description of each variable saved in a CSV file. This file is also available at https://github.com/LieberInstitute/recount-brain/blob/master/SupplementaryTable1.csv .

Table S2
: Detailed notes on how the GTEx and TCGA variables were processed when creating recount-brain-v2 in order to merge them with recount-brain-v1 . This file is also available at https://github.com/LieberInstitute/recount-brain/blob/master/SupplementaryTable2.csv .

Table S3
: List of studies present in MetaSRA and recount2 with at least one brain sample that are absent in recount-brain . Includes the study abstract, whether the abstract mentions brain, number of brain samples listed, percent of brain samples for the project, and whether the study would pass the selection criteria used for this study. This file is also available at http://LieberInstitute.github.io/recount-brain/metasra_comp/discrepant_studies.csv .

Supplementary File 2 : Full example analysis on the effects of post-mortem interval on
transcription. This file is also available at https://github.com/LieberInstitute/recount-brain/blob/master/example_PMI/example_PMI.pd f .