Human milk variation is shaped by maternal genetics and impacts the infant gut microbiome

Human milk is a complex mix of nutritional and bioactive components that provide complete nutrition for the infant. However, we lack a systematic knowledge of the factors shaping milk composition and how milk variation influences infant health. Here, we used multi-omic profiling to characterize interactions between maternal genetics, milk gene expression, milk composition, and the infant fecal microbiome in 242 exclusively breastfeeding mother-infant pairs. We identified 487 genetic loci associated with milk gene expression unique to the lactating mammary gland, including loci that impacted breast cancer risk and human milk oligosaccharide concentration. Integrative analyses uncovered connections between milk gene expression and infant gut microbiome, including an association between the expression of inflammation-related genes with IL-6 concentration in milk and the abundance of Bifidobacteria in the infant gut. Our results show how an improved understanding of the genetics and genomics of human milk connects lactation biology with maternal and infant health.


RNA extraction and sequencing
We extracted RNA from whole milk cell pellets to capture gene expression from both mammary epithelial cells and immune cells in milk. Previous studies that have performed bulk RNA-sequencing from human milk have used RNA extracted from the milk fat layer 15 . This procedure enriches for milk fat globule RNA, which originates from mammary epithelial cells 15,16 . Our approach allowed us to use computational approaches to estimate the contribution of different cell types to our milk transcriptome, and explore genetic influences on gene expression that could be specific to the immune cells in milk, in addition to mammary epithelial cells.
Nucleic acid extractions and RNA-seq library preparation and sequencing was performed at the University of Minnesota Genomics Center (UMGC). Frozen 2 mL whole milk aliquots from 245 milk samples were thawed and split in two, with each 1mL half used for either RNA or DNA extraction. RNA was extracted from the cell pellet using the RNeasy Plus Universal HTP following the manufacturer's instructions. We used the TakaraBio Stranded Total RNA Pico Mammalian kit for RNA-seq library preparation. RNA libraries were sequenced on an Illumina NovaSeq 6000 S2 flow cell with 2x150 paired-end reads to a median depth of 34.2 million reads per sample (Table S1).
RNA-seq pre-processing and quantification RNA-seq reads were trimmed with Trimmomatic and aligned with STAR 69 v2.7.1a to the GRCh38 human reference genome. Gene-level quantification was performed with RNA-SeQC 70 v2.3.4 using a Gencode v36 gene model annotation that was collapsed to a single transcript model per gene using a script provided by GTEx ("collapse_annotation.py" from https://github.com/broadinstitute/gtex-pipeline/tree/master/gene_model). To assess the gene-level quantifications, TPM spearman correlations were calculated between each pair of samples with the 'rcorr' function from the 'Hmisc' R package 71 . One outlier sample was removed for having low median pairwise correlations with other samples and fewer than 10,000 genes detected (Fig. S1). There were two participants that each contributed two milk samples, from two separate pregnancies. We included only one milk sample from each of these participants in our analyses, leaving 242 milk transcriptomes from 242 different participants (Table S1).
Whole-genome sequencing and quality control DNA was extracted from the cell pellet using the QIAamp 96 DNA Blood Kit at UMGC following the manufacturer's instructions. Low-pass whole genome sequencing (WGS) at ~1x and genotype imputation was performed by Gencove. Low-pass WGS and imputation provides comparable or improved accuracy and variant discovery to array-based genotyping 72,73 . 173 milk samples successfully underwent WGS and imputation from the original 1 mL aliquot DNA extraction. 72 samples had insufficient DNA extracted from the initial 1 mL sample, or failed Gencove's quality control. Of these 72 samples, 62 had an additional 15 mL frozen aliquot that was shipped to Gencove and DNA was extracted using a mag Nucleic Acid Purification Kit (Biosearch Technologies), and ~1x low-pass WGS was performed. 11 of these samples failed Gencove's quality control and 51 samples successfully underwent WGS and imputation, resulting in a total of 224 samples with genotype information. 2 participants contributed 2 milk samples (from 2 separate pregnancies), and we included only one sample per participant in our analyses, leaving 222 unique individuals with genotype information. Sample-level details of extraction and sequencing are in Table S1.
BCFtools 74 was used to combine VCFs into a BCF file for all individuals, filtering for minor allele frequency >1% and maximum missing genotypes of 5%. A genetic relatedness matrix was generated with the PLINK 75 (v1.90b6.10) '--make-rel' command, and pairs of individuals with relatedness coefficient >0.05 were pruned, leaving 206 individuals for genetic analyses. Genetic PCs for these 206 individuals were calculated with PLINK using 67,878 SNPs with minor allele frequency >5% after pruning for linkage disequilibrium (PLINK command '--indep-pairwise 500 5 0.5'). Genetic ancestry proportion estimates provided by Gencove showed that individuals included in genetic analyses were mostly of European genetic ancestry, with only 17 of 206 individuals with estimated European ancestry <95% (Fig. S2).
We checked for sample swaps by performing genotype calling from RNA-seq reads aligned to chromosome 2 using 'bcftools mpileup', and using 'bcftools gtcheck' to compare genotypes from RNA-seq to Gencove variant calls from low-pass WGS 74 . We did not detect any sample swaps: for all samples included in eQTL analysis, the DNA sample with matching sample ID had the lowest average concordance, compared to all DNA samples with a different sample ID (Fig. S3). function in edgeR to downsample each GTEx and milk sample to 5 million read counts. We took the resulting count matrix as a DESeq2 object and performed variance stabilizing transformation (VST). We then took the VST matrix of only GTEx samples, selected the 1000 most variable genes, and performed principal component analysis in R with the 'prcomp' function. We then projected the milk samples onto the PCA scatter plots by calculating each milk sample's values from the GTEx-only PCA.
To compare TPM values across milk and GTEx samples (Fig. 1B), we downloaded gene-level TPM values from the GTEx portal (GTEx_Analysis_2017-06-05_v8_RNASeQCv1.1.9_gene_tpm.gct.gz). We filtered to include only female GTEx samples and filtered to protein-coding genes (as annotated in EnsDb.Hsapiens.v86) and removed histone genes. We then rescaled the TPM for each GTEx and milk sample to again sum to 1 million and calculated each gene's median TPM across a tissue type.

Correlations between milk gene expression and maternal/infant traits
We used edgeR 76 to test for correlations between milk gene expression and maternal/milk traits, including all tested traits and technical covariates. Included traits were: Milk CRP concentration, milk glucose concentration, milk IL-6 protein concentration, milk insulin concentration, milk leptin concentration, milk volume expressed, milk fat %, gestational weight gain, milk lactose %, maternal pre-pregnancy BMI, maternal age, and parity (Table S2). We scaled each trait to a mean of zero and standard deviation of one, except binary traits and parity, for which we use the integer number of previous births. The milk RNA-seq count matrix was filtered to include samples with measurements for all tested traits, leaving 158 samples. The count matrix and metadata were loaded into an edgeR object and the "filterByExpr" was used to remove lowly expressed genes, leaving 12,584 genes. We then used the 'estimateDisp' function on a design matrix regressing gene expression across all 12 traits. This model accounted for potential confounding technical effects, including sample mass RNA and sample RIN, by including them as covariates. We then used 'glmQLFit' to fit a quasi-likelihood negative binomial generalized log-linear model to the count matrix and design model, and 'glmQLFTest' to perform a quasi-likelihood F-test of each gene against each tested trait. We used Benjamin-Hochberg correction of P-values across all 12,584 genes x 12 traits.
We tested for gene ontology enrichment of significant genes (q-value<10%) for each trait using the R package topGO 77 , with all tested genes as the background gene list. We used the 'resultFisher' function to run a classic Fisher's exact test for each gene ontology, and used a Benjamini-Hochberg correction 78 for all ontologies (N=14,226) across the 9 traits with significant genes (128,034 tests). We report pathways with q-value<10%, fewer than 500 annotated genes, and an overlap of more than 5 genes with the significantly associated gene list for each trait (Table S4). 6 traits (milk glucose, milk IL6, milk insulin, milk volume expressed, milk lactose, and parity) had enriched ontologies that met these criteria.

Examination of PER2 expression and milk traits
Circadian rhythm genes were defined as those in KEGG pathway 'hsa04710'. To test if the time of day of the milk sample collection study visit explained the relationship between PER2 expression and expressed milk volume, we transformed the time of the study visit into a quantitative variable with the R package 'lubridate' 79 . PER2 expression values from a variance-stabilizing transformation of the sample-by-gene count matrix in DESeq2 80 were used, including sample RNA mass and RIN as covariates. Regression models were calculated with 'lm' in R. Study time of day was correlated with PER2 expression in a simple linear regression (B=0.07, P=0.04), but not with milk volume expression (B=-0.07, P=0.3) We then ran the following linear models: (A) milk volume ~PER2 expression (B) milk volume ~PER2 expression + time of study visit The linear models were compared with the 'anova' command in R, with a non-significant P-value suggesting that adding the time of study visit variable did not provide a better fit to the data. The same analysis was performed with normalized milk fat percentage replacing milk volume to assess the relationship between milk fat, PER2 expression, and time of study visit.

Deconvolution of bulk transcriptomes with Bisque
Raw gene counts (MIT_Milk_Study_Raw_counts.txt.gz) and metadata (MIT_milk_study_metadata.csv.gz) were downloaded for the Nyquist et al. study 17 from the Broad Insitute Single Cell Portal (https://singlecell.broadinstitute.org/single_cell/study/SCP1671/cellular-and-transcriptional-diversity-ove r-the-course-of-human-lactation) on 6/3/2022. Count data was filtered to keep just one sample per participant, requiring samples to have been collected >14 days and <3 months postpartum, leaving 10 samples. The count matrix and associated metadata was then put into a the Bioconductor 'ExpressionSet' object format, combining the two macrophage cell type annotations from Nyquist et al. into one cell type called just "Macrophage" and resulting in 8 cell type annotations. The milk gene-level count data was then loaded into an ExpressionSet object and Cell type deconvolution was run with the R package "BisqueRNA" and the function 'ReferenceBasedDecomposition', with parameters "markers=NULL" and "use.overlap=F". Bisque 30 used 27,889 genes present in both the bulk and single-cell expression sets.

Milk eQTL analysis
Gene-level quantifications were filtered for the 206 individuals with RNA-seq and genotype data. Genes were filtered to retain those with ≥ 6 counts and and TPM > 0.1 in at least 20% of samples, leaving 17,351 genes of the original 44,747. TPM quantifications were then rank-normalized with the 'RankNorm' function in R package RNOmni 81 , and gene coordinates were added using annotations from R package 'EnsDb.Hsapiens.v86'. Genes without coordinate annotations, mitochondrial, and Y chromosome genes were removed, leaving 16,999 genes used in eQTL analyses.
The APEX toolkit was used for cis-eQTL analysis (https://corbinq.github.io/apex/doc/) 82 . First, 50 latent factors from the gene expression matrix were calculated using command 'apex factor' with 10 iterations. cis eQTL analysis was run with the command 'apex cis' with 5 genetic PCs and 25 gene expression latent factors as covariates, using the LMM model with a genetic relatedness matrix calculated as above in PLINK, and with distance to start site weighting for eGene p-values (ACAT-dTSS). SNPs with minor allele frequency >1%, missing genotype information <5%, and within 1Mb of the gene transcription start site were included. The command used was as follows: apex cis --bcf [genotypes bcf file] --bed [gene expression bed file] --cov [genetic PCs + gene expr. LFs covariate file] --grm [genetic relatedness matrix] --prefix [output file prefix] --long --dtss-weight 0.00001 APEX uses an aggregated Cauchy association test to calculate a gene-level P-value, and can use the distance to TSS weighting to improve discovery power (parameter '--dtss-weight' in the command above). eGene P-values were adjusted for multiple tests using a Benjamini-Hochberg correction 78 .
To compare the fraction of milk colocalized eQTLs across GTEx tissues, we needed to take into account that the number of eQTLs for a given tissue is a linear function of the sample size 84 , and that eQTLs discovered in a smaller sample size are more likely to be those shared across tissues 85 . We observed a clear linear relationship between the number of eGenes included in our colocalization analysis for each GTEx tissue with the fraction of 'colocalized' eQTLs with milk ( Fig. S6), and so we used the residual fraction of eQTLs colocalized in this regression to compare across tissues (i.e. in Fig.  2E). Binomial confidence intervals for this fraction were estimated with a normal approximation. eQTLs were denoted as milk-specific if either of these criteria were met: (1) there were no GTEx tissues with a significant eQTL for the gene (q-value<5%), or (2) there were no tissues with an eQTL that colocalized with milk, and at least 75% of tested tissues' eQTLs were categorized as not-colocalized. Enrichment analysis of genes with milk-specific eQTLs or tissue-shared eQTLs was performed with the enrichR R package 86 with reference database "GO_Biological_Process_2021". Significantly enriched pathways (those with a reported 'Adjusted.P.value'<0.1) were filtered to retain only pathways whose gene overlap with the tested gene set did not intersect any other enriched pathways' overlapping genes, retaining the pathways with the smallest P-values. This approach yielded 2 enriched pathways for each gene set (Fig. 2D).
Overlap between milk eGenes and dairy cattle QTL Cattle gene coordinates for ARS_UCD1.2 genome were downloaded from https://bovinegenome.elsiklab.missouri.edu/downloads/ARS-UCD1.2, filtered for mRNAs, and for each gene with multiple entries the entry with the largest region was retained. Dairy cattle QTL were downloaded from the animalQTLdb (https://www.animalgenome.org/cgi-bin/QTLdb/index) by selecting "All data by bp (on ARS_UCD1.2 in bed format)".
For each of 4 milk-related traits, we selected QTL with the following trait labels: milk yield (Milk yield, 305-day milk yield, Average daily milk yield), milk somatic cell count (Somatic cell score, Somatic cell count), milk protein (Milk protein percentage, Milk protein yield, Milk protein content), and milk fat (Milk fat percentage, Milk fat yield, Milk fat content). To identify a smaller list of genes identified in QTL from multiple studies, as some of these traits' QTL overlapped thousands of genes, we identified genes that overlapped at least 1 QTL for all 4 dairy cattle milk traits (N=1,035 genes, Table S7).
To test for enrichment of milk-specific vs. tissue-shared eQTL genes in these lists, we filtered genes for those that were called as milk-specific vs. shared and that were present in the dairy cattle genome annotation above (N=837 genes). We performed a two-sided Fisher's exact test where the 2x2 contingency table axes were: (A) milk-specific vs. tissue-shared eGenes (from our human milk eQTL analysis), and (B) cattle QTL overlapping genes vs. cattle QTL nonoverlapping (from the gene lists identified above), using the 'fisher.test' command in R.
Colocalization of milk eQTLs with breast cancer GWAS summary statistics GWAS summary statistics from Zhang et al. 2020 40 (icogs_onco_gwas_meta_overall_breast_cancer_summary_level_statistics.txt.gz) were downloaded from the BCAC website (https://bcac.ccge.medschl.cam.ac.uk/bcacdata/oncoarray/oncoarray-and-combined-summary-result/g was-summary-associations-breast-cancer-risk-2020/). Coordinates were converted to hg38 with LiftOver, and the meta-analysis summary statistics for all breast cancers were used (column names 'Beta.Meta', 'p.meta', etc.). For each milk eGene, colocalization was performed if there was a breast cancer GWAS hit of P< 5x10 -8 within the eQTL window (within 1 Mb of gene TSS). Breast cancer GWAS and milk eQTL summary statistics were filtered to variants within 500 kb of the smallest milk eQTL P-value, and statistics harmonized so the reference/alternative alleles matched, then colocalization was run with the command 'coloc.abf' using the coloc R package 83 .
We used edgeR to test for correlations between milk gene expression and HMO concentrations. The count matrix and metadata were loaded into an edgeR object and "filterByExpr" was used to remove lowly expressed genes, leaving 12,471 genes. For each HMO, we then used the 'estimateDisp' function on a design matrix regressing gene expression across HMO concentration, sample mass RNA, and sample RIN. We then used 'glmQLFit' to fit a quasi-likelihood negative binomial generalized log-linear model to the count matrix and design model, and 'glmQLFTest' to perform a quasi-likelihood F-test of each gene against each tested HMO. We used Benjamin-Hochberg 78 correction of P-values across all HMO-gene pairs. We tested for gene ontology enrichment of significant genes (q-value<10%) for each trait using the R package topGO, with all tested genes as the background gene list. We used the 'resultFisher' function to run a Fisher's exact test for each gene ontology, and used a Benjamini-Hochberg correction 78 for all ontologies (N=14,178) across the 6 HMOs/HMO categories with at least 10 significant genes (DSLNT, LSTb, LSTc, 3'SL, sialylated sum, all HMOs sum). We report pathways enriched with q-value<10% for each HMO/HMO category (Table S11).
Genetic associations at milk eQTLs with milk oligosaccharides The list of candidate genes to test for effects of milk eQTLs on HMO concentrations was downloaded from Supplementary Dataset 2 in Kellman et al, 2022 42 . From this gene list, we identified 8 genes with significant eQTLs in our dataset (q-value<10%). To test for genetic associations between the lead SNP (smallest P-value) at milk eQTLs and HMO concentrations using the rank-normalized HMO concentrations described above. For 2'FL and DFLac, which were absent in non-secretors (Fig. S7), we rank-normalized the concentrations within secretors and scaled concentrations in non-secretors to have mean -3 and s.d. 0.1, to avoid introducing variation that did not exist in non-secretors. We used 'glm' in R to fit a model with HMO concentrations as the outcome, including genotype, secretor status and the first five genetic PCs as covariates: HMO ~ genotype + secretorStatus + PC1 + PC2 + PC3 + PC4 + PC5 For models of HMOs vs. FUT2 eQTL genotype, we excluded the secretor status term.
To test for potential causal effects between milk gene expression and HMO concentrations, we used a Wald Ratio, which estimates the causal effect between an exposure (milk gene expression) and outcome (HMO concentration) by dividing a single genetic variant's effect on outcome by the genetic effect on the exposure 87 . We used the Wald Ratio rather than a two-stage least squares approach because we had a much larger sample size for our eQTL analysis (N=206) than HMOs (N=41), and we wanted to leverage the eQTL effect estimates from this larger sample size. Two-stage least squares, which is used in a one-sample study design such as ours, would require the same participants to have both exposure and outcome data, limiting the sample size for estimates of genetic variant effects on gene expression to N=41.

Processing of infant fecal metagenomes
Infant fecal collection and storage, and metagenomic DNA extraction were described previously 63 . Briefly, feces were collected from diapers either during study visits and frozen at -80°C immediately, or collected at home, stored in 2 ml cryovials with 600 µl RNALater (Ambion/Invitrogen, Carlsbad, CA), and stored at -80°C after shipping to the lab at the University of Minnesota. DNA was extracted using the PowerSoil kit (QIAGEN, Germantown, MD), eluted with 100 µl of the provided elution solution, and stored in microfuge tubes at -80°C.
Extracted DNA was used to construct libraries for metagenomics sequencing using the Illumina Nextera XT ¼ kit (Illumina, San Diego, CA, United States). Metagenomics libraries were then sequenced on an Illumina NovaSeq system (Illumina, San Diego, CA) using the S4 flow cell with the 2x150 bp paired end V4 chemistry kit by the University of Minnesota Genomics Center, achieving a sequencing depth of 4.5 million reads per sample.
Microbial taxon abundances were generated by first processing metagenomic fastq files with Shi7 version 1.0.1 (Al-Ghalith, et al., 2018), which learns optimal quality control parameters from the data. Sequences were then trimmed, filtered by quality scores, and stitched per the learned parameters in Shi7. Sequences from all samples were multiplexed into a single fasta file for downstream processing. Processed sequences were aligned to reference databases using BURST version 0.99.7f, (Al-Ghalith and Knights, 2017) using a reference genome database generated from GTDB r95 (https://gtdb.ecogenomic.org/stats/r95). A 95% identity cutoff and forward/reverse complement flag were used. Resulting .b6 files were converted to reference and taxonomy tables using embalmulate (Al-Ghalith and Knights, 2017) with 'GGtrim' activated. To generate microbial pathway abundances, metagenomic sequences were run through the MetaPhlAn version 3.0.7 pipeline, with BowTie2 version 2.4.2 64-bit, DIAMOND version 0.9.24, and MinPath version 1.5.
To generate the PCA of infant metagenomes in Fig. 4A, data were filtered to include only taxa with relative abundance >0.001 in at least 10% of 1-month or 6-month samples. A centered log-ratio transformation was performed on the relative abundances of each sample, and principal components were calculated with the 'prcomp' command in R.

Sparse CCA of human milk transcriptomes and infant fecal metagenomes
Input datasets were prepared as follows: Milk gene expression: To prepare gene expression data for this analysis, the sample-by-gene count matrix was loaded into DESeq2 80 , filtered to keep only protein-coding genes with count>0 in at least half the participants (13,157 genes), and transformed using the variance stabilizing transformation. After this transformation, the variance of each gene was calculated across all samples and genes in the lowest 25% variance were removed, leaving 9,868 genes.

Infant fecal metagenomes:
Taxon abundances and pathway abundances from 1-month infant fecal samples were processed separately. The taxon relative abundance matrix was filtered to retain species-level taxa only, keeping only 96 species with a relative abundance >1x10 -3 in at least 10% of samples. A centered log-ratio transformation was then performed on each sample's relative abundances. For microbial pathways, species-specific and unclassified pathways were removed, leaving 145 pathways. The species and pathway level information was then combined into one matrix.
Each dataset was filtered for the 108 individuals with both 1-month infant fecal metagenomes and 1-month milk gene expression. Sparse canonical correlation analysis (sparse CCA), to identify sparse components maximizing correlation between the milk gene expression and infant fecal metagenome datasets, and enrichment analyses of genes in each sparse component, were performed as previously described 88 , using k=15 components. Code was downloaded from https://github.com/blekhmanlab/host_gene_microbiome_interactions. Significance of the sparse components was calculated with leave-one-out cross-validation, and 10 components were retained at Benjamini-Hochberg q-value<10%. Pruning significant components whose scores across mother-infant pairs were correlated at Pearson's r>0.75 left 6 remaining sparse components (Fig. 4B).
To generate network interaction plots between milk-expressed genes and infant fecal microbes identified in the sparse CCA analysis, for each significantly enriched pathway (q-value<10%) in a component, we (1) filtered for overlapping genes between the component and pathway; (2) generated a pairwise correlation matrix of mother-infant pairs' trait values for those genes, the top 3 microbiome traits in the component with positive weights, and top 3 microbiome traits with negative weights; (3) pruned for correlations with Pearson's r>0.3; (4) generated a network plot from the pairwise correlation matrix using the 'ggnetwork' package in R 89 .
B. infantis growth rates were estimated using Compute PTR (CoPTR) 90 . We aligned the infant gut metagenomic shotgun reads to the B. infantis ATCC 15697 reference genome, downloaded from NCBI, using bowtie2 v2.2.4 91 . We then used CoPTR to get coverage information for each mapped sample, filtering for samples with at least 75% coverage and at least 3000 mapped reads to the B. infantis genome. For samples that passed these filters, CoPTR was used to estimate the peak-to-trough ratio (PTR) from the coverage information, an estimate of the bacterial growth rate.

Associations between maternal genotype at eQTLs and HMOs or the infant fecal microbiome
For each milk-specific eQTL, a tag SNP was defined as that with the smallest P-value, requiring there to be at least 10 individuals in the smallest diploid genotype class. 1-and 6-month infant microbiome taxa abundances and pathway abundances were processed as described above for sparse CCA. Microbiome data was retained for mother-infant pairs with maternal genotype data and 1-or 6-month metagenomic data, leaving N=105 mother-infant pairs. Within each time point and microbial taxon or pathway dataset, microbiome traits were pruned by removing one trait from trait pairs with the highest Pearson correlation until no pairs remained with r>0.75. As we were hoping to detect genetic effects from gene expression in milk, we regressed microbiome traits against infant delivery mode (cesarean vs. vaginal), and for 6-month traits whether the mother-infant pair was exclusively breastfeeding, and if complementary foods were introduced. At 1 month postpartum all pairs were exclusively breastfeeding with no complementary foods. We also regressed microbiome traits against maternal 10 genetic PCs, using the residuals of the following regression models using 'glm' in R: We then performed genetic association testing of maternal genotype at each milk-specific eQTL tag SNP against each infant microbiome trait, using a linear mixed model Wald test in GEMMA v0.98.5 92 . There were 232 unique eQTL tag SNPs tested against 313 microbiome traits (46 1-month taxa, 103 1-month pathways, 58 6-month taxa, 104 6-month pathways) (Table S16). Figure S1 Distribution of median pairwise correlations of TPM counts for each sample against all other samples (y-axis), and the number of genes with non-zero counts per sample in RNA-seq data. The red dot indicates an outlier sample that was removed from further analysis.

Figure S2
Distributions of genetic ancestry estimates for individuals included in the eQTL analysis. Within each panel, representing a continental ancestry group, is a histogram displaying the distribution of estimated ancestry proportions for that group for all samples. e.g. all samples have an estimated European ancestry proportion >0.4, with the majority ~1; while no samples have estimated African ancestry proportion > 0.3.

Figure S3
Distribution of discordance between genotypes estimated from RNA and DNA samples. Each dot represents a single sample ID, with the x-axis showing the genotype discord between genotype calls between the RNA and DNA samples with the same sample ID. The y-axis is the minimum discord between that sample ID's RNA sample and any DNA sample. All points are above the x=y line (dashed line), showing that the DNA sample with the matching sample ID always had the most similar genotype calls for each RNA sample, and that there were no sample label mix-ups.

Figure S4
Principal component analysis of transcriptomes from a subset of GTEx tissues and milk. PCs were calculated using the 1000 most variable genes within GTEx, then milk samples were projected onto the GTEx samples.

Figure S5
Spearman correlations among the 12 maternal/milk traits tested for relationships with milk gene expression. An "X" signifies P>0.05, i.e. not significant. Figure S6 (A) (B) (A) For each GTEx tissue, the figure shows the number of genes with eQTLs in the tissue and milk (x-axis), vs. the fraction of colocalized eQTLs between milk and the given tissue (y-axis). The dashed line illustrates a significant linear regression slope (P = 2.7x10 -9 ). (B) Sharing of eQTLs between milk and 45 GTEx tissues, measured through statistical colocalization. Each bar shows each tissue's similarity to milk, measured by residual fraction of eQTLs colocalized with milk, after regressing out tissue sample size. Error bars represent a 95% confidence interval.

Figure S10
Correlation between normalized GCTN3 expression and normalized FLNH concentration. Secretors are shown in orange and non-secretors in light green. To visualize the positive correlation in both secretors and non-secretors, regression lines are shown for secretors and non-secretors separately, but the relationship was assessed using all individuals with secretor status as a covariate in edgeR as described in Materials and Methods. Beta= 0.67, P=1.26x10 -6 , q-value=0.04. There was a significant difference in PC1 score between the two delivery mode groups, using a linear mixed effects model with sample time point and delivery mode as fixed effects and subject ID as a random effect (delivery mode: est. = -3.2, P = 5.1x10 -3 ).            Starting from a list of 54 candidate glycosyltransferase genes 42 , eight genes had for 8 of these genes with significant milk eQTLs in our data. gene_name: gene name; gene: Ensembl gene ID; egene_pval: eQTL gene-level P-value if gene was included in eQTL analysis; FDR: eQTL gene-level q-value if gene was included in eQTL analysis; median_tpm: median transcript per million of gene in our milk transcriptomes, NA if not included in eQTL analysis; chrom: chromosome of eQTL tag SNP; pos: base position of eQTL tag SNP; ID: SNP ID of eQTL tag SNP; ref: reference allele of eQTL tag SNP; alt: alternative allele of eQTL tag SNP; beta: estimated SNP effect on gene expression; se: standard error of SNP effect on gene expression; pval: P-value of SNP effect on gene expression.  Table S15

Supplemental table descriptions
Pairwise Pearson correlations between expression levels of JAK/STAT pathway genes in milk, infant fecal B. infantis growth rate and relative abundance, and milk IL-6 concentration. t1: trait 1; t2: trait 2; r: correlation coefficient; P: correlation P-value; N: sample size of correlation estimate; FDR: Benjamini-Hochberg q-value.