Phenotyping of Klf14 mouse white adipose tissue enabled by whole slide segmentation with deep neural networks

White adipose tissue (WAT) plays a central role in metabolism, and multiple diseases and genetic mutations cause its remodeling, most notably obesity, which has reached pandemic levels. WAT is present in subcutaneous (SAT) and visceral (VAT) depots, and its main components are white adipocytes. Quantitative analysis of white adipocyte size and counts is of great interest to understand physiology and disease, due to intra- and inter-depot heterogeneity, as well as better prognosis for hypertrophy than hyperplasia, and for SAT expansion than VAT expansion. H&E histology of whole depot cross-sections provides excellent approximation of cell morphology and preserves spatial information. Previous studies have been limited to window subsampling of whole slides, and cell size analysis. In this paper, we present a deep learning pipeline that can segment overlapping white adipocytes on whole slides and filter out other cells. We also propose a statistical framework based on linear models to study WAT phenotypes with three interconnected levels (body weight BW, depot weight DW and quartile cell area). Applying it to find Klf14 phenotypes in mice using 147 whole slides of WAT H&E histology, we show sexual dimorphism, and different effects between depots, heterozygous parent of origin for the KO allele and genotype (WT vs. Het). In particular, whether variables are correlated (DW vs. BW and cell area vs. DW), and statistical differences between fitted linear models. We also find significant differences between hand-traced or window subsampling datasets and whole slide analysis. Finally, we provide heatmaps of cell size for all the slides, showing substantial spatial heterogeneity and local spatial correlations.


Introduction
White adipose tissue (WAT) provides the body's main long-term energy storage and plays a central role in metabolism. It is distributed in subcutaneous (SAT) and visceral (VAT) depots, with SAT depots located abdominally, gluteofemorally, intramuscularly and in the upper body The laboratory mouse is the leading model organism for the study of human disease, due to 99% of its genes having human orthologs, a wide catalogue of inbred strains and mutant models, and ease of breeding 25 . Depending on the strain, 16 week mice on regular diets vary from 16.4% to 43.1% fat percentage in males and from 16.7% to 43.5% in females 26 . For the C57BL/6NTac strain that we use in this work, fat percentage varies from 29.0% in males to 23.2% in females on control diets, and from 43.3% in males on a 12 week high fat diet (HFD) to 31.6% in females on a 2 week HFD 27 . As in humans, mice present inter-depot heterogeneity, with visceral gonadal rather than SAT or visceral mesenteric expansion being associated with metabolic disorders 28 . Common mouse models to study obesity are numerous genetics models such as the ob/ob, db/db, POMC, MC4 knockout, ectopic agouti and AgRP, FABP4-Wnt10b, LXRβ−/−, Sfrp5−/− and Timp−/− models, surgical/chemical models and diet induced obesity 2,29 . Other models to study WAT phenotypes are R6/2 and CAG140 for Huntington's disease 30 and CRH for Cushing's syndrome 31 . In this paper, as an exemplar we focus on the C57BL/6NTac Klf14 knockout model (Klf14 tm1(KOMP)Vlcg ) previously developed by Small et al. 13 . The KLF14 (Krüpple-Like family 14) transcription factor is associated with metabolic syndrome and regulates gene expression in adipose tissue. This single exon gene is imprinted and only expressed from the maternally inherited allele 32 , and is expressed more highly in females than males 22 . Homozygous females with the KLF14 risk allele had significantly larger SAT white adipocytes in the Oxford Biobank (BMI-matched) 13 and GTEx (unmatched) 11 datasets, compared to non-risk homozygous females. The ENDOX dataset showed the same trend, but no statistical significance possibly due to the smaller data sample 11 . By contrast, VAT did not show significant adipocyte size differences in the GTEx 11 dataset. Therefore, for human or animal model analysis, unbiased, high-throughput, global WAT quantitative analysis is of great interest. Adipocytes can be collagenase-isolated, counted and measured with a series of methods: microscope 33,34 , hemocytometer 35 , Coulter counter 36 and flow cytometry 37 . However, those approaches consume the tissue and discard its spatial information, whereas whole slide imaging using fluorescent or conventional brightfield microscopy preserves tissue architecture within their slices. Given a sufficiently large field of view, ideally surveying the entire depot area, 2D assessment provides an excellent approximation of cell morphology. Hence our work focuses on the analysis of digitised Hematoxylin and Eosin (H&E) stained tissue sections.
In the first part of this work, we tackle adipocyte segmentation. Manual segmentation of adipocytes is highly accurate 6 , but also labour intensive and slow. Larger scale studies require semi-or automatic instance segmentation methods, that need to be robust against preanalytical variation in histopathology (e.g. staining variability, tissue deformations, tears) and imaging artefacts (e.g. out of focus regions, bubbles). Early approaches were either based on hand-tailored features combined in ad hoc pipelines, including colour conversion, median filters, mathematical morphological operators, thresholding and watershed algorithms [38][39][40][41][42] ; or based on training a pixel classifier with feature vectors extracted by a set of predetermined general-use filters 43 .
Advances in deep learning have revolutionised biomedical image analysis, for instance cell detection and segmentation 44,45 . In addition, these methods provide new ways for phenotyping specific cell types 46 . DeepCell 47,48 replaces predetermined filters 43 by deep convolutional neural networks (DCNNs) that learn optimal feature extraction for the target cell and microscopy modality, although its fully connected layers restrict input images to a pre-arranged size. To overcome this, Adipocyte U-Net 11 uses a Fully Convolutional Network (FCN) 49 , so that images of variable size can be processed, up to the GPU memory limit. DCNNs successfully tackle stain variability and other colour variations using their generalisation capabilities, transfer learning, data augmentation (geometric transformations and colour) or a combination thereof. Even so, a DCNN-based whole H&E slide WAT segmentation pipeline needs to address several design decisions that we briefly review next.
Segmentation by pixel classification (typically as background, cell boundary or cell interior) is widely used 11,46,47 , but the results may need to be regularised, for example with an active contour 47 . In addition, segmentation results tend to be worse where membranes touch, have less definition or are damaged. Segmentation results have been shown to improve by replacing pixel-classification by regression of the Euclidean distance transform (EDT) 50,51 , i.e. distance of each pixel to the closest membrane point (see Supplementary material for insight in pixel-classification vs. EDT). Then, watershed seeds can be computed with peak detection 51 or with a DCNN trained as an object detector on the EDT 50 , although neither can differentiate between adipocytes or other objects in the EDT. In this paper, we propose an EDT DCNN followed by a contour detector DCNN and watershed for full segmentation of H&E images.
The aforementioned segmentation approaches do not tackle white adipocyte overlap (see Fig. OVERLAP). Cell overlap has been tackled with a number of options such as a Physics model of light attenuation through the cytoplasm 52 to ISOODL 53 , which rotates the plane of each cell in 3D space so that they do not overlap, but those increase the problem complexity. A general solution is using a single Shot Multibox Detector followed by individual cell segmentation with a U-Net 46 , but this approach does not suit our whole-tile segmentation. Instead, we propose a DCNN based on QualityNet1(Huang, Wu, and Meng 2016) that corrects each segmented object to account for cell overlap.
Efficient computation requires pre-segment tissue areas to avoid segmenting large areas of empty background space 54,55 ; we tackle this problem with traditional image processing techniques. In addition, tissue areas are typically too large at full resolution to process in random-access or GPU memory, and need to be tiled and the results stitched together. Uniform titling is commonly used 11,55,56 as it is simple to implement, but the tile overlap needed to avoid dropping cells on the tile seams produces redundant computations; in this work, we propose an adaptive tiling algorithm to reduce that burden. Furthermore, it is necessary to differentiate between the cells of interest (mature white adipocytes) and other WAT components and image artefacts. For this, tiles can be accepted or rejected as a whole by an InceptionV3 network 11 . This, however, has poor granularity and favours areas away from tissue edges and where other components are less prevalent. For full granularity, one can first detect valid cells and then segment them 46 , or as we do in this paper, first compute a whole-tile segmentation and then classify each object as a valid or invalid white adipocyte.
Building a training dataset for training/validating DCNNs can be done by cropping small histology windows, labelling them as valid/invalid for processing, and/or manually hand tracing the cells they contain. Previous multiple-cells-per-window approaches 11,47,50 need that all training pixels are labelled as either background / membrane / cell interior, because they all contribute to the network's loss function. But in practice, windows often contain ambiguous pixels, due to damaged, overlapping or unclear membranes. Furthermore, windows with non-adipocytes are precluded, or those objects need to be labelled as background or a new class, something laborious if they present intricate boundaries. Alternatively, one-cell-per-window approaches 46 can be trained granularly, as each training image contains a single cell. However, this also introduces redundant computation, as each training image must allocate space around the cell to provide spatial context. In this paper we search for a compromise, with a multiple-cells-per-window approach that can leave pixels unlabelled.
We address the challenges above to propose a whole slide white adipocyte segmentation pipeline called DeepCytometer. The challenges include: 1) whole slide processing of all tissue, ignoring the background; 2) tiling overlap compromise between segmenting all cells and reducing redundant computations; 3) colour variability in the slide; 4) cell segmentation considering that white adipocytes present as mostly background surrounded by a thin membrane, and that membranes can touch, overlap, be damaged or have poor definition; 5) differentiate between adipocytes and other WAT components, as well as image artefacts; 6) choice of training scheme, e.g. one-window multiple-cells vs. one-window one-cell, full vs. partial segmentation. We validate the segmentation results with summary statistics (Dice coefficient and relative area error) as in previous literature, but also propose to examine segmentation errors as a function of cell size, to assess whether all subpopulations are equally well segmented. We integrate DeepCytometer with the open source web-based application AIDA (github.com/alanaberdeen/AIDA) to navigate whole slides with the segmentation results.
In the second part of this paper, we phenotype Klf14 tm1(KOMP)Vlcg C57BL/6NTac mice WAT, analysing DeepCytometer segmentations of 147 whole slides of H&E histology. We extend previous approaches that quantify median cell area from BMI-matched subjects 13 or mean area 11 , and propose a phenotype framework with three interconnected linear model levels (body weight, depot weight and quartile cell area). Finally, we provide heat maps of cell area in whole slides, for qualitative assessment of spatial heterogeneity of subpopulations.
We provide all the code for the pipeline and experiments, and trained weights (github.com/MRC-Harwell/cytometer). We also provide the histology images, hand-traced and pipeline segmentations (TODO: upload to zenodo.org).

Results
In this section we present the DeepCytometer pipeline and its validation. Then, we present an analysis of WAT from Klf14 tm1(KOMP)Vlcg C57BL/6NTac mice, using tissue samples and additional data generated as part of the Small et al. 2018 study 13 . This analysis is based on cell areas derived from automatic segmentations from our pipeline and body and depot weight. It should be noted that the single exon Klf14 gene is imprinted and only expressed from the maternally inherited allele 32 . However, our current preliminary analysis of this dataset does not take into account this Klf14 monoallelic expression, which we will include in our next round of analysis.
A main contribution of this paper is our DeepCytometer pipeline to segment white adipocytes from whole H&E histology slides. The pipeline ( Fig. PIPE(a)) performs a coarse segmentation of the tissue, uses an adaptive tiling algorithm to select image blocks that fit in GPU memory, performs colour correction and feeds the blocks to a white adipocyte segmentation sub-pipeline ( Fig. PIPE(b)) based on DCNNs. The sub-pipeline works by segmenting all objects in the image block, then classifying which ones are white adipocytes, and correcting their outlines to account for cell overlap. Pixels that belong to cropped cells on the edges are flagged to be processed in neighboring image blocks. A detailed description of slide preprocessing can be found in Methods -Coarse tissue mask and adaptive tiling for full processing of whole slides. The segmentation sub-pipeline is described in Methods -Segmentation sub-pipeline, with details of its constituent Deep CNNs, in Methods -Deep CNN architectures.
In this subsection we present three groups of experiments: "Adaptive tiling computational load reduction" quantifies the reduction in the number of pixels that need to be processed with our adaptive tiling algorithm compared to uniform tiling. "Tissue CNN validation" shows that the segmentation sub-pipeline correctly identifies white adipocytes and rejects other types of elements in the histology image with high sensitivity and specificity. "Segmentation validation" shows that the segmentation sub-pipeline outlines white adipocytes with low area errors, such that it provides a good representation of the adipocyte population in the slide.

Adaptive tiling computational load reduction
We measured the reduction of computational load provided by our adaptive tiling compared to uniform tiling of the tissue region with overlapping blocks, by comparing the number of pixels each approach needs to process. Uniform tiling was produced by splitting the image into (L max , L max ) square blocks, where L max =2,751 pixels is the maximum tile length allowed in our adaptive algorithm. Blocks overlapped by on each side, where pixels is the radius of the largest circular cell accepted by the pipeline, and pixels is the maximum Effective Receptive Field of the CNNs. The sum of the areas of all the uniform blocks containing tissue in an image were compared to the sum of the areas of the adaptive blocks (Fig. ADAPTBLOCK). The average ratio from the 147 whole slides used in the phenotyping experiments was / = 0.86 ± 0.13, corresponding to a reduction of 16.59% in the number of processed pixels (and correspondingly, time), from an average of 2.11 · 10 -9 pixel (uniform) to 1.81 · 10 -9 (adaptive) per slide.

Tissue CNN validation
The Tissue CNN was validated on 126 training images using 10-fold cross validation. We calculated the receiver operating characteristic (ROC) curve for the classification of white adipocytes vs. "other" objects, weighted by the number of pixels in each object (experiment details in Methods -Segmentation sub-pipeline, curve in Fig. CLASS_ROC(a)). The classifier performs very well, with area under the curve = 99.59%, and pixel-wise false positive rate (FPR) = 1.80% and true positive rate (TPR) = 97.71% for a white-adipocyte classification threshold . The low FPR means that cell population studies will contain a negligible number of false objects, and the high TPR indicates that the vast majority of white adipocytes will be detected in the slides. We also provide TPR and FPR values for other thresholds in Fig. CLASS_ROC(b).

Segmentation sub-pipeline validation
We validated the segmentation sub-pipeline on 55 hand-segmented images using 10-fold cross validation, both for the Auto and Corrected methods (experiment details in Methods -Segmentation sub-pipeline). We computed the Dice Coefficient (DC) between pairs of DeepCytometer and hand-traced (ht) cell contours, matched as described in Methods. For the Auto method we obtained DC Auto = 0.89 (median), 0.85 (mean), 0.10 (std), and for the Corrected method, DC Corrected = 0.91 (median), 0.87 (mean), 0.10 (std). We also computed the median relative errors, which were -10.19% (Auto) and 4.19% (Corrected) (Fig.  SEG_VALIDATION(b)). Both DC and relative error suggest that DeepCytometer segments white adipocytes with an acceptable area error, and that the Corrected method performs better than the Auto method.
Furthermore, as discussed in Methods, we went beyond summary statistics commonly found in the current literature to evaluate whether segmentation errors remain constant across the cell population. For this, we plotted Auto and Corrected relative errors vs. hand traced cell area, the rolling median and interquartile range curves and the global median relative error (Fig. SEG_VALIDATION(c)-(f)). The curves suggest that relative errors remain constant for Area ht ≥ 780 μm 2 , but shift towards more positive values for smaller cells. This would suggest less reliable phenotyping results for cells with Area ht < 780 μm 2 , which comprise the bottom 15.9% of the population. This highlights the need to report segmentation errors by cell size in future literature.

Phenotype study of WAT using DeepCytometer segmentations
We broke down the study of Klf14 phenotypes into three interconnected levels: the mouse level (body weight, BW), the depot level (depot weight, DW) and the cell level (quartile cell area). The mice were stratified by sex, and we tested separatedly for genotype (WT or Het) and "parent" (heterozygous parent of origin for the KO allele: father, PAT or mother, MAT) effects as exploratory analysis (the number of mice did not allow us to include both variables and their interactions in a model). Body and depot weight were measured in the laboratory, and cell areas were computed both for hand traced and DeepCytometer segmentations. (Methodology details are provided in Methods, Phenotype framework for WAT).

Mouse level
We studied the sex effect on BW, as well as genotype and parent. (We also checked that cull age had no significant effect, see Suppl. Cull age effect on BW).

Sex effect on BW
To assess the effect of sex on BW, we fitted a Robust Linear Model (RLM) (BW~sex) to the PAT mice (n female =n male =18). We used all PAT mice instead of only WT as controls as they do not display Klf14 phenotypes. The RLM was preferred to an Ordinary Least Squares (OLS) model to moderate the leverage of a large male outlier. The model calculated a mean BW for females of 25.06 g and 38.30 g for males (males 52.82% larger, β=13.24 g, p=6.00e-20). This sexual dimorphism is larger than genotype or parent effects that we found in the next sections at the BW, DW or cell level, so we stratify the data by sex in the rest of the study.

Genotype and parent effect on BW
Next we looked at the effect of parental inheritance on BW. We fitted separate OLS models (BW~genotype) and (BW~parent) to 76 mice stratified by sex (n female =n male =38). T-tests of the β genotype (Het) coefficients do not show any significant genotype effect in females (p=0. 22) or males (p=0.79) (Fig. BW(a)). On the other hand, there is a highly significant parent effect for females (β parent =4.48 g, p=0.0061), where MATs are on average 4.48 g / 25.10 g = 17.86% larger than PATs (Fig. BW(c)).

Depot level
As discussed in the Introduction, SAT and VAT have different impacts on disease and phenotypes. In this section we study genotype or parent effects on DW of inguinal subcutaneous (for SAT) and gonadal (for VAT) depots adjusting for BW, as well as DW correlation with BW.

BW, genotype and parent effects on depot weight (DW)
We fitted OLS models (DW~genotype * BW/BW) and (DW~parent * BW/BW) to the same 76 mice stratified by sex and depot, where BW=33.44 g is the mean BW of all animals used as a normalisation factor to lower the condition number of the linear model. We then used Likelihood ratio tests (LRTs) to compare those models to null-models (DW~BW/BW) in order to test for genotype or parent effects. Even though the (DW~effect * BW/BW) models already provide one fitted line for the control group (WT or PAT) and another for the effect group (Het or MAT), this assumes similar data variance and overlap in both groups. To be free from that requirement, in order to assess correlation we fitted new models (DWB W/BW) separately to the control and effect groups, and then used a t-test of their slopes to evaluate correlation between DW and BW. Said individual linear models are shown in  Tables Table  DW_BW_RLM_GENOTYPE and Table DW_BW_RLM_PARENT. The p-values of the slopes from the 8 genotype or parent models were jointly corrected using Benjamini-Krieger-Yekutieli 57 .
Parent effect: LRTs reveal significant parent effects in females (gonadal LR=5.23, p=0.022, subcutaneous LR=5.42, p=0.020). According to the individual models stratified by parent ( Table Table DW_BW_RLM_PARENT), the p-values of the slopes suggest that BW is positively correlated with DW in female gonadal depots both for PATs and MATs (β BW/BW (PAT)=2.58, p(PAT)=0.0048; β BW/BW (MAT)=1.40, p(MAT)=0.010). In female subcutaneous depots, the trend is the same as in gonadal ones, but does not reach significance. In both depots, female MATs have lower DW for the same BW than PATs, and thus, lower fat percentages. In males, LRTs show very significant parent effects (gonadal LR=7.48, p=0.0063, subcutaneous LR=11.29, p=0.00078). However, what those LRTs are finding is a significant difference between a non-significant positive slope β BW/BW (PAT) and a non-significant negative slope β BW/BW (MAT). Thus, we consider that male DW is uncorrelated with BW under parent stratification, and that a parent effect in male DW is inconclusive.
To summarise, DW is positively correlated with BW in female gonadal depots, but not in female subcutaneous, or either male depot. In addition, there is no genotype effect on DW, but there is a parent effect in females, with MAT females having lower fat percentages than PAT ones.

Cell level
First, we calculate probability distribution functions (pdfs) in the hand traced data set and the DeepCytometer segmented whole slides to compare hand traced populations to DeepCytometer whole slide ones. Second, we study genotype or parent effects on cell area in the same depots as before, adjusting for DW, as well as cell area correlation with DW. Finally, we present heatmaps of cell areas computed from the DeepCytometer segmentations. This provides a clear picture of whether there are local correlations or a uniform spatial distribution of cell sizes across whole slides, as well as whether slides from the same stratum have similar cell population spatial distributions.

Area population distributions of hand traced cells
The hand traced dataset consists of 1,903 cells pooled from 60 subcutaneous windows and 20 mice (see Data). The cells measured between 66.0 μm 2 (321 pixel) and 19,058.2 μm 2 (92,544 pixel). To represent cell populations, we estimated probability density functions (pdfs) of the areas of hand traced cells ( Fig. MANUAL_POPULATION_HISTOS(a)). We also calculated the cell area Harrell-Davis (HD) quartiles (Q1, Q2, Q3) with 95%-CIs ( Fig.  MANUAL_POPULATION_HISTOS(b)). Male cells were notably larger than female cells for each quartile. On the other hand, for each sex and quartile, PAT and MAT areas were similar, with overlapping 95%-CIs (although there is a trend for smaller MAT cells in each quartile).

Area population distributions of DeepCytometer segmented cells
We estimated pdfs from the areas of DeepCytometer segmented cells (with the Corrected method), one pdf per slide ( Fig. SEG_POPULATION_HISTOS(a)-(b)). We segmented 75 inguinal subcutaneous and 72 gonadal whole histology slides, corresponding to 73 females and 74 males, to produce 2,560,067 subcutaneous and 2,467,686 gonadal cells. We combined all the pdfs by computing pdf HD quantiles q={2.5%, Q1, Q2, Q3, 97.5%} (i.e. quantiles of density values instead of cell areas), and displayed them as shaded areas and solid curves in Fig. SEG_POPULATION_HISTOS(c)-(d): 2.5%-97.5% form the 95%-interval (light shaded area) and Q1-Q3 form the interquartile range (dark shaded area). The Q2 curve (solid blue) provides a median histogram for each stratum. In addition, we computed the cell area quartiles Q1, Q2, Q3 for each pdf, and their standard errors. We combined those estimates using the inverse-variance meta-analysis method to produce one combined .

Comparison of hand traced vs. DeepCytometer segmented cell populations
In section Results -Segmentation sub-pipeline validation, we showed that DeepCytometer automatic segmentation approximates hand tracing in training windows, and is faster. In this section, we test whether whole slide segmentation also adds valuable population information to training window segmentation. First, we compared hand tracing of the training windows sampled from 20 subcutaneous whole slides against the DeepCytometer segmentations of those same 20 whole slides. For the purpose of this experiment, it is enough to stratify by sex and parent, omitting genotype. We computed HD quartiles (Q1, Q2, Q3) from each mouse and combined them using the inverse-variance meta-analysis as in the previous section. Pdfs and quartiles are plotted in Fig . In all but one stratum, the DeepCytometer segmentation quartiles are larger than the hand traced ones. Whereas for males the area difference is between -8.72% and +20.42%, for female MATs it ranges between +57.65% and +65.15%. This suggests that our 1,903 hand traced cells from 60 windows and 20 mice, despite being a rather large training dataset, misrepresents the whole slide populations in the four strata, especially for female MATs. Namely, it undersamples the long tails on the right-hand side of the pdfs, i.e. the larger cells in the population. These errors are not systematic, and vary between strata. This could be partly due to hand tracing sampling a relatively small number of cells per mouse and pooling them. This highlights the need for whole slide analysis.
Furthermore, to test whether 20 whole slides are enough to represent cell populations, we computed pdfs from the other 55 subcutaneous DeepCytometer whole slide segmentations for a total of 75 subcutaneous pdfs (as well as from the 72 gonadal slides for completion) (Fig. SEG_POPULATION_HISTOS). The quartile area difference between the 20 and 75 whole subcutaneous slides is shown in Fig. SEG_POPULATION_HISTOS(g). Although the quartile areas are similar for females (from -4.10% to +9.00%), the subpopulation estimates change substantially for males (from -14.28% to +22.36%). Therefore, both whole slide analysis and the analysis of more mice significantly changed the cell population pdfs. Both increases are enabled by DeepCytometer's segmentation.

DW, genotype and parent effects on cell area quartiles
So far, our comparison of cell area quantiles has not taken into consideration confounders such as BW or DW. As we have previously studied DW as a function of BW, in this section we complete the three level phenotyping analysis with OLS models (area q~g enotype * DW) and (area q~p arent * DW), stratified by sex and depot, where area q are the area quartiles q={Q1, Q2, Q3} (Table AREAQ_DW_LRT). (We removed 2 gonadal and 1 subcutaneous slides from the analysis due to lack of BW and DW records of two mice.) We apply a similar approach as before, using LRTs to assess genotype and parent effects, and slopes to assess correlation. The p-values of the slopes β q =β DW (q) were jointly corrected using Benjamini-Krieger-Yekutieli in the 24 models that correspond to 3 quartiles, 2 sexes, 2 genotypes/parents, and 2 depots.  Table  AREAQ_DW_GENOTYPE_LINREG. LRTs are shown in Table AREAQ_DW_LRT(a). For the gonadal depot, the individual linear models for WT and Het are visually very close, and LRTs comparisons show no significant difference between WT and Het for females or males. For the individual models, there is a statistically significant correlation between DW and cell area in female WTs (β Q2 =2360.7 μm 2 /g, p Q2 =0.030; β Q3 =4420.8 μ 2 /g, p Q3 =0.030), but not in female Hets. However, this could be due to slightly higher variance for Het values. Both visual assessment and LRT p-values suggest a trend of WT and Het female gonadal cell area increasing with DW. By contrast, in male gonadal depots, visual assessment and slopes and their p-values suggest constant cell area regardless of DW.
Subcutaneous depots visually show that cell area increases with DW in female and male WTs, and that cell areas are smaller in Hets, at least as a trend. However, in females, slopes β q are not statistically significantly different from zero according to their t-test p-values. There is no evidence of a difference between female WT and Het models according to LRT p-values either. On the other hand, p-values for male WT slopes are very significant (p Q1 =p Q2 =p Q3 =0.0011) and indicate that cell area increases with DW (β Q1 =720.9 μm 2 /g, β Q2 =1735.2 μm 2 /g, β Q3 =2744.8 μm 2 /g), whereas Het slopes are not significantly different from 0. The difference between WT and Het models is significant according to LRTs (p Q1 =0.0091, p Q2 =0.0058, p Q3 =0.014). Thus, there is evidence that cell area increases with DW for WTs, but there is no evidence of correlation with DW for Hets.  Table  AREAQ_DW_PARENT_LINREG. LRTs are shown in Table AREAQ_DW_LRT(b). The plots display positively correlated cell area to DW. However, the residuals reveal heteroscedasticity and autocorrelation, and t-tests of the β q coefficients return non-significant p-values after multitesting correction due to the variance of area values. Thus, it is inconclusive whether DW and cell area are truly correlated. Nonetheless, the LRTs suggest a very significant parent effect in female cells, both gonadal (p Q1 =0.01, p Q2 =0.0023, p Q3 =0.0022) and subcutaneous (p Q1 =0.00082, p Q2 =0.0015, p Q3 =0.0021). Visual inspection of the OLS plots suggests that this difference arises from larger MAT than PAT cell areas for a given DW. However, due to the aforementioned issues with the residuals, we qualify this phenotype as inconclusive.
For males, we have a case analogous to the depot level above, as in some cases the LRTs show a significant difference between PAT and MAT, but all the gonadal and subcutaneous slopes β q are non-significant. Thus, we conclude that cell area and DW are uncorrelated, and there is effectively no parent effect.

Quantile colour map plots to assess cell size heterogeneity
In order to gain insight into the spatial distribution of adipocyte populations, we used the area-to-colour map described in Methods -Quantile colour map plots to assess cell size heterogeneity to visualise cell area distribution in all whole slides processed by DeepCytometer, both in AIDA to visually assess the segmentation, and to generate figures for this paper ( Fig. COLORMAP_F_GWAT-Fig. COLORMAP_M_SCWAT). The colour map is linear with area quantile, rather than area, and we use separate colour maps for females and males, due to sexual dimorphism. The results clearly show local subpopulations or clusters of white adipocytes. These clusters are irregular in shape, and present high inter-and intra-slice variability, even within the same sex and depot stratum. They illustrate the challenges for statistical studies of cell populations performed on subsamples of whole slides, such as our hand traced dataset. Namely, cells within clusters are correlated observations, so although spatial analysis is without the scope of this paper, we conjecture that an apparently large number of cells (~2,000 in our hand traced dataset) may not properly represent the mixture of subpopulations in the original whole slides.  The main findings we report are genotype/parent effect (significant difference between WT and Het, or PAT and MAT, respectively) and correlation (β slope coefficient significantly different from zero) between trait and covariate (BW or DW). N.e.: no effect (genotype or parent, respectively). E: effect. N.c.: no correlation between continuous covariate and dependent variable. C: norrelation. n.s.: not significant.

Discussion
This paper tackled two parts: first, we presented DeepCytometer, a pipeline to segment white adipocytes from high resolution H&E histology, together with visualisation tools for the results. Second, we presented a phenotype framework for white adipose tissue that we applied to Klf14 mouse data. Unlike previous methods that have been limited to processing small windows containing mostly white adipocytes with good image quality, DeepCytometer can process whole slides containing cell overlaps, other types of tissue, image artifacts and variations in image quality. In the first part of the paper, we addressed several problems that naturally arise in whole slide segmentation: 1) Coarse tissue mask. By applying traditional image processing techniques to segment tissue areas on the 16× downsampled slide. This resolution was enough to obtain a tissue mask, avoided processing large empty background areas and its computation time was negligible compared to segmentation. 2) Adaptive tiling for whole slide processing. We improved on previous tiling approaches by proposing a method that does not discard cells cropped by tile edges, chooses each tile's location and size to reduce overlap and redundant computations. We found a 16.59% reduction in the number of processed pixels on 147 histology images, compared to uniform tiling. As the processing time is linear with the number of pixels, this resulted in an overall speedup of whole slide segmentation. 3) White adipocyte segmentation. We built on previous cell segmentation work, with a combination of FCNs and post-processing methods, in particular watershed and mathematical morphology (we call this our Auto method). As in previous work, we estimated an EDT from adipocyte membranes with a CNN, but followed it by a Contour CNN to find the EDT troughs, because we found that in practice, previously proposed simple post-processing methods did not work consistently in whole slides. In addition, we proposed the Correction CNN to correct Auto labels to account for cell overlap (Corrected method). Our median Dice coefficients were 0.89 (Auto) and 0.91 (Corrected), showing good agreement with the hand traced segmentation. The median relative area errors were -10.19% (Auto) and 4.19% (Corrected). Following previous practice in the literature, these measures would validate our segmentation method. However, in this paper we also proposed looking at segmentation errors over the population distribution. We found that the segmentation area error was roughly proportional to cell area for cells ≥ 780 μm 2 , which is desirable, but shifts towards more positive values for smaller cells, the bottom 15.9% of the population. This illustrated how segmentation errors in subpopulations could go undetected using summary statistics, and highlighted the need for validating segmentation errors as a function of cell size. This did not affect our phenotyping evaluation, as we used the cell area population quartiles (25%, 50% and 75%) and not the smallest cells. To determine which segmented objects are white adipocytes, first we classify each histology pixel, and then accept objects that contain ≥ 50% white adipocyte pixels. We weighed the object classifier's ROC by the number of pixels per object, to balance the contribution to classification errors of white adipocytes (small but numerous) with other objects (scarcer but large). The 0.97 area under the ROC indicated overall good performance of the classifier. With the 50% acceptance threshold, the false positive rate (FPR) = 15% and true positive rate (TPR) = 95%, which we considered acceptable for phenotyping. If a lower FPR was required in other work, the 50% threshold could be raised. Given the large cell counts in the slides, the resulting lower TPR may be an acceptable trade-off. 5) Ground truth / training data. We created a ground truth / training data set for the EDT, Contour and Correction CNNs with 55 random windows from 20 mice, totalling 2,117 white adipocyte hand traced contours, for 10-fold cross validation.
For the Tissue CNN, we added another 71 windows containing only non-white adipocyte regions. This was necessary to create a balanced set of ≈ 23.7 · 10 6 white adipocyte pixels and ≈ 45.1 · 10 6 non white adipocyte pixels. Non-white adipocyte areas were easier to manually segment because they are larger and do not need precise outlining. For the cell population studies, we added another 5 windows (for a total of 60) to two mice with undersampled populations, but removed 214 segmented objects (for a total of 1,903 cells left over) where we had doubts of being white adipocytes. We hope that the data set will be a useful resource in the field, and have made it available for download at XXXXXX. 6) Segmentation results visualisation. Integration of our pipeline with AIDA allowed us to review whole slide segmentation results in real-time. AIDA was launched on the same GPU server as the pipeline, and enabled real-time review of the results from a desktop or laptop using a regular browser. The pipeline features saving each block as a separate layer or all in the same layer for display. AIDA also allows manual correction of labels. The Results -Comparison of hand traced vs. DeepCytometer segmented cell populations experiment returned very significant differences between the distributions obtained from the hand traced data set and whole slide automatic segmentations used as ground truth. This suggests that even though the hand traced data set contained 1,903 cells from 60 random windows, it failed to represent the true distribution of white adipocyte areas. Although spatial analysis is without the scope of this work, we conjecture that intra-slice clustering introduces strong spatial local correlations in cell size, thus reducing the effective size of the training dataset ( Fig. COLORMAP_F_GWAT-Fig. COLORMAP_M_SCWAT). Such difference between hand traced and DeepCytometer populations highlights the need for whole slide segmentation and analysis, with pipelines that can run on dozens or hundreds of slides. Furthermore, this begs the question for future work of whether a single whole slide provides an appropriate representation of a whole depot's cell population, or whether fat phenotype studies should contain multiple slices that cover each depot, or our work should be extended to 3D modalities such as fluorescent microscopy, considering the financial and computational cost increase.
In the second part of this paper, we performed an explanatory study of Klf14 phenotypes in B6NTac mice at three nested levels -animal (body weight, BW), depot (depot weight, DW) and cell (quartile cell area)-in four strata defined by sex (female/male) and depot (gonadal/subcutaneous). Cell areas were obtained by DeepCytometer from 75 inguinal subcutaneous and 72 gonadal full histology slides. At the BW level, control mice presented marked sexual dimorphism, with males weighing 52.8% more than females on average. Thus, the rest of the study was stratified by sex. At each level, we considered two phenotype assessments: 1) correlation between variables (e.g. BW vs DW) via t-tests of linear model slopes and 2) genotype (WT/Het) or parent (PAT/MAT) effects using LRTs. In addition, at the cell level we computed pdfs of cells areas.
At the animal level (BW), there was no genotype effect, but we found a parent effect, as MAT females were 14.7% heavier than PATs. This would suggest a phenotype where females with mothers that carry the KO allele are heavier, regardless of whether the daughter carries it herself. At the depot level (DW~BW), males had no genotype or parent effects, and their BW and DW were uncorrelated. Females showed no genotype effect either. Nonetheless, there was an interesting non-phenotypic difference between VAT and SAT: stratified by genotype, BW and DW were correlated in gonadal depots, but not in subcutaneous depots. Females showed a depot parent effect, with MAT depots being smaller than PAT ones. This is remarkable, because then, female MATs are both heavier but have smaller fat depots than their PAT counterparts, i.e. they are both larger and leaner mice. Furthermore, when stratifying by parent, we observed a similar non-phenotypic difference as above: BW and DW were correlated in gonadal depots, but the slope in subcutaneous depots is statistically non-significant. Thus, both at animal and depot weight, there are phenotypes for female MATs. At the cell level (area q~D W), it is noteworthy that males exhibited a phenotype, but only as a subcutaneous genotype effect, with cell area in Hets being smaller than in WTs. In that depot, male area q is correlated with DW in WTs but not in Hets. These two observations, together with the fact that there was no genotype effect in male DW itself, would suggest that male WTs and Hets have similar subcutaneous DW, with the distinction that in WTs, larger DW is achieved by white adipocyte enlargement, whereas in Hets, it is by cell multiplication. In female gonadal depots, there is no genotype effect according to the LRT. However, DW and area q are correlated in WTs but not in Hets; because in the depot level, DW and BW were correlated for both WTs and Hets, this would suggest that WT female gonadal white adipocytes grow with BW, but in Hets, cell growth is limited and instead DW is due to cell multiplication. The female gonadal parent effect analysis is inconclusive, due to the data variance and not suitability for a linear model. Finally, female subcutaneous depots show no correlation between DW and area q with genotype stratification, and only a weak trend of Het areas being smaller than WT ones. The parent effect is inconclusive as before, although there is a trend for MAT areas being larger than PAT ones. In summary, our exploratory analysis reveals interesting phenotype leads by breaking down the study into animal, depot and cell level, and assessing parent and genotype effects. Of particular interest are strata where depot size can be explained in terms white adipocyte size vs. multiplicity, which would be directly connected to hypertrophy vs. hyperplasia. However, further investigation with more mice is necessary, to be able to examine genotype:parent interactions and provide explanatory mechanisms for the phenotypes.
In future work, the pipeline can also be redesigned with alternative deep learning approaches that find candidate objects and then segment them, as opposed to DeepCytometer, which first segments all objects and then classifies them, e.g. based on Faster R-CNN 58 , Mask R-CNN 59 or TensorMask 60 . Although we have proved that DeepCytometer can analyse hundreds of slides with common resources, we would like to improve its architecture to scale it to studies with thousands of slides, without needing large cloud computing resources.

Data
Mouse description, tissue acquisition and imaging To develop and evaluate our methods we analysed adipose tissue histological samples from mice carrying (Het) or not (WT) a Klf14 gene knockout on either maternally (MAT) or paternally (PAT) inherited chromosomes, described previously 13 . The summary of the characteristics of the 20 Klf14-C57BL/6NTac (B6NTac) mice used for training and testing the DeepCytometer pipeline, as well as the hand traced population experiments, is shown in Table MICE. The mean ± standard deviation mouse weights were 26.6 ± 3.7 g (female PAT), 26.2 ± 2.9 g (female MAT), 37.6 ± 1.9 g (male PAT), 39.9 ± 3.8 g (male MAT).
The histopathology screen involved fixing, processing and embedding in wax, sectioning and staining with Hematoxylin and Eosin (H&E) both inguinal subcutaneous and gonadal adipose depots. For paraffin-embedded sections, all samples were fixed in 10% neutral buffered formalin (Surgipath) for at least 48 hours at RT and processed using an Excelsior™ AS Tissue Processor (Thermo Scientific). Samples were embedded in molten paraffin wax and 8 μm sections were cut through the respective depots using a Finesse™ ME+ microtome (Thermo Scientific). Sampling was conducted at 2sxns per slide, 3 slides per depot block onto simultaneous charged slides, stained with haematoxylin Gill 3 and eosin (Thermo scientific) and scanned using an NDP NanoZoomer Digital pathology scanner (RS C10730 Series; Hamamatsu).

Ground truth hand traced dataset for CNN training and cell population studies
We created two slightly different hand traced datasets. For DeepCytometer training and validation, we randomly sampled each of the 20 training histology slides to extract a total of 55 histology training windows with size 1001×1001 pixels and traced white adipocytes, other types of tissue (connective, vessels, muscle, brown adipocytes, etc.) and background areas. Another 71 windows were manually selected to add more examples of only other types of tissue to train the tissue classifier. This first dataset is summarised in Table MICE in black. For cell population studies, we added 5 random windows from 2 mice whose cell population was undersampled, and we removed those white adipocyte objects with ambiguous interpretation, namely fully overlapped, suffering from image artifacts or comprising very small gaps between clear adipocytes. This second dataset is summarised in Table MICE in blue.
The total number of hand traced white adipocytes is 2,117 and 1,903, respectively, with a roughly balanced split between the four groups formed by the female/male and PAT/MAT partitions (Table CELL-NUM). The total number of "other tissue" objects is 232, and of "background", 24. (Note that cell objects tend to be much smaller than "other" or "background" objects). Each window contained one or more of these types of objects. Hand tracing was performed with the image editor Gimp (www.gimp.org) over a month. Automatic levels correction and manual curves adjustment was temporarily applied to each window to improve image contrast for the human operator. Contours were drawn as linear polygons on the outermost cell edge, accounting for the cell overlaps shown in Fig. OVERLAP(b), and exported with an ad hoc plugin as SVG files for further processing. For "other" or "background", representative linear polygons were drawn, avoiding complex boundaries. This approach produced partially labelled training windows.

Methods
Coarse tissue mask and adaptive tiling for full processing of whole slides

Coarse tissue mask
Histology slides in Hamamatsu NDPI format can be read by blocks of arbitrary size at a fixed number of precomputed resolution levels with OpenSlide 61 . At the highest resolution level, our images have pixel size 0.454 μm. We read the whole histology slide at the precomputed ×16 downsampling level (7.26 μm pixel size), applied contrast enhancement and computed the mode and standard deviation in each RGB channel of the image. We assume that the background colour is centered around , as background pixels are more numerous than tissue pixels. To segment the tissue, we thresholded the downsampled image for pixels that are darker than in all RGB channels (see Fig.  RMABb), where . Then, we applied morphological closing with a 25×25 kernel at ×16 downsampling level, filled holes smaller than 8,000 pixels (421,759 μm 2 ), and removed connected components smaller than 50,000 pixels (2,635,994 μm 2 ).

Adaptive tiling
When using uniform tiling, to guarantee that any cell will be processed whole in at least one tile, adjacent tiles need to overlap by , where is the receptive field or diameter of input pixels that affect each output pixel, and is the diameter of the largest cell. This overlap introduces repeated processing of the same pixels and multiple segmentations of the same cells from adjacent tiles, but when it is ignored 11 , cells cropped by tile edges need to be discarded. To ameliorate this problem, we propose an iterative tiling algorithm that adapts the block size and overlap of each new tile according to the local cell size and tissue mask, such that the whole coarse tissue mask is covered, all cells are segmented, and redundant computations are reduced. For this, first we flag all pixels in the coarse tissue mask as "to be processed". To find the location of the first block, the mask is convolved with two small linear kernels. Pixels > 0 in both outputs are potential locations for the block's top-left corner. Any of those locations guarantee that the block has at least one mask pixel on the top and left borders. The algorithm chooses the first of the candidate pixels, in row-column order. The bottom-right corner of the block is initially chosen to obtain a block with maximum size 2751⨉2751 pixels (maximum allowed by GPU memory). The block's right side and bottom are then cropped to remove empty columns or rows, producing an adaptive size. Finally, the block is extended half the effective receptive field on each side, to prevent border effects. (This extra border is discarded after image processing operations.) If the block overflows the image, it is cropped accordingly. The image block is then segmented; pixels from cells cropped by the edges keep their "to be processed" flag, so they will be included in another block. The rest of the mask pixels are cleared, and the process is repeated iteratively choosing new adaptive blocks until the whole tissue mask is cleared. Pseudocode and details for the algorithm are provided in Suppl. Pseudocode for adaptive tiling, and an example of its behaviour is shown in Fig. RMAB:

Deep CNN architectures
In this section we describe the function of the four different DCNNs used in this work, with illustrative examples (Fig. DMAP-Fig. CORRECT), a summary of their architectures (Table  CNN), and the calculation of their effective receptive fields (ERFs) ( Table ERF). Training and validation details are provided in Supplementary material. These networks are components of the Segmentation sub-pipeline that we describe in the next section. The networks are fully convolutional 49 , so that tile size can be adjusted to available GPU memory and the needs of the adaptive tiling. We used a stride of 1 to avoid downsampling followed by deconvolution, and thus preserve high-resolution segmentation following 62 . We used atrous or dilated convolution [63][64][65] to enlarge the ERF following 47,62 . RGB 8-bit unsigned integer values in the image files were converted to 32-bit float type and scaled to [0, 1] or [-1, 1] depending on the specifications of the network they are being fed to.

Histology to EDT regression CNN (EDT CNN)
This network (see architecture in Table CNN(a)), based on previous work by 51,66 and similar to 50,67 -as discussed in the Introduction-takes an input histology image and estimates the Euclidean Distance Transform (EDT) as the Euclidean distance of each pixel to the closest label boundary point (see EDT CNN and Contour CNN training dataset below for details). This produces an image similar to an elevation map (Fig. DMAP), where each "hill" defines an object (e.g. a white adipocyte, an area of muscle tissue or a vessel cross-section). If the object is a white adipocyte, troughs represent the cell's membrane boundary or a compromise boundary between overlapping cells. Trough points are all critical points (extrema or saddle points) but have different values or "elevations". We found that troughs in our whole slides could not be segmented with simple segmentation methods like in 50,67 . Instead, we trained the following Contour network for that task.

EDT to Contour detection CNN (Contour CNN)
This network (see architecture in Table CNN(b)) classifies each EDT pixel as either belonging or not to a trough (Fig. CONT).

Pixel-wise tissue classifier CNN (Tissue CNN)
This network (see architecture in Table CNN(c)) classifies each pixel from the histology block as "other type of tissue" vs. "white adipocyte" or "background" (see Fig. CLASS). "White adipocyte" and "background" are combined in one class because a gap in the tissue and the inside of a white adipocyte have the same appearance in the histology. The output of this network is used to classify a segmented object as white adipocyte or not, according to the proportion of "white adipocyte" / "background" pixels it contains (see section DeepCytometer pipeline below for details).

Segmentation correction regression CNN (Correction CNN)
This network (see architecture in Table CNN(d)) takes the cropped and scaled histology of a single object, multiplied by a mask derived from its segmentation, and estimates which pixels underestimate (detection error) or overestimate (false positive) the segmentation. This output is then used to correct the object's segmentation (see section DeepCytometer pipeline below for details). Because each object's segmentation is corrected separately, corrected boundaries can overlap. To create an input for the network, the histology tile is cropped using a square box with at least twice the size as the segmentation mask's bounding box, and scaled to a fixed size of 401×401 pixels to make the network blind to cell size. The scaling factor is , where × is the size of the cropping window. Then, the histology RGB values are multiplied by +1 within the segmentation mask, and by -1 without. By contrast, 68 's QualityNet1 multiplied the histology RGB values by 0 without. Our approach preserves the information outside the segmentation, while still partitioning the histology image into two sets of inside/outside pixels. Moreover, instead of estimating one single quality measure for the whole input image as in QualityNet1, we estimate whether the segmentation is correct per pixel, computing a value between -1 (undetected pixel) to +1 (false positive pixel) through 0 (correctly segmented pixel).

Effective receptive field (ERF)
The theoretical receptive field (span of input pixels that contribute to an output pixel) can be computed considering the properties of convolutions, downsampling and pooling. However, the weight of an input pixel's contribution decreases quickly towards the edges of the span, and thus the effective receptive field (ERF) is much smaller than the theoretical one 69 . To estimate the ERF, we used gradient backpropagation 69 , but replaced ReLU activations by Linear activations and Max Pooling by Average Pooling to avoid numerical instabilities. The ERF was around 131✕131 pixels (Table ERF), or 37.1% of maximum cell diameter (160 μm or 353 pixels), causing the EDT CNN to clip the estimated distance to the membrane for distances larger than the ERF. However, the segmentation validation showed no performance drop for large cells. This is because distant points do not contribute essential information to the Contour CNN. This was convenient, as increasing the ERF is computationally expensive, generally requiring deeper networks.

Segmentation sub-pipeline
The segmentation sub-pipeline combines the EDT, Contour, Tissue and Correction CNNs with traditional image processing methods ( Fig. PIPE(b)) to segment an input histology image tile, producing one label per white adipocyte.

Histology colour correction
To estimate the typical background colour of the training data set, we computed density histograms with 51 bins between 0-255 for each RGB channel of the 126 training images (Table MICE). The 50% HD quantile was computed for each bin, producing a median histogram for each channel. The modes in each median channel were taken as the typical background colour, = 232.5, = 217.5, = 217.5. Colour correction was applied for inference but not for training, to reduce overfitting. To apply colour correction to a histology slide, we estimated the mode intensity of each channel .
Then, each colour channel was corrected as , and .

White adipocyte label segmentation without overlap (Auto)
The colour-corrected histology image was used as input to the EDT CNN, and its output to the Contour CNN. To conservatively detect pixels inside objects, we thresholded the resulting image with a zero threshold (pixels on or near EDT troughs have values > 0). This produced one connected component inside each cell or object. We filled holes with fewer than 10,000 pixels, and each connected component was given a different label. Components with fewer than 400 pixels were removed. The colour-corrected histology image was also fed to the Tissue CNN, and pixels with score > 0.5 were labelled as white adipocyte pixels. All non white adipocyte pixels that do not already belong to a seed were labelled as a single new seed. Seeds were expanded using a watershed algorithm on the negative of the EDT surface (the negative sign turned hills into basins). Each watershed basin corresponded to a candidate object. Objects were rejected if they: 1) were smaller than 1,500 pixel (308.9 μm 2 ), 2) did not overlap at least 80% with the coarse tissue mask, 3) did not contain at least 50% white adipocyte pixels, 4) touched an edge, 5) were larger than 200,000 pixels (41,187.4 μm 2 ) -the largest training cell was 92,544 pixel (19,058.2 μm 2 ). Objects that were inside another object were merged into the surrounding object. Each surviving label was considered to segment one white adipocyte, but without overlap.

Segmentation label correction with overlap (Corrected)
The colour-corrected histology image was cropped and resized around each valid white adipocyte label, and passed through the Correction CNN as described above. Output pixels with scores 0.5 were added to the label, and pixels with scores -0.5 were removed. Label holes were filled, and the connected component that had the best overlap with the input label was kept as the corrected label. Finally, the corrected label was smoothed using a closing operator with an 11×11 pixel square kernel. (See Fig. CORRECTc).

Segmentation output to AIDA user interface
Each corrected label was converted to a linear polygon with vertices using Marching Squares 70 . Point coordinates were converted from the processing window to the whole histology image using , where is the scaling factor and are the coordinates of the processing window's top-left pixel within the whole histology image. The contours were then written to a JSON file that can be read by the browser interface AIDA. The results are illustrated in Fig. AIDA.

Tissue CNN validation
The Tissue CNN was applied to each of the 126 training images (see Table MICE) according to the 10-fold split, and classification scores > 0.5 were labelled as white adipocyte pixels. Then, for each hand traced contour we computed the white adipocyte score , where #WA stands for the number of white adipocyte pixels within the object, and #NWA for the number of non white adipocyte pixels.
This was compared to the ground truth score to compute the receiver operating characteristic (ROC) curve, weighting each object by its number of pixels. This way, the ROC takes into account the fact that white adipocyte objects tend to be much smaller and more numerous than other objects. Namely, this allows us to interpret the classification error in terms of the more balanced tissue area classification error (our training contours contain ≈ 23.7 · 10 6 white adipocyte pixels and ≈ 45.1 · 10 6 non white adipocyte pixels).

Segmentation sub-pipeline validation
We applied the segmentation sub-pipeline to each of the 55 colour-corrected images with hand traced cells (see Table MICE), using 10-fold cross validation. This produced Auto and Corrected contours for each image that we compared to the hand traced ones for validation.
The literature relies on summary statistics for segmentation validation. For instance, per-image average diameter 40 , mean Dice coefficient (DC) and mean Jaccard Index 47 , median cell area/volume per subject 13 , IoU and F1-score 71 and mean cell area 11 . As summary statistics, we used the DC and relative area error between each pipeline-produced contour and the hand traced (ht) contours in the image. The , where are areas computed with polygon operations. The highest DC was considered the best match. DC ≤ 0.5 were considered no match, as they are usually contours that segment an adjacent object. To compare cell populations, cell area distributions were computed using box-and-whisker plots with median notches for the hand traced, Auto and Corrected contours with a match (Fig. SEG_VALIDATION). The lower whisker was set at the lowest datum above Q1 -1.5 (Q3 -Q1), and the upper whisker, at the highest datum below Q3 + 1.5 (Q3 -Q1). Data outside the whiskers was displayed as circles. The relative area error was computed as and , respectively.
Although they are the standard, summary statistics hide important subpopulation information. Ideally, we would like a constant relative error for all cell sizes, so that segmentation errors do not create spurious changes in cell subpopulations. To assess this for our algorithm, we plotted Auto and Corrected relative errors vs. hand traced cell area. We then sorted the errors by hand traced cell area and computed rolling Harrell-Davis (HD) estimates for quartiles={Q1, Q2, Q3} of the relative errors using a rolling window of 100 points that shrinks as it overflows at the extremes down to a minimum size of 20 points. The rolling median and interquartile range Q1-Q3 were plotted as a solid curve and red shaded area, respectively (Fig. SEG_VALIDATION(c)-(f)). Finally, we computed the global HD estimate for Q2 and plotted it as a horizontal green line.

Phenotype framework for WAT
We use two main tools to study WAT: cell area probability density functions (pdfs) to represent cell populations, and linear models to phenotype body and depot weight, and cell area.
To estimate cell area pdfs, we applied Kernel Density Estimation with a Gaussian kernel and bandwidth = 100 μm to preserve sharp peaks in the distributions. Quartiles (Q1, Q2, Q3) or other quantiles were computed with Harrell-Davis (HD) quantile estimation(Harrell and Davis 1982). For a weighted average of quantiles from multiple mice, for each p-quantile for the i-th mouse q i (p) we computed the corresponding HD standard error se i (p), using jackknife and applied the meta-analysis inverse-variance method 72,73 . The combined quantile estimate is and the combined standard error is The 95%-CI for the combined quantile estimate is 95%-CI(p) = Previous KLF14 phenotype studies stratified datasets by sex 11,13 and compared two groups (sex or risk allele vs. non-risk allele) with summary statistics, namely median adipocyte area with a Wilcoxon signed-rank test 13 , or mean area with inverse variance fixed effects meta-analysis 11 . Neither adjusted for BW (the Wilcoxon signed-rank test compared pairs of BMI-matched subjects and the meta-analysis study selected subjects within the normal BMI range), although BW is a known general phenotype confounder 74,75 , in particular of mouse adipocyte diameter and DW 28 . In addition, summary statistics could miss changes in cell subpopulations that become apparent when comparing population quantiles 76 .
To tackle equivalent issues in the mouse model, we propose a Klf14 phenotype framework with three interconnected levels: the mouse level (body weight, BW), the depot level (depot weight, DW) and the cell level (quantile cell area), to remove confounding effects. Also, we study cell sizes at different quantiles; for the sake of simplicity, we use the three quartiles in this work, but the model could be applied to any other quantiles. In each phenotype level, we used linear models to quantify the trait (e.g. cell area) vs. categorical effect (e.g. genotype) and continuous covariate (e.g. depot weight). For this, we built upon the mixed model proposed by Karp et al. 74,75 (trait~genotype*sex + body_weight + (1|batch)), where "batch" groups animals processed the same day, and adipocyte size or DW vs. BW non-linear models by van Beek et al. 28 . Our approach is different in six ways: 1) as we have a smaller number of animals per stratum, we did not consider "batch", and thus, replaced mixed models for simpler Ordinary Least Squares (OLS) models or Robust Linear Models (RLMs). In larger studies, batch and other random effects (litter, mother) could be considered. 2) Due to sexual dimorphism that would dominate other effects, we stratified the study by sex, rather than include sex as a covariate. 3) We considered two effects, genotype and parent, instead of just genotype. Due to limited data (~18 mice per sex/depot/effect stratum), we performed separate exploratory analysis for each effect rather than combining both in the same model. 4) We considered two covariates to adjust for, body weight (BW) and depot weight (DW), instead of just BW. 5) We added an interaction, "effect * covariate = effect + covariate + effect:covariate", to allow for different slopes in the trait vs. covariate relationship. 6) We use adipocyte area instead of adipocyte diameter 28 to linearise the relationship with DW. Accordingly, the three-level linear models we propose are: (BW~effect), (DW~effect * BW/BW) and (area q~e ffect * DW), where BW=33.44 g is the mean BW of all animals, effect ∈ {genotype, parent} and q={Q1, Q2, Q3} are quartiles. BW, DW and area q are continuous variables, and effect (genotype or parent) are categorical variables. In addition, an intercept is included in all models, but omitted in the formula for simplicity.
We extract two results from the models. First, the two-tailed t-test of the continuous variable's slope β tells us whether trait and variable are correlated (due to the equivalence between slope and Pearson's coefficient tests); for a significant correlation with p-value ≤ 0.05, β also provides e.g. the increase rate of DW with BW/BW. Second, a Likelihood Ratio Test (LRT) evaluates whether the categorical variable has a significant effect in the trait. The LRT compares the null-model without the effect to the model with the effect, e.g. (area qD W) vs. (area q~g enotype * DW), both applied to the same data. (Details are provided in Suppl. Likelihood Ratio Test).
Body and depot weight were measured with Satorius BAL7000 scales. For cell area quantification, we applied DeepCytometer to 75 inguinal subcutaneous and 72 gonadal full histology slides (with one or two tissue slices each), including the 20 slides sampled for the hand traced data set. Each slide belongs to an animal and depot (gonadal or subcutaneous), stratified by sex -female (f) or male (m)-, heterozygous parent -father (PAT) or mother (MAT)-and genotype -wild type (WT) or heterozygous (Het)-. For segmentation of training slides, we used the pipeline instance that did not see any part of it in training. Slides not used for training were randomly assigned to one of the 10 pipeline instances.
When fitting multiple linear models, e.g. 8 models for 2 sexes, 2 depots and 2 effect categories, the p-values of all slopes undergoing t-tests where corrected using the FDR 2-stage Benjamini-Krieger-Yekutieli method 57 for a significance level α=0.05.
Quantile colour map plots to assess cell size heterogeneity Because of skewness and long tails in the histogram (Fig. COLORMAP_F(a) and Fig.  COLORMAP_M(a)), overlaying a colour map proportional to cell area or radius on the tissue sample produces a low contrast image that offers very little visual information about spatial patterns. In this section, we propose a colour map that provides a much clearer picture, by making the colour scale proportional to the ECDF of cell area.
We applied the pipeline to whole slides, to produce the scatter map , where are the coordinates of the -th white adipocyte centroid and is the white adipocyte's area. We created a mask with all the pixels that belong to at least one white adipocyte. We used Delaunay triangulation and linear interpolation to rasterise the scatter map to pixel coordinates within the mask. Finally, area values were mapped to quantiles, , where is the HD quantile estimate computed on the hand traced training data set. Because of the significant sex effect in cell area (Fig.  SEX_EFFECT), we stratified the training data set into females and males and computed separate ECDFs and colour maps for each group. Areas smaller or larger than in the hand traced data set were clipped to or , respectively. The colour map was created in Hue/Saturation/Lightness (HSL) mode, by setting and . To sum up, each cell area was mapped to a colour using the relationship

Supplementary material
Relation between pixel-classification and EDT regression.
We note that the pixel-classification 47 and EDT 50,51,66 approaches reviewed in the Introduction are related, as pixel-classification can be seen as a simplification of signed EDT regression by applying the sign function Thus, the EDT representation is richer, because it not only estimates whether a pixel is inside a cell, but how far from the membrane it is. This could account for the better segmentation results found by Wang et al. 50 .

Effective receptive field (ERF) of CNNs
The ERF of each CNN and fold was computed and is shown in Deep CNNs training methodology

Training for 10-fold cross validation
All CNNs were trained with the same 10-fold cross validation described in Table MICE: 20 mice randomly partitioned in 10 sets of 18 mice for training and 2 for validation.

Training with a combination of labelled and unlabelled pixels
As advanced in the Introduction, it is convenient to use training images with a combination of labelled ('White adipocyte', 'Background', 'Other tissue') and unlabelled pixels. Instead of becoming a new class ('Void'), unlabelled pixels should not contribute to the training process, as if they were not part of the training dataset at all. This functionality is not available in Keras 2.2. Thus, we implemented an extension that enables element-wise 1 weighting of pixel-wise scores. Let be the score matrix of size , where each element is the contribution of an output pixel to the loss. Let be a weighting matrix such that the loss is where is the number of output pixels where . If then is the average score of labelled pixels.

EDT CNN and Contour CNN training dataset
We used the 55 histology images with 2,117 ground truth white adipocyte (WA) hand traced contours from Table MICE. The corresponding SVG files containing the description of the hand traced WA contours (as described in Ground truth hand traced dataset for CNN training) were read. Each contour was rasterised as a closed polygon. Pixels that belonged to a single polygon were labelled as seeds. Then a watershed algorithm expanded the seeds over areas where polygons overlap. This effectively found a compromise boundary between overlapping cells. Boundaries were computed as pixels between two labels or between label and background. Ground truth EDTs were computed with respect to those boundaries (Fig.  DMAP(c)). Boundaries were then dilated with a 3×3 kernel because we found that the Contour CNN training was unsatisfactory on 1-pixel thick boundaries. Masks of labelled pixels were computed for the loss function from the union of all polygons, and then were dilated with a 3×3 kernel. The histology windows, EDTs, boundaries and masks dataset was then 10× augmented with random rotations up to ±90º, a scaling factor in , and horizontal and vertical flips. The augmented images were split into 4 blocks due to GPU memory restrictions. Blocks where the mask had fewer than 1,900 pixels were discarded.

EDT CNN training
We trained the 10-fold CNNs using the augmented histology windows (input), ground truth EDTs (output), masks of labelled pixels (loss function masks), Adadelta optimisation 77 with He uniform variance scaling initialisation 78 , mean absolute error (MAE) loss, MAE and mean squared error (MSE) metrics for validation with the left-out data, a batch size of 10 and 350 epochs until the loss and metrics converged.

Contour CNN training
We trained the 10-fold CNNs using the EDTs estimated by the previous CNN (input), ground truth dilated boundaries (output), masks of labelled pixels (loss function masks), Adadelta optimisation 77 with He uniform variance scaling initialisation 78 , binary cross entropy loss, accuracy metric for validation with the left-out data, a batch size of 10 and 500 epochs until the loss and metric converged.

Tissue CNN training dataset
We used all 126 histology images with hand segmentations in Table MICE, containing a total of 2,117 "white adipocyte" (WA) objects and 232 "other" (NWA) objects. All WA and NWA contours were read. Each contour was rasterised as a polygon. Pixels within each polygon were labelled as "1" for WA and background, and "0" for other types of tissue to create the classifier ground truth. Pixels in overlap areas between a WA and NWA were considered NWA. The histology windows, classifier ground truth and masks dataset were 10× augmented with random rotations up to ±90º, a scaling factor in , horizontal and vertical flips and shear angle in to an output shape of 1,416×1,416 to avoid cropping out training pixels. The augmented images were split into 4 blocks due to GPU memory restrictions.

Tissue CNN training
We trained the 10-fold CNNs using the augmented histology windows (input), classifier ground truth (output), masks of labelled pixels (loss function masks), Adadelta optimisation 77 with He uniform variance scaling initialisation 78 , binary focal loss 79 with , accuracy metric for validation with the left-out data, a batch size of 8 and 37 epochs until the loss and metric converged. We used a cyclical learning rate 80 with a triangular cycle that scales initial amplitude by half each cycle, initial learning rate , upper boundary and number of training iterations per half cycle equal to 8× training iterations in epoch.

Correction CNN training dataset
We used the same 55 histology images, hand traced contours and rasterised polygons as for the EDT and Contour CNNs. Let be the hand traced contour polygon area, and the radius of a circle with the same area. We generated a series of incorrect segmentations by eroding and dilating the ground truth segmentation using an square kernel with length , where the parameter ( for erosion and for dilation). Thus, we established a correspondence between a histology cropped image, the eroded/dilated segmentation mask and the segmentation error , where is the hand traced contour polygon. Note that is -1 for pixels that underestimate the segmentation, 0 for correctly segmented pixels and +1 for pixels that overestimate the segmentation. Next, we multiplied the histology by +1 within and by -1 without. We also computed a mask for the loss function by dilating by a factor . Finally, the resulting image, the segmentation error and the loss function mask were cropped and resized according to the bounding box of , as described above in the Correction CNN architecture section.

Correction CNN training
We trained the 10-fold CNNs using the cropped and resized -1/+1 masked histology (input), segmentation error (output) and loss function mask, the same cyclical learning rate as in the Tissue CNN, Adadelta optimisation 77 with He uniform variance scaling initialisation 78 , mean squared error (MSE) loss, mean absolute error (MAE) and MSE metrics for validation with the left-out data, a batch size of 12 and 100 epochs until the loss and metrics converged.

DeepCytometer running times
We computed running times on a random sample of 95 automatically segmented whole slides. We measured tissue area as the area covered by the coarse tissue mask. Ordinary Least Squares model (time~tissue area) showed a linear relationship with intercept β 0 =297.1 ± 2,495.8 s, slope β 1 =111.9 ± 10.6 s / mm 2 . Tissue areas were between 41.  (3.3, 5.5, 8.7) h. It should be noted that each slide contained two tissue slices. Segmentation times were calculated applying the Corrected method to the 60 training images for population studies. The Auto part of the pipeline took 43.9% ± 1.6% of the total time, and overlap correction took the other 56.1% ± 1.6%. Thus, overlap correction increased Auto computation time by a factor ⨯(2.28 ± 0.08).

Likelihood Ratio Test
To assess whether an independent variable X (parent, genotype) produces a significant effect on a dependent measure Y (body weight, depot weight) in a linear model, we use the Likelihood Ratio Test (LRT). With the LRT, we compare the fitting of the data to a null model (without X) with the fitting to an alternative model (with X). The LRT statistic is λ LR = 2 (ln(L1) -ln(L0)) where ln(L0), ln(L1) are the log-likelihoods of the null and alternative models, respectively. The λ LR statistic follows a χ 2 distribution with 1 degree of freedom. The test's null hypothesis (H0) is that the data is fully specified under the null model (λ LR =0). The alternative hypothesis (H1) is that the alternative model, with variable X, is significantly better (λ LR >0). The test produces a p-value, p=χ 2 (λ LR , 1), to reject H0 for H1. Using the LRT is equivalent to using the Akaike Information Criterion (AIC). The AIC is a measure that combines the goodness of fit of the model and its parsimony AIC = 2 k -2 ln(L) where k is the number of parameters of the model and ln(L) is the log-likelihood as above, so a lower AIC corresponds to a better model. Comparing the null model to the alternative model yields AIC 1 -AIC 0 = 2 -2(ln(L1) -ln(L0)) where by convention X produces a worthy improvement if AIC 1 -AIC 0 < -2 ⇒ ln(L1) -ln(L0) > 2. Note that under a χ 2 distribution with 1 degree of freedom, this corresponds to p=χ 2 (4, 1)=0.046. Thus, the LRT with significance threshold α=0.050 is just slightly more lenient than the AIC with AIC 1 -AIC 0 < -2.