A Bayesian Nonparametric Approach to Discover Clinico-Genetic Associations across Cancer Types

Motivation Personalized medicine aims at combining genetic, clinical, and environmental data to improve medical diagnosis and disease treatment, tailored to each patient. This paper presents a Bayesian nonparametric (BNP) approach to identify genetic associations with clinical/environmental features in cancer. We propose an unsupervised approach to generate data-driven hypotheses and bring potentially novel insights about cancer biology. Our model combines somatic mutation information at gene-level with features extracted from the Electronic Health Record. We propose a hierarchical approach, the hierarchical Poisson factor analysis (H-PFA) model, to share information across patients having different types of cancer. To discover statistically significant associations, we combine Bayesian modeling with bootstrapping techniques and correct for multiple hypothesis testing. Results Using our approach, we empirically demonstrate that we can recover well-known associations in cancer literature. We compare the results of H-PFA with two other classical methods in the field: case-control (CC) setups, and linear mixed models (LMMs).


Introduction
Cancer encompasses not one, but a large group of genetic diseases involving abnormal cell growth with the potential to invade or spread to other parts of the body. Although a small set of universal underlying principles were identified, the so-called "hallmarks" of cancer [19,20], each type of cancer presents unique properties, making this disease very hard to treat [37,22,43].
Genetic association studies have been successful in relating somatic mutation to carcinogenesis, but has been limited to the detection of common large-effect variants in the presence of only small cohorts [9,27,8]. Finding somatic driver mutations is even more challenging since these mutations are often rare. The large phenotypic heterogeneity, which reduces statistical power in the discovery method, causes some associations to remain hidden [38,29,40]. Cohort sizes tend to be small, especially in rare cancers, which makes the discovery of small effect size associations difficult [3]. Additionally, carcinogenesis is driven by the accumulation of mutations that may act epistatically or pleiotropically during the disease, further reducing the power of typical approaches [48,11,42,12]. To overcome these difficulties, new approaches for interpreting genetic variation across different cancer types are required.
In recent years, efforts to mine electronic health records (EHRs) show promise to impact nearly every aspect of healthcare [26]. The adoption of EHRs in hospitals has increased dramatically and has become a powerful resource for phenotyping [2,32], with the potential for establishing new patient-stratification principles and for revealing unknown disease correlations. Integrating EHRs data with genetic data will also give a better understanding of genotypephenotype relationships [49]. EHRs consist of both structured and unstructured information. Structured data is a valuable source of information that includes billing codes, laboratory reports, physiological measurements, and demographic information, among others. Yet, most of the clinical data comes as unstructured notes, e.g., around 98% of the EHRs [26]. These include a broad spectrum of clinically-relevant information which might be useful to identify novel phenotypic relationships so far unknown by the clinicians [32,26].
This work presents a joint generative model to discover associations between somatic mutations and clinical features in cancer that deals with phenotype heterogeneity, small cohort size, epistasis and pleiotropy in a straightforward way. Our method infers latent topics from the clinical text and genetic information, capturing complex interactions between groups of genes and clinical features. It is directly inspired by the Poisson factorization model for recommendation systems [16], with three important differences.
First, we introduce confounding effects as conditional variables, i.e., variables that might cause spurious associations to appear. In particular, our model considers multiple types of cancer together (the type of cancer is treated as a confounder) and shares information among all patients in a hierarchical fashion. Indeed, most cancers are known to share common pathogenesis despite specificities of the cell type and tissue origin [43]. By doing so, specific effects for each type of cancer can be isolated, and additional (less well-known) associations with somatic mutations of smaller effect size can be obtained.
Second, we force sparsity on the textual and genetic topics by using shape parameters smaller than one in the Gamma distribution priors. Sparsity is crucial to find meaningful, easy-tointerpret associations; those can be validated either through previous studies by looking in the literature, or subsequent tests in the lab.
Third, we present a nonparametric alternative model to [16] by replacing the continuous patient weights with a binary matrix whose probability distribution is induced by a hierarchical extension of the Indian buffet process (IBP). Also in the literature, the authors in [17] propose a nonparametric Poisson factorization model, but they rely on a stick-breaking construction different from the IBP, and the weights are continuous, which renders interpretability of the latent variables more tedious. The discrete nature of the IBP helps in terms of interpretability and allows combining the proposed Bayesian model with classical frequentist approaches for statistical testing between the inferred patient partitions. An efficient Markov chain Monte Carlo (MCMC) procedure based on a slice sampler for the hierarchical IBP is presented.
Bayesian modeling has already been proven useful for epistasis [53], pleiotropy [53,52] or sub-phenotyping applications [35,28]. To our knowledge, the proposed model is the first one to deal with clinical text data and genetic information jointly in a Bayesian nonparametric way, capturing phenotypic heterogeneity, epistasis and pleiotropy in a straightforward way while correcting for the cancer type as confounder. We consider multiple cancers jointly in order to increase statistical power, allow for the analysis of rare cancers, and identify fundamental mechanisms shared across different types of cancer.

Study design and setting
The study was designed as a retrospective cohort study for the development and analysis of techniques to analyze clinical narratives in the context of somatic mutations. It was performed at Memorial Sloan Kettering Cancer Center (MSKCC). The institutional review board of MSKCC provided a Waiver of Authorization (WOA; WA0426-13) for this study. Clinical notes were provided by the IT services group at MSKCC. The Center for Molecular Oncology at MSKCC provided the information about somatic mutations from the MSK Impact panels of patient tumors. We included 1,946 patients for which we had MSK Impact panel data and at least one clinical narrative available at the time of delivery. Data analyses were performed on the HPC compute systems at MSKCC. Additional statistical analyses were performed at University Carlos III and ETH Zürich.

Database Description
So far, genomic testing of tumors has been done routinely only for certain solid cancer tumors, such as melanoma, lung, or colon cancer. For most cancers, the available tests have been limited to analyzing one or a handful of genes at a time, and within each gene, only the most common mutations could be detected. 1 A new targeted tumor sequencing test called MSK-IMPACT (Integrated Mutation Profiling of Actionable Cancer Targets) is able to detect somatic mutations and other critical somatic aberrations in both rare and common cancers [10,51].
Using the MSK-IMPACT panels [51], somatic mutations regarding specific screened genes can be obtained as follows. For each patient, tumor cells are compared with healthy cells of that same patient, extracted from the blood stream, as illustrated in Figure 1. In this work, a gene is said to be mutated when there exists at least one difference in the sequence between the tumor cells and healthy cells for that particular gene. 2 We finally obtain a binary matrix for N = 1946 patients and G = 410 genes where "1" encodes for a mutated gene and "0" otherwise. The screened genes have been shown to play a role in the development or behavior of tumors, although their individual relation to specific phenotypes remains obscure [10,51].
Concerning the clinical information, based on all EHRs, we build a bag-of-word representation of unified medical language system (UMLS) terms, extracted using the Metamap 3 processing tool [4]. The UMLS refers to a standardized, comprehensive thesaurus and ontology for biomedical concepts, whose objective is to provide facilities for natural signal processing tasks [7]. Since each patient can have a varying number of records, we group all clinical history into a single EHR, and only consider the appearance or absence of each UMLS term, i.e., binarized clinical features. We compute the tf-idf score for each UMLS term, and only keep the 300 clinical terms with highest score.
The database includes clinical and genetic information for N = 1946 patients and 5 different cancer types: bladder cancer, breast carcinoma, colorectal cancer, non-small cell lung cancer, and prostate cancer. We consider genes and UMLS terms that are present in at least 1% of the patient population, resulting in D = 249 dimensions, including 72 genes and 177 clinical terms. Even if the dataset is binary, we can use a Poisson likelihood because of the high sparsity degree of such matrix (7.28% of non-zero values). In such scenario, the Poisson distribution is a good approximation of a Bernoulli, and we adopt it by mathematical convenience in the inference process [14,46]. In the following, we use this data for: a) discovering latent factors and, b) testing for significant features (either genetic or phenotypical) associated to each factor.

Classical Approach: Case-Control Setup
The most common approach for genetic association studies is the case-control (CC) setup, which compares two large groups of individuals, one case group presenting a particular phenotype, and one control group without such phenotype.All individuals in each group are genotyped to identify somatic mutations in a panel of genes. For each of these genes, it is then investigated if the somatic mutation is significantly associated with the phenotype of interest. In such setups, the fundamental unit for reporting effect sizes is the odds ratio. The odds ratio in this case refers to the odds of exhibiting the phenotype for individuals having a specific somatic mutation and the odds of exhibiting the phenotype for individuals who do not present such somatic mutation. A p-value for the significance of the odds ratio is typically computed using a simple χ-squared test or Fisher test. Finding odds ratios that are significantly different from one is the objective of an association study because this shows that there is statistical evidence that the somatic mutation is associated with the phenotype.

Confounder Correcting Approach: Linear Mixed Model
Linear mixed models (LMMs) have proved particularly useful for genetic association studies due to its capacity to account for confounding effects and limit the number of false associations [31,30]. Let X be the observation matrix, where each element x ng corresponds to an indicator variable for a particular patient n ∈ {1, . . . , N } and somatic mutation in gene g ∈ {1, . . . , G}, and x g ∈ {0, 1} N ×1 is the indicator vector for gene g across all patients. The binary variable x ng indicates whether any somatic mutation occurred in the corresponding gene. Let y nq be the binary indicator variable of the presence of a certain clinical feature q ∈ {1, . . . , Q} for a given patient n, and y q ∈ R N ×1 the indicator vector for clinical feature q across all patients. Finally, let us define c nl as the binary indicator variable of patient n to the cancer type ∈ {1, . . . , L}, and c n ∈ {0, 1} 1×L the cancer type assignment vector, where c n = 1 (for simplicity, we only consider patients having one single type of cancer). For each pair of gene g and clinical feature q, a LMM can be defined as follows: where β qg ∈ R refer to the fixed effect of feature q and gene g, and u qg , ε qg ∈ R N ×1 are the random effects (structured noise and observational noise, respectively). The prior assumptions for the structured and uniform noises are u qg ∼ N (0, σ 2 u K) and ε qg ∼ N (0, σ 2 e I), where K refers to a similarity matrix between the patients, for instance, the cosine similarity of the cancer type assignment vectors c i and c j , K = CC T . The LMM assumes that the output y q is Gaussian-distributed: When the data is binary or count data, a common practice is to apply a standard rank-based inverse normal transformation beforehand as a preprocessing step [24], although this has become controversial more recently [5]. LMMs are discriminative since they try to model the conditional probability p(y q |x g ). In this paper, we propose an alternative generative approach that models the joint distribution p(y q , x g ) and captures complex correlations via latent factors.

Hierarchical Bayesian Nonparametric Approach
Let X ∈ N N ×D be the observation matrix of count data for N patients and D dimensions, where D includes both clinical and genetic information, i.e., D = G + Q, where G is the number of genes and Q is the number of clinical terms. In the following, we propose two Poisson factor analysis (PFA) approaches to model the joint observation matrix X of genetic information and clinical data. In these models, patients will be represented by binary feature activation vectors, and each of these features will capture common correlation patterns among the somatic mutations and clinical term occurrences.

Poisson Factor Analysis (PFA)
We first consider a nonparametric non-negative matrix factorization model with Poisson likelihood and Gamma-distributed factors: where α is the concentration parameter of the IBP controlling the a priori number of ones in matrix Z (i.e., the a priori expected number of latent features), and µ A , α A are the prior mean and shape parameter for each element of matrix A. Sparsity can be induced easily in the factors by choosing α A 1. Inference is performed using an MCMC approach based on a semi-ordered stick-breaking representation of the IBP prior [44].

Hierarchical Poisson Factor Analysis (H-PFA)
Although different types of cancer are known to share similar phenotypes and underlying mechanisms (shared activation of certain pathways), the mutation rate and phenotype occurrence might vary in different proportions, according to each type of cancer. Given this premise, we propose a hierarchical Bernoulli process Poisson factor analysis model to allow for different feature activation levels depending on each type of cancer. In the following, we will shorten the name of this model to hierarchical Poisson factor analysis (H-PFA).
Let r n ∈ [1, . . . , L] be a categorical variable indicating the type of cancer of patient n among the total number of cancer types L (r n corresponds to the index of the non-zero value in vector c n defined in Section 2.4). A hierarchical construction can be formulated based on the finite representation of the IBP and letting K → ∞, such that different levels of feature activation are allowed for each type of cancer. Let ρ k be the global activation probability of feature k, and π k be the specific activation probability of feature k for cancer type ∈ [1, . . . , L]. We can then assume that each specific activation probability is Beta-distributed such that E [π k ] = ρ k : where the feature activation variables in vector Z n are drawn from different activation probability vectors {π 1 , . . . , π L } depending on the type of cancer r n of patient n. When K → ∞, this prior over Z is equivalent to a hierarchical Beta process (BP) construction [45] on top of Bernoulli processes (BePs) in the De Finetti representation. In the same way that a hierarchical Dirichlet process (HDP) allows for atom sharing with varying weights across different groups of data, the H-PFA allows for feature sharing with different activation weights across different types of cancer.

Statistical Methodology
Once the model has been trained (samples from an approximate posterior distribution can be drawn), we proceed with a classical frequentist approach 4 to identify statistically significant clinico-genetic associations across cancer types. First, we take M posterior samples from the posterior distribution of Z given A fixed. For each sample, patients that have the same feature assignment vector (activation pattern of features) can be grouped together in the same subpopulation. For instance, subpopulation (1001) refers to all patients having the first and fourth feature active. Let P refer to the total number of inferred subpopulations across the M posterior samples. By considering multiple posterior samples, we obtain slightly different partitions of patients in subpopulations. This can be seen as performing soft-clustering of patients, i.e., patients that are in-between subgroups might be assigned to different subpopulations in different posterior samples. Thus, the method is more robust against model inaccuracies at clustering patients. This is an important benefit of the Bayesian framework. Next, to make our method robust against outliers (e.g., patients with rare features), we perform bootstrapping B times for each subpopulation and posterior sample. Bootstrapping relies on random sampling with replacement. It is a technique used for computing robust estimators against outliers by sampling from an approximating distribution, which is particularly useful for hypothesis testing when the model assumptions are in doubt or unknown [50]. The standard bootstrapping approach relies on the construction of an estimator for hypothesis testing based on a number B of resamples with replacement of the observed dataset (and of equal size to the observed dataset), i.e., sampling with replacement from the empirical distribution of the observed data.
Finally, given M posterior samples and B bootstrapping instances for each sample, we end up with M B different subpopulation instances. Measures of effect size (quantitative measure of the difference between two subpopulations) and statistical significance can be computed for each instance and then averaged across them, so that partition inaccuracies and outlier effects are mitigated. To identify statistically significant dimensions for each latent feature k = 1, . . . , K in sample m and bootstrap b, we split the whole patient population in two subgroups, S k (m, b) and S −k (m, b), corresponding to patients whose latent feature k is active or inactive respectively, and perform two-sample statistical tests for each dimension d.

Effect size
For each latent feature k and dimension d (either clinical or genetic), we compute the effect size ∆ kd as: where where µ d (G) is the mean value of variable d for a given subpopulation G.

Statistical significance
To measure how significant an effect size δ kd (m, b) is, for each posterior sample m and bootstrap instance b, we compute a statistical significance value υ kd (m, b) as the p-value resulting from a Fisher test, which is a standard test for discrete variables [50]. We define the K × D matrix of statistical significance Υ, for each latent feature k and input dimension d as the median p-value across the M samples and B bootstrapping instances: where υ kd denote the M × B matrix of statistical significance values υ kd (m, b) for each posterior sample m and bootstrapping instance b. Finally, we follow the Benjamini Hochberg procedure for multiple hypothesis testing to adjust the statistical significance threshold α s such that a certain false discovery rate (FDR) is guaranteed [6]. An input dimension d (either clinical or genetic) is said to be statistically significant for latent feature k if its significance value Υ kd (the median p-value across posterior samples and bootstrapping instances) is smaller than the adjusted threshold, i.e., Υ kd < α s . The whole procedure is summarized in Algorithm 1.
Algorithm 1 Statistical approach for discovery of clinico-genetic associations (post-processing procedure).
Require: M posterior samples from Z, where P is the number of subpopulations, and K is the number of inferred latent features. bootstrap for each subpopulation B times 3: end for 4: for k = 1, . . . , K do 5: compute effect size according to Eq. 8 and 9.

6:
compute statistical significance (p-value) according to the Fisher test adjusting for multiple hypothesis testing [6]. 7: end for Ensure: effect size matrix ∆ and significance matrix Υ, both of dimensions K × D

Experimental Setup
We compare the proposed H-PFA approach with a LMM and a standard case-control set-up for each potential clinico-genetic association. The model parameters for each LMM are found by maximizing the log likelihood using standard optimization techniques within a python platform called LIMIX [30]. In the final step, we obtain p-values for each pair (y q , x g ) using likelihood ratio tests. Regarding the case-control analysis, for each clinical term we consider a case and control group corresponding to the patients having that clinical term active or inactive respectively. Given such a partition, we perform an individual Fisher test for each gene. For all methods, we correct for multiple hypothesis testing based on the Benjamini-Hochberg approach [6]. To quantify the statistical significance of the features discovered by the H-PFA model, we follow the statistical procedure described in Section 2.6. To increase interpretability of the H-PFA model, we force one of the latent features to be active for all patients. This is a common practice in BNP models to capture mean effects [41,47,46], which in this case corresponds to phenotypical attributes common to all types of cancer. Finally, we have set the hyperparameters of the proposed H-PFA as α A = 0.01 and µ A = 1, while we infer the values for the concentration parameter α. Figure 2 represents the number of associations found by each method, and how many overlap across techniques. LMM found 14 clinico-genetic associations, CC found 178, and H-PFA found 95. 5 LMM finds the least number of associations, since it corrects for the cancer type as a confounder effect, and only gets less well-known associations that are present across all types of cancer. CC discovered the highest number of associations, from which only 30% are shared with H-PFA. Out of the 95 associations discovered by the H-PFA approach, 63% were also present in any of the other methods. Figure 3 lists the associations that are shared across methods. Tables 1 and 2 present the list of clinico-genetic associations found by LMM and CC methods (for CC, we only report a random selection of associations, but the complete list can be found in the Appendix).

Identification of Clinico-Genetic Associations
Next, Table 3 shows the list of inferred latent features by the H-PFA model. The bias term F0 reflects the high rate mutation of the TP53 gene which occur across all types of cancer. The TP53 gene is essential for the production of a protein called tumor protein p53. This protein acts as a tumor suppressor, which means that it regulates cell division by keeping cells from growing and dividing too fast or in an uncontrolled way. Because p53 is essential for regulating cell division and preventing tumor formation, it has been nicknamed the "guardian of the genome" [34]. On top of the bias term F0, H-PFA inferred 19 other latent features. Features F3, F5, and F17 capture complex phenotypes (no somatic mutations involved), whereas F4 and F18 mostly capture somatic mutations. Interestingly, F18 relates Esophagogastroduodenoscopy (a test to examine the lining of the esophagus, stomach, and the beginning of the small intestine) to multiple somatic mutations, which was already revealed by LMM in Table 2. The remaining 14 features capture co-ocurrence of somatic mutations and clinical UMLS terms. Some latent features reflect well known relationships in oncology research. To name a few, mutations of gene PIK3CA (captured by F1 and F16) are present in over one-third of breast cancers; such mutations are nowadays known to be oncogenic and also implicated in cervical cancers [18]. Somatic mutations in the triad APC-KRAS-TP53 genes (captured by F0 and F6 together) are prominent in colon cancer [1]. Finally, previous studies have found direct physiological and molecular evidence for a role of gene FOXA1 in controlling cell proliferation in prostate cancer [23], which is accounted for in factor F12. Figure 4 depicts the cancer-specific activation weights π l for each type of cancer , as de-       and significance. Our method is able to provide concise grouping of both clinical terms and somatic mutations. Among the clinical terms, we find both phenotypical terms, as well as names of chemotherapy medications (Adriamycin, Irinotecan, or Leucovorin). Table 4 shows cancer-specific clinico-genetic associations. We recover well-known associations (such as APC gene mutation being prominant in colorectal cancer, or STK11 to lung carcinoma), but other associations are more surprising, such as GATA3 gene with bone mineral density. Finally, H-PFA found several statistically significant sets of associations involving somatic mutations in gene TERT, as shown in Table 5. Somatic mutations in the gene promoter of telomerase reverse transcriptase (TERT) have been found in 70-79% of bladder tumors in a multi-institutional study published in European Urology [36]. Table 5 shows that TERT muta-      Table 5: Clinico-genetic associations found by the H-PFA (2/2). All these associations involve gene TERT. One same set (depicted in bold) appears in all associations.
tions are associated to not only malignant neoplasm of urinary bladder (which is not surprising), but also hematuria and hydronephrosis. Hematuria refers to the presence of red blood cells in the urine. Also, hydronephrosis is a condition that typically occurs when the kidney swells due to the failure of normal drainage of urine from the kidney to the bladder. Hydronephrosis is not a primary disease, but results from some other underlying disease (cancer in this case) as the result of a blockage or obstruction in the urinary tract. H-PFA points out to interesting gene relationships (KDM6A, CREBBP, and ARID1A genes) with TERT, which have been partially studied in the literature [33,39,25].

Conclusion
This paper proposes a novel Bayesian nonparametric approach for discovering clinico-genetic associations between somatic mutations and EHR-based clinical features. We present a hierarchical Bernoulli process Poisson factor analysis model based on a hierarchical construction of Beta processes and Bernoulli processes. Our approach is not specific to cancer data nor Electronic Health Records, but can be broadly used to discover associations between arbitrary count data features. Compared to other approaches, our model delivers group-associations instead of pairwise ones, accounting for epistatic and pleiotropical effects straightforwardly. The delivered associations are statistically significant after correction for multiple hypothesis testing combined with a bootstrapping procedure, to better account for false positives. These associations give potentially interesting insights for future research in oncology. Under the proposed model, we hopefully open the door to find new associations that give rise to hypotheses, and if those are validated, then we may get new insights about cancer biology. Ultimately, studies like this one have the potential to lead us towards more accurate diagnosis, and inform us about actionable pathways when considering cancer therapy, where interventions through drug administration can be designed.     Table 9: Additional clinico-genetic associations found by the H-PFA involving EGFR gene. This group of associations was found in a subgroup of 100% non-small cell lung cancer patients.

Appendix: Inference details for Poisson Factor Analysis
Poisson factorization models have been successfully applied for recommendation systems [15], topic modeling [14], and analysis of Electronic Health Records among others [21]. Let X ∈ N N ×D be a sparse matrix of count-data observations with N samples and D dimensions. The generative model for the Bernoulli process Poisson factor analysis (PFA) is given by: where Z is a N × K matrix of binary weights, and A is a K × D matrix of non-negative hidden factors. Direct inference in such models is intractable, but we can easily solve the problem using MCMC techniques. For each observation x nd , we introduce the auxiliary variables x nd,1 , . . . , x nd,K such that x nd = K k=1 x nd,k , and x nd,k ∼ Poisson(θ nk A kd ) for k = 1, . . . , K. Each Poisson count is separated in a sum of Poisson contributions corresponding to each latent factor. Given such auxiliary variables, the model is conditionally conjugate, and a Gibbs sampler can be derived straightforwardly. In particular, we use the following theorem: Using Theorem 1, x nd,• can be sampled from a Multinomial given x nd , θ n• and A •d . In the following, we propose two MCMC algorithms: a collapsed Gibbs sampler where matrix A is marginalized out using a Laplace approximation, and an uncollapsed slice sampler version which allows for parallel sampling of both the elements in Z and A given the auxiliary variables x nd,1 , . . . , x nd,K . The results shown in this work have been generated using the uncollapsed Gibbs Sampler.

Collapsed Gibbs Sampler
We first propose a collapsed Gibbs sampler where matrix A is marginalized out, and we only need to sample the elements of matrix Z. We need to compute its posterior distribution: p(z nk |X, Z ¬nk ) ∝ p(z nk |Z ¬nk )p(X|Z) In order to approximate the integral in (16), we resort to a Laplace approximation, which assumes that: e ψ(A •d ) dA •d (17) has a peak at a certain value of A MAP In order to find the maximum value A MAP •d , we can use either Newton's method or gradient descent. Where applicable, Newton's method might converge faster towards a local maximum or minimum than gradient descent. Newton's method is an iterative method for optimization where each value A (t) •d at iteration t is computed as: where γ ∈ (0, 1] is the step-size of the algorithm. Note that for the Laplace approximation to work properly, −∇∇ψ(A •d ) should be a positive semi-definite matrix. This is guaranteed only if α B > 1, so the collapsed Gibbs sampler will only work for shape parameters bigger than one, resulting in non-sparse A matrices.

Uncollapsed Gibbs Sampler
Inference for the PFA model can be performed using an uncollapsed Gibbs sampler together with a slice sampler for semi-ordered stick-breaking representation of the IBP [44]. For the sake of completeness, the slice sampling procedure for matrix Z is described in Algorithm 2.
Using the auxiliary random variables described at the beginning of this Appendix, the complete conditionals can be easily derived as follows: x nd,k , d + p(x nd,• |x nd ,