Abstract
Teratogenicity poses severe threats to patient safety. Stem-cell-based in vitro systems are promising tools to predict human teratogenicity. However, current in vitro assays are limited because they either capture effects on a certain germ layer, or focus on a subset of predictive markers. Here we report the characterization and critical assessment of TeraTox, a newly developed multi-lineage differentiation assay using 3D human induced pluripotent stem cells. TeraTox probes stem-cell derived embryoid bodies with two endpoints, one quantifying cytotoxicity and the other inferring the teratogenicity potential with gene expression as a molecular phenotypic readout. To derive teratogenicity potentials from gene expression profiles, we applied both unsupervised machine-learning tools including factor analysis and supervised tools including classification and regression. To identify the best predictive model for the teratogenicity potential that is explainable, we systematically tested 64 machine-learning model architectures and identified the optimal model, which uses expression of 77 representative germ-layer genes, summarized by 10 latent germ-layer factors, as input for random-forest regression. We combined measured cytotoxicity and inferred teratogenicity potential to predict concentration-dependent teratogenicity profiles of 33 approved pharmaceuticals and 12 proprietary drug candidates with known in vivo data. Compared with the mouse embryonic stem cell test, which has been in routine use for more than a decade, the TeraTox assay shows higher sensitivity, particularly towards teratogens impairing ectodermal development or stem-cell renewal, and a more balanced prediction performance. We envision that further refinement and development of TeraTox has the potential to reduce and replace animal research in drug discovery and to improve preclinical assessment of teratogenicity.
1. Introduction
Teratogenicity, the ability of a chemical to cause defects in a developing fetus, has gained wide and continuous attention since the thalidomide tragedy in the 1960s (1). To assess the teratogenic potential of drug candidates, pharmaceutical companies must perform embryo-fetal-development studies (EFD studies hereafter) in at least one rodent and one non-rodent species (2, 3). There is an urgent need to develop alternative, humanized in vitro assays for early assessment of teratogenicity, because they can potentially better mimic human physiology, reduce animal use in drug discovery, and lower the attrition rate of drug development by filtering out potential teratogens early (3–5).
The current industrial standard in vitro assay for teratogenicity assessment is the mouse embryonic stem cell test (mEST). It measures both the differentiation of embryoid bodies (EB) derived from D3 mouse embryonic stem cells (mESC) by quantifying beating cardiac tissue, and the cytotoxicity in both mouse D3 ESCs as well as mouse 3T3 fibroblasts (6–8).
The mEST assay offers several advantages compared with other assays, including the zebrafish model (9, 10) and other stem-cell-based in vitro models (11–18). It uses two well-characterized, stable cell lines as the biological model that recapitulates early embryogenesis and no animal experiments are required. The cells are easy to acquire and handle. The protocol is well established and the assay is widely adopted. Importantly, the assay is validated by the European Centre for the Validation of Alternative Methods (7,8,19–21), and therefore trusted by many laboratories.
However, the mEST assay has both conceptual and practical limitations as a predictive model of human teratogenicity. First, it uses murine cells, which fail to recapitulate human teratogenicity for some chemical classes, for instance phthalimide-based molecules including thalidomide (22). Second, because the stem cells are differentiated into cardiomyocytes, the assay preferentially quantifies impairment of mesodermal germ-layer development. Third, the EB differentiation is a lengthy process of ten days and the manual counting of beating cardiomyocytes is both time-consuming and error-prone, which limits the throughput of the assay. Finally, and critically, the predictive algorithm relies on ID50, the concentration at which half of the maximal inhibition of differentiation is achieved. For strong cytotoxic compounds, it is common that IC50, the concentration at which half of the maximal cytotoxicity is observed, coincides closely with ID50, which causes false-negative predictions. Since the assay is used to pre-select developmental compounds prior to regulatory EFD studies, misclassifications necessitate unnecessary animal use in EFD studies and, in case the teratogenicity is specific to humans, pose severe threats for patient safety.
Given the limitations of the mEST assay, we developed a new, humanized in vitro teratogenicity assay. The new assay, which we call TeraTox, uses ethically non-restricted human induced pluripotent stem cells (hiPSC). The cells form three-dimensional embryoid bodies (EBs) and differentiate spontaneously into all three germ layers – ectoderm, mesoderm, and endoderm – with expression of representative developmental markers of each layer. We previously documented the technical details of the assay and demonstrated its feasibility with four teratogens and four non-teratogens (23). However, a systematic assessment of its performance using a larger compound set has not been conducted yet and the prediction algorithm is missing.
To fill these gaps, here we describe the optimization and critical assessment of the TeraTox assay and the setup of a predictive model for human teratogenicity evaluation. We compiled a panel of 45 drug-like molecules with known teratogenicity profiles and tested them in six-point concentration response, generating the largest published dataset so far in a single study about in vitro modelling of teratogenicity with reference to clinical/ animal in vivo data. Because both the cell amount and the workload required by digital PCR would be prohibitive, we adapted Molecular Phenotyping, a technology based on amplicon-based RNA sequencing, to quantify expression of germ-layer genes. Using gene expression data as input, we built machine-learning models with varying architectures and identified the best-performing model using factor analysis and random-forest regression. Using a leave-one-out training-testing strategy, we classified the 45 compounds as either teratogenic or non-teratogenic, thereby considering both concentration-dependent cytotoxicity and teratogenicity potential. We found that TeraTox features a lower specificity but outperforms mEST with regard to sensitivity and balanced prediction considering precision and sensitivity. Finally, we augmented the model with biological and pharmacological interpretations as well as simulation studies that explain how it works. In summary, our assessment highlights both the advantage of TeraTox over the standard mEST assay for preclinical teratogenicity assessment and directions of its future development.
2. Material and Methods
2.1. Human iPSC derived TeraTox Assay
The TeraTox assay is built upon commercially available human induced pluripotent stem cells (hiPSC, Gibco, A18945) with indistinguishable gene expression profiles compared with embryonic stem cells (16, 24). The cells form 3D EBs and undergo multi-lineage differentiation into all three germ layers (23). Prior to the assay, the hiPSC were tested with the TaqMan ScoreCard assay (Thermo Fisher) to confirm sufficient levels of pluripotency (25). The EBs were spontaneously differentiated and treated with several reference substances over a time course of seven days in Elplasia 96w micro-well plates (Corning, 4442) using the ViaFlo 96 automated microplate pipetting device (Integra) for liquid handling. Compounds were applied to the EBs on day 0, day 3 and day 5 at six concentrations, together with EB medium and 0.25% DMSO solvent controls as the negative reference. Cell viability was determined on day 7 by measuring ATP release in supernatants with the CellTiter-Glo 3D assay (Promega, G9681) to pre-specify appropriate testing ranges. All cell culture media and reagents were obtained from Gibco (Thermo Fisher) unless otherwise specified. The overall cell culture and cytotoxicity protocol was previously described in detail by Jaklin et al., 2020 (23).
Targeted gene expression profiling was performed with the molecular phenotyping platform that we developed previously (26–28). In total, 1,055 samples of differentiated EBs were lysed after 7 days of differentiation in 350 μl MagNA Pure LC RNA Buffer (Roche Diagnostics) and purified by using the automated MagNA Pure 96 system (Roche Diagnostics). The total RNA was quantified using the Qubit RNA Assay Kit (Thermo Fisher) on the Fluorometer Glomax (Promega). Total RNA with a maximum of 10 ng from each biological replicate was reverse transcribed to cDNA using Superscript IV Vilo (Thermo Fisher). Libraries were generated with the AmpliSeq Library Plus Kit (Illumina) according to the reference guide. Pipetting steps for target amplification, primer digestion, and adapter ligation were done with the mosquito automatic pipettor (SPT Labtech) in miniaturization. For the purifications before and after final library amplification, solid phase reversible immobilization magnetic bead purification (Clean NGS, LABGENE Scientific SA) was performed on the multidrop automated pipetting station (Thermo Fisher).
We measured both amplicon sizes and cDNA concentrations using an Agilent High Sensitivity DNA Kit (Agilent Technologies) according to the manufacturer’s recommendation. Prior to sequencing, cDNA contents of the samples were normalized and pooled to 2 nM final concentration on Biomek FXP workstation. The libraries were sequenced on the NovaSeq 6000 Instrument (Illumina) with the sequencing-by-synthesis technology. All the 75 cycles ended up with a minimum of 2 Mio sequencing reads per sample for analysis. We used molecular phenotyping with 1,215 detectable pathway reporter genes including a subset of 87 early developmental markers (germ-layer genes, Suppl. Tab. S3) and genes representative of toxicological pathways to identify differentially expressed genes induced by the compounds at pre-specified concentration levels (25, 29).
2.2. Mouse Embryonic Stem Cell Test
The protocol of the mEST was adapted from the original publication from Genschow et al., 2004 (7) into an industry compliant format (8). We used the pluripotent mouse embryonic stem cell line ES-D3 (ATCC, CRL-1934) and the somatic mouse 3T3 fibroblast line (Balb/c 3T3 cell clone A31 from ATCC, CCL-163). Most manual steps of the assay, such as cell seeding, dilution and addition of compounds, centrifugation and incubation of the EBs, are standardized and automated to gain reproducible data (30). The only non-automated assay procedures were cell maintenance and the manual count of beating cardiomyocytes.
The mEST assay is performed in two steps. First, the MTT cytotoxicity assay (3-(4,5-Dimethylthiazol-2-yl)-2,5-diphenyltetrazolium bromide) is conducted with both differentiated 3T3 fibroblasts and pluripotent D3 ESCs in monolayer cultures. Second, EBs derived from D3 ESCs are differentiated into cardiomyocytes over a total time course of 10 days, with compound treatment in six different concentrations on day 4 and day 7. The endpoints measured are the concentration at which 50% inhibition of growth of 3T3 (IC50 3T3) and D3 cells (IC50 D3) is achieved, and the concentration at which 50% inhibition of differentiation into cardiomyocytes (ID50 D3) is achieved, compared to DMSO solvent controls, respectively (Suppl. Fig. S1a).
A modified discriminant function analysis was used to classify the test chemicals into two groups based on the calculated predictive score (PS) for low potential of teratogenicity (negative, PS <0.6) and high potential of teratogenicity (positive, PS ≥0.6).
A possible prediction result is ‘borderline’, if calculated predictive scores are below the cut-off of 0.6 but above 0.5. Inconclusive results are also possible, for example, if solubility limits the concentration ranges tested to an extent that no IC50 or ID50 values can be reliably determined for one or more concentration response curves (Suppl. Fig. S1b).
2.3. Assessing characteristics of differentiated hiPSC with BioQC
We applied the BioQC software that we developed previously to characterize the identity of the differentiated samples across all treated compound concentrations (including vehicle controls) at the endpoint on day 7 (31). We used raw data of gene expression derived from molecular phenotyping and compared these profiles with tissue-preferential gene signatures derived from organ, tissue, and cell-type-specific gene expression data collected from public compendia (32, 33). The BioQC performs Wilcoxon-Mann-Whitney tests comparing expression of genes in a set, for instance genes preferentially expressed in one tissue, versus genes that are not in the set. The enrichment scores (log-10 transformed P-values) reported by BioQC are used to assess the similarity between the expression profile of interest and cell-type-and tissue-specific expression profiles.
2.4. Analysis and modelling of the TeraTox data
We performed differential gene expression analysis comparing compound-treated samples with DMSO controls using the generalized linear model implemented in the edgeR package in R/Bioconductor (34). To generate features for machine-learning models, we transformed the P-values associated with the coefficients of compound treatment to z-scores by the inverse of the quantile function of Gaussian distribution, given by the sign of log2 fold-change (logFC). The vectors of z-scores of all genes (N=1,215) were used as raw features for machine-learning models, based on which further feature selection and engineering work was performed.
We also tested the possibility of using the effect size, logFC, as features. However, we found that using z-scores as features delivered better generalizability between training and testing datasets. Therefore, we report the performance of models using z-scores unless otherwise specified.
Besides the raw feature set of z-scores of all genes, we used three knowledge-and data-driven approaches to engineer the features in order to improve the performance of the machine-learning algorithms. First, we confined ourselves to the subset of germ-layer genes, because our and other’s work confirmed that their expression is specific to germ layers of embryogenesis, and their expression is modulated by teratogenic compounds (Suppl. Tab. S3) (23,25,29,35). Second, we used the germ-layer associations reported by Tsankov et al. to derive a reduced feature set defined by five germ-layer classes, including both germ layers (ectoderm, endoderm, mesoderm, mesendoderm) and pluripotency, by taking the median z-scores of germ-layer genes associated with each germ-layer class (25). Finally, we used factor analysis, a dimension-reduction approach that derives latent variables from the correlation structure of observed variables, to identify latent biological, germ-layer factors (germ-layer factors for short), which reflect linear combinations of transcription factors, epigenetics, and other gene regulatory mechanisms that control embryogenesis.
We predicted the teratogenicity potential in two ways. One way was to treat teratogenicity as a binary variable and to perform binary classification. The other way was to convert concentration-response teratogenicity into numeric metrics and to construct regression models. For the latter case, we define a compound-specific Teratogenicity Score (TS hereafter). For non-teratogens, the TS is defined as 0 independent of the tested concentration. For teratogens, the TS is defined as the 0-1-bounded cosine similarity between the differential expression profile induced by a particular concentration of a certain compound and the differential expression profile induced by the highest non-cytotoxic concentration of the same compound.
The non-cytotoxic concentration was determined by the highest concentration that we tested which is associated with an average variability equal or larger than 80%.
The models were trained and validated using the Leave-One-Out (LOO) scheme. We iterated over all compounds, leaving one compound out at a time and using the remaining compounds to build machine-learning models. Then we compared teratogenicity scores predicted by the models with the observation of the left-out compound with the Spearman correlation coefficient.
As an alternative to LOO, we also tested repeated 80%/20% splitting of data into training sets and test sets. However, it cannot be used to predict teratogenic scores for any particular compound without using its data in both training and testing sets. Therefore, we report only results derived from the LOO scheme unless otherwise stated.
In short, we considered two types of features (z-scores and logFC), four sets of features (all genes, germ-layer genes, median z-scores or logFC of germ-layer classes defined by Tsankov et al., and median z-scores or logFC of germ-layer factors defined by factor analysis), two methods (linear regression with elastic net regularization and random forest, implemented in the caret package, version 6.0-88), two types of target variables (binary classification and regression), and two training/testing schemes (LOO and 80%/20% splitting). We tested all combinations exhaustively to build machine-learning models for teratogenicity scores and identified the best-performing models.
Besides predicting teratogenicity scores, we also exhaustively probed all options to build regression models for cytotoxicity (100%-viability), which was measured as part of the TeraTox assay. The same set of model architectures was tested, however the combinations giving best performing models differ from that for teratogenicity scores (further discussed in results).
All data analysis was performed with R (version 4.0.1) or Python (version 3.8.1) unless otherwise specified.
2.5. Test chemicals for validation
In total we tested 27 positive and 18 negative reference substances in six-point concentrations in the mEST and the human TeraTox assay (Tab. 1). This compound panel consisted of both commercial and developmental pharmaceuticals with known teratogenicity profiles available from either human evidence-based information, where unambiguous warnings have been found and use during pregnancy is explicitly contraindicated by the FDA, or from in vivo EFD studies in rats and/ or rabbits (3,36–47). Compounds without existing human or in vivo animal data were classified as teratogens based on known teratogenic risks associated with their mode of action (18,48–55). Some compounds have been taken by cohorts of pregnant women and did not lead to any observed increase in the frequency of malformations during early pregnancy. We considered these drugs as non-teratogenic in humans, at least in the physiologically relevant concentrations of exposure (56–72) (Suppl. Tab. S1).
The commercial compounds were obtained from Merck, Germany. We also included 12 developmental small molecules RO-1 to RO-12 provided by F. Hoffmann – La Roche, Switzerland (compound structures are not disclosed due to confidentiality and intellectual property). Those compounds have unknown human teratogenicity profiles, but in vivo data are available from EFD studies performed either in rats, or in rabbits, or in both (Suppl. Tab. S2).
We assigned RO-1, RO-3, RO-8, RO-9 and RO-10 due to the outcome of in vivo studies as positive teratogens, and RO-2, RO-4, RO-5, RO-6, RO-7, RO-11, RO-12 as non-teratogens (73). All compounds were serially diluted in DMSO (0.25%) from a stock solution to six test concentrations.
We used the following metrics to compare the performance of the TeraTox assay and that of the mEST assay. We calculated assay sensitivity as the proportion of correctly predicted teratogens. Assay specificity was calculated as the proportion of correctly predicted non-teratogens. Overall accuracy was taken as the proportion of all correct predictions, and F1 scores are calculated as the harmonic mean of precision and recall. When we denote True Positive, True Negative, False Positive, and False Negative with TP, TN, FP, and FN, respectively, the metrics of performance are defined in Equations 1-5.
To identify the threshold of TS that maximizes the performance (F1 score) of the TeraTox Score model, we used grid search between 0 and 1 with a step size of 0.01. The best threshold (TS=0.38) was chosen manually by inspecting the performance metrics defined in Equations (1)-(5).
2.6. Model explainability and interpretation
We used the Type I importance measure of features (mean decrease in accuracy) of random-forest models to compare the importance of germ-layer genes in the teratogenicity model and in the cytotoxicity model.
Pharmacology data of publicly available compounds were downloaded from ChEMBL (version 26). We only used human targets and affinities derived from high-quality dose-response data. Binary distances were used to cluster the compounds by their pharmacological profiles.
To construct a Bayesian network model of regulations between factors, we first discretized differential gene expression data of the first six germ-layer factors into three levels using the Hartemink’s pairwise mutual information method implemented in the bnlearn package (74). We generated 1,000 bootstrap replicates using Hill Climbing, a score-based learning algorithm, and the Bayesian Dirichlet equivalent (uniform) score (bde, with the imaginary sample size set to 10). Edges that persist in more than 85% bootstrap samples are deemed as significant and reported.
The beta regression model used for sensitivity analysis was built with the glmmTMB package (75). Scores outside the boundaries [0.01, 0.99] are set to the boundary values to allow beta regression. All ten factors and significant interaction terms identified in the Bayesian network are used as the model input, and compounds are modelled as random effects to capture between-concentration correlations. For better interpretability, input variables are scaled to 0 mean and standard deviation. Simulation was performed with the ggeffects package (76).
3. Results
3.1. Gene expression quantification by molecular phenotyping
We described previously that differential expression of a set of 87 genes preferentially expressed in different germ layers (germ-layer genes hereafter), which both determine and reflect embryonic development, is in principle able to distinguish between teratogenic and non-teratogenic compounds (23, 25). To validate our findings, we compiled a large set of well-documented teratogens, partially with label information for drug-use, and non-teratogens that are challenging to predict and/or known to cause false-positives using animal studies (Suppl. Tab. S1, S2). The compounds cover a broad spectrum of chemical classes and a wide range of effective concentrations. This large compilation of compounds with solid clinical and animal data anchoring is a useful resource for further model development.
We interrogated our human stem-cell model with the compilation of compounds, adapting the experimental workflow that we developed previously (Fig. 1a and 1b). We identified the assay throughput as a major challenge due to the high number of samples for gene expression profiling (>1,000). It would be particularly cost-and labor-intensive if we use the digital PCR technique, established in our previous work to quantify gene expression (23). To address this challenge, we used molecular phenotyping as alternative readout. Molecular phenotyping is an amplicon-based targeted sequencing approach, which delivered quantitative expression data of 1,215 detectable genes. Notably, all germ-layer specific genes used in our previous work were included. In this way, we were able to characterize both general pathway activity modulations and germ layer-specific changes as potential features associated with teratogenicity (26–28).
We performed extensive quality control of the data. In particular, we addressed the questions whether results of molecular phenotyping are comparable to those of qRT-PCR, and whether the hiPSC used show expected reproducibility based on their gene expression profile. We compared the differential expression profiles of germ-layer genes obtained by RT-qPCR in previous studies with newly generated data of molecular phenotyping and observed highly similar results (Pearson correlation coefficient R=0.9, p<2.2E16) (Fig. 1c). This suggests that targeted RNA sequencing with molecular phenotyping delivers highly comparable results, at least for germ-layer genes. The comparison is not feasible for other pathway reporter genes because they were not quantified by digital RT-qPCR.
A unique advantage of quantifying pathway reporter genes along with germ-layer genes is that we can use them to assess cell-type-specific gene expression patterns. To this end, we applied BioQC analysis, a method that we developed to identify sample heterogeneity and tissue comparability using gene sets preferentially expressed in cells and tissues (31). We observed that the expression profiles of the cells used in the TeraTox assay at day 7 resemble a mix of those gene signatures specific for astrocytes, epithelial cells, and iPSC derived neurons (Fig. 1d). It suggests that the hiPSC used for the assay shows a preferred differentiation propensity into the neuroectodermal lineage, which is in agreement with previous time-series gene expression studies that demonstrated pronounced expression of ectodermal markers at day 7, followed by meso-and endodermal expression (23, 25).
3.2. Unsupervised learning from gene expression data with factor analysis
Before applying supervised learning techniques to differentiate teratogens from non-teratogens, we applied several unsupervised learning algorithms to analyze the gene expression data, including principal component analysis (PCA) and factor analysis. PCA revealed experimental plate effects that we could successfully correct with linear regression models for differential gene expression (data not shown). Unexpectedly, factor analysis revealed both biological insights and, as further discussed below, a feature engineering technique that contributed to the best-performing model. Since this is the first time to our knowledge that factor analysis is applied in the context of gene-expression-based toxicity prediction, we highlight its concepts and unique advantages.
Factor analysis, sometimes called exploratory factor analysis to differentiate it from confirmatory factor analysis, is a statistical method to discover latent (unobserved) variables that account for the correlations observed between features. Useful for both dimension reduction and feature engineering, factor analysis has been particularly powerful in building predictive machine-learning models in biology using highly correlated features such as cell morphology in the context of high-content screening (32,33,77,78). With respect to gene expression, factor analysis reduces the data dimension from genes to factors, each of which is usually associated with multiple genes. Genes in each factor show correlated gene expression profiles across samples (Fig. 2a, b). These factors, therefore, can be thought of as being a representation of all biological processes influencing gene expression, for instance epigenetic profiles, transcription factor activities, microRNA abundances, etc. Despite the fact that most of these variables are not directly observable, latent factor analysis offers a possibility to infer their total contribution to detected variation in gene expression profiles.
Conceptually, factor analysis is familiar with other correlation-based methods, for instance Relevance Networks (79) and Weighted Correlation Network Analysis (WGCNA) (80). We preferred factor analysis to alternative methods because factor analysis does not make any additional assumptions than the common, minimum ones underlying correlation analyzes (homogeneity, completeness, etc.), whereas other methods do so, for instance the scale-free network structure assumed by WGCNA, whereas this assumption is often challenged (81, 82). On the other hand, we have many more samples than the number of factors. Factor analysis is feasible with the maximum-likelihood method. We therefore decided to use factor analysis following the principle of Occam’s Razor.
We applied factor analysis to raw gene expression data and identified intriguing patterns. Since factor analysis is based on inter-gene correlations, we visualize the correlation matrix of germ-layer genes in Figure 2a (the full matrix is visualized in Supplementary Figure S2a). Genes that strongly correlate with each other form clusters, which correspond to latent factors.
Despite that, factor analysis is a correlation-based statistical method in which we injected no prior knowledge, it revealed biologically meaningful patterns. Using the maximum likelihood method, we decomposed the covariance matrix of gene expression into factors. The heatmap in Figure 2b shows loadings, i.e. how strong factors influence the expression of germ-layer genes, of the first ten factors that collectively explain more than 70% of the covariance (Suppl. Fig. S2b and S2c). Left to the heatmap we use colors to indicate germ-layer classes that were distilled from biological knowledge. We found that the first six factors (ranked by explained covariance of the data) are significantly enriched with signatures of individual germ layers or signatures of stem-cell self-renewal (Fig. 2c, p<0.01, Fisher’s exact test). This significant enrichment is both intriguing and novel, because while it is established that germ-layer genes are highly expressed at different stages of embryogenesis, we failed to find any previous studies reporting that their expression are strongly correlated in 3D embryoid bodies formed by hiPSC, with or without compound treatment. Given that the cells in TeraTox are all grown up to day 7, it is unlikely that the correlations are caused by temporal changes of embryogenesis. Instead, factor analysis suggests that besides being correlated across time in development, expression of germ-layer genes is also correlated across treatment conditions in 7-day spontaneously differentiated EBs.
Detailed analysis of the results from the factor analysis revealed more insights. The strongest correlation of the germ-layer genes was observed among genes in Factor 1, many of which are markers of the ectodermal layer, e.g., WNT1, POU4F1, OLFM3, CDH9, LMX1A, DMBX1, PAX3, MAP2, and TRPM8 (Fig. 2a). While BioQC analysis revealed that ectodermal genes are highly expressed at the endpoint on day 7, factor analysis further indicates that their expression is strongly correlated across conditions, too, which is neither sufficient nor necessary for their high expression. Factors 2-6 mainly consist of genes representing the mesodermal layer (Factor 2), stem-cell self-renewal (Factor 3), and the endoderm layer (Factor 4-6), respectively. The remaining factors (Factor 7-10) are of smaller sizes and more heterogeneous (Fig. 2b). Genes associated with each factor are associated mainly, but not exclusively, with other genes of the same germ-layer class.
3.3. Training and testing of a predictive model for the TeraTox assay
To build a quantitative predictive model of concentration-dependent teratogenicity potential with gene expression as input, we explored all combinations of the following options exhaustively (Fig. 3a):
Feature type: We tested both log2 fold change (logFC), the point-estimate of the effect size, and z-scores transformed from the sign of logFC and p-value reported by the edgeR model, which considers both effect size and variance of differential gene expression.
Feature engineering: We used all detectable pathway reporter genes (N=1,215), detectable germ-layer genes (N=87), germ-layer classes defined by Tsankov et al. (N=7), and germ-layer factors derived from factor analysis (N=10). In case of both germ-layer classes and factors, we use the median value of the genes belonging to each group as the engineered feature.
Model construction: We used and benchmarked two methods of different nature, Elastic Net (linear regression with regularization) and Random Forest (ensemble decision trees), to construct machine-learning models. We chose them based on the size of the dataset and the relatively good explainability of both methods (83).
Target variable: We used both binary classification (teratogen or non-teratogen) and regression (the teratogenicity score, defined below and further detailed in the Material and Methods section) for teratogenicity and regression alone for cytotoxicity.
Data splitting: we tried both repeated splitting of 80% training and 20% test set, and the leave-one-out (LOO) scheme. In the first case, we used 80% compounds (stratified sampling from non-teratogens and teratogens) as the training set to train a model, which was used to predict the teratogenicity scores using the remaining 20% compounds as the test set. In the latter case, all except one compound were used to train the model, which predicts the teratogenicity scores for the left-out compound, and repeated the procedure for all compounds so that teratogenicity scores were predicted for each compound based on data from other compounds. In either case, the model performance was assessed by F1 scores in case of binary classification models, and Spearman correlation coefficients of teratogenicity scores for teratogens in case of regression models. The best model parameters were searched by 10-fold cross-validations of the training set.
While all other technical terms are used in their common sense, we explain the motivation and definition of the Teratogenicity Score in detail. A key challenge for building a predictive model of teratogenicity is that the potential of a compound inducing teratogenicity varies by its concentration. A concentration-response relationship can be assumed, namely a treatment with a higher concentration is more likely to induce teratogenicity than that with a low concentration. However, the concrete functional form between the potential and the concentration is not known. This motivated us to define the Teratogenicity Score as the ‘0-1 cosine bounded similarity’ between differential gene expression profiles induced by any given concentration and the profiles induced by the maximum non-cytotoxic concentration. The teratogenicity scores of teratogens are defined between 0 and 1, and those of non-teratogens are fixed as 0 at all concentrations (Fig. 3b). By defining teratogenicity scores, we effectively transformed the binary classification problem into a regression problem.
Two important technical details require clarification. First is the range of the teratogenicity score. Mathematically, cosine similarity ranges between −1 and 1; we bounded it to 0-1 by setting negative similarities as zero, which did not change the performance of the models (data not shown) but helped with human understanding. The teratogenicity score can be interpreted as an estimate of the probability of inducing teratogenicity, which would be a real number between 0 and 1, though the real probability is unknown to us because we are working with an in vitro system only, and the probability estimated in our system may differ significantly from that in vivo.
The second technical detail is the selection of regression models. Given the truncated domain where the teratogenicity score is defined, we tried both simple linear regression and generalized linear models with beta regression. However, beta regression was computationally intensive and much slower, and its use led to similar results as simple linear regression for predicting teratogenicity scores. Therefore, we used simple linear regression throughout the study except in the last part of model explainability, because only one model is required there and the boundary consideration is important for simulation studies.
We observed the following patterns as we tried all options of model building:
The feature type has minimal impact on the performance, though models trained with z-scores perform better on the test set than models trained with logFC (data not shown).
The combination of feature engineering and machine-learning model is important and the best combination depends on the prediction task (Fig. 3c and 3d, contrasted with Fig. 5a). For teratogenicity prediction, the combination of germ-layer factors and random-forest regression works the best.
With regard to the target variable, the performance of the regression-based teratogenicity-score prediction model is slightly better than the model for binary classification (data not shown).
Performance is comparable between two modes of data splitting (data not shown). However, the leave-one-out training-testing scheme is preferable because it allows us to set up a single threshold of teratogenicity score which can be applied to all compounds, whether or not a compound is included in the training set or in the test set as in the case of 80%/20% data splitting.
Based on these observations, we decided to use germ-layer factors as features, random-forest regression as the machine-learning model, and teratogenicity score as the target variable to build the predictive model for teratogenicity with gene expression data.
3.4. Assay performance of the TeraTox assay compared to mEST prediction
Based on the best-performing machine-learning model, we defined the following predictive model for teratogenicity. First, we considered the maximal non-cytotoxic threshold concentration (NCCmax) for cell viability measured by the CellTiter Glo assay of at least 80%. Next, we defined the minimal teratogenic concentration (TCmin) as the concentration at which the threshold of the teratogenicity score was met (TS=0.38, defined by grid search, Fig. 4a). If no NCCmax or TCmin could be determined because values did not exceed these thresholds, the maximal tested concentrations were used for NCCmax and TCmin. The predictive score, which we named TeraTox Score to avoid confusion with Teratogenicity Score, is defined by the logarithmic ratio between threshold concentrations at 20% viability impairment (NCCmax) and teratogenic concentrations (TCmin). Negative TeraTox scores classify the compounds as negative whereas positive scores classify compounds as positive (Fig. 4b).
We plotted the concentration-response curves of both measured cytotoxicity and predicted teratogenicity scores induced by each compound (Fig. 4c, see Suppl. Fig. S4 for all compounds). In general, teratogenicity levels increased while cell viability decreased with rising concentrations. Correctly predicted negative compounds were unlikely to induce teratogenicity within non-cytotoxic concentrations, which means the calculated TeraTox score was negative or zero (e.g., Doxycycline, RO-4, RO-6). Positive compounds (e.g., Bosentan, Carbamazepine, Retinoic Acid, RO-1) or false positive predicted compounds (e.g., Cetirizine) were more likely to induce teratogenicity under non-cytotoxic concentrations, which was indicated by positive TeraTox scores (Fig. 4c).
We compared predictions of 45 reference compounds by TeraTox scores with classifications from FDA or in vivo EFD studies (Suppl. Tab. S4). Classification with TeraTox Scores achieved an overall accuracy of 68% and outperformed mEST (60%). The two assays show different sensitivity and specificity profiles: While mEST is more specific (specificity 78%), TeraTox is more sensitive (sensitivity/recall 78%). Among 18 negative reference compounds, 9 were classified as false positives (FP) by TeraTox, and only 4 by the mEST. Whereas from 27 positive reference compounds, 21 were predicted as true positives (TP) by the human TeraTox and only 13 by the mEST (Tab. 2, Fig. 4d). It is noteworthy that among the 26 compounds misclassified in total, these seven are wrongly predicted by both assays: cyproheptadine, RO-11, 5-FU, methotrexate, misoprostol, RO-8, warfarin. Given the distinct sensitivity and specificity profiles of the two assays, we asked whether we can achieve even better prediction results by using them in a sequential mode. Specifically, we first let mEST classify the compounds, and among the negative predictions, we accept the predictions by TeraTox. The intuition is that we may benefit both from the high specificity of mEST and the high sensitivity of TeraTox. Indeed, we found that overall accuracy of the combined prediction increased to 78%. This suggests that it may be possible to achieve better prediction results by combining the existing mEST assay with the novel TeraTox assay.
3.5. Model interpretation and explanation
A model’s explainability is crucial for understanding that allows inspection and further improvement (84). We performed additional in-depth analysis and collected data orthogonal to TeraTox, thereby implementing three independent approaches to interpret and explain how the TeraTox model, in particular, how the teratogenicity score prediction model works.
First, we followed up on previous work and asked the question whether the cytotoxicity quantified by the phenotypic assay can be predicted by gene expression data as well, and whether teratogenicity scores are confounded by general cytotoxicity (85, 86). For this purpose, we followed the same scheme as described in Figure 3a while using cytotoxicity instead of teratogenicity scores as the target variable. Interestingly, an exhaustive search showed that using all pathway reporter genes and the elastic net model, instead of using germ-layer factors and random forest as in the case of teratogenicity prediction, gives the best result (Fig. 5a).
Given that the combination of germ-layer genes and random forest gives reasonable performance in both cases, and that random forest allows inquiry of feature importance by accuracy, we compared the feature importance of germ-layer genes in predicting both target variables (Fig. 5b). The prediction of cytotoxicity and teratogenicity by molecular phenotyping relies on expression changes of distinct genes. The distinction shows that teratogenicity of a compound is not a determinant for cytotoxicity whereas a compound that shows cytotoxicity at a specific concentration can still be teratogenic at lower, non-cytotoxic concentrations and that pathways for cytotoxicity and teratogenicity may be independently regulated. This is well in line with several previous findings (87–89).
The second approach addressed the question whether a compound’s pharmacology, namely its target profile (protein targets and binding affinities), suffices to predict its teratogenicity potential. If so, one may hope to predict teratogenicity potential based on target profiles and/or even based on the chemical structure alone. While some teratogens indeed have similar target profiles, we observe close clustering of teratogens and non-teratogens that have similar target profiles as well (Fig. 5c, Suppl. Fig. S5a). The potential of teratogenicity, therefore, may be associated with off-target effects or effects through targets that are not captured in ChEMBL, especially at the relatively high concentrations approaching cytotoxicity levels that we tested. Corroborating this, we found almost no correspondence between clustering of average differential gene expression across concentration per compound and that of pharmacological profiles (Suppl. Fig. S5b). Therefore, we conclude that while knowing the target-and off-target profile of a compound is essential for de-risking its safety liabilities including teratogenicity, pharmacology data alone cannot predict a compound’s teratogenicity potential, at least in their current stand. In-vitro assays, for instance with TeraTox and other advanced cellular models, are indispensable for preclinical teratogenicity assessment.
The third approach was to use a simpler generalized linear regression model for sensitivity analysis, which would allow us to analyze how the model responds to changes of the input. Given that random forest is an ensemble method and the contribution of each germ-layer factor can be therefore difficult to interpret, we built an alternative model using beta linear regression. To identify interaction terms in the linear regression, we made the assumption that germ-layer factors regulate each other by forming a directed acyclic graph (DAG). Under this assumption, we built a Bayesian network using the differential expression data of germ-layer factors (Fig. 5d). The network reveals potential influences on both mesoderm and endoderm by the ectoderm, influences on endoderm by mesoderm, and influences on stem-cell renewal by endoderm.
The Bayesian network topology prompted us to build a beta regression model including all germ-layer factors and interactions identified in the Bayesian network (Fig. 5e, Suppl. Fig. S6). The model provides both interpretable coefficients of the model and a tool for sensitivity analysis, because we can quantify prediction uncertainty much easier with a linear model than the random forest model, by paying the price of assuming linear regulation relationship. For the sensitivity analysis, we kept all other parameters fixed and tuned one input parameter at a time to simulate its impact on predicted teratogenicity scores. We observed that the model is likely sensitive to impairment of either ectoderm layer or stem-cell self-renewal, while being relatively robust to changes to either mesoderm or endoderm (Fig. 5e). The results of sensitivity analysis further underlined the prominent ectodermal nature of the model at the endpoint on day 7.
In summary, we explain how the TeraTox model works by complementing the machine-learning model with feature importance analysis, biological and pharmacological interpretation, and sensitivity analysis.
4. Discussions
This study characterizes the optimization of TeraTox, a newly developed human teratogenicity assay. TeraTox quantifies drug-like molecules’ cytotoxicity and teratogenicity profiles in concentration response using a hiPSC derived embryoid body model that spontaneously differentiates into all three germ layers over seven days. It thus extended and standardized earlier embryoid body models, and fully leveraged their predictive potential by adding a toxicological prediction model (87, 90). We challenged the TeraTox assay with a selection of 45 reference substances with teratogenic profiles based on high-quality data. We identified latent germ-layer factors that influence germ-layer gene expression, and identified the best machine-learning model that predicts the teratogenicity potential based on germ-layer factors as input and random forest as the regression model. We demonstrated that TeraTox outperforms mEST in both sensitivity and balanced prediction performance, though having lower specificity. Furthermore, we explored the interpretation and explainability of the TeraTox model with three independent approaches. We found that teratogenicity can be distinguished from cytotoxicity, that pharmacological profiles are not sufficient for predicting teratogenicity, and that the TeraTox assay is particularly sensitive towards teratogens impairing ectoderm development and stem-cell self-renewal. The study embodies a comprehensive and critical assessment of the TeraTox assay and its predictive algorithm, addressing important open questions for its practical use.
The TeraTox model presents a promising companion and an alternative to mEST as a humanized in vitro model for preclinical teratogenicity assessment. The two assays differ in cellular origin (human iPSC versus mouse ESC and fibroblasts), final endpoints (differential gene expression from all germ layers versus direct differentiation into mouse cardiomyocytes), and the prediction model. Both assays are anchored to a specific cytotoxicity threshold that determines the non-cytotoxic yet teratogenic effects. Contrary to the mEST assay, where cytotoxicity is inferred from IC50 values of D3 and 3T3 cells that are grown in monolayers, we anchored the TeraTox assay to a much lower cytotoxicity threshold (NCCmax, viability >80%) in a three-dimensional scale, which is more physiological relevant. With the exception of a few compounds, TeraTox determined cytotoxicity and/or teratogenicity LOAEL (lowest observed adverse effect levels) at lower concentrations compared to the mEST (except of dexamethasone, bosentan, dorsomorphin, hydroxyurea, imatinib, isotretinoin). We therefore believe that TeraTox may be a more relevant in vitro assay for human teratogenicity assessment.
Our analysis of the TeraTox data revealed its three unique advantages over mEST. First, TeraTox is more sensitive than the mEST assay. We believe the higher sensitivity is due to several factors, including the use of human induced pluripotent stem cells, cytotoxicity determination in 3D EBs and using gene expression as readout. In this study, we carefully selected concentration ranges based on drug-specific maximum plasma concentrations (Cmax) from either human data whenever possible or model species otherwise (Suppl. Tab. S1, S2). Retrospective comparison of the TeraTox readout with the human therapeutic Cmax data showed that TeraTox captured relevant in vivo doses for teratogenicity for most compounds, except for bosentan, isotretinoin, imatinib, and warfarin. The higher sensitivity to detect teratogens is particularly important for preclinical drug discovery to remove potential teratogens from the pipeline as early as possible.
The second advantage of TeraTox over mEST is that it allows the detection of human-specific teratogens. Generally, using a model species such as the mouse or the rabbit to predict toxicity may lead to misclassifications if the toxicity is specific for either the species or for humans. For this reason, when we compiled our compound panel, we chose preferentially those compounds that are either known to be species-specific or known to be misclassified by alternative methods. And when we assigned labels to the compounds, we relied on human data whenever possible. An example for species-specific teratogenicity is thalidomide, which was correctly identified as positive by TeraTox. At the same time, it shows a high level of cytotoxicity at concentrations that are 80-fold lower than human Cmax. It is well established that the mouse system is insensitive to the teratogenic effects of thalidomide due to the lack of cereblon-mediated degradation of the SALL4 transcription factor, which has been shown to result in agenesis of the limb buds in rabbit embryos and was recapitulated by a species-specific false-negative response of the mEST (55,91,92).
The third advantage of TeraTox is that it is less of a phenotypic black box but more an interpretable and explainable model. We used factor analysis, an established unsupervised, generative data-analysis method, to reveal clustering patterns in correlations between expression of germ-layer genes. Despite that these clustering patterns, which we termed germ-layer factors, were derived from the raw gene expression data statistically without any biological prior knowledge, we were surprised that they correlated well with known biology of germ-layer development. Specifically, germ-layer factors were enriched with genes preferentially expressed in one of the three germ layers or stem-cell renewal. Interestingly, averaging differential gene expression of germ-layer genes by germ-layer factors provided the best features for the prediction of teratogenicity. The latent factors can be seen as a sum of the output of gene regulatory networks in germ-layer development and stem-cell self-renewal. Therefore, TeraTox informs predictions not only based on statistical data patterns: it builds upon biological mechanisms and thus may reflect disturbed functionalities, similar to those leading to teratogenicity in vivo. This feature puts the TeraTox conceptually in a group of other assays that use phenotypic changes or disturbed functionalities as readouts (17,93–95). The model consolidates our previous call to ‘focus on germ layers’ and corroborates our recent work exploring gastruloid models that profiles morphological changes of germ-layers for teratogenicity prediction (23, 96).
Besides factor analysis, we tried several ways to shed light on how the model works (or not). Most importantly, we could distinguish cytotoxicity from teratogenicity. We explored machine-learning model variants for both teratogenicity and cytotoxicity predictions and made the intriguing observation that the best models are distinctly depending on the target variable. Whereas germ-layer factors and random forest performed best for teratogenicity prediction, the combination of all pathway reporter genes and regularized linear regression with elastic nets showed the best prediction for cytotoxicity. We speculate that there might be two explanations for this. First, the molecular phenotyping platform contains well curated genes that reflect cytotoxicity and cell death, which were highlighted in a previous drug screening study using iPS-derived cardiomyocytes (26). Therefore, we can anticipate that these genes are used by linear regression to predict cytotoxicity. Second, teratogenicity is notably complex. It can be caused in many different subtle ways, with many different perturbations leading to different down-stream changes that are collectively known as teratogenicity. Therefore, a change in the total output of the germ-layer regulatory network as summarized by germ layers is probably a more robust readout than individual genes, and random-forest, which is an ensemble learning method, is better at detecting heterogenous signals than linear regression.
Furthermore, we used pharmacological data to show that knowing target profiles of drug candidates is likely not sufficient to predict its teratogenicity potential, therefore an in vitro based assay like TeraTox is necessary. Last but least, we combined Bayesian network analysis, beta linear regression, and sensitivity analysis to show that while TeraTox is sensitive to ectoderm development damage, further work is required to better model mesoderm and endoderm development.
Given the advantages of TeraTox over mEST, and considering distinct profiles of sensitivity and specificity of the two assays, we can image three possible scenarios of their routine use in drug discovery: TeraTox replacing mEST, TeraTox running besides mEST, or two assays running sequentially. We believe while the first option is the long-term goal that we go after, the last option of running them sequentially may be currently the best solution. Our analysis showed that if we use the mEST assay first, and next run the TeraTox assay for compounds predicted negative by mEST, we gain improved prediction accuracy, sensitivity, and specificity. Further real-world testing is planned to validate the performance of this approach.
Further studies are warranted to explore several parallel paths further optimizing the TeraTox assay, which can be divided into three categories: paths leading to better characterization of EBs, paths leading to better predictive and explanatory algorithms, and paths leading to better biological models of human embryo development. To better characterize EBs, one apparent way is to perform multi-modal - bulk and single-cell omics, and morphological profiling - characterizations of the EBs. Extension of the assay duration to more than 7 days or using other differentiation protocols may further improve TeraTox’s capacity to model mesoderm and endoderm development. Omics profiling of EBs may reveal the best condition.
There are several viable options to further improve the predictivity and the explainability of the TeraTox model. To better distinguish between non-teratogens and teratogens, we may try to test the compounds with the TeraTox assay at lower concentrations (especially for non-teratogens), where the lowest concentration should be predicted to have a teratogenicity score equal to or close to zero. Multi-model data, if available, can be used to identify further relevant features beyond germ-layer genes and factors. As more and more data are collected, we may also optimize the prediction algorithm, for instance using the nearest-neighbor prediction or other variants, to benefit from the data.
Finally, the TeraTox assay may benefit from a better modelling of human embryo development. We may use alternative morphology-based assays of gastruloids to complement the TeraTox readout (96, 97). Alternatively, sophisticated microphysiological systems may better mimic the maternal-placenta-embryo axis and with that may recapitulate true embryo exposure levels (98–100). In the future they may replace the 3D embryoid bodies in TeraTox. In the current throughput, though, such systems will probably be more powerful as a secondary assay to spot check a few compounds of particular interest. For this purpose, a continuous integration and modelling of data of human embryogenesis, for instance from omics, imaging, and perturbation studies, is required to guide further optimization of the TeraTox assay (96,101,102).
5. Conclusion
In summary, we demonstrate that the TeraTox assay addresses several limitations of the industrial standard mEST assay regarding performance, species-specificity, and explainability. We believe that further optimization of the TeraTox assay and its routine use in drug-screening processes will lead us towards better preclinical assessment of teratogenicity.
7. Funding
This work was supported by CEFIC, the BMBF, EFSA, and the DK-EPA (MST-667-00205). It has received funding from the European Union’s Horizon 2020 research and innovation program under grant agreements No. 681002 (EU-ToxRisk), No 964537 (RISK-HUNT3R), No. 964518 (ToxFree) and No. 825759 (ENDpoiNTs) and from Horizon Europe.
8. Disclosure Statement
Some authors (MJ, JDZ, SPL, NS, PB, EK, NC and SK) are employees of F. Hoffmann-La Roche Ltd, and all authors have nothing to disclose.
6. Acknowledgement
We thank Kevin Michaelsen, Claudia Bossen, and Jean-Christophe Hoflack for their generous support. JDZ thanks colleagues of the Bioinformatics and Exploratory Data Analysis (BEDA) team for their input and discussions.
9. References
- 1.↵
- 2.↵
- 3.↵
- 4.
- 5.↵
- 6.↵
- 7.↵
- 8.↵
- 9.↵
- 10.↵
- 11.↵
- 12.
- 13.
- 14.
- 15.
- 16.↵
- 17.↵
- 18.↵
- 19.↵
- 20.
- 21.↵
- 22.↵
- 23.↵
- 24.↵
- 25.↵
- 26.↵
- 27.
- 28.↵
- 29.↵
- 30.↵
- 31.↵
- 32.↵
- 33.↵
- 34.↵
- 35.↵
- 36.↵
- 37.
- 38.
- 39.
- 40.
- 41.
- 42.
- 43.
- 44.
- 45.
- 46.
- 47.↵
- 48.↵
- 49.
- 50.
- 51.
- 52.
- 53.
- 54.
- 55.↵
- 56.↵
- 57.
- 58.
- 59.
- 60.
- 61.
- 62.
- 63.
- 64.
- 65.
- 66.
- 67.
- 68.
- 69.
- 70.
- 71.
- 72.↵
- 73.↵
- 74.↵
- 75.↵
- 76.↵
- 77.↵
- 78.↵
- 79.↵
- 80.↵
- 81.↵
- 82.↵
- 83.↵
- 84.↵
- 85.↵
- 86.↵
- 87.↵
- 88.
- 89.↵
- 90.↵
- 91.↵
- 92.↵
- 93.↵
- 94.
- 95.↵
- 96.↵
- 97.↵
- 98.↵
- 99.
- 100.↵
- 101.↵
- 102.↵