## Summary

Apart from discriminative models for classification and object detection tasks, the application of deep convolutional neural networks to basic research utilizing natural imaging data has been somewhat limited; particularly in cases where a set of interpretable features for downstream analysis is needed, a key requirement for many scientific investigations. We present an algorithm and training paradigm designed specifically to address this: decontextualized hierarchical representation learning (DHRL). By combining a generative model chaining procedure with a ladder network architecture and latent space regularization for inference, DHRL address the limitations of small datasets and encourages a disentangled set of hierarchically organized features. In addition to providing a tractable path for analyzing complex hierarchal patterns using variation inference, this approach is generative and can be directly combined with empirical and theoretical approaches. To highlight the extensibility and usefulness of DHRL, we demonstrate this method in application to a question from evolutionary biology.

## 1 Introduction

The application of deep convolutional neural networks (CNNs^{1}) to supervised tasks is quickly becoming ubiquitous, even outside of standardized visual classification tasks.^{2} In the life sciences, researchers are leveraging these powerful models for a broad range of domain-specific discriminative tasks such as automated tracking of animal movement,^{3–6} the detection and classification of cell lines,^{7–9} and mining genomics data.^{10}

A key motivation for the expanded use of deep feed-forward networks lies in their capacity to capture increasingly abstract and robust representations. However, outside of the objective function they have been optimized on, building interpretability into these representations is often difficult as networks naturally absorb all correlations found in the sample data and the features which are useful for defining class boundaries can become highly complex (Figure S1). For many investigations the main objective falls outside of a clearly defined detection or classification task, e.g. identifying a set of descriptive features for downstream analysis, and interpretability and generalizability is much more important. Because of this, in contrast to many traditional computer vision algorithms,^{11–14} the application of more expressive approaches built on CNNs and other deep networks to research has been limited^{15} (Figure 2).

Unsupervised learning, a family of algorithms designed to uncover unknown patterns in data without the use of labeled samples, offers an alternative for compression, clustering, and feature extraction using deep networks. Generative modeling techniques have been especially effective in capturing the complexity of natural images, i.e. generative adversarial networks (GANs^{16}) and variational autoencoders (VAEs,^{17,18}). VAEs in particular offer an intuitive way for analyzing data. As an extension of variational inference, VAEs combine an inference model, which performs amortized inference (typically a CNN) to approximate the true posterior distribution and encode samples into a set of latent variables (*q*_{ϕ}(*z*|*x*)), and a generative model which generates new samples from those latent variables (*p*_{θ}(*x*|*z*)). Instead of optimizing on a discriminative task, the objective function in VAEs is less strictly defined but typically seeks to minimize the reconstruction error between inputs *x* and outputs *p*_{θ}(*q*_{ϕ}(*x*)) (reconstruction loss) as well as the divergence between the distribution of latent variables *q*_{ϕ}(*z*|*x*) and the prior distribution *p*(*z*) (latent regularization).

### 1.1 Overcoming Hurdles to Application

In VAEs, two problems often arise which are of primary concern to researchers using natural imaging data. *1*) The mutual information between *x* and *z* can become vanishingly small, resulting in an uninformative latent code and overfit to sample data, the information preference problem;^{22,24} this is particularly true when using powerful convolutional decoders which are needed to create realistic model output.^{20,23,24} *2*) In contrast to the hierarchical representations produced by deep feed-forward networks used for discriminative tasks, in generative models local feature contexts become emphasized at the cost of large-scale spatial relationships. This is a product of the restrictive mean-field assumption of pixel-wise comparisons and produces generative models capable of reproducing complex image features while using only local feature contexts without capturing higher-order spatial relationships within the latent encoding.^{22}

The basis of a more expressive and robust approach for investigating natural image data has some key requirements: *1*) provide a useful representation which disentangles factors of variation along a set of interpretable axes; *2*) capture feature contexts and hierarchical relationships; *3*) incorporate existing knowledge of feature importance and relationships between samples; *4*) allow for statistical inference of complex traits; and *5*) provide direct connections between analytical, virtual and experimental approaches. Here we integrate meta-prior enforcement strategies taken from represnetation learning^{19} to specifically address the requirements of researchers using natural image data (Table 1).

Here we propose to address the limitations of existing approaches and incorporate the specific requirements of researchers using a combination of meta-prior enforcement strategies. VAEs with a ladder network architecture has been show to better capture a hierarchy of feature by mitigating the explain away problem of lower level feature, allowing for bottom-up and top-down feedback.^{22} Additionally, combining pixel-wise error with a perceptual loss function^{25} adapted from neural style transfer,^{26,27} may also reduce the restrictive assumptions of amortized inference and pixel-wise reconstruction error by balancing them against abstract measures of visual similarity.

In terms of the latent regularization, a disentangled representation of causal factors requires an information-preserving latent code. Choosing a regularization techniques which mitigate the trade off between inference and data fit^{21} can encourage the disentanglement of generative factors along a set of variables in an interpretable way. We also propose a novel training paradigm inspired by GAN chaining that further relaxes the natural covariances in the data: *decontextualized learning* and actually uses the restrictive assumptions of GAN generator networks to our advantage to overcome the limitations of small datasets, typical for many studies in the natural sciences and further increase the disentanglement of generative factors (Figure 1, Methods 4.2).

While several metrics have been proposed for assessing interpretability and disentanglement,^{28–30} these metrics rely heavily on the associated labels, well defined features or stipulations from classification of detection competitions, e.g.^{31} In addition to being highly domain specific, for most practical investigations in the natural sciences, these types of labels do not exist and we must often rely on fundamentally qualitative assessments. In many cases, labeled data is not available and interpreting traversals of the latent code (Figure S2) may introduce our own perceptual biases. Here, we adapt an approach from explainable AI: integrated gradients^{32} in application to latent variable exploration too provide a direct assessment of latent variables, quantifying latent feature attributions without the necessity of labeled data and allows for exploring latent variables without adding additional human biases (Methods 4.3).

We demonstrate the proposed framework using two example datasets: male guppy ornamentation and butterfly wing patterns from the discipline of sensory ecology and evolution (see Appendix A for motivation and background on existing approaches).

## 2 Results

While biological datasets are typically small, they are usually highly structured and standardized compared to large classification datasets (e.g. ImageNet^{33}). This provides an advantage for controlling noise and uninformative covariates in the data. Using a modified infoGAN^{23} architecture, we incorporate prior knowledge about the structure of our sample data to generate realistic samples from the complex image distribution conditioned on a set of latent variables. Here, we incorporate prior knowledge about our samples of male guppy ornamentation images by providing a 32-class categorical latent code (Figure 3b, top right). These 32-classes represent the 32 individual tanks, unique subsets of the overall sample, with shared traits related to guppy ornamentation patterns. The categories learned by the trained model posses unique features which also covary in the sample data, e.g. a distinct black bar and orange stripe which characterizes one guppy species, *P. wengei* (Figure S2, a). While generated samples share characteristics and even resemble known varieties, generated samples posses decontextualized combinations of features across examples (Figure S2, a). We use these, decontextualized samples as input to our variational (VAE) model for our “decontextualized” training paradigm.

GAN training and VAE training are performed in separate steps so that models are not jointly optimized. The generated samples from the trained GAN model are used as training data to a variational model (Figure 1) with a hierarchical model architecture^{22} which consists of 10 latent variables across four codes (*z*_{1}, …, *z*_{4}) with increasing expressivity, (Methods 4.2.2). We observe distinct clusters in the latent space of the trained model which correspond to sample categories and differs qualitatively from two existing method (raw pixel and perceptuall loss embeddings using tSNE,^{34,35} Figure 2). The unique latent space of the four latent encodings capture unique factors of variation in the sample data in a scale-dependent way (Figure S2, Figure S3). In this model, *z*_{1}, the latent code with the lowest capacity captures local traits such as the color and intensity of discrete patches, e.g. *z*_{1}1 encodes variation in the intensity of an orange spot (S2 4b, left). At higher levels (*z*_{2}, …, *z*_{4}), latent variables encode complex traits which combine multiple elements, (S2 4b, right). We use this same latent representation to describe the relationship between samples and calculate likelihood estimates. Samples with rare traits, e.g. such as the “Tr5” strain in our sample data which are distinctly melanated, cluster together in the embedded space, and have a low sample likelihood (3).

Embedding the 4, 10-dimensional, latent codes reveals scale-dependent relationships between elements. In *z*_{1} (Figure S3, left) color values and local features dominate the relationship between points (Figure S3, left). Nearest neighbor samples (Minkowski distance^{36} in the 10-dimensional space, Figure S3, b) show color similarity whereas higher order features, e.g. patterning and morphology, determine the relationships between samples in the more expressive latent spaces (*z*_{2}, …, *z*_{4}). Though we find strong covariance between features across scales, in some cases the nearest neighbors samples differ greatly depending on the scale and feature context (Figure S3, b).

We assess the level of disentanglement of our trained variational models using the metric established in^{30} using known class labels as attribute classes (butterfly species, learned class from infoGAN pre-training, and guppy strain varieties). Across models, we find the most expressive latent codes (*z*_{4}) provide the highest degree of disentanglement between known classes with the highest disentanglement score overall using our decontextualized, DHRL method (see Table 2).

We also provide a qualitative approach for attributing latent variables to image features using network gradients (Methods 4.3); when labels are unknown. In Figure 4, a-d we visualize one variable of *z*_{1}, the least expressive latent variable space (*z*_{13}) of the DHRL-trained guppy latent variable model. We find that the same latent variable controls the relative intensity of green color patches across individuals. Looking at a single variable of more expressive latent codes *z*_{27} of the trained butterfly model (4, e-h) we find that this latent variable controls the size of yellow patches on the lower wings relative to the size of yellow patches on the upper wings (when patches are not present this variable has no effect (4, f). Further investigation of latent variables can be performed using the provided tool (https://github.com/ietheredge/VisionEngine/notebooks/IntegratedGradients.ipynb).

Using the latent representation, *z*, of our DHRL trained variational model of guppy ornaments as input, we apply an evolutionary algorithm (Figure 5), defined by a fitness function from the guppy literature: oranger, higher contrast males are preferred by females.^{37} Starting from a parent population initialized by our sample embedding (900 samples), we simulate 500 generations under these selective forces. We observe exaggerated and more numerous orange and black patches in novel configurations compared to the initial population (Figure 5, b). Projecting the latent representation of generations 1, 250, and 500, we find that instead of a single peak, after several generations, many novel solutions are optimized (Figure 5 a). Investigating the values of the latent variables over generations reveals two distinct latent factors driven to fixation in the population under these selective forces (S4). We also observe to population optimization of latent factors over time in Movie S5. Using a single Titan Xp GPU with 12GB memory we could simulate a population size of 1000 individuals in an average of 19.5 seconds per generation.

## 3 Discussion

Supervised discriminative learning algorithms are already becoming an integral tool for researchers across disciplines whereas unsupervised generative modeling approaches remain a relatively young and active area of machine learning research. Already, the highly expressive generative models like the ones presented here are transforming the way we interact with image data. By solving problems in a more general way, generative modeling approaches provide more direct connections to hypothesis testing and connecting observations. Here, we demonstrate how these approaches may serve as an engine for more integrative studies of animal coloration patterns, and natural image data more generally, directly connecting approaches.

Analytically, our approach captures important hierarchical features across spatial scales that existing approaches do not account for (Figure 2, Figure S3, Appendix A.1), it removes the inherent biases of predefined filters by learning features directly from the sample data, and it disentangles complex factors of variation into a useful, meaningful representation (Figure S2, Figure 4). More than compressing data into a low dimensional space, this approach is generative and can create novel out-of-sample examples with high fidelity. This is a potentially transformative extension for researchers in the natural sciences which is not offered by existing approaches, allowing researchers to test analytical results with virtual experiments, and empirically, by using virtual reality playback experiments or observational studies (see Movie S6).

These techniques can be adapted to many domain specific questions (see A.1 for a specific discussion regarding the potential impact of this approach on the study of color pattern evolution). As the latency between input and output decreases in video playback experiments, integrating instantaneous behavioral feedback and in-the-loop methods for hypothesis testing may be used to design complex real-time assays. More sophisticated virtual experiments may also incorporate agent based models and evolutionary algorithms working directly on the latent representation to create complex simulations (e.g as in,^{38} Figure 5). In our demonstration, we are able to simulate 1000 individuals in under 20 seconds per generation with very little optimization and asynchronous approaches may already be possible. Analytically, as research in machine learning aimed at understanding how information is organized and used by algorithms advances, a growing theoretical framework with a basis in statistical mechanics^{39} and information theory^{40} may provide additional avenues for investigating the statistical properties of color pattern spaces and their evolution.

## 4 Experimental Procedures

### 4.1 Materials Availability

Guppy images were collected from a maintained stock at the University of Wuerzburg under authorization 568/300-1870/13 of the Veterinary Office of the District Government of Lower Franconia, Germany, in accordance with the German Animal Protection Law (TierSchG). Individuals were imaged on a white background with fixed lighting conditions^{41} using a Cannon D600 digital camera. Images were down sampled and center cropped to final size of 256 × 256 pixels. The dataset consists of 977 standardized RGB images across three species and 13 individual strains.

Butterfly images were downloaded from the Natural History Museum, London under a creative commons license (DOIs: https://doi.org/10.5519/qd.gvq3p7xq, https://doi.org/10.5519/qd.pw8srv43). This dataset consists of 9531 RGB images.

For each dataset, we segmented samples from the background using a customized object segmentation network adapted from.^{42} For each dataset we annotated 8 samples to train the segmentation network. All samples were cropped and resized to 256 × 256 and placed on a transparent background (RGBA). For calculating the perceptual loss during training, images were translated to 3-channel images with a white background using alpha blending.

Updated links to original data repositories can be accessed here: https://github.com/ietheredge/VisionEngine/README.md.

#### 4.1.1 Data and Code Availability

All models were implemented using Tensorflow 2.2 and can be accessed here: https://github.com/ietheredge/VisionEngine, including installation and evaluation scripts to reproduce our results. Instructions for creating new data loaders for training new datasets using this method can be found at https://github.com/ietheredge/VisionEngine/data_loaders/datasets/README.md.

### 4.2 Key Methods

DHRL relies on a three-step process of sequential training where first a generative adversarial network is trained to transform a noise sample into realistic out of sample examples. Next, a variational autoencoder is pre-trained on the generated samples. Then finally, the pretrained variational model is fine-tuned on the original samples.

#### 4.2.1 InfoGAN

We use an unsupervised approach to disentangle discrete and continous latent factors adapted from^{23} (InfoGAN) which modifies the minimax game typically used for training GANs such that:
where *V* (*D, G*) is the original GAN objective introduced in^{16} and *L*_{I} (*G, Q*) approximates the lower bound of the mutual information *I*(*c*; *G*(*z, c*)) using Monte Carlo sampling such that *L*_{I} (*G, Q*) ≤ *I*(*c*; *G*(*z, c*)).^{23} Like the generator *G* and discriminator *D, Q* is parameterized as a neural network and shares all convolutional layers with *D*.

Both discrete *Q*(*c*_{d}|*x*) and continuous latent codes *Q*(*c*_{c}|*x*) are provided with continuous latent codes treated as a factored Gaussian distributions. Importantly, InfoGAN does not require supervision and no labels are provided, e.g.^{29}

We substitute the original generator and discriminator models from^{23} with the architecture described in^{44} and increase the flexibility of the latent code, providing additional continuous and discrete latent codes. For guppy experiments, we provide two continuous and 19 discrete codes (samples were drawn from 19 paternal lines). For the basis noise vector input to the generator, we used 100-unit random noise vector.

#### 4.2.2 Variational Ladder Autoencoder

In contrast to hierarchical architectures, e.g.,^{45,46} we learn a hierarchy of features by using multiple latent codes with increasing levels of abstraction from,^{22} i.e. *q*_{ϕ}(*z*_{1}, …, *z*_{L}|*x*). The expressivity of *z*_{i} is determined by its depth. The encoder *q*_{ϕ}(*z*_{1}, …, *z*_{L}|*x*) consists of four blocks such that:
where *H*_{ℓ}, *G*_{ℓ}, and *µ*_{ℓ} are neural networks. For our encoder model, *G*_{ℓ} is a stack of convolutional, batch normalization, and leaky rectified linear unit activation (Conv-BN-LeakyReLU), we stack four Conv-BN-LeakyReLU blocks for each *G*_{ℓ} with increasing number of channels for each subsequent convolutional layer, i.e. N-channels/2, N-channels, N-channels, N-channels*2 where N-channels is 16, 64, 256, 1024 for *G*_{1}, *G*_{2}, *G*_{3}, *G*_{4} respectively. We apply spectral normalization to all convolutional layers (see below). Because we want to preserve feature localization, we use average pooling followed by a squeeze-excite block to apply a context-aware weighting to each channel (see below).

Similarly, the decoder, *p*_{θ}(*x*|*z*_{1}, …, *z*_{L}), is composed of blocks such that:
where [.; .] denotes channel-wise concatenation. Parallel to *G*_{ℓ}, blocks in the encoder: *U*_{ℓ} are composed of Conv-BN-ReLU blocks (note the use of ReLU and not LeakyReLU in the decoder) with decreasing number of channels in each convolutional layer, i.e. N-channels*2, N-channels, N-channels, N-channels/2 where N-channels is 1024, 256, 64, 16. No spectral normalization wrappers or squeeze-excite layers are applied in the decoder.

#### 4.2.3 Squeeze-Excite Layers

Squeeze-and-Excitation Networks^{47} were proposed to improve feature interdependence by adaptively weighting each channel within a feature map based on the filter relevance by applying a a channel-wise recalibaration. Here we apply squeeze-excite (SE) layers prior to the variational layer such that each embedding *z*_{i} captures features with cross-channel dependencies. Each SE layer consists of a global average pooling layer which averages channel-wise features followed by two fully connected layers with relu activations, the first with size channels/16 and the second with the same size as the number of input channels. Finally a sigmoid, “excite,” layer assigns channel wise probabilities which are then multiplied channel wise with the original inputs.

#### 4.2.4 Reconstruction Loss

We minimize the negative log likelihood of the sample data by minimizing the mean squared error between input and output, jointly optimizing the reconstruction loss for each sample *x*:
To relax the restrictive mean-field assumption which is implicit in minimizing the pixel-wise error, we jointly optimize the similarity between inputs and outputs using intermediate layers of a pretrained network, VGG16,^{43} as feature maps.^{25–27} Here we calculate the Gram matrices of feature maps, which match the feature distributions of real and generated outputs for each layer as:
for feature maps *F*_{a} and *F*_{b} in layer *ℓ* across locations *c, d*. This measures the correlation between image filters and is equivalent to minimizing the distance between the distribution of features across feature maps, independently of feature position.^{48}

The combined reconstruction loss is a weighted sum of the perceptual loss and pixel-wise error:
where *α* and *β* are Lagrange multipliers controlling the influence of each loss term. Here we set *α* = 1e-6 and *β* = 1e5 to balance the contribution of reconstruction terms with variational loss (see below).

#### 4.2.5 Maximum Mean Discrepancy

We use the maximum mean discrepancy approach (MMD)^{21} to maximize the similarity between the statistical moments of *p*(*z*) and *q*_{ϕ}(*x*) using the kernel embedding trick:
using a Gaussian kernel, *k* (*z, z*′), such that
to measure the similarity between *p*_{θ}(*z*) and *q*_{ϕ}(*z*) in Euclidean space. We measured similarity using multiple kernels with varying degrees of smoothness, controlled by the value of *σ*_{2}, i.e. multi-kernel MMD,^{49} with varying bandwidths: *σ*_{2} = 1e-6, 1e-5, 1e-4, 1e-3, 1e-2, 1e-1, 1, 5, 10, 15, 20, 25, 30, 35, 100, 1e3, 1e4, 1e5, 1e6.

Weighing the influence of MMD kernel differences on the combined objective function is controlled by the Lagrange multiplier *λ* applied across each latent code. Giving the combined objective:
where *L* is the number of hierarchical latent codes and *z*_{i} is the n-dimensional latent code and the prior, *p*(*z*_{i}) = 𝒩 (0, **I**) and ℒ_{reconstruction} define above. Here, we set *λ* = 1.

#### 4.2.6 Denoise Training

In addition to further relaxing the contribution of pixel-wise error, adding a denoising criterion has been shown to yield better sample likelihood by learning to map both training data and corrupted inputs to the true posterior, providing more robust training for out of sample data.^{50} We implement this with the addition of noise layer which samples a corrupted input from input *x* before passing to the encoder . We use apply random binomial noise (salt and pepper) to ten percent of pixels.

#### 4.2.7 Spectral Normalization

Spectral normalization has been proposed as a method to prevent exploding gradients when using rectified linear units to stabilize GAN training via a global regularization on the weight matrix of each layer as opposed to gradient clipping to provide bounded first derivatives (the Lipschitz constraint).^{51}

### 4.3 Latent Feature Attribution and Disentanglement

Understanding the importance of features for model predictions is an active area of research. Integrated gradients, introduced by,^{32} assigns feature importance, determining causal relationships between predictions and image features by summing the gradients along paths between *x*′ and *x*.
We adapt this procedure to investigate the contribution of each latent variable parameter *z*_{i} where we use a baseline *z*, an encoding of a singe sample *x* and iterate *z*_{j} while holding all other *z*_{l} constant and summing the gradients of the decoder *p*_{θ}(*x*|*z*) such that:
where *j* is the axis of latent code being interpolated, *i* is the individual feature (pixel), *p*_{θ}(*x*|*z*) is the reconstructed output, *p*_{θ}(*x*|*z*′) is the baseline reconstructed output, *k* is the perturbation constant, and *m* is the number of steps in the approximation of the integral. We use the Riemann sum approximation of the integral over the interpolated path *P* which involves computing the gradient in a loop over the inputs for *k* = 1, …, *m*. Here, we use *m* = 300 and *k* = 2 max(|*z*|) for each *z*^{j} starting from a baseline *p*_{θ}(*x*|*z*^{j′}) : *z*_{j} = − max(|*z*|).

We use the technique developed in^{30} for assessing disentanglement, measuring the relative entropy of latent factors for predicting class labels. We measure disentanglement of *D*_{i} of each latent code is measured by *D*_{i} = (1 − *H*_{K}(*P*_{i})) where *H*_{K} is the entropy and *P*_{i} is the relative importance of the generative factor. We also include a metric of completeness *C*_{i}, approximating the degree to which the generative factor is captured by a single latent variable, where *C*_{j} = (1 − *H*_{D}(*P*_{j})) where *P*_{j} is the unweighted contribution of generative factors.^{30} Here, in the absence of labeled features, we use species (butterflies), breeding line variants (guppies), and predicted class of the generative model (generated guppies, 4.2.1, above) for each model as approximate class labels (one class). This approximation naturally overestimates *D*_{i} and underestimates *C*_{j} as there is some overlap between classes in terms of visual features (see Figure 2, Figure S3). While^{30} proposes a third term to evaluate representations *I* to measure the relative informativeness, we found that this value was highly coupled to the choice of the Lagrange multiplier *λ* used for latent regularization (above).

#### 4.4 Simulating Evolution on the Latent Space

For demonstrating an example virtual experiment, we use a genetic algorithm, with a parent population of 1000 random samples, evolved over 500 generations. Parent samples are random initialized across the the latent variables of each latent code. Fitness was calculate as an equally weighted sum of the total percentage of pixels within two ranges (orange rgb(0.9, 0.55, 0.) > rgb(1., 0.75, 0.1) and black rgb(0., 0., 0.) < rgb(0.2, 0.2, 0.2)) measured on the generated output, a simplification of empirical results from the literature.^{37,52} During each generation predicted fitness for each sample in the population was measured by the fitness of the nearest neighboring value in the reference table (for processing speed). To simulate weak selective pressure on the fitness function, we drew 500 random parent subsamples weighted by their proportional fitness. An additional 200 samples were drawn, without the proportional fitness weighting. Together, from the 700 subsamples in each generation we drew 300 random pairs, the “alleles” from each sample (the specific latent variable values) were chosen randomly with equal probability to create a combined offspring between the two samples. Each combined offspring then had two alleles randomly mutated, one by drawing from a random normal distribution and the other by replacing an existing value with zero (similar to destabilizing and stabilizing mutations). The next generation thus consisted of 100 samples, 700 parent samples + 300 offspring. This process was repeated for 500 generations.

## 5.1 Author contributions

RIE conceived the approach and designed the methodology; MS and RIE collected sample data. RIE wrote the manuscript. AJ secured funding. All authors contributed to editing and approving the manuscript.

## 5.2 Declaration of Interests

The Authors have no financial or non-financial competing interest.

## 5 Acknowledgments

We would like to thank members of the Dept. of Collective Behavior, Max Planck Institute of Animal Behavior and Centre for the Advanced Study of Collective Behaviour, University of Konstanz for comments on earlier versions of the manuscript as well as the Max Planck Computing and Data Facility for use of computational resources.

## Appendix A Example Application to the Evolution of Color Patterns: Background

The incredible variety of color patterns seen in nature evolved under the selective forces imposed by the environment, and the visual experience of their receivers.^{53–59} Quantifying this diversity, and reliably testing the functional significance of these traits is fundamental to understanding fitness landscapes^{60} and underlies many subdisciplines of sensory ecology, cognitive neuroscience, collective behavior, and evolution.

Creating quantitative descriptions of color patterns which take into account the unique sensory and semiotic worlds of their receivers^{61} has been a central challenge in visual ecology. Many tools have been developed: Quantitative Colour Pattern Analysis,^{62} PAVO,^{63} Natural Pattern Match,^{64} among others.^{65–73} Each of these tools uses one or an ensemble of complimentary metrics from image analysis and computer vision, e.g. image statistics, edge detection, and landmark-based filters.^{14}

Still, fundamental gaps remain. One of these gaps is the difficulty in building quantitative descriptions of complex features with multiple subelements. Most existing approaches fail to capture the full complexity of many of color patterns; the algorithms themselves are insufficiently expressive. This is particularly true when spatial or scale dependent relationships between features exist, e.g. the irregular patterns of male guppy ornamentation or butterfly wing patterns where similar sets of elements are arranged in species-specific configurations.^{74} Recently, researchers have begun employing machine learning algorithms such a as non-linear dimensionality reduction, e.g. t-distribute stochastic neighbor embedding (t-SNE,^{34,35} Figure 2), and deep neural networks (Figure 2,^{15} Figure S1). Still, while these techniques can better represent more complex relationships between pixel values within an image, current implementations do not disentangle features across scales or provide extensions to downstream experiments.

While complex trait may be difficult to quantify, they are nonetheless biologically relevant in terms of feature context^{75–80} and the perceptions of shape, motion, and attention.^{57–59,81–83} And in the brain, we know that perception is hierarchically organized,^{84} and representations made at higher levels of the visual cortex and its homologs heavily influence the perception of low-level features.^{85,86} While measuring local features across an image provides important insight on regularity and the nature of wide-field variation, a collection of local feature descriptions across space is fundamentally different to a feature description built across scales.

Another gap is in building direct connections between approaches. Establishing spectral sensitivity, acuity, and feature importance is typically done using stimulus playback experiments or behavioral assays. However, beyond using statistical descriptions of features to guide researchers in the creation of stimuli there are few explicit connections between analysis and experiment. The current state of the art: immersive virtual reality (VR) and low-latency playback experiment—with fully animated, photo-realistic, 3D models, provide a rich experimental basis for investigating the relationship between visual inputs, neural activity, and behavior.^{87–90} VR systems are also beginning to better account for species-specific sensory biases including photoreceptor sensitivity, flicker fusion rate, acuity, and depth perception.^{89,91} Still, currently these approaches rely on human-in-the-loop interventions for creating stimulus with even moderate complexity.

Additionally, because color pattern traits have evolved under selective pressure from multiple receivers, establishing these types of evolutionary trade-offs is important to our understanding. However, experimental approaches often require large, highly disruptive manipulations such as translocation experiments or large scale crossbreeding experiments. Simulations and virtual experiments may better allows researchers to be explicit about the stimulus that is being tested and greatly reduce the number of subjects needed (Methods 4.4).

### A.1 The potential impacts of this approach on the study of evolution

This platform may be used to address many outstanding questions regarding the functional significance of color pattern traits; here, we discuss some of these questions. *1*) What are the constraints on the evolvability of a given trait? By identifying the topographical relationship between different traits within the color pattern space we can test predictions about the selective forces acting on them related to their geometric relationships, e.g. the axes of variation in traits meant to communicate viability should show increased orthogonality compared to co-occurring traits which have evolved under a Fisherian process.^{92–97} *2*) Categorical perception is an important perceptual mechanism for understanding the evolution of color signals.^{98} But in systems where color patterns are used for mimicry^{53,55,}99 or novelty, investigating the boundaries between complex traits is fundamental. By performing traversals across the distribution of the latent variables, interpolating between samples can allow for tests of continuous^{100} versus categorical perception^{101} of complex traits. *3*) Many color pattern traits have evolved under selective pressure from multiple receivers, e.g. both females and predators shape the diversity of male guppy ornaments.^{102} Establishing these types of evolutionary trade-offs is difficult and often requires large, highly disruptive manipulations such as translocation experiments. Using evolutionary models similar to the ones presented here researchers can simulate multiple fitness landscapes and evolutionary trajectories simultaneously to perform a broad range of virtual experiments. Importantly, while each of these examples place either analytical, experimental, or virtual results at the center, by using the platform presented here, they maintain direct connections across approaches. Furthermore, they can incorporate existing techniques^{67–73} as image preprocessing routines, during playback, or constraints on virtual experiments.

## Appendix B Supplemental Figures

Figure S5: *Movie 1: The combined pattern space over 500 generations, visualized in 2D using tSNE*

Figure S6: *Movie 2: VR animation of learned coloration pattern models to an animated guppy for virtual playback.*