Augmentation of Virtual Endoscopic Images with Intra-operative Data using Content-Nets

Medical imaging applications are challenging for machine learning and computer vision methods, in general, for two main reasons: it is difficult to generate reliable ground truth and databases are usually too small in size for training state of the art methods. Virtual images obtained from computer simulations could be used to train classifiers and validate image processing methods if their appearances were comparable (in texture and color) to the actual appearance of intra-operative medical images. Recent works focus on style transfer to generate artistic images by combining the content of an image and the style of another one. A main challenge is the generation of pairs with similar content ensuring preservation of anatomical features, especially across multi-modal data. This paper presents a deep-learning approach to content-preserving style transfer of intra-operative medical data for realistic virtual endoscopy. We propose a multi-objective optimization strategy for Generative Adversarial Networks (GANs) to obtain content-matching pairs that are blended using a siamese u-net architecture (called Content-net) that uses a measure of the content of activations to modulate skip connections. Our approach has been applied to transfer the appearance of bronchoscopic intra-operative videos to virtual bronchoscopies. Experiments assess images in terms of, both, content and appearance and show that our simulated data can substitute intra-operative videos for the design and training of image processing methods.

State of the art methods in computer vision need huge amounts of data with 2 unambiguous annotations for their training. In the context of medical imaging this is, in 3 general, a very difficult task due to limited access to clinical data, the time required for 4 manual annotations and variability across experts. The particular field of intervention 5 guiding has the extra difficulty of intra-operative recordings probably requiring the 6 alteration of standard protocols. These facts have encourage the development of 7 computational methods for the simulation of realistic medical imaging data from the 8 available scans. In this context, virtual endoscopy could be useful to train intervention 9 support methods. 10 In order to obtain realistic data useful for data augmentation and validation of 11 machine learning and image processing methods, simulations should resemble 12 intra-operative recordings. We consider that style transfer could be used to endow 13 virtual endoscopic images with the content and texture of intra-operative videos using 14 modern techniques for artistic style transfer. 15 The basic idea of style transfer [1,2] is to transform input images appearance 16 according to one or more style images while preserving structure and content of input images. Recent works [3,4] have shown the power of Generative Adversarial Networks 18 (GANs) and Convolutional Neural Networks (CNNs) in general for artistic style transfer. 19 The main difference between artistic style transfer and realistic simulations of 20 endoscopic procedures is that in the latter, stylized images should preserve the structure 21 of simulated data. This follows from the fact that style structure encodes the 22 anatomical content of the image and, thus, it should be preserved. 23 This work addresses the generation of realistic endoscopic images using 24 intra-operative video data to augment the appearance of virtual endoscopy. We present 25 a two-stage method based on CNNs that maps virtual images to the intra-operative 26 domain preserving their anatomical content.

27
Related work 28 State-of-art techniques for artistic style transfer based on CNNs use a system of two 29 different neural networks (a generative network and a discriminative one) to obtain 30 stylized images achieving a compromise between preservation of input image content 31 and style images texture and appearance. The first network is an auto-encoder that 32 generates stylized images from the input images. The output of this auto-encoder is the 33 input of a discriminative network that classifies between style and input images to 34 assess how much stylized images appearance matches the actual style. This general 35 scheme has several variants concerning, especially, the kind of loss function used to train 36 the network.

37
In [3,5] the generative network minimizes a loss function that includes two terms.

38
One term (feature reconstruction loss) penalizes that stylized images deviate in content 39 from input images, while the second one (style reconstruction loss) measures the 40 similarity between stylized images and style appearance. The feature reconstruction loss 41 is given by the L 2 difference between feature maps of input and stylized images. The not necessarily preserve its spatial structure (as illustrated in Fig 1(b)).

47
Other approaches [6] are based on deep markovian models and transform images 48 locally instead of globally. To do so, feature maps are split into patches which are the 49 input of the classifier that discriminates between real and virtual appearances.

50
Like [3,5] the loss function also includes a content regularization term to preserve the 51 spatial structure of images. However, the fact that style is locally transferred also leads 52 to a loss of anatomical content in style images (Fig 1(c)).

53
New approaches such as [4] use Generative Adversarial Networks (GANs) in order to 54 transform images from one domain A (like virtual simulations) into a domain B (like 55 interventional videos). The novelty of [4] is that a cyclic term is added in order to make 56 the domain transfer bijective (A → B → A and B → A → B). Although, the method 57 also adds a regularization term to preserve the spatial structure of the stylized virtual 58 images, content information is still lost as shown in Fig 1(d).  To obtain realistic textured endoscopic images is a very complicated task since it is 86 usually very difficult to have exact correspondences between real and virtual images.

87
Besides methods for unpaired mapping fail to preserve the anatomical content of the 88 original domain.

89
Our strategy for intra-operative virtual endoscopy is a two-step method. In a first 90 stage, we generate pairs of virtual and intraoperative images sharing anatomical content 91 using GANs. In a second step, the content and appearance of such pairs are blended 92 using a siamese u-net architecture trained to modulate the amount of content and 93 texture that is taken from each pair.

95
Given two domains V irtual, V , and Real, R, a GAN learns two (bijective) maps (G r , G v ) from one domain onto the other one: with the map composition G r (G v ) and G v (G r ) being the identity on each domain.

96
Following [4], maps are given by auto-encoders trained to optimize: The term GAN measures how good are G v , G r transferring images from one domain to 98 the other one, while cyc is a "cycle consistency loss" introduced to force bijective 99 mappings. The minimization problem is given by adversarial training as: This way, G * r and G * v are optimized so that G r , G v minimize (2) while the adversarial 101 D r , D v maximize it. These conditions (minimize while maximizing at the same time) 102 might be in conflict when they are considered as a single optimization process. Such a 103 conflict is prone to introduce an oscillating behavior in back propagation iterations, 104 which might hinder the convergence of cycleGAN training and the selection of the 105 stopping epoch. 106 We propose to consider separately the optimization of each of the terms in the 107 objective function and pose adversarial training as the following multi-objective 108 optimization [8] problem: Since a multi-objective optimization problem involves the optimization of multiple being 1 , . . . , k the set of functions to be optimized.

115
The condition of the Pareto front can be used to select cycleGAN epoch as follows. Let G k := (G k r , G k v ) be the transformation maps at the k-th epoch and be the set of epochs belonging to the Pareto front. Such Pareto maps can be iteratively computed from the values of the objective functions as: for D 1 the set of maps for all epochs and D i , i > 2, the set of maps dominating GP i−1 . 116 In our case, G i ∈ D i if it satisfies the following conditions: for || · || 2 the L 2 -norm and mean v∈V , mean r∈R the average values for the training set of virtual, V , and real images, R. The epoch selected, G * P is the one in the Pareto front achieving the minimum value of cont : We will note the images transformed by these maps by v * := G * r (v) and r * := G * v (r).

141
The contrastive loss function that content-net minimizes is given by a content loss, 142 Cont , defined as: for v and v * an image pair produced by our multi-objective cycleGAN and C(v, v * ), 144 denoting the output of content net with v, v * being inputs of, respectively, the first and 145 second siamese encoders.

146
The function ρ weighting skip-connections is learned from a training set of intra-operative images by comparing the input image to the activation of each neuron in an encoder trained to yield the identity map. The similarity measure chosen to compare input images to its neuron activations is their mutual information [9]. Mutual information compares the correlation between random variables and is used in multimodal registration to compare images with equal content but different appearance.
If v denotes the random variable given by input images and a the random variable given by each neuron activation, then their mutual information, I(v, a) is given by: Given an input image v, the mutual information evaluated for all neurons' activations, a j i , of a given layer, j , defines a strictly positive random variable: The distribution of I j v is bimodal and allows to categorize neurons' activations into two 153 groups sharing high and low content with the input image v.

154
The function ρ weighting each neuron activation is defined as the probability of a neuron sharing high content with input images and is computed as the percentage of times each activation belongs to the high content class for a set of training images:  intr-operative appearance and preservation of virtual anatomical content. Content-net 173 was fine-tuned on the set of intra-operative recordings from an auto-encoder trained to 174 yield the identity map on the Real domain. The weighting function ρ was also learned 175 from the latter auto-encoder.

177
We evaluated the quality of the enhanced virtual images in terms of intra-operative    (Fig 4 (a)), the 200th epoch network (Fig 4(b)) and the least cost one (Fig   198   4(c)). For each case, we show two consecutive frames of the enhanced virtual sequence 199 which should be very similar in appearance and content. GANLeast images have sudden 200 dark artifacts, while GAN200 yields highly unstable images that do not always match

204
For the second experiment, we applied the lumen center detector [12] to ContentNet, 205 GAN200, GANLeast and the non-enhanced original virtual images to verify that original 206 lumen position and structure is preserved. The center detector was applied using two 207 different sets of parameters, one learned on interventional videos and the other one 208 learned on simulated bronchoscopies. Interventional parameters were used on enhanced 209 images, while simulation parameters were applied to original virtual images. Like [12], 210 detections were plot on original virtual images and shown to 2 independent observers for 211 the identification of false detections and missed centres. Ground truth was produced by 212 intersecting the experts' annotations and used to compute precision and recall.

213
Scores obtained for ContentNet, GAN200, GANLeast were compared to the ones 214 obtained for virtual non-enhanced images using a Student T-test for paired data. As in 215 the first experiment, we computed p-values and 95% confidence intervals and a p-value 216 < 0.05 was considered statistically significant.
217 Table 2 and 3 summarize statistics for Prec and Rec reported as in Table 1.

218
ContentNet outperforms the two cycleGAN, both, in terms of precision and recall.   the appearance of enhanced images compares to intra-operative videos for their use in 252 classification problems. The 2nd experiment assessed whether the anatomical content of 253 the virtual images extracted from patient's CT is preserved after their enhancement for 254 their use in image processing problems.

255
Results obtained for the first experiment show that, like cycleGAN, Content-net has 256 an appearance close enough to intra-operative videos as to be classified real by a 257 discriminative network. This validates our method for data augmentation in 258 classification problems. The 2nd experiment shows that images enhanced using both 259 cycleGANs have a significant distortion in anatomical content (see Fig 5) and have 260 larger temporal artifacts (see Fig 4) in comparison to Content-net. In fact, according to 261 T-tests Content-net anatomical structure is not no significantly different from original 262 virtual images extracted from patient's CT anatomy. This validates our method for 263 data augmentation in image processing problems.

264
The artifacts of cycleGANs images might be partially attributed to the adversarial 265 training. On one hand, the combination of two loss functions with different (opposite, 266 indeed) goals (minimization and maximization) introduces an oscillating behavior across 267 training epochs and, thus, consecutive epochs might produce very different results. On 268 the other hand, it is not guaranteed that both losses will have equal influence during 269 training since the magnitude of one of the two might be predominant in the 270 back-propagation of their gradients.

271
The proposed multi-objective approach allows the join optimization of both losses 272 ensuring equal influence on the cycleGAN, regardless of their magnitude or gradient.

273
This way, our multi-objective cycleGAN produces stylized images that share enough 274 anatomical structure with virtual images as to be the input for a network blending both 275 image pairs. The weighted skip connections of ContentNet provide selective blending of 276 the structure and texture of these image pairs. This allows enhancing the patient 277 specific anatomical content acquired by CT scans, while keeping an intra-operative 278 appearance. In this context, it is worth noticing that ContentNet precision and recall 279 ranges achieved in the detection of airways structure (centers) is very close to the ranges 280 obtained in intra-operative videos [12].

281
In summary, two interesting conclusions can be inferred from our experiments. First, 282 the use of multi-objective optimization strategies can be an effective alternative to 283 back-propagation for the optimization of adversarial networks and other networks 284 relying on multiple loss functions. In this context, the Pareto front condition could also 285 be adapted for the selection of the most appropriate task in sequential multi-task 286 learning. Second, the structure of neuron activations can be measured by the amount of 287 information shared with input images. This measure of their content provides a 288 description easy to interpret in terms of classical computer vision. We envision that this 289 could be useful to define more specific and interpretable representation spaces based on 290 CNNs.