A single latent channel is sufficient for biomedical image segmentation

Glottis segmentation is a crucial step to quantify endoscopic footage in laryngeal high-speed videoendoscopy. Recent advances in using deep neural networks for glottis segmentation allow a fully automatic workflow. However, exact knowledge of integral parts of these segmentation deep neural networks remains unknown. Here, we show using systematic ablations that a single latent channel as bottleneck layer is sufficient for glottal area segmentation. We further show that the latent space is an abstraction of the glottal area segmentation relying on three spatially defined pixel subtypes. We provide evidence that the latent space is highly correlated with the glottal area waveform, can be encoded with four bits, and decoded using lean decoders while maintaining a high reconstruction accuracy. Our findings suggest that glottis segmentation is a task that can be highly optimized to gain very efficient and clinical applicable deep neural networks. In future, we believe that online deep learning-assisted monitoring is a game changer in laryngeal examinations.

Data and preprocessing 48 To train and evaluate deep neural networks, we used the Benchmark for Automatic 49 Glottis Segmentation (BAGLS, [9]). We used the full training and test dataset 50 containing 55,750 and 3,500 images, respectively. For training, we resized all images to 51 512×256 px, which is the native resolution of most images in the dataset. All endoscopic 52 images were converted to grayscale. The input image intensities were normalized to -1 53 and 1, the segmentation masks were normalized to 0 and 1, where 0 is background and 1 54 the glottal area. We applied data augmentation to the training data randomly using 55 Gaussian blur, rotation (-30 to 30°), horizontal flip and gamma adjustments. We also 56 use short video snippets of 30 frames available in the BAGLS dataset for time variant 57 data analysis. Videos are processed as single frames on a single frame basis. 58 Glottal area waveform (GAW) 59 The glottal area waveform (GAW) is a one-dimensional representation of the vocal fold 60 oscillation behavior. Each time point of the GAW is computed as the sum of foreground 61 pixels in the glottal area segmentation mask at the given time point [20]. The baseline glottis segmentation network is based on the U-Net architecture [21] 65 modified as described in [10]. Briefly, we rely on an encoder-decoder architecture to 66 change the image domain from endoscopic image to glottal area segmentation (see 67 Figure 1). Initially, we use skip connections between encoder and decoder to pass 68 mid-level information by concatenation. We setup deep neural networks in TensorFlow 69 2.6 using the Keras high-level package. We trained for 25 epochs at a constant learning 70 rate of 10 −4 using the Adam optimizer [22]. Each convolutional layer used a kernel size 71 of 3×3 and f L convolutional filters that follow equation 1: ReLU6(x) = min(max(0, x), 6).
During training, we were minimizing the Dice loss [23] as defined in equation 4 by 81 comparing the predicted glottis segmentation maskŷ to the ground-truth segmentation 82 mask y.
The latent space Ψ is a high-level representation of the initial endoscopy image at the 85 end of the encoder and serves as input to the decoder ( Figure 1). It can be interpreted 86 as an image with f L "color" channels. For latent space investigations, we changed f L 87 from its initial value (here: 256), as defined by equation 1, to a fixed value ranging from 88 1 to f L . When f L = 1, we refer to the latent space as latent space image Ψ 1 .

Decoder experiments 90
The initial decoder is constructed as described in the section Architecture. For decoder 91 experiments, we used different strategies to construct the decoder ( Figure 5A). The skip connections and f L = 1). We converted each latent space image in uint8 as we have 100 shown that eight bit are sufficient for high-level encoding ( Figure 4A,B).

101
Bit encoding

102
For evaluation of the bit encoding, we created a histogram of a given latent space image 103 and divided it into 2 bits bins. We then set each pixel in a given bin range to the average 104 value in a given bin ( Figure 4C). The resulting new latent space image is provided to 105 the decoder and the reconstructed image is compared to the ground-truth segmentation 106 mask. We used the mean squared error (MSE) and the intersection over union (IoU) 107 score (see Evaluation) as evaluation metrics.

109
We evaluated the segmentation quality using the IoU (intersection over union) score [24] 110 as defined in equation 5.
We further computed the correlation between the latent space image Ψ 1 (in equation 112 refered as x) and the GAW (y) as follows: where n is the number of time points/samples,x andȳ the average of x and y, 114 respectively, and s x and s y the sample standard deviation of x and y, respectively.

115
Code and data availability 116 We provide all relevant code at https://github.com/ankilab/latent. To understand which components are crucial in a segmentation deep neural network, we 124 performed an ablation study on a modified U-Net architecture (see Methods). We 125 trained a full size, complete U-Net to perform glottis segmentation ( Figure 1A), similar 126 to previous works [9,10]. The latent space Ψ, the ultimate bottleneck that connect 127 encoder and decoder in the full U-Net, has initially 1024 channels ( Figure 1A), when 64 128 filters are used in the first layer (f base =64). In this work, we use a reference 129 implementation with 16 filters in the first layer (f base =16) and thus, 256 channels in the 130 latent space, as this has been shown previously to provide comparable performance 131 compared to f L = 1024 [10]. We systematically reduced the amount of channels in the 132 latent space to determine the minimum viable latent space. We found that even a single 133 latent space channel is sufficient to encode the glottal area segmentation ( Figure 1B).

134
However, we hypothesized that the skip connections in the U-Net allow to rescue the 135 strong limitation in the bottleneck. Hence, we removed the skip connections and found 136 that the segmentation accuracy was reduced across configurations ( Figure 1C). However, 137 the network architecture is still able to provide accurate glottis segmentations ( Figure 138 1D), and has a performance on the test set similar to higher latent space encodings and 139 enabled skip connections. In summary, we show that a single latent space channel is 140 sufficient for glottis segmentation. We will refer to this single latent space channel as 141 latent space image Ψ 1 .

142
The latent space encodes glottis location and shape 143 Next, we investigated the properties of the latent space. We encoded all images of the 144 BAGLS training dataset to gain a collection of latent space Ψ images. We first 145 determined if any pixel is correlated for with the glottal area. We found that the above 0.8 ( Figure 2B). Interestingly, we clipped the available value space in the latent 151 space between 0 and 6 (see Methods), however, the largest value we observed was 1.45, 152 indicating that we are not constrained by our activation function.

153
To understand the meaning behind these values, we found that values around 0.8 154 seem to encode for background (referred to as β pixels), values higher than 0.8 define We indicated the 95% confidence interval that is used for defining γ pixels. B: The average latent space image Ψ 1 across 30 frames. We indicate the three pixel subtypes, α for glottis refining, β for background-defining, and γ for glottal area defining pixels. C: The average reconstruction obtained from feeding Ψ 1 from panel B into the decoder.
Thresholded latent space is highly correlated with glottal area 167 waveform 168 The glottal area waveform (GAW) is a time-variant signal important for assessing vocal 169 fold physiology [2,25]. We therefore asked, if the latent space image Ψ 1 is a good proxy 170 for the GAW. To answer this question, we used short video fragments from the BAGLS 171 dataset and converted the provided ground-truth segmentation mask to the GAW (see 172 Methods). We followed two approaches: (1) summing all values in Ψ 1 and (2) threshold 173 Ψ at 95% confidence interval (value = 0.8) and then summing the positive pixels. In 174 Figure 3A, we show that approach (1) is correlated to a limited extend with GAWs (on 175 average 0.03 ± 0.56), however, approach (2) is highly correlated with the GAW, on 176 average 0.84 ± 0.18. We show two exemplary videos with corresponding ground-truth 177 GAW, the GAW generated by using the segmentation masks reconstructed by the 178 decoder, and the thresholded Ψ 1 waveform in Figure 3B.

179
A low bit encoding is sufficient for glottis reconstruction 180 As the value range is very limited in the latent space ( Figure 2B) and the existence of α, 181 β, and γ pixels, we hypothesized that a low bit depth is sufficient for encoding the 182 latent space Ψ 1 for accurate glottal area reconstruction. By reducing the bit depth from 183 32-bit floating point to a range of 1 to 8-bit, we found that 4-bit encoding is sufficient 184 for high quality reconstructions ( Figure 4A-C). Specifically, with 4-bit encoding the IoU 185 score becomes stable and shows a low error, that is neglectable with 8-bit encoding 186 ( Figure 4A). The mean-squared error (MSE) between full 32-bit reconstruction and low 187 bit reconstruction declines as expected with increasing bit depth, but is still shows a 188 significant deviation of from the original reconstruction ( Figure 4B). We further are able 189 to reproduce the high correlation of Ψ 1 with the glottal area waveform ( Figure 4D). In 190 summary, we show that 4-bit encoding is sufficient for subjectively similar glottis As the latent space image Ψ is easily interpretable and shows a low-level complexity, we 195 hypothesized that the decoder architecture can be largely simplified. Hence, we 196 investigated how many convolutional filters and how many upsampling steps are 197 necessary for decoding ( Figure 5A). Further, we were interested if the upsampling 198 strategy (nearest neighbours vs. bilinear interpolation) and multiple convolutional 199 layers are affecting the decoding ( Figure 5A). When using a single convolutional layer in 200 each upsampling step, we found that one and two convolutional filters are not sufficient 201 for decoding and that four convolutional filters are only sufficient in a single 202 configuration (4x upsampling and bilinear interpolation) as shown in Figure 5B. The 203 best results were achieved using eight convolutional filters together with 4x upsampling, 204 which resulted in decent IoU scores (0.817, Figure 5B). Using two convolutional layers 205 in each upsampling step, however, allowed 2x upsampling being competitive in the eight 206 convolutional filters configuration. In general, two and four convolutional filters show  Figure 5D), it has a relatively stable file size of 99 kB 213 ( Figure 5E). It is astonishing that even configurations with less than 200 trainable 214 parameters achieve IoU scores higher than 0.4 (Table 1).

216
In this work, we found that a single channel in the latent space of an encoder-decoder 217 architecture is sufficient for glottal area reconstruction. We further show that the latent 218 space forms an image that has interpretable properties, such as background (β), glottal 219 area defining (γ) and refining (α) pixels. Our findings suggest that encoder-decoder 220 frameworks are not only suitable for glottis segmentation, but also provide a 221 higher-order approximation of the glottal area sufficiently encoded in a significantly 222 smaller, single channel image. Together with a low bit encoding (Figure 4), it may serve 223 as an efficient data storage system for glottis segmentations. The latent space image can 224 be easily reconstructed using efficient decoders as presented in Figure 5.

225
In this study, we particularily focused on the latent space and its sufficiency for 226 glottis segmentation. We found that removing the U-Net-specific skip connections 227 yielded lower IoU scores in the validation set, whereas we did not find any differences in 228 the test set, i.e. on independent, unseen data ( Figure 1D). This is in line with a 229 previous study [10], where the authors showed that the kind of skip connection is not

235
Glottis segmentation is a straight forward task and was previously approached using 236 thresholding-based techniques [7,20,26]. Therefore, it is likely that the encoder-decoder 237 architecture learns a non-linear thresholding algorithm. However, other modalities, such 238  as anterior-posterior point prediction for midline estimation [19] and vocal fold 239 localization for paralysis analysis [27] may not benefit from this very constrained latent 240 space. Future studies should address these limitations and speculate about the necessity 241 of an increased latent space crucial for multitask architectures, as the latent space has 242 been shown useful for midline estimation [19].

243
The U-Net is a very powerful starting point for biomedical image segmentation tasks, 244 also for glottis segmentation [4,9,28]. Modifications to this architecture, such as 245 convolutional layers with LSTM memory cells [29] as shown by [28] may improve the 246 glottal segmentation accuracy. Also, more sophisticated encoding backbones, such as 247 the ResNet [30] and the EfficientNet [31] architecture show superior performance in 248 glottis segmentation, especially on more dissimilar data sources [20]. Future research 249 should investigate, if these architectures are able to detect and encode better high-level 250 features in the latent space, such that a potentially higher dimensionality in the latent 251 spcae yields further performance improvements.