FollicleFinder: automated three-dimensional segmentation of human ovarian follicles

In vitro fertilization (IVF) treatment protocols require frequent monitoring of the ovarian follicle growth process. We report FollicleFinder, an open source pipeline for the automated, 3D segmentation of ovarian follicles. FollicleFinder also accurately measures clinically-relevant morphological properties such as diameter, surface area, and volume. Availability The FollicleFinder pipeline is available at https://git.bsse.ethz.ch/iber/ovary-analysis and the graphical user interface is available at https://git.bsse.ethz.ch/iber/follicle-tracker.


Introduction
Assisted reproductive technology (ART) offers a range of treatment options to couples having trouble conceiving. In vitro fertilization (IVF) treatment protocols include the collection of gametes. Successful IVF requires a set of multiple oocytes and ovarian hyper stimulation is therefore used to support the maturation of a larger number of ovarian follicles. During hyperstimulation, the ovarian response is monitored by repeated transvaginal ultrasound (TVUS) examination, and clinical decisions regarding the dosage and timing of hormonal stimulation are based on the follicle number and size.
Previously, 2D slices were imaged in clinical settings, but 3D segmentation algorithms have since been developed [1], and several commercial software packages now offer the 3D segmentation of ovarian follicles [2]. General Electric (GE) offers two 3D segmentation algorithms, VOCAL and SonoAVC [3], directly with its ultrasound scanner. Both software packages measure the volume and surface area of follicles [4]. Unfortunately, there are too many segmentation errors that require correction by hand as that either package could be used routinely in a clinical setting [5]. Measurements by hand are, however, slow and introduce a user-dependent variability. Accordingly, there is a need for better algorithms that enable the automatic, accurate detection and 3D segmentation of ovarian follicles.
Deep learning approaches have improved image segmentation across scientific domains, and have recently been explored in the context of ovarian follicle detection and segmentation. CR-UNet, which combines a spatial Recurrent Neural Network (RNN) with a standard 2D U-Net, yields promising results with 2D TVUS images [2,1]. A subsequent study used an S-Net architecture for the simultaneous segmentation of ovary and follicles in 3D TVUS volumes [6]. These deep learning approaches provide more accurate segmentation of follicles in ultrasound images. However, the source code and models were not made available, making it difficult to build upon and reproduce this work.
To facilitate the translation of image-derived features into the clinic and the development of new models of follicular selection, there is a need for open source software that will serve as the basis for these innovations. To fill this gap, we have developed FollicleFinder, an open source platform for the detection and measurement of follicles from transvaginal ultrasound images.

Approach
FollicleFinder performs instance segmentation of the follicles from transvaginal ultrasound images using a 3D Unet via a two step segmentation pipeline: first the ovary is segmented and then the ovary is extracted and the follicles are segmented within the extracted ovary volume ( Figure 1A). See the Methods section in the Supplementary Information for details on image acquisition and network training. We constrain the segmentation to the biologically plausible region by first segmenting the ovary. Since obtaining ground truth labels is laborious, we have implemented a self-supervised denoising preprocessing step. Denoising reduces the amount of training data required because denoising and semantic segmentation are related tasks [7]. Indeed our segmentations are on par with state of the art results while training on only 4% of the data (Supplemental Information, section 2). Morphological properties such as the volume, surface area, and effective diameter are directly measured from the segmented follicles. Thus, follicle growth dynamics can be tracked over the course of the menstrual cycle ( Figure 1C) To facilitate usage of the FollicleFinder pipeline in research and clinical settings, we have created a graphical user interface as a napari plugin called FollicleTracker ( Figure 1B). The Fol-licleTracker plugin loads images from standard image formats (e.g., TIFF, DICOM) and renders them in 2D and 3D in the napari viewer. Via the graphical user interface, the user can use FollicleFinder to segment the follicles and measure their properties. Since FollicleTracker is written as a napari plugin with standard python libraries, we expect that others will be able to extend it for usage with other analyses.

Dataset
To help others further improve follicle segmentation algorithms, we are releasing our image dataset of raw images plus ground truth masks. The dataset contains 94 3D transvaginal ultrasound images annotated by a medical doctor. These data were collected by vaginosonographic examination as part of the Bicycle study [8]. The data can be accessed via the ETH openBIS instance.

Measurement of clinically relevant metrics
First, we compared the number of detected follicles to the number of follicles in the ground truth. Follicle count is a clinical measure that can be used to guide IVF treatment. In particular, the antral follicle count, or the count of follicles 2-5 mm in diameter can be used to assess ovarian reserve. FollicleFinder counts are well correlated with the ground truth and have low error (1.2 follicle mean error, Figure S2). Next, we considered how accurately the size of follicles was measured. To do so, we compared the effective diameter and volumes measured on the FollicleFinder predictions to those measured in the ground truth. For both effective diameter and volume, the measurements were highly correlated (Pearson's correlation coefficient, r=0.99, Figure S3). Finally, we calculated the localization error when comparing the centroid detected by FollicleFinder to the centroid calculated from the ground truth. FollicleFinder had a median localization error of 0.15 mm, which is within one voxel ( Figure S4).

Conclusion
In conclusion, we have developed a segmentation algorithm and software that achieves clinically-relevant and state of the art performance while only using 4% of the amount of training data. We expect that model performance can be further improved by increasing the amount of training data. By releasing both our model and dataset as an open resource, we anticipate other researchers will be able to build upon and improve segmentation algorithms and integrate the rich 3D information into clinical practice.

Acknowledgements
We thank Ting-Yu Ho and Steve Runser for helpful discussions.

Image Acquisition
Transvaginal ultrasound images of both ovaries were acquired in 50 women every second day during two natural cycles by a medical doctor using a GE Healthcare Voluson E8 + RIC5-9-D ultrasound imaging system.

Image Export and Training Dataset
The 3D images were exported in the DICOM format using GE 4D View software, and converted to tiff images. For training and testing, a clinical expert hand-segmented 94 ovaries using 3D slicer (https://www.slicer.org, [9]).

Pre-processing
To prepare the ultrasound images for processing, we first rescaled the images to 0.157288 mm x 0.157288 mm x 0.157288 mm voxels. Then we performed self-supervised denoising using the aydin Noise2SelfFGR restoration function [10]. Finally, we replaced the padding voxels with the mean intensity of the image. The padding voxels are the voxels added to yield a rectangular image from cone shaped detection volume.

3D UNet
We segmented the ovary using a 3D UNet [11] and extracted the voxels corresponding to the ovary. Then, we used a second 3D UNet to segment the follicles from the extracted image of the ovary. Finally, we identified individual follicles via connected component analysis.

Training ovary segmentation
We trained the ovary segmentation network using the ADAM optimizer. We set the initial learning rate to 0.0004 and the weight decay to 0.00001. We reduced the learning rate by a factor of 5 after no improvement of 10 validation runs. We set the batch size to 1 and validated after 1000 iterations. We used the intersection over union as our validation metric. To augment our training data, we applied random flips, random 90 degree rotations, random rotations (± 30 degrees), and random elastic deformations. We stopped training when the learning rate was reduced to below 1E-6.

Training follicle segmentation
We trained the follicle segmentation network using the ADAM optimizer. As an auxillary task, the network predicted the follicle boundaries in addition to the follicles themselves. We set the initial learning rate to 0.0002 and the weight decay to 0.00001. We reduced the learning rate by a factor of 5 after no improvement of 10 validation runs. We used intersection over union as our validation metric. We set the batch size to 1 and validated after 1000 iterations. To augment our training data, we applied random flips, random 90 degree rotations, random rotations (± 30 degrees), and random elastic deformations. We stopped training when the learning rate was reduced to below 1E-6.

Volume measurement and surface area measurements
We measured the volume of the segmented follicles with the scikit-image [12] regionprops table function. To measure the surface area of the of the follicles, we generated meshes from the segmented objects. First, we generated a mesh using the scikit-image marching cubes algorithm. Then we cleaned up the mesh (i.e, made it watertight) using pymeshfix [13]. Finally, we smoothed the mesh using the trimesh[14] filter taubin function (a Laplacian filter). Finally, we calculated the surface area using the Trimesh.area method().

FollicleTracker Graphical User Interface
We developed the graphical user interface (GUI) as a plugin for napari [15], a GPU-accelerated nD data viewer. Within the plugin, users can explore the data in both 2D and 3D. Our plugin is structured in three parts: IO 1 , GUI and Model ( Figure  S1). Napari allows users to import 3D images in various formats by drag-and-dropping them into the GUI. This includes images saved in the TIFF or standard DICOM format. Note, that some  Figure S1: FollicleTracker is implemented as a napari plugin and has a modular architecture.
GE ultrasound machines have a feature to save 3D ultrasound images as "DICOM", but then do not save them using the universal DICOM standard but a proprietary image file format [16]. The plugin allows users to save images as hdf5 files with separate datasets for the "left" and "right" ovaries. The plugin saves the follicle segmentations and all intermediate results from each side. These files can then be opened in napari using the plugin.
Through the controls on the napari plugin GUI, users can start the segmentation of ovarian ultrasound images and morphological feature analysis, such as volume and surface area of all detected follicles. The morphological measurements of the follicles in each ovary are then displayed in table similar to existing reproductive electronic health record programs. Users can select a morphological feature to plot so they can understand the distribution of follicle sizes. The default morphological feature is volume.
The model that processes and analyses the input ovary images is implemented separately from the GUI to ensure all variables, images and settings only have one state. The model computation is run in a separate thread than the GUI, so that the viewer remains responsive during segmentation and analysis. Any intermediate results are directly send back to the user interface thread and added to the viewer. The intermediate results are the de-noised image, the initial ovary segmentation, the processed ovary segmentation, the initial follicle segmentation and the processed follicle segmentation. The final results are the morphological follicle measurements.

Data Availability
All images and corresponding ground truth annotations are available via the ETH openBIS instance.

Installation and Code Availability
All source code used to produce the quantitative plots is released under the 3-clause BSD license, and is publicly available as a git repository. Installation instructions are available on the readme of each code repository.

Quantitative Benchmarking of Ovarian Follicle Segmentation
To validate our segmentation algorithm, we performed a 10-fold cross-validation with each fold using a 80-10-10 train-validatetest split. We evaluated the precision (true positives / (true positives + false positives) ) and recall (true positives / (true positives + false negatives )). We classify the error modes as described in Greenwald et al. [17]. True positive is assigned when the predicted object has an intersection-over-union greater than 0.5 with a matching object in the ground truth. A false positive is when a predicted object does not match any ground truth objects. A false negative is when a ground truth object does not match any predicted objects. A merge error is when one prediction object matches multiple ground truth objects. A split error is when multiple predicted objects match to a single ground truth object.
To benchmark the performance, we compared to the results of a previous study [18], which reports results from a neural network architecture called CR-Unet. CR-Unet is network architecture that incorporates a spatial recurrent neural network unit between the encoder and decoder arms of a Unet and has top end performance of recently-published algorithms (Table  1). A more recent study has achieved a recall of 0.91 for 4-12 mm diameter follicles with a network architecture called S-Net [19], but the authors do not report the precision, so we did not include it in the benchmarking. Similarly, a direct comparison with CR-Unet is difficult because authors did not report the method used to call detection events for calculating precision and recall. For large follicles (>10 mm diameter), FollicleFinder achieves 14% higher precision and similar recall when compared to CR-Unet. For medium follicles (5-10 mm diameter), FollicleFinder achieves 9% higher recall and 7% lower precision. Finally, for small follicles, CR-Unet has higher precision and recall (18% and 7%). As we explore in the following section, in spite of the difference in precision, FollicleFinder is still able to make accurate clinical measurements (e.g., counts and sizes). In summary, when compared to CR-Unet, FollicleFinder achieves better performance on large follicles, similar performance on medium follicles, and slightly worse performance. Finally, we note that we have trained on far fewer images than CR-Unet (90 versus 2509 images) and we anticipate FollicleFinder's performance will be further improved as we incorporate more training data is collected. Taken together, in spite of training on less data than CR-Unet, FollicleFinder achieves similar performance.

Clinically-relevant Measurements
To characterize the performance of FollicleFinder on clinicallyrelevant metrics, we compared the values obtained from our predictions from fold 0 of our 10-fold cross validation to the ground truth values. In particular, we compared the follicle counts (Figure S2), follicle sizes ( Figure S3), and follicle localization ( Figure  S4). We see that for all measurements, the FollicleFinder values closely match ground truth.