Generative adversarial networks in cell microscopy for image augmentation. A systematic review

Cell microscopy is the main tool that allows researchers to study microorganisms and plays a key role in observing and understanding the morphology, interactions, and development of microorganisms. However, there exist limitations in both the techniques and the samples that impair the amount of available data to study. Generative adversarial networks (GANs) are a deep learning alternative to alleviate the data availability limitation by generating nonexistent samples that resemble the probability distribution of the real data. The aim of this systematic review is to find trends, common practices, popular datasets, and analyze the impact of GANs in image augmentation of cell microscopy images. We used ScienceDirect, IEEE Xplore, PubMed, bioRxiv, and arXiv to select English research articles that employed GANs to generate any kind of cell microscopy images independently of the main objective of the study. We conducted the data collection using 15 selected features from each study, which allowed us to analyze the results from different perspectives using tables and histograms. 32 studies met the legibility criteria, where 18 had image augmentation as the main task. Moreover, we retrieved 21 publicly available datasets. The results showed a lack of consensus with performance metrics, baselines, and datasets. Additionally, we evidenced the relevance of popular architectures such as StyleGAN and losses including Vanilla and Wasserstein adversarial loss. This systematic review presents the most popular configurations to perform image augmentation. It also highlights the importance of design good practices and gold standards to guarantee comparability and reproducibility. This review implemented the ROBIS tool to assess the risk of bias, and it was not registered in PROSPERO.


Introduction
personnel from mechanical tasks, and simplifying their complexity. It is well known that 25 model performance is proportional to the amount of data available to train; normally a 26 significant amount of training data is required to achieve the best model generalization 27 possible, and even though data is not a limitation for most fields, specialized domains 28 such as cell microscopy struggle in this regard [2]. 29 Fortunately, generative modeling is an area of machine learning whose task is to 30 understand the distribution of data to produce synthetic samples resembling the real 31 data distribution. Generative adversarial networks (GANs), a well established 32 generative model and the focus of this study, comprise a family of DL algorithms based 33 on the intuition of training two neural networks (NNs) simultaneously in a competitive 34 fashion [3]. In computer vision, the evolution of GANs allowed researchers to propose 35 promising solutions to several tasks comprising image classification, segmentation, 36 augmentation, enhancements, and domain translation [4]. There are reviews analyzing 37 GANs, their general foundation, and applications [4] [5], and specialized reviews of 38 GANs in medical imaging [6] [7] [8] [9] [10]. Although some of these reviews compile 39 some microscopy image datasets, we could not find a review focused on cell microscopy 40 generation. 41 Considering the lack of data availability, we present a systematic review of GANs 42 using cell microscopy imaging for image augmentation. The objectives of this work are: 43 • Analyze the studies using GANs to perform cell microscopy image augmentation. 44 • Analyze popular public available datasets to train generative models for cell 45 microscopy image augmentation. 46 • Identify the most popular architectures, losses, and methods when using GANs in 47 the field. 48 • Identify common practices of experimental design related to image augmentation. 49 We first screened different databases and selected relevant publications based on 50 legibility criteria to analyze and summarize them to discover the main trends, 51 challenges, and limitations present in the field. Moreover, we analyzed the publicly 52 available datasets used in the selected publications and summarized their principal 53 characteristics. Finally, we also briefly discuss the most representative GAN 54 architectures and definitions for those readers with a biological background or 55 enthusiastic researchers that are introducing themselves to the field of GANs. The main 56 contributions of this work are: 57 • Detailed compilation of 32 selected studies from 5 different datasets, using 58 features to facilitate their comparison. 59 • List and detailed description of the publicly available datasets used in the studies 60 considered in this review. 61 • Important notions and considerations in the experimental design of generative 62 modeling studies that are essential to produce research with robust methodology 63 and results. 64 Materials and methods 65 Before defining the guidelines for the systematic review, we included a brief section to 66 present and discuss the basic notions of GANs for those readers that are not so familiar 67 with them. We also present some representative losses and architectures that will be 68 relevant in this study. 69 Generative Adversarial Network (GAN) 70 The goal of GANs is to learn the probability distribution of a training dataset and use 71 such distribution to generate new synthetic data that cannot be differentiated from the 72 original dataset [11]. A GAN is traditionally composed of a generator that tries to learn 73 the training data distribution, while the discriminator has to differentiate between 74 synthetic and authentic data [3]. The training procedure of GANs can be conceived as a 75 competition where the generator tries to fool the discriminator with realistic samples 76 while the discriminator is trained to detect the generated ones. The formal definition of 77 this problem is described as follows: with input noise z ∈ Z following the probability 78 distribution p z and training data x ∈ X with probability distribution p data , two 79 differentiable functions (generator and discriminator) represented as artificial neural 80 networks (ANNs) are trained simultaneously. 81 The generator G takes z as input and transforms it into x ∈ X that follows the 82 generated probability distribution p g , the discriminator D takes input samples from 83 both X and X and returns a scalar representing the probability that its input belongs 84 to X, if D = 0, then the discriminator concludes that such input does not belong to X 85 and is, therefore, a synthetic sample. 86 The goal of G is to approximate p g to p data as much as possible using the 87 information that D retrieves. Thus, it will be harder for D to find the synthetic samples 88 once G is well-trained. 89 GANs have had a high development since first introduced by Goodfellow et al. in 90 2014 [3], several loss variants, architectures, and even additional components have been 91 proposed to boost the performance of GANs. The Vanilla GAN loss implemented the 92 cross-entropy loss, which has interesting properties that are discussed later. The 93 objective is presented in the first row of Table 1. 94 The Vanilla loss can be interpreted as the sum of the expected value of correctly 95 classifying the real training samples and the expected value of correctly classifying the 96 generated samples. Ideally, an optimal G causes D to output an average value of 0.5 no 97 matter the data source (50% chance of being a real sample). 98 Additionally, Mirza and Osindero [12] extended the capacity of GANs with 99 conditional GAN (cGAN). The idea of cGAN is simple yet powerful. It uses additional 100 information h that encourages D to produce synthetic data containing the conditional 101 information h. The benefit of cGAN is that h can be any data type used as additional 102 input in both G and D as seen in Eq (1). One of the most popular conditional settings 103 is to use an image as generator input. In this way, the GAN performs Image-to-Image 104 (I2I) translation, a process where G pretends to transform images from an input domain 105 into a target domain while preserving the meaning of the input image, e.g., transforming photographs into sketches. The most iconic architectures are Pix2Pix [13] 107 and CycleGAN [14].
Later in time, new adversarial losses have been proposed with the goal of optimizing 109 the limitations of Vanilla GAN, the most representative adversarial losses are 110 Least-square GAN (LSGAN) [15] in 2017, Wasserstein GAN (WGAN) [16] in 2017, and 111 Relativistic GAN (RGAN) [17] in 2019. Table 1 summarizes the main characteristics of 112 these losses. For those readers interested in the formal definitions, training dynamics 113 and theoretical implications of GANs, we encourage them to visit Arjovsky and 114 Bottou's work [18] in which they discussed all these topics thoroughly.
115 Table 1. Summary of popular adversarial losses.

Loss Equation Approximates to Attributes
Vanilla min Jensen Shannon Divergence (DJS) with optimal discriminator • Robust against instability due to gradient exploit.
• Instability due to gradient vanishing in case of disjoint supports. Least-Square Pearson χ 2 divergence (D χ 2 ) with optimal discriminator • The least-square objective encourages G to push the generated samples to the decision boundary of D.
• Instability due to gradient vanishing in case of disjoint supports. Wasserstein Wasserstein-1 distance with K-Lipschitz • This objective measures the distance between distributions, making it robust against disjoint supports.
• The discriminator is renamed as critic C since it will no longer perform classification, C outputs now a distance.
• The critic output is correlated with image quality, a property that allows to monitor the training progress. Relativistic • This objective pretends to measure how much realistic is a generated sample compared to a real one.
• Any discriminator performing classification can be transformed into a relativistic counterpart.
• Considers prior knowledge related to the data and the training scheme that other losses do not consider.

116
StyleGAN [19] is one of the most popular GAN architectures for image augmentation. 117 The authors proposed a unique generator architecture inspired by style-transfer learning 118 to better understand and control the synthesis process from the latent space where the 119 generator samples its inputs. G is composed of two networks; a mapping, and a 120 synthesis network. The mapping network is an 8-layer multilayer perception (MLP) that 121 embeds the random noise input z ∈ Z into an intermediate latent space denoted as W. 122 Different from other generators, the synthesis network's input is a known constant and 123 each convolutional block is fed with broadcasted noise to provide stochasticity and 124 styles to control its adaptive instance normalization (AdaIN). 125 Based on the hypothesis that producing realistic outputs is easier from a 126 disentangled space, the generator encourages the mapping network to produce a 127 disentangled latent space W with more linear subspaces that facilitate the image 128 synthesis. w ∈ W passes through learned affine transformations to produce the styles, 129 Then, each style is fed into a specific block in the synthesis network. This review was carried out following the guidelines of the Preferred Reporting Items for 157 Systematic Reviews and Meta-Analyses (PRISMA 2020) [25] as close as possible (S3 158   Table). Even though PRISMA 2020 is mainly designed to structure systematic reviews 159 related to medicine, the results of this review could lead to ideas in the context of DL 160 that could benefit the health field. Additionally, we believe that implementing these 161 guidelines enhances the process and outcomes of a paper review independent of its context. This systematic review is not registered in PROSPERO [26] and neither has 163 nor follows any review protocol.

164
Eligibility criteria and search strategy 165 The only source of information selected for the review are papers written in English 166 whose main topic was the implementation or design of GANs. Additionally, the purpose 167 of the GAN must be image augmentation of cell microscopy. Even though I2I 168 translation could be considered an image augmentation method, we decided to define 169 image augmentation as transforming random noise into a generated sample (as in the 170 traditional GAN definition). Papers are legible as long as they perform image 171 augmentation independently of whether it was not the study goal, i.e., an image 172 classification study that used a GAN architecture to augment the dataset is legible and 173 therefore selected for this review.

174
The selected databases are IEEE Xplore, ScienceDirect, PubMed, arXiv, and 175 bioRxiv. We retrieved studies without limit in the year of publication, meaning that we 176 will consider any publication from the introduction of GANs in 2014. Additionally, we 177 did not include publications cited in the retrieved papers or publications from different 178 sources, considering that our search criteria include the year of publications of GAN.
179 Table 2 summarizes the query search and other search parameters used in each database. 180   We summarized the publicly available datasets used in the studies that met the 212 legibility criteria similarly to the table described before. The features extracted from 213 each dataset are microscopy modality, number of samples, image resolution, the task for 214 which the dataset was designed for, other published applications of the dataset, binary 215 feature to tell whether the dataset is annotated, cell type, URL, and notes. Moreover, 216 we included a detailed description of each dataset.

218
Our methodology retrieved a total of 321 publications and 32 met the eligibility criteria 219 excluding duplicates; 10 (31.25%) from arXiv, 10 (31.25%) from IEEE Xplore, 6 220 (18.75%) from PubMed, 4 (12.5%) from ScienceDirect, and 2 (6.25%) from bioRxiv.  Year of publication 228 We observed an increased interest in data augmentation from 2020, i.e., the year with 229 the most publications (8). Besides 2021 (same amount as in 2020), the second most 230 prolific year is 2022 (6), and then followed by 2023 and 2018, with 4 publications 231 (Fig 2A). The success in 2018 could be attributed to the gained interest in the field due 232 to the introduction of Pix2Pix and CycleGAN in 2017. The oldest publication on cell image augmentation is from 2017. Considering that only a single publication used the 234 Vanilla GAN architectures [28], and that Deep Convolutional GAN (DCGAN) [29] was 235 published in 2016, we suppose that GAN was not powerful enough to produce realistic 236 microscopy images before 2016. The GAN loss feature has 13 different classes throughout the publications. It is possible 248 to classify losses into two big groups, adversarial and auxiliary. Such losses pretend to 249 guide the architecture training towards the specific task and boost the general 250 performance, respectively. Independent of the data domain and the computer vision 251 task, most GANs use both adversarial and auxiliary losses in their implementation. In 252 this case, however, only 11 publications presented architectures that rely on both types 253 of losses.

254
The most frequent auxiliary losses are the pixel-wise losses, L2 (4 publications) and 255 L1 (2 publication) norm. Concerning the adversarial losses, Vanilla GAN loss was the 256 most popular present in 13 publications, followed by the WGAN variants. Fig 2C   257 depicts the adversarial losses' distribution in the selected studies.

258
Vanilla loss 259 The vanilla adversarial loss, defined by the cross-entropy loss, is implemented so that G 260 aims to maximize it; while D tries to minimize it. Interestingly, this loss approximates a 261 function dependent on the Jensen Shannon Divergence (D JS ) when D reaches its 262 theoretical optimum.

263
Different from Kullback-Leibler divergence (D KL ), a popular metric for distribution 264 comparison, D JS is a symmetric statistical distance, and its output bounds to [0, 1], an 265 essential factor to stabilize neural networks training. However, similar to D KL , D JS 266 cannot measure the distance between two distributions with disjoint supports. If, for 267 instance, the real and generated distribution do not overlap, D JS = 1 independently of 268 the distance between them, leading to a possible gradient vanishing during training.  Most studies used traditional architectures such as CNN or vanilla generators and 274 discriminators. Only one used alternative architectures; StyleGAN generator and U-net 275 discriminator [31]. In that work, the authors designed this GAN to generate time series 276 images of two different microscopy modalities simultaneously using a two-branched 277 generator. Another publication presented a two-branched generator to produce images 278 and segmentation masks in a single run with a discriminator that takes a 4-channel Kastaniotis et al. used attention maps from a trained classifier to perform knowledge 281 transfer in the discriminator and focus the generator on details [33]. Some publications 282 presented pipelines that combine several GAN architectures [34] [35] [36] [37] and a 283 study that aims to compare the performance of different GAN architectures [38].

284
Wasserstein adversarial loss 285 Given that conventional adversarial losses suffer from training instability by its intrinsic 286 definition, the authors of Wasserstein GAN designed an adversarial loss robust against 287 this instability. As its name implies, WGAN exploits the Wasserstein-1 distance as a 288 loss function. The Wasserstein-1 distance can be elucidated as the minimum amount of 289 work required to move an input distribution to match a target distribution. The main 290 benefit of this measure is that its output is proportional to the distance between the 291 distributions, independently of whether there are disjoint supports. However, its 292 computational complexity makes its calculation infeasible with high-dimensional data. 293 Because of this, Arjovsky et al. [16] proposed to use the Kantorovich-Rubinstein 294 duality and reformulate the Wasserstein-1 distance into a maximization problem of 295 K-Lipschitz functions [39]. The main challenge of this reformulation is to encourage the 296 neural network architecture to represent K-Lipschitz functions during training. the generated images start to get classified as real. A benefit of RGAN is that it can be 328 easily applied to those adversarial losses using a regular discriminator. Analogous to WGAN, the discriminator can be seen as a critic with a final sigmoid layer, the critic 330 measures how realistic an input is, while the activation function transforms that value 331 into a probability. The authors used this new critic, subtract the output of real and 332 generated image, and then feed it into the activation layer. In this way, the output of 333 the whole discriminator tells how much more realistic a sample is compared to another 334 one. This implementation has independent objectives for the discriminator and 335 generator that they train to maximize. Moreover, the generator now has an active role 336 in both discriminator and generator objectives. Empirical results suggest that RGAN is 337 more efficient in terms of performance, stability and complexity compared to Vanilla 338 GAN, LSGAN and WGAN [17].

339
The only two studies using relativistic GAN are a modification of PathologyGAN, a 340 model combining BigGAN, StyleGAN, and RGAN. In [51], the authors included an 341 "inverse generator" to transform images back into the input latent space to extract 342 significant features and generate samples at the same time. On the other hand, the 343 second study implemented an unmodified version of PathologyGAN and selected images 344 based on their latent representation to boost a classification network [52]. ablation study, and 19 did not use any baseline (Fig 2D). One could argue that since 374 around 60% of the studies used unmodified GANs and 43% of publications focused on a 375 different computer vision task, there is no need to do an ablation study, use a baseline, 376 or share a code implementation. However, we believe that a reference point (baseline 377 and ablation study) is essential to assess the performance of an approach that uses  Table 3 collects the included studies with the most relevant features 399 extracted from the paper. A full version of the table is presented in S1 Table. 400   images (13596 training images and 54833 test images). The images were automatically 448 segmented with the DAPI channel and manually annotated by experts. The images are 449 labeled with 6 different classes (homogeneous, centromere, speckled, nucleolar, mitotic 450 spindle, and Golgi) and each sample also has metadata including label, cell intensity, 451 cell mask, and ID. This dataset is only available under request.

452
Arabidopsis thaliana dataset 453 The A. thaliana is the product of a publication carried out by Willis et al. [88] where 454 they studied the size and growth regulations of this plant with a Python pipeline. To do 455 so, the authors designed an A. thaliana strain with fluorescence markers targeting 456 different genes. There are three channels; the authors used the yellow and red channels 457 to study the cell membranes and nucleus, and the green channel was not analyzed in the 458 study. Each sample corresponds to a stack of volumetric data, which is automatically 459 segmented and partly curated. It has a total of 125 3D stacks (of variable resolution) of 460 six different A. thaliana apical meristem.  Eschweiler et al. [53] only used the Zebra fish data to train their GAN architecture, 471 since they also used the Arabidopsis thaliana dataset.   The BCCD dataset is a blood cell dataset designed to train cell detection models, it has 515 three labels (red cell, white cell, and platelet) and it is composed of 365 light microscopy 516 640 × 480 images. The images are available in a GitHub repository, where there are 517 data preparation scripts for abnormalities recognition.

561
Following the guidelines of ROBIS, we present the concern and rationale for each 562 domains proposed there: 563 Concerns regarding specification of study eligibility criteria 564 Low Concern. All signalling questions were answered as "Yes" or "Probably Yes", so 565 no potential concerns about the specification of eligibility criteria were identified.

566
Concerns regarding methods used to identify and/or select studies 567 High Concern. Some eligible studies are likely to be missing from the review, since no 568 additional searching was performed beyond databases. Moreover, only a single person was responsible for screening title, abstract and whole body to classify a study as legible. 570 Finally, only papers written in English were considered.

571
Concerns regarding methods used to collect data and appraise studies 572 High Concern. Some bias may have been introduced since only a single person was 573 responsible for data collection, and the nature of the studies do not allow a suitable risk 574 of bias assessment.  One could argue that designing a robust loss objective can be an option to condition 595 the generator and achieve the requirements of this domain. But again, the collected 596 studies indicated that Vanilla GAN is the most popular loss regardless of its theoretical 597 limitations. It could be possible that more sophisticated losses like WGAN could require 598 more resources, but this is not clear from the evidence here. Moreover, most of the 599 auxiliary losses are pixel-wise losses, losses that focus on the pixel level entirely, 600 excluding neighborhood, shapes, textures, and more content features that are essential 601 when the purpose of the architecture is to tell whether a sample looks realistic or not.

602
There are perceptual losses that aim to overcome such limitations [101], but they are 603 more popular for tasks such as I2I translation since it is a supervised loss. It would be 604 beneficial to assess the need for losses capable of capturing features that regular 605 pixel-wise losses can not for these applications. 606 Additionally, it was surprising to us to see that PatchGAN was present in only two 607 studies. In most cases, a single cell represents a small percentage of the whole image; it 608 is likely that a discriminator evaluates the image as a whole, considering only global 609 features like cell distribution over the medium or cell-to-cell interaction while skipping 610 all local features that could be important, such as cell anatomy. Nevertheless, it is still 611 hard to tell whether PatchGAN or a different architecture has the power to encourage 612 the generator to reproduce those key features, since we evidenced a lack of consensus in 613 the experimental designs.

Challenges, limitations, and opportunities
615 Experimental design 616 The experimental designing is one of the biggest flaws that we encountered 617 in the legible studies. It is not possible to reproduce, compare, and assess the models 618 presented in the selected studies. Each model is trained and tested with different 619 datasets, each study employs different baselines with distinct performance metrics, and 620 a significant part do not share a code implementation. Even though data augmentation 621 is generally not the main task of the studies, we believe that relying solely on 622 downstream tasks like classification or object detection to assess generative performance 623 should be considered carefully. Such downstream tasks are likely to be performed by 624 other ANNs which are susceptible to bias during training. Supporting the image 625 augmentation evaluation with additional measurements will give a better insight of the 626 image quality, and therefore, ensure adequate performance of downstream architectures. 627 Performance metrics, evaluation, and realism 628 Evaluate and comparing GAN studies is currently not feasible. Another 629 limitation we perceived in this study is how to evaluate GANs performance. One of the 630 reasons for this difficulty is the performance metrics; this study shows how there are 631 different metrics, and although some can be transformed into others, the high number of 632 options limits the assessment between models. Additionally, measuring realism is an 633 intrinsic challenge of generative modeling. Tasks like I2I translation and image 634 augmentation heavily rely on producing realistic images. If it is not possible to 635 accurately measure the generated data quality, it will be unfeasible to compare models 636 or estimate their generalization.

637
Medical and microscopy imaging require more robust metrics than other 638 fields. Some of the most popular metrics, such as IS and FID, are based on ANNs 639 trained with a general-purpose dataset, ImageNet [102], raising the concern of whether 640 such metrics can capture and evaluate the features describing the quality of images that 641 are likely to be significantly different from daily live objects. Considering this concern, 642 Tronchin et al. [103] assess and propose a framework for GANs in the context of medical 643 imaging. We believe that it is essential to focus the attention on how results are 644 measured. A good starting point could be to use the work of Borji [104] as a reference, 645 where he presents and evaluates different qualitative and quantitative performance 646 metrics for GANs.

647
Even though GANs have been state-of-the-art for image generation for a 648 considerable amount of time, they have important limitations compared to other 649 generative models. First, GANs are known for their training stability; keeping a good 650 balance between the generator and discriminator is mandatory to preserve stable 651 gradients, and they are highly sensible to hyperparameter as well. Second, an intrinsic 652 limitation of these models is mode collapse [105], i.e., when the model cannot learn all 653 the modes of the data distribution and ends up generating samples with limited 654 variability. The current main alternative is Diffusion models [106]. Compared to GANs, 655 Diffusion models are more stable than GANs while better capturing the data 656 distribution at the cost of requiring a considerably higher computational complexity 657 during training and generation. There is already a paper review of Diffusion models in 658 medical imaging [107]. Although it might seem that GANs are losing interest in the 659 community, the adversarial concept is still relevant in computer vision and there is 660 active research on whether merging features of GANs and diffusion models can yield 661 improvements in the field [108].

662
The joint efforts between biology and DL experts is of vital importance. 663 Developing good practices such as including ablation studies, base-lines comparisons, and defining gold standards for metrics and datasets facilitate knowledge transfer. 665 Finally, the risk of bias in this review could be high considering that only a single 666 person was responsible for the searching, screening, and studies analysis, including that 667 we only used 5 databases. However, we must remember that the PRISMA guidelines 668 were not designed for systematic reviews in the field of DL, and some criteria are not 669 applicable. The systematic review is a practice that should be extended to other fields 670 beyond health, but an adaptation is required to make it possible.

672
This publication is, to our knowledge, the first systematic review of GANs with cell 673 microscopy for image augmentation. In this study, we examine, summarize, and discuss 674 the most popular methods, together with the current trends in how researchers 675 approach problems in the domain of cell microscopy imaging with GANs. We also 676 compiled and compared the public cell microscopy datasets that have been used for 677 image augmentation since the introduction of GANs to give a brief overview of the 678 available options. Finally, we found some inconsistencies related to the experimental 679 set-up that should be considered not only in studies with GANs, but in any generative 680 modeling study. Reproducibility and comparability is essential to speed up the research 681 progress in any field.