DiversityScanner: Robotic discovery of small invertebrates with machine learning methods

Invertebrate biodiversity remains poorly explored although it comprises much of the terrestrial animal biomass, more than 90% of the species-level diversity and supplies many ecosystem services. The main obstacle is specimen- and species-rich samples. Traditional sorting techniques require manual handling and are slow while molecular techniques based on metabarcoding struggle with obtaining reliable abundance information. Here we present a fully automated sorting robot, which detects each specimen, images and measures it before moving it from a mixed invertebrate sample to the well of a 96-well microplate in preparation for DNA barcoding. The images are then used by a newly trained convolutional neural network (CNN) to assign the specimens to 14 particularly common, usually family-level “classes” of insects in Malaise trap samples and an “other-class” (N=15). The average assignment precision for the classes is 91.4% (75-100%). In order to obtain biomass information, the specimen images are also used to measure specimen length and estimate body volume. We outline how the DiversityScanner robot can be a key component for tackling and monitoring invertebrate diversity. The robot generates large numbers of images that become training sets for CNNs once the images are labelled with identifications based on DNA barcodes. In addition, the robot allows for taxon-specific subsampling of large invertebrate samples by only removing the specimens that belong to one of the 14 classes. We conclude that a combination of automation, machine learning, and DNA barcoding has the potential to tackle invertebrate diversity at an unprecedented scale.

would receive next to no attention because they only contribute very little biomass [2]. Indeed, endangered species would receive the least attention because many are functionally extinct. The same conclusions is supported when one adopts a species diversity perspective. The largest number of multicellular species are fungi and invertebrates. The same groups would also be research priorities if one were to adopt a functional or an evolutionary point of view given that many fungal and invertebrate clades are much older and diverse than those taxa that contain most of the charismatic species. All these points of views suggest that it will be critical to have efficient tools for assessing and monitor non-charismatic taxa that provide numerous ecosystem services. 15 One major obstacle to pivoting attention towards those taxa that are important from a quantitative point of view are lack of biodiversity data on many of the relevant taxa. More than 10 years ago, Robert May [3] summarized the state-of-affairs as follows: "We are astonishingly ignorant about how many species are alive on earth today, and even more ignorant about how many we can lose (and) yet still maintain ecosystem services that humanity ultimately depends upon." He highlighted that 20 the discovery and description of earth's biodiversity is one of the large, outstanding tasks in biology but he also anticipated that neglecting this task is perilous. Most of the undiscovered and undescribed diversity is in those invertebrate clades that are nowadays often called "dark taxa". Hartop et al. [4] recently defined these clades as those "for which the undescribed fauna is estimated to exceed the described fauna by at least one order of magnitude and the total diversity exceeds 1.000 species." They dominate many biodiversity samples and contribute most of the undescribed species-level diversity. Species discovery in 25 these taxa is particularly difficult because it requires the sorting of thousands of usually very small specimens that need to be dissected for careful morphological examination.
Fortunately, there are three technical developments that promise relief. The first is already widely used. It is cost-effective DNA sequencing with 2nd and 3rd generational sequencing technologies, which have revolutionized microbial ecology, but can also be applied to invertebrate specimens [5]- [7]. In particular, portable nanopore sequencers by Oxford Nanopore Technologies are 30 in the process of democratizing access to DNA sequence data [8]- [10]. However, the two remaining developments remain underutilized in biodiversity science. They are automation and data processing with neural networks. Currently, automation mostly exists in the form of pipetting robots in molecular laboratories, while data processing with neural networks is only widely used for the monitoring of charismatic species. Bulk invertebrate samples that include most of the undiscovered and unmonitoried biodiversity remain orphaned although thousands of samples are collected every day. They include plankton and can be affected by taxonomic bias [16]. New systems are needed that yield comprehensive information. Fortunately, computer-based identification systems for invertebrates are starting to yield promising results [17]- [19]. Particularly attractive are deep convolutional neural nets with transfer learning [15], but they require reasonably large sets of training images which 45 are hard to obtain for invertebrates given that most species are undescribed and/or difficult to identify. It is here that robotics can have an important impact if robotic handling of specimens can be combined with taxonomic identifications based on DNA barcodes. First steps in this direction have been taken. One system was developed for processing macroinvertebrate samples that are routinely obtained for freshwater quality assessment. This system can size and identify stoneflies (Plecoptera) [20].
Another system focused on soil mesofauna [21]. However, these systems used a robotic arm which made them comparatively 50 expensive. Many other insect sorting robots have been designed for more specific purposes. Some are for sorting mealworm larvae (Tenebrio molitor) and can separate healthy mealworm larvae from skins, feces, and dead worms. Another commercially available robot can sort mosquitoes [22] and is capable of distinguishing the gender of target species. However, all these machines lack the ability to recognize a wide variety of insect specimens preserved in ethanol. A machine that is closer to achieving this goal is the BIODISCOVER, a "robot-enabled image-based identification machine" by Arje et al. [23] which can 55 identify ethanol-preserved specimens which, however, have to be fed into the machine manually one by one. After identification all specimen are returned into the same container.
We here describe a new system that overcomes some of these shortcomings. It recognizes insect specimens based on an overview image of a sample. Specimens below 3 mm body length are then imaged and moved into the wells of a 96-well microplate. We demonstrate that the images are of sufficient quality for training convolutional neural nets to common taxa.

60
Furthermore, the images are used to derive length measurements and a coarse estimation of biomass based on specimen volume.
Please note that we refer to the term "classification" in the machine learning context, as assigning objects (specimens) to different classes.
For the purpose of insect handling, a petri-dish with full-ethanol preserved insects is placed in the robot, which are then 75 classified, measured and sorted in a microwell plate. The setup provided for this purpose consists of a 50 x 50 x 50cm main frame, in which all components except the control panel are located. Figure 1 shows the sorting robot. The the x-, y-, and z-axis can be seen as well as the petri dish and the micro wellplate. For the operation of the robot by the user, a touch screen with graphical user interface (GUI) is mounted on the front side.

80
The transportation system is based on a three-axes robot for transferring insects from a petri dish to a microwell plate and to position a camera for a detailed view (C2) of a single specimen. The x-and y-axes of the robot are realised by LEZ1 linear drives (Isel AG, Eichenzell, Germany) and connected to the outer frame of the robot at half height. Both linear drives are driven by high-precision stepper motors with little tolerance to ensure good positioning accuracy. The y-axis is connected orthogonally to the shaft slide of the x-axis and is transported by it. The shaft slide of the y-axis transports both, the camera (C2) and the 85 z-axis with the suction hose. In order to move the suction hose in the z-direction (=up and down) the z-axis is driven by a AR42H50 spindle drive with stepper motor (Nanotec Electronic GmbH & Co. KG, Feldkirchen, Germany). The transportation system with its three axes is illustrated in Figure 1.
All three axes are controlled by a single TMCM-3110 motor controller (Trinamic, Hamburg, Germany) that allows for precise, fast and smooth movements of the axes. The motor controller was located in a box at the bottom of the robot along with 90 other electronics, so that it is protected from water and ethanol droplets. The transport system is controlled by a Raspberry Pi single-board computer that was programmed in Python software, specially developed for the sorting robot. In order to pick up insects from a petri dish and discharge them in a well of a 96-well microplate a suction hose with a pipette tip is positioned by the transportation system. The hose is connected to a LA100 syringe pump (Landgraf Laborsysteme HLL GmbH, Langenhagen, Germany), that is also controlled by the Raspberry Pi. The sorting process is illustrated in Figure 2. The sorting system includes two cameras with different lenses: the overview camera (C1) and the detailed view camera (C2).
The first camera (C1) is a Ximea MQ042CG-CM camera with a CK12M1628S11 lens (Lensation GmbH, Karlsruhe, Germany) 4 . CC-BY 4.0 International license available under a was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint (which this version posted May 18, 2021. ; Figure 1. The insect sorting robot with 1: x-axis; 2: y-axis; 3: z-axis; 4: Petri dish; 5: Micro wellplate; 6: Overview camera (C1), 7: Detail camera (C2). The electronics box with Raspberry Pi, motor control unit, and the syringe pump are in the lower part of the sorting robot and therefore not visible in this view. The status of both, insect position determination and status of the sorting process are displayed on a touch screen, where the sorting process can also be started and stopped.
with a focal length of 16mm and an aperture of 2.8 is positioned directly above the petri dish to take a detailed overview image of all insects inside. This image is used for detecting insects and their position within the Petri dish for the sorting process.  The second camera (C2) ia a Ximea MQ013CG-E2 camera with a telecentric Lensation TCST-10-40 lens with a magnification of 1x. This camera has to be moved by the x and y axes of the robot above the position of an insect to take a detailed image of it for classification, measuring and length determination. Figures 4 and 6 show examplary images from the detail camera.

Image Processing Software
Three different software algorithms are used: The first algorithm determines the position of each object within the square 105 petri dish. The second one measures the length and volume of each insect. The third algorithm is based on an artificial neural net to classify insects into different classes.

5
. CC-BY 4.0 International license available under a was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint (which this version posted May 18, 2021     After the overview image is taken, various image processing operations have to be performed to detect the objects: (1) A median filter removes noise from the image, (2) a conversion from a RGB-image to a gray scale image is performed, (3) an 110 adaptive threshold filter segregates the objects and (4) a contour finder identifies the boundaries of all objects. Two conditions must be met for objects to be detected: first, their size must be within a specified interval, and second, the distance between an 6 . CC-BY 4.0 International license available under a was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint (which this version posted May 18, 2021. ; https://doi.org/10.1101/2021.05.17.444523 doi: bioRxiv preprint object and neighboring objects must exceed a minimum threshold value. If a cluster of objects is present, then the objects in the cluster fall below the specified minimum distance and are therefore not considered until they are seperated. This ensures that only one single object is picked up during pipetting. Additionally, an accessible area within the petri dish has been defined that 115 has a distance of ten millimeters from the edge to ensure that the insects can be reached (blue line in Figure 3)    Each sample contains a wide variety of insect taxa but only the common ones can be covered by the trained CNN. To be able to process images of insects that do not belong to any of the 14 classes, an additional residual class is created. This class 150 consists of different taxa and images of body parts (mainly legs and wings), each of which has too few images for its own class.
In total there are 693 images in this residual class.
Data Augmentation: Since the database consists of only 5018 images for training the CNN, data augmentation was performed to increase both, the number of images and the invariance within a class. The following image processing operations were applied randomly to the images: rotation, width shift, height shift, shear, zoom, horizontal flip and fill mode nearest.

10
. CC-BY 4.0 International license available under a was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint (which this version posted May 18, 2021. ; https://doi.org/10.1101/2021.05.17.444523 doi: bioRxiv preprint all insects were correctly classified, whereas insects of the class Diptera Dolichopodidae had the lowest correct classification rate. Two different automated sorting processes are possible: Either one insect after the other can be classified and sorted until the last well of the 96-well microplate is filled, or only insects of a predefined class are pipetted into the well plates until no insect of this class can be found.

180
The use of CNNs for the identification of charismatic species is starting to be routine [26]- [28]. However, these methods are largely unavailable for small invertebrates although they comprise most of the multicellular animal species and contribute many ecosystem services. The main problem is not the availability of invertebrate samples, but the lack of CNNs which cannot be trained because there are few sets of training images. We believe that the best strategy for changing this undesirable situation is by combining automated imaging with DNA barcoding. Each "DiversityScanner" robot can process several 185 invertebrate samples per day. Each contains thousands of specimens that can be imaged with minimal manual labor. After imaging, the specimens are moved into microplates for DNA barcoding. Once barcoded, the images can be re-labeled with approximately species-level identifications given that most animal species have species-specific barcodes, or they can be assigned to family-or genus-level based on DNA sequence similarities. Common species, genera, and families rapidly acquire sufficiently large sets of images that can then be used for training CNNs. Indeed, for the most common "classes" of insects 190 in Malaise traps, we already had enough images for creating such networks after partially imaging only five Malaise trap samples.
Some biologists doubt that CNNs will be sufficiently powerful to yield species-level identifications for closely related species and we agree that it remains unclear whether species-level identifications can be achieved [15], [19]. However, we believe that the main limitation is not the CNN but the image quality and orientation of the insects. Fortunately, these limitations design of the robot focused on reproducibility and low-cost (<5,000C), so that many robots can sort a large number of insects simultaneously. This makes the robot an attractive alternative to manual identification and sorting. After modification, the DiversityScanner will also be suitable for many additional purposes. For example, larger specimens could be handled by 210 modifying the suction tip diameters or installing a gripper with a sensor-based feedback system that ensures that the specimens are not damaged. A particularly attractive modification would also be the ability to subsample a sample. For example, some invertebrate samples are dominated by a few taxa whose exhaustive treatment may not be needed for monitoring. The robot could then be instructed to only fill/identify 2-3 microplates' worth of specimens for these taxa. Conversely, the user could specify that only certain taxa should be moved to microplates or different taxa should be moved to different microplates.

215
The latter would be particularly useful if the specimens are supposed to be barcoded using different molecular markers or taxon-specific DNA extraction or PCR recipes should be used. Many additional modifications are conceivable. For example, only specimens belonging to one gender could be selected given that often only the morphology of one sex is species-specific.
Thus, we believe that robots like the DiversityScanner have the potential to solve some of the problems that were out-220 lined by Robert May. Biodiversity discovery and monitoring can be greatly expedited and accelerated, in particular for the "dark taxa" that have been largely ignored in the past, because of the problems associated with their handling and identification.
Of course, the DiversityScanner can only address some of the challenges. For example, newly discovered species will still have to be described and described species matched to types. Even when all the species have been described or identified, we will still know very little about the ecological roles that the species play within ecosystems. Fortunately, molecular approaches to 225 diet analysis and life history stage matching can help [29], [30], but ecosystems routinely consist of thousands of species. This means that automation and data analysis with the tools of AI will become increasingly important.