Abstract
The accuracy of machine learning tasks is critically dependent on high quality ground truth data. Therefore, in many cases, producing good ground truth data typically involves trained professionals; however, this can be costly in time, effort, and money. Here we explore the use of crowdsourcing to generate a large number of training data points of good quality. We explore an image analysis task involving the segmentation of corn tassels from images taken in a field setting. We explore the accuracy, speed and other quality metrics when this task is performed by students for academic credit, Amazon MTurk workers, and Master Amazon MTurk workers. We conclude that the Amazon MTurk and Master Mturk workers perform significantly better than the for-credit students, with no significant difference between the two MTurk worker types. The quality of the segmentation produced by Amazon MTurk workers rivals that of an expert worker. We provide best practices to assess the quality of ground truth data, and to compare data quality produced by different sources. We conclude that properly managed crowdsourcing can be used to establish large volumes of viable ground truth data at a low cost and high quality, especially in the context of high throughput plant phenotyping. We also provide several metrics for assessing the quality of the generated datasets.
1 Introduction
Crop genetics include basic research (what does this gene do?) and efforts to affect agricultural improvement (can I improve this trait?). Geneticists are primarily concerned with the former and plant breeders are concerned with the latter. A major difference in the perspectives between these groups is their interest in learning which genes underlie a trait of interest: whereas geneticists are generally interested in what genes do, breeders can treat the underlying genetics as opaque, selecting for useful traits by tracking molecular markers or via phenotypic selection directly (1).
Historically, the connections between plant genotype and phenotype were investigated through forward genetics approaches, which involve identifying a trait of interest, then carrying out experiments to identify which gene is responsible for that trait. With the advent of convenient mutagens, molecular genetics, bioinformatics, and high-performance computing, researchers were able to associate genotypes with phenotypes more easily via a reverse genetics approach: mutate genes, sequence them, then look for an associated phenotype.
However, the pursuit of forward genetics approaches is back on the table, given the even more recent availability of inexpensive image data collection and storage coupled with computational image processing and analysis. In addition, the potential for breeders to compute on phenotypes directly is enabled, thus allowing for the scope and scale of breeding gains to be driven by computational power. While high-throughput collection of forward genetic data is now feasible, we must now enable the analysis of phenotypic data in a high-throughput way. The first step in such analysis is to identify regions of interest as well as quantitative phenotypic traits from the images collected. Tang et al described a model to extract tassel out of one single corn plant photo through color segmentation (21). However, when images are taken under field conditions, classifying images using the same processing algorithm can yield sub-optimal results. Changes in illumination, perspective, or shading, as well as occlusion, debris, precipitation, and vibration of the imaging payload all result in large fluctuations in image quality and information content. Machine learning (ML) methods have shown exceptional promise in extracting information from such noisy and unstructured image data. Kurtulmuş and Kavdir adopted a machine learning classifier, support vector machine (SVM), to identify tassel regions based on the binarization of color images (10). An increasing number of methods from the field of computer vision are recruited to extract phenotypic traits from field data (18, 24). For example, fine-grained algorithms have been developed to not only identify tassel regions, but also identify tassel traits such as total tassel number, tassel length, width, etc. (13, 22)
A necessary requirement for training ML models is the availability of labeled data. Labeled data consist of a large set of representative images with the desired features labeled or highlighted (hence the term labeled data). A large and accurate labeled data set, the ground truth, is required for training the algorithm. The focus of this project is the identification of corn tassels, which are complex structures, in field-acquired images (Figure 2). For this task, the labeling process includes defining a minimum rectangular bounding box around the tassel. While seemingly simple, drawing a bounding box does requires effort to ensure accuracy (20), and a good deal of time to generate a sufficiently large training set. Preparing such a dataset by a single user can be laborious and time consuming. To ensure accuracy, such a generated set should ideally be proofed by several people, adding more time, labor, and expense to the task.
Overall schema of datasets (boxes) and processes (arrows) that led to the analyses (red).
Example image used during training to demonstrate correct placement of bounding boxes around tassels
One solution to the problem is to take a large cohort of untrained individuals to perform the task, and to compile and extract some plurality or majority of their answers as a training set. This approach, also known as crowdsourcing, has been used successfully many times to provide image-based information in diverse fields including astronomy, zoology, computational chemistry, among others (2, 3, 9, 14).
Crop genetics research has a long history of crowdsourcing large-scale efforts. For geneticists interested in identifying a single individual plant with a particular mutation of interest, large screening fields are grown and groups of student workers are sent into fields to identify phenotypes of interest. Rates of success are often a single instance among thousands of plants. Similarly, plant breeders have used student workers to work in their fields to plant, carry out crosses, de-tassle, etc.
Students participate in experiments to learn about the research process and gain first-hand experience acting as participants. To manage these large university participant pools, cloud based software, such as the Sona system (www.sona-systems.com), are routinely used to schedule experiment appointments and to link to web-based research materials before automatically granting credit to participants. University participant pools provide a unique opportunity for crowdsourcing on a minimal budget because participants are compensated with course credit rather than money.
In addition to students, workers can be recruited through commercial platforms, for instance the Amazon Mechanical Turk (MTurk) platform (https://www.mturk.com/). MTurk is one popular venue for crowd-sourcing data due to the large number of available workers and the relative ease with which tasks can be uploaded and payments disbursed. Methods for crowdsourcing data and estimates of quality have been available for years, and several recommendations have emerged from past work. For example, collecting multiple responses per image can account for natural variation and the relative skill of the untrained workers (19). Furthermore, a majority vote of MTurk workers can label images with similar accuracy to that of experts (16). Although those studies were limited to labeling categorical features of stock images, other studies have shown success with more complex stimuli. For example, MTurk workers were able to diagnose disease and identify the clinically relevant areas in images of human retinas with accuracy approaching that of medical experts (14). Amazon’s MTurk is a particularly valuable tool for researchers because it provides incentives for high quality work. The offering party has the ability to restrict their task to only workers with a particular work history, or a more general criteria known as ‘Master Turk’ status. The master title is a status given to workers by Amazon based on a series of criteria that Amazon believes to represent the overall quality of the worker. Amazon does not disclose those criteria.
The time and cost savings of using crowdsourced data are obvious, but crowdsourcing is only a viable solution if the output is sufficiently accurate. The goal of the current project was to test whether crowdsourcing image labels (also called tags) could yield a sufficient positive-data training set for ML from image-based phenotypes in as little as a single day. We focus on corn tassels for this effort (see Figure 3), but findings are anticipated to extend to other similar tasks in plant phenotyping.
Left: Sample participant-drawn boxes. Right: The Red box is the gold standard box and black is a participant-drawn box
In this project, we recruited three groups of people for our crowdsourcing tassel identification task, from the two online platforms, Sona and MTurk. The first group was students recruited through Sona (the Course Credit Group). The second group was Master-status Mechanical Turk workers who were paid, (the Master MTurkers group), and third group was non-master Mechanical Turk workers who were paid (the “non-Master MTurkers” group). The accuracy of the different groups’ tassel identification was evaluated against an expert-generated gold standard. These crowdsourced labelled images were then used as training data for a “bag-of-features” machine learning algorithm. The overall scheme of this project is shown in Figure 1.
We found that under the same fee structure, performance of Master and non-Master MTurkers was not significantly different, with a median performance accuracy (defined in Section 3.1) of 0.79. The course-credit group performed less well, with a median accuracy of 0.69. There was no significant decline in accuracy in any group over time, though there was a slight improvement in performance time in later images, since all groups showed an increase in speed over time. After training a ML algorithm using the three training sets, the performance was not significantly different, achieving an accuracy of 0.88. We conclude that crowdsourcing via MTurk can be useful for establishing ground truth for complex image analysis tasks in a short amount of time and that MTurkers’ performance exceeds that of students working for course credit.
2 Methods
2.1 Recruiting Participants
The course credit group included 30 participants recruited from the undergraduate psychology participant pool at Iowa State University. Individuals in the course credit group were recruited through the subject pool software Sona (www.sona-systems.com), and were compensated with course credits. The master MTurkers included 65 master-qualified workers recruited through MTurk. The exact qualifications for master status are not published by Amazon, but are known to include work experience and employer ratings of completed work. Master MTurkers were paid $8.00 to complete the task and the total cost was $572.00. Finally, the non-master MTurkers pool included 66 workers with no qualification restriction, recruited through the Amazon Mechanical Turk website. Due to the nature of Amazon’s MTurk system, it is not possible to recruit only participants who are not master qualified. However, the purpose of including the non-master MTurkers was to evaluate workers recruited without the additional fee imposed by Amazon for recruitment of Masters MTurkers. Non-master MTurkers were also paid $8.00 to complete the task and the total cost was $568.00. Note that the costs include Amazon’s fees.
2.2 Pilot Study
A short cropping task was initially administered to university students and master Turkers as a pilot study to test the viability of this project and task instructions. Each participant was presented with a participant-specific set of 40 images randomly chosen from 393 total images. The accuracy of participant labels helped designate Easy and Hard status for each image. Forty images were classified as “easy to crop”, and 40 as “hard to crop”, based on accuracy results of the pilot study. An expert who made gold standard boxes made adjustments to the Easy/Hard classifications based on personal experience. These 80 images were selected for the main study. As opposed to the pilot study, participants in the main study each received the same set of 80 images, with image order randomized separately for each participant. The results of the pilot study indicated that at least 40 images could be processed without evidence of fatigue so the number of images included in the main experiment was increased to 80. The pilot study also indicated, via user feedback, that a compensation rate of $8.00 for the set of 80 images was acceptable to the MTurk participants.
2.3 Gold Standard
We define a gold standard box for a given tassel as the box with the smallest area among all bounding boxes that contain the entire tassel, a minimum bounding box. Gold-standard boxes were generated by a trained and experienced researcher. An expert cropped all 80 images then computationally minimized the boxes to be minimum bounding. These images were used to evaluate the labelling performance of crowdsourced workers, and should not be confused with the ‘ground truth’ which were used to refer the labeled boxes used in training the ML model.
2.4 Materials and Procedure
We randomly selected the images used in this study from a large pool of images obtained as part of an ongoing maize phenomics project. The field images focused on a single row of corn captured by cameras set up as part of the field phenotyping of the maize Nested Association Mapping (23), using 456 cameras simultaneously, each camera imaging a set of 6 plants. Each camera took an image every 10 minutes during a two week growing period in August 2015 (12). Some image features varied, for example, due to weather conditions and visibility of corn stalks, but the tassels were always clearly visible. Images were presented through a Java applet linked by a web page hosted by Qualtrics (www.qualtrics.com). After providing Informed Consent, participants viewed a single page with instructions detailing how to identify corn tassels and how to create a minimum bounding box around each tassel. Participants were first shown an example image with the tassels correctly bounded with boxes (Figure 2). Below the example, participants read instructions on how to create, modify, and delete bounding boxes using the mouse. These instructions explained that an ideal bounding box should contain the entire tassel with as little additional image detail as possible. Additional instructions indicated that overlapping boxes and boxes containing other objects would sometimes be necessary and were acceptable as long as each box accurately encompassed the target tassel. Participants were also instructed to only consider tassels in the closest plant row, ignoring tassels from plants that appear to be more distant. After reading instructions, participants clicked to progress to the actual data collection. No further feedback or training were provided.
For each image, participants created a unique bounding box for each tassel by clicking and dragging the cursor. Participants could subsequently adjust the vertical or horizontal size of any drawn box by clicking and dragging on a box corner, and could adjust the position of any drawn box by clicking and dragging in the box body. Participants were required to place at least one box on each image before moving on to the next image. No upper limit was placed on the number of boxes. Returning to previous images was not allowed. The time required to complete each image was recorded in addition to locations and dimensions of user-drawn boxes.
3 Crowdsourcing Accuracy Evaluation
3.1 Defining Precision and Recall
Consider any given participant-drawn box and gold standard box as in the right panel of Figure 3. Let PB be the area of the participant box, let GB be the area of the gold standard box, and let IB be the area of the intersection between the participant box and the gold standard box. Precision (Pr) is defined as IB/PB, and recall (Rc) is defined as IB/GB. Both Pr and Rc range from a minimum value of 0 (when the participant box and gold standard box fail to overlap) to a maximum value of 1. As an overall measure of performance for a participant box as an approximation to a gold standard box, we use the harmonic mean of precision and recall given by

Each participant box was matched to the gold standard box that maximized F1 across all gold standard boxes within the image containing the participant box. In the event that more than one participant box was matched to the same gold standard box, the participant box with the highest F1 value was assigned the Pr, Rc, and F1 values for that match, and the other participant boxes matching that same gold standard box were assigned Pr, Rc, and F1 values of zero. In the usual case of a one-to-one matching between participant boxes and gold standard boxes, each participant box was assigned the Pr, Rc, and F1 values associated with its matched gold standard box.
To summarize the performance of a participant on a particular image, F1 values across participant-drawn boxes were averaged to obtain a measure referred to as Fmean. This provides a dataset with one performance measurement for each combination of participant and image that we use for subsequent statistical analysis.
3.2 Dataset Description
Of the 30 students recruited, 26 completed all 80 images. Of the 65 Master MTurkers recruited, 49 completed all images. Of the 66 non-master MTurkers recruited, 51 completed all images. Data collected from participants who did not complete the survey were not included in subsequent analyses.
As described in Section 3.1, precision and recall were calculated for each participant-drawn box. Density of precision recall pairs by group based on 61,888 participant-drawn boxes are shown in the heatmap visualization of Figures 4a, 4b and 4c.
Density of precision, recall and Fmean for all three groups
High value precision-recall pairs are more common than low value precision-recall pairs in all three groups. Perfect recall values were especially common because participants tended to draw boxes that encompassed the minimum bounding box, presumably to ensure that the entire tassel was covered. Figure 4d shows the distribution of Fmean for the three groups.
3.3 Testing for Performance Differences among Groups
We used a linear mixed-effects model analysis to test for performance differences among groups with the Fmean value computed for each combination of image and user as the response variable. The model included fixed effects for groups (Master MTurker, non-Master MTurker, course credit), random effects for participants nested within groups, and random effects for images. The mixed procedure available in SAS software was used to perform this analysis with the Kenward-Roger method (8) for computing standard errors and denominator degrees of freedom. The analysis shows significant evidence for differences among groups (p-value < 0.0001). Furthermore, pairwise comparisons between groups (Table 1) show that both Master and non-Master MTurkers performed significantly better than undergraduate students performing the task for course credit. There was no significant performance difference between Master and non-Master MTurkers.
Parameter estimates from the ANOVA with master MTurk group as baseline.
3.4 Time Usage and Fatigue
Participants took a median time of 26.43 seconds to complete an image, with the median time for the Master MTurker group at 30.02 seconds, non-Master MTurkers at 29.40 seconds, and the course credit student group at 16.86 seconds. The course credit group generally spent less time than both MTurker groups. It is worth noting that there is a large variance in time spent on each image, with the longest time for a single image at 15,484.63 seconds, and the shortest being 0.88 seconds. The very long image time was probably due to the participant taking a break after cropping part of the image and then coming back later to finish that image. Figure 5a shows the histogram of time per question in a log scale.
Density of precision, recall and Fmean for all three groups
There is a general downward trend in the time spent on each image over time. The trend is shown in Figure 5b, via linear regression on log time with mixed effects. Random effects for user and image were controlled. The trend is statistically significant in all three groups, with similar effect sizes. As a participant complete the next question, his or her average time per question is reduced by about 1%, as shown by Table 2. By looking at the interaction term between participant group and question index, we were able to conclude that the reduced time effect is not significantly different between the Master MTurker and non-Master MTurker group (p=0.6003), but is different between the course credit group and Master MTurker group (p=0.0431). This difference is weakened in terms of course credit versus non-Master MTurker, with a p-value of 0.1086.
Parameter estimates in linear mixed effects regression of time spent each image
We also analyzed the change in accuracy, as measured by Fmean as the test progresses. Figure 5c shows that Fmean decreases slightly as the task progresses. The decreases are statistically significant (p < 0.05) for all three groups. However, the effect sizes (average decrease in Fmean per round of image) for both MTurker groups are almost negligible, with Master MTurk group showing a 0.00080 decrease per image and Non-master group showing a 0.00027 decrease. Decrease in Fmean for the course credit group is only slightly more noticeable, at 0.00095.
The decreasing Fmean trend is statistically significant among the three groups, as is shown by Table 3. The table is obtained by fitting an interaction term in addition to the fixed effects: Question Ordinal Index and group, as well as the random effects.
Type 1 Test of Fixed Effects
To summarize the effect of image order, there was a subtle decline in Fmean and a larger decrease in image completion time as the survey progressed.
Another question of interest was whether image accuracy correlates with image completion time. Indeed, there was a slight increase in accuracy if one spent more time on an image, shown in Figure 5d. This correlation is statistically significant in all three groups. Again, effect sizes are too small to conclude that spending more time on a single image has a positive effect on accuracy of that image.
In conclusion, all three groups of participants spent less time on each image as the survey progressed, showing familiarity in the task. Although their performance in the task also decreases slightly overtime, the effects were almost negligible. This fatigue effect is most evident in the course credit group. This observation is confirmed by the positive correlation between time spent per question and accuracy.
3.5 Image Difficulty
We obtained the Best Linear Unbiased Predictor (BLUP)(5) of each image in the above analyses to assess whether each image contributes to increased or decreased accuracy and time. BLUPs can be viewed as estimates corresponding to random effects, in our case, the eighty images. Figure 6 is a scatter plot, with each point representing an image. The x-axis shows the BLUPs with regard to logtime. The higher the BLUP, the more this particular image contributes to increased time spent on each question. Similarly, the y-axis shows the BLUPS with regard to Fmean. The higher the BLUP, the more it contributes to increased accuracy. We have also obtained a difficult / easy classification of all eighty images from our expert who manually curated the gold standard boxes, as they are shown by the two different colors on the plot.
Best Linear Unbiased Predictors for image in analyses for Fmean and time in log scale. Color represents image difficulty determined by expert
It is interesting to observe that longer time spent on a question positively correlates with accuracy. Indeed, the linear regression fit shown as the red line on the plot has an estimate of 0.1003 (p=0.00136), and an adjusted R2 of 0.1127, suggesting weak correlation. It is even more interesting to observe that the images that our expert considered difficult did not take participants longer to complete, nor did they yield significantly lower accuracy. The images are shown to participants in a random order, eliminating the possibility that fatigue contributes to the longer time it takes to complete easy images. Since previous analysis showed that participants tend to spend less time on images shown to them later (Figure 5b), this may be evidence to suggest an ordering of the images so that more difficult images are shown to the participants first, to take advantage of the fact that participants tend to spend more time on each image in the beginning, to ensure optimal accuracy results.
4 Machine Learning Accuracy Evaluation
Each of the 126 participants who completed this study labelled a set of 80 images. Each of these sets was used as training set in a bag-of-features (15) machine learning model. These models were then tested on a new set of labelled images to generate accuracy. Due to algorithmic differences, the accuracy metric in evaluating the ML performance was not comparable with the Fmean accuracy. Accuracy was calculated based on the average of the True positive rate and True negative rate for each set of images from the participants. Overall the algorithm achieved an accuracy of 0.8811. For the master and non-master MTurker groups, the average accuracy rates are 0.8851 and 0.8781 respectively. For the course credit group, it was 0.8795. A linear regression was performed to determine whether there is group difference between the machines’ performance. The F test yielded a p-value of 0.7325, indicating there is no detectable difference between the machine’s performance when it comes to different training data.
5 Discussion
Machine learning methods have proven useful for processing images for inclusion in various databases. However, these algorithms still require an initial training set created by expert individuals before structures can be automatically extracted from the image and labeled. This project has identified crowdsourcing as a viable method for creating these initial training sets without the time consuming and costly work of an expert. These results indicate that straightforward tasks, such tassel cropping, do not benefit from the extra fee assessed to hire master over non-master MTurkers. Performance between the two groups was not significantly different, and non-master MTurkers can safely be hired without compromising data quality.
The MTurk platform allows for fast collection of data within a day instead of one to two weeks. While MTurk may be one of the most popular crowdsourcing platforms, many universities possess a research participant pool that compensates students with class credit instead of cash for their work. If the image tagging task meets an Institutional Review Board (IRB) approval, students could be tapped to tag images for course credit and further reduce the cost of sourcing data. However, such undergraduate student participant pool performed more poorly than either of the MTurker groups. While it is possible that MTurk workers are simply more conscientious than college students, it is also possible that monetary compensation is a better motivator than course credit. In addition to the direct monetary reward, both groups of MTurkers were also motivated by either working towards or maintaining the “master” status. Such implicit motivational mechanisms might be useful in setting up a long-term crowdsourcing platform. The distinction in labelling performance between MTurkers and students does not persist when considering the actual outcome of interest: how well the machine learning algorithm identifies corn tassels when supplied with each of the three training sets. Indeed, the accuracy of machine performance is not affected by the quality of the training set provided, which were manually-labelled through crowdsourcing. Therefore, a student participant pool with a non-monetary rewards system provides the opportunity for an alternate model by lowering overall image tagging cost. This would allow additional features to be tagged or a larger number of responses to be sourced with existing funding levels and further database expansion.
Indeed, many non-monetary crowdsourcing projects already exist. For example, the Backyard Worlds: Planet 9 project hosted by NASA for search of planets and star systems in space (9), the Phylo (http://phylo.cs.mcgill.ca/) game for multiple sequence alignment (7) and fold.it (http://fold.it) (3) for protein folding. These projects do not offer monetary rewards but instead attract participants by offering the chance to contribute to real scientific research. This concept has been categorized as citizen science, where nonprofessional scientists participate in crowdsourced research efforts. In addition to the attraction of the subject matter, these projects often have very interactive and entertaining interfaces to quickly engage people’s interests and attention, as well as providing extensive demonstrations. Some of them were even designed as games, and competition mechanisms such as rankings provide extra motivation. Another important purpose of such citizen science projects is to educate the public about the subject matter. Given the current climate regarding Genetically Modified Organisms (GMOs), crowdsourcing efforts of crop phenomic and phenotypic research could potentially be a gateway to the better understanding of plant research in the general public. A recent effort has shown that non-experts can be used for accurate image-based plant phenomics annotation tasks (4). However, the authors have pointed out to the challenge of non-monetary reward in sustaining a large-scale annotation effort.
Phenomics is concerned with the quantitative and qualitative study of phenomes, where all possible traits of a given organism varies in response to genetic mutations and environmental influences (6). An important field of research in phenomics is the development of high-throughput technology analogous to high-throughput sequencing in genetics and genomic studies, to enable the collection of large-scale data with minimal efforts. A lot of phenotypic traits could be recorded with images, and databases such as BioDIG (17) makes the connection of such image data with genomic information, providing genetics researchers with tools to examine the relationship between the two types of data directly. Hence, the computation and manipulation of such phenomic image data becomes essential. In plant biology, maize is central for both basic biological research as well as crop production (reviewed in (11)). As such, phenotypic information derived from ear (female flowers) and tassel (male flowers) are key to both the study of genetics and crop productivity: flowers are where meiosis and fertilization occur as well as the source of grain. To add a new features such as tassel emergence, size, branch number, branch angle and anthesis to the systems such as BioDIG, the specific tassel location and structure should be located, and our solution to this task, is to use crowdsourcing combined with machine learning, to reduce cost and time of such a pipeline, while expanding its utility. Our findings and suggested crowdsourcing methods can be generally applied to other phenomic analysis tasks.
We hope our study will help establish some best practices for researchers in setting up such a crowd-sourcing study. Given the ease and relatively low cost of obtaining data through Amazon’s Mechanical Turk, we recommend it over the undergraduate research pool. That being said, student research pools would be a suitable method for obtaining proof of concept or pilot data to support a grant proposal.
6 Funding
This work was supported primarily by an award from the Iowa State University Presidential Interdisciplinary Research Initiative to support the D3AI (Data-Driven Discovery for Agricultural Innovation) project. For more information, see http://www.d3ai.iastate.edu/. Additional support came from the Iowa State University Plant Sciences Institute Faculty Scholars Program and the USDA Agricultural Research Service. IF was funded, in part, by National Science Foundation award ABI 1458359. DN, BG and CJLD gratefully acknowledge Iowa State University’s Plant Sciences Institute Scholars program funding.