Data science competition for cross-site delineation and classification of individual trees from airborne remote sensing data

Delineating and classifying individual trees in remote sensing data is challenging. Many tree crown delineation methods have difficulty in closed-canopy forests and do not leverage multiple datasets. Methods to classify individual species are often accurate for common species, but perform poorly for less common species and when applied to new sites. We ran a data science competition to help identify effective methods for delineation of individual crowns and classification to determine species identity. This competition included data from multiple sites to assess the methods’ ability to generalize learning across multiple sites simultaneously, and transfer learning to novel sites where the methods were not trained. Six teams, representing 4 countries and 9 individual participants, submitted predictions. Methods from a previous competition were also applied and used as the baseline to understand whether the methods are changing and improving over time. The best delineation method was based on an instance segmentation pipeline, closely followed by a Faster R-CNN pipeline, both of which outperformed the baseline method. However, the baseline (based on a growing region algorithm) still performed well as did the Faster R-CNN. All delineation methods generalized well and transferred to novel forests effectively. The best species classification method was based on a two-stage fully connected neural network, which significantly outperformed the baseline (a random forest and Gradient boosting ensemble). The classification methods generalized well, with all teams training their models using multiple sites simultaneously, but the predictions from these trained models generally failed to transfer effectively to a novel site. Classification performance was strongly influenced by the number of field-based species IDs available for training the models, with most methods predicting common species well at the training sites. Classification errors (i.e., species misidentification) were most common between similar species in the same genus and different species that occur in the same habitat. The best methods handled class imbalance well and learned unique spectral features even with limited data. Most methods performed better than baseline in detecting new (untrained) species, especially in the site with no training data. Our experience further shows that data science competitions are useful for comparing different methods through the use of a standardized dataset and set of evaluation criteria, which highlights promising approaches and common challenges, and therefore advances the ecological and remote sensing field as a whole.


79
Data science competitions are a unique way to advance image processing methods for particular  researchers. The 2017 event was the first data science competition using NEON data, and was 98 instrumental in advancing methods and in establishing a framework for providing data and 99 evaluating submissions. The competition identified the most effective methods for delineating For classification, this task is challenging due to highly imbalanced multi-species datasets, 117 combined with differences in the species present at different sites. Finally, differences in the 118 remote sensing data due to variability in the brightness and shadows caused by differences in sun 119 position, and phenological differences in the vegetation due to the time of year, impact the ability 120 for methods to generalize and transfer patterns. While the tasks of delineation and classification 121 remain critical to the needs of ecologists, an expansion in the number sites and diversity of data 122 is required to effectively achieve these tasks.

124
To address these needs we ran a new iteration of the IDTReeS competition focused on the tasks 125 of delineation and classification using a new set of data that allowed for within and cross site 126 evaluation. This iteration of the competition uses data from three NEON sites in the Southeastern 127 United States to compare 1) how well methods generalize by incorporating information from 128 different sites, and 2) how well methods transfer to make predictions on sites for which the 129 algorithms have not been trained. Here we present the results of the competition that includes 130 delineation and classification scores from participating teams, details about the methods used, 131 and a discussion of how this competition advances our ability to delineate and classify individual 132 trees using existing inventory and remote sensing data. More details of the approaches taken by 133 one of the teams are included in a companion paper (Scholl et al., 2021).

136
Study sites 137 We used multiple NEON data products from three sites in combination with data collected by 138 members from our team. The three NEON sites in the southeastern United States (Fig. 1)  The field data to provide species labels in the competition training data were collected through 162 the NEON Terrestrial Observation System (TOS). The data contain information on individual 163 tree identifiers, location of trees relative to sampling locations (i.e. distance and azimuth from a 164 central location), species and genus labels, and measures of salient structural attributes. The field 165 attribute that is directly used in this competition is the taxonomic species information. The 166 taxonomic species information is described by its scientific name, which includes a genus and 167 species classification. To simplify the taxonomic species information, each scientific name is 168 simplified to its unique taxonomic identification code. More information about the data products 169 and the field data and the list of species classes and taxonomic codes is provided in Appendix A.

171
Individual tree crown data 172 Individual tree crown (ITC) data are spatial delineations that represent the spatial extent of an 173 individual tree in remote sensing data and are fundamental for both tasks in the competition. For 174 the delineation task, participants were given ITCs in the training dataset and generated ITCs for 175 the test dataset (Fig. 2). For the classification task, participants were given ITCs labeled with the 176 taxonomic species and generated species predictions for unclassified ITCs. ITC delineation data 177 are not a standard NEON product and were generated by the IDTReeS research team. Each ITC 178 is represented by a 2-dimensional rectangular bounding box that geographically defines an 179 individual crown. The delineation represents the maximum crown boundary or extent in the teams (Appendix B, Table B1). Three teams completed the delineation task and 4 teams 197 completed the classification task. Of the six teams, two completed both tasks.

199
All teams were allowed up to 4 submissions per task. Submissions made prior to the final 200 submission were evaluated and scores were returned. Pre-submissions were allowed to ensure 201 submissions were properly formatted and provide teams with feedback on model performance.

202
The final submission deadline was extended by 2 months after the train and test data were 203 released. This was done to allow teams more time to work with the data given the challenges Delineation task 211 The delineation task required the detection of ITCs, and defining their boundaries using remote 212 sensing data. Participants were given remote sensing data and submitted spatial vector data that 213 defined the location and size of all predicted ITCs. The data were split into training and testing 214 datasets (Fig. 2). Training data were provided for the development and self-evaluation of  Participants could use any remote sensing and field data for training their methods. No TALL 221 data were provided in the training data. The testing data were separate plots with associated 222 remote sensing data and ITC delineations. Only the remote sensing data for the 153 test plots 223 was provided to the participants. The ITC delineations for the test plots, which were labelled in 224 the field, were withheld from the teams. Participants submitted delineated ITCs for all crowns for 225 each plot. A total of 262 ground truth ITCs across all three sites (a subset of 1-9 trees per plot) 226 were used by the competition organizers to score the team submissions.

228
We evaluated each possible pair of predicted ITC delineation and ground-truth delineations using 229 two metrics: the Intersection over Union (IoU) and a modification of the Rand index we refer to 230 as RandCrowns (Stewart et al., 2021). The IoU metric, known also as the Jaccard Index (Jaccard, 231 1908), is defined as the area of overlap between the ground-truth ITC delineation and predicted 232 ITC delineation divided by the total area of both ITC delineations. The RandCrowns measure is 233 designed to provide robust detection scores despite spatial uncertainty in ground truth and label 234 data. RandCrowns is a modification of the commonly used Rand index using a set of "halos" 235 around each ground truth box to account for variability in the size of the ITC delineations with respect to ground-truth. The halos account for variability in the crown size and uncertainty of 237 ground-truth ITC delineations. Scores for both IoU and RandCrown range from 0 (no overlap) to 238 1 (perfect overlap), but in general, scores are higher for RandCrowns than IoU.

240
Since the test data included only a subset of ITC delineations for all trees in the plot and 241 participants submitted all possible ITC delineations, the first step was assigning a single 242 predicted ITC delineation to each ground-truth ITC delineation in the test data. This was an 243 important step since participant submissions may include many delineations that overlap the test 244 ITCs. In cases of ambiguity caused by multiple predictions corresponding to a single test ITC 245 delineation, we used the one-to-one mapping that provides the best RandCrowns score for the 246 submission, which was done using the Hungarian assignment algorithm. Each possible pair of 247 predicted and ground-truth delineations was scored using the RandCrowns measure and the 248 highest scoring prediction for each ground-truth delineation was used for evaluation. The IoU 249 measure was computed on each pair of submitted and ground-truth ITCs. For each metric, the 250 mean of all assigned pairs was calculated to produce the final evaluation scores for participants.

252
Delineation algorithms 253 Four unique approaches were used for the delineation task (Appendix B). The baseline 254 delineation method was the approach that performed the best in the first iteration on a single site 255 (Dalponte, Frizzera & Gianelle, 2019) and is available in the itcSegment R package (Dalponte, 256 2018). It is a region growing method that identifies seed points in a single HSI band (at ~ 810 257 nm) using a moving window to identify high reflectance associated with the tallest parts of a tree 258 crown and grows the ITC based reflectance differences of neighboring pixels. Parameters used in 259 the first iteration were implemented rather than optimizing parameters using the training data.

260
The Fujitsu Satellite team used an instance segmentation pipeline. First, the spatial resolution of 261 the HSI data was increased using the RGB images and bands were iteratively selected based on 262 the best performance. Second, a two-branch backbone structure neural network was used to LiDAR point cloud data (Tusa et al., 2020). Critical to this approach was the use of a probability 269 density function (PDF) to identify clusters of points that define an ITC. The PDF was based on a 270 superellipsoid kernel where the parameters that defined the crown shape and kernel size were 271 based on allometric equations using tree height, crown radius, and crown depth.

273
Classification task 274 The goal of the classification task was to classify ITCs to their taxonomic species. Remote 275 sensing-based species identification over large spatial scales would have major benefits for ecology, but this task is challenging due to low inter-species variance in spectral properties and 277 the highly imbalanced nature of these types of datasets (Graves et al., 2016).

279
Like the delineation task, the data were split into training and testing datasets where the training 280 data allows for the development and self-evaluation of models and the testing data was used to 281 evaluate the team methods. The characteristics of training data for the classification task were 282 similar to the delineation task. Training data included all remote sensing data products (clips of 283 20 x 20 meters around each plot) for 85 OSBS and MLBS plots, and ITC delineations with 284 taxonomic species labels for 1057 ITCs (Fig. 2). Participants could use any of the remote sensing 285 and field data for training their models since this represents a common scenario where models 286 are developed using data from inventory plots. No TALL data was provided in the training data.

287
The testing data were 353 separate plots with associated remote sensing data, 585 ITC 288 delineations, and ground truth species labels (withheld from the teams) at the OSBS, MLBS, and 289 TALL sites. Providing the ITC delineations kept this task focused on classification methods 290 rather than having participants also incorporate delineation approaches prior or post-291 classification. Participants submitted the taxonomic species predictions for ITCs in the test data.

292
The predictions were submitted as a probability from 0 to 100% that the ITCs belonged to the 293 associated species class. The ground truth species labels for each ITC were used by the 294 competition organizers to score the submissions.

296
Significant features of this dataset, and forest remote sensing data in general, are class imbalance 297 in the training data, and a difference in species composition and relative abundances between the 298 training data and the test data. Due to the nature of these data, the ability to train on imbalanced 299 data and predict species with species identities and abundances that differ between the training 300 and testing datasets is an important challenge addressed in this competition. The training dataset 301 for the OSBS and MLBS sites had a total of 33 distinct species classes, ranging from 1-302 302 individuals per species class (Fig. 3). This distribution represents the composition and relative 303 abundance of canopy trees in the NEON plots and therefore the data available from forest 304 inventory plots that are used to develop and test classification models. The test data for OSBS 305 and MLBS both show unequal distributions of data among species classes. The test data for both 306 sites include only 15 species in the training data, and both sites include species in the test data 307 that are not part of the training data (OSBS: 11 species, MLBS: 5 species). Furthermore, while 308 the test data for TALL has less imbalance across the species classes than the training data at 309 OSBS and MLBS, it includes only 10 of the species from the training data and introduces 11 new 310 species that are not part of the training data (as "Other" in Fig. 3). In this way, the external TALL 311 site tests not only the ability of the models to be applied to new remote sensing data, but also to a 312 new site with different species composition.

314
To account for new species classes in the test dataset not present in the training data, participants 315 were also allowed to include a species class with the label "Other". The Other class can be used to indicate a probability that an ITC is a species that is not represented in the training data and is 317 therefore likely a new species.

319
Evaluation of the classification task 320 The primary metric for evaluation of submissions was the macro F1 score. This score is the 321 arithmetic mean of the weighted harmonic mean of the precision and recall scores for each 322 species class in the dataset and is given by equation (1), where C is the set of species classes, P c 323 is the precision of species class c, R c is the recall of species class c, and |C| is the cardinality of 324 set C.  We evaluated participant models on two additional metrics, log-loss and weighted F1 score, to 333 better understand their performance. While these metrics were not used to assess the strongest 334 submission, they provide another measure of model robustness and allow for a more in-depth 335 comparison between models. Cross-entropy loss, also known as log loss, is a performance metric Finally, full confusion matrices were calculated for each submission to allow for further analysis, 358 discussion and comparison, particularly to identify classes that are commonly confused (e.g., 359 species within a genus) across methods.   were not trained (Fig. 4).

406
Across all methods, delineation scores varied in a similar way as a function of crown size. IoU 407 scores of individual delineations tended to be lowest for the smallest crowns, peak at crown sizes 408 from 50 to 100 sq meters, then decline with increasing size. This pattern was especially apparent 409 in the baseline delineations, with a peak IoU score for crowns of approximately 75 sq meters and 410 a steady decline with increasing crown size (Fig. 5).

412
The multi-sensor neural network approach differed from the other methods in that it produced  This is also likely to increase the scores of this method relative to other approaches.

421
Performance on sites without training data 422 The Fujitsu Satellite team's two-stage fully connected neural network approach had the strongest 423 performance in the classification task for two evaluation metrics (Macro F1 = 0.32 and Weighted 424 F1=0.53) and the second-best performance in the cross-entropy metric (Log-Loss=3.6, Table 2).

425
The rank of models from best to worst performance was consistent for both Macro F1 (a measure

445
Results by species 446 Model performance varied widely for predictions of individual species classes, with the general 447 pattern of better performance for the most common species and poorer performance for the least 448 common species (Fig. 7). Because of the high number of taxonID classes, we focus on a subset 449 of classes that highlight important patterns seen across teams and how predictions compare for 450 untrained sites. In addition, the Fujitsu Satellite team had 100% precision for 2 less abundant 471 species, QUAL and TSCA, though this came at the expense of lower recall as this algorithm 472 overpredicted the prevalence of QUAL and TSCA (precision data not shown, but see Appendix 473 B for team confusion matrices).

475
In addition to comparing classification methods, the aggregated confusion matrix (Fig. 8) shows 476 patterns of misclassification, which is important for understanding the data and how to improve 477 models in the future. For example, inaccuracy in PIPA2 predictions was mostly due to confusion 478 with a taxonomically and structurally similar species or with a species that co-occurs in the same promise if coupled with a method to select the best delineation from the sets of overlapping 511 delineations for each tree. An example is the recent work on a DeepForest method that generates 512 multiple delineations, and then uses a non-max suppression to output a single final delineation 513 for each tree (Weinstein et al., 2020a). A limitation of operationalizing the instance segmentation 514 pipeline approach is the reliance on three remote sensing data types (HSI, RGB, and CHM).

515
While these datasets are available for the NEON sites, this is not typical for many areas. The 516 second highest score was earned by the Intellisense CAU team that used only RGB data with a 517 Faster-RCNN detector approach. While the RCNN approach is complex, only having the RGB 518 data requirement is encouraging because RBG data is much more widely available than LiDAR 519 and hyperspectral data, potentially allowing RGB based methods to be applied more broadly  , 2002). Second, the 529 kernel shape implemented in the algorithm is a superellipsoid, whose profile is more suitable for 530 delineating conifers rather than hardwood trees. Although proper parameter setting can improve 531 tree delineation, prior information concerning the spatial distribution of the tree species is critical 532 to adapt the kernel profile shape and size.

534
The results from this competition, as well as recent methods comparisons (Aubry-Kientz et al., One lesson learned from this competition is the importance in selecting an appropriate evaluation 544 metric to accurately reflect the strength of delineations that produce valuable outputs for 545 ecological studies. For each ground ITC, our evaluation pipeline selected the best of all predicted 546 delineations and used this best delineation to calculate two evaluation metrics (IoU and 547 RandCrowns). As a result, our evaluation favored methods with multiple candidate predictions 548 for each crown since there was no penalty for multiple predicted delineations for each crown.

549
This evaluation pipeline may have increased the apparent strength of the winning method. While we do believe that the instance segmentation pipeline by the Fujitsu Satellite team is promising, 551 especially coupled with non-max suppression, the results illustrate that evaluation metrics should 552 be chosen carefully to focus on outputs relevant to the application, in this case mapping canopy 553 trees for ecological analysis. Standard metrics such as IoU can be used, but evaluation pipelines 554 should penalize for too many overlapping delineations, such as scaling the IoU or RandCrowns 555 by the number of candidate detections, selecting the worst rather than the best intersecting 556 detection, or penalizing the score for each "extra" detection. This may be advantageous when 557 evaluating applications where detecting the approximate location and size is more important than 558 accurate delineation for measuring crown size parameters. In addition to evaluation metrics that 559 assess delineation accuracy of individual crowns, evaluation metrics such as detection rate that 560 assess the accuracy of the number of ITC delineations on a plot level could be used to reward 561 methods produce the correct number of delineations over larger areas as well as assess metrics of 562 interest to ecologists and foresters (Yin & Wang, 2016).

564
A challenge worth noting is that the training and testing data were generated in two different 565 ways. The training data was generated by the IDTReeS research team using multiple remote 566 sensing images and NEON field inventory data as support. Every apparent individual tree crown 567 in the plots was delineated, even those that were uncertain due to their small size or adjacency to 568 other crowns. This introduced some uncertainty in the training data because the delineations 569 were not generated or validated in the field. The test data, in contrast, were crown delineations 570 that were generated in the field and only crowns that were clearly identifiable in the remote 571 sensing imagery were delineated. Therefore, the test data had greater certainty and did not 572 include very small crowns or those that were not apparent in the imagery. In this way, the 573 performance of the methods in this competition may be inflated because operational tree crown 574 delineation requires all crowns, even small crowns or crowns that are not fully in the canopy, 575 require delineations.

577
Classification 578 Most remote sensing approaches to tree species classification focus on a method for a single site 579 and where all species classes are known. In this iteration of the competition, we asked 580 participants to grapple with more challenging classification tasks, specifically building models 581 using training data from multiple sites and applying those models to a novel location. Our results

582
show the promise for all methods to generalize by learning from data from multiple sites, but all 583 methods had a limited ability to produce strong predictions for the untrained site. Two 584 classification methods using neural networks stood out due to their ability to predict common 585 classes and learn unique spectral features of uncommon species classes. The first-ranked team 586 based on F1 scores (Fujitsu Satellite) used a two-stage fully-connected neural network approach, 587 and the second-ranked team (Intellisense CAU) used a 1-D convolutional neural network. These 588 methods also performed well based on the Log-Loss score, but the first-ranked team based on 591 The classification task for this competition was challenging due to limitations and complexities 592 of the data; complexities that reflect the characteristics of data for real-world applications for 593 which robust methods are needed. One common challenge for ecological applications is that the 594 amount of field data for training and testing is often smaller than the optimal amount to train and 595 robustly evaluate algorithms. We believe the most accurate data for training and evaluating identified as belonging to one species may in fact belong to another, presenting challenges to 605 classification models. Therefore, the extra challenge of this competition is learning from data 606 with uncertain labels. We emphasize that this is an inherent challenge in ecological studies since 607 high-quality data, such as the field-delineated ITCs, will always be limited, and therefore there is 608 a need for methods to account for this source of potential uncertainty. Future efforts should be 609 made to support improved alignment between field and remote sensing datasets. In generating 610 datasets, field data could indicate if a tree has a position in the canopy and is therefore viewable by Fujitsu), and that scores were similarly low for both the trained and untrained sites (Fig. 6). A 648 low Log-Loss score means that a method was confident with its correct predictions and 649 unconfident with its incorrect predictions. Methods that score low in Log-Loss could be most delineation methods, especially for small and large crowns, but methods generally work equally 671 well regardless of the sites they are trained and then applied. Participants also developed a wide 672 range of approaches to classification of tree crowns to their taxonomic species, many of which 673 were significantly better at cross-site prediction than the best method from previous competition.

674
Most methods can predict common classes well, even across sites, but more work is needed in 675 methods that can handle imbalanced data and learn from rare species, i.e. those with lower 676 relative abundances, and are robust to new species when applied to an untrained site. This 677 competition has highlighted the value in comparing methods on a standardized dataset to 678 implement approaches from a broad range of expertise, and highlight areas where we can 679 continue to improve.