Synthesizing Disparate LiDAR and Satellite Datasets through Deep Learning to Generate Wall-to-Wall Regional Forest Inventories

Light detection and ranging (LiDAR) has become a commonly-used tool for generating remotely-sensed forest inventories. However, LiDAR-derived forest inventories have remained uncommon at a regional scale due to varying parameters between LiDAR datasets, such as pulse density. Here we develop a regional model using a three-dimensional convolutional neural network (CNN), a form of deep learning capable of scanning a LiDAR point cloud as well as coincident satellite data, identifying features useful for predicting forest attributes, and then making a series of predictions. We compare this to the standard modeling approach for making forest predictions from LiDAR data, and find that the CNN outperformed the standard approach by a large margin in many cases. We then apply our model to publicly available data over New England, generating maps of fourteen forest attributes at a 10 m resolution over 85 % of the region. Our estimates of attributes that quantified tree size were most successful. In assessing aboveground biomass for example, we achieved a root mean square error of 36 Mg/ha (44 %). Our county-level mapped estimates of biomass were in good agreement with federal estimates. Estimates of attributes quantifying stem density and percent conifer were moderately successful, with a tendency to underestimate of extreme values and banding in low density LiDAR acquisitions. Estimate of attributes quantifying detailed species groupings were less successful. Ultimately we believe these maps will be useful to forest managers, wildlife ecologists, and climate modelers in the region.


Overview
Over the past two decades light detection and ranging (LiDAR) has become a common tool for developing spatially-explicit forest inventories (Naesset 1997). Measurements of point cloud datasets derived from LiDAR can be used to predict useful forest inventory attributes such as biomass, stem volume, tree count, and species (Means et al. 2000, Jensen et al. 2006, Lim and Treitz 2004. These inventories are useful for a wide range of applications, including assessing carbon stocks (Patenaude et al. 2004), assisting in precision forestry (Woods et al. 2011), and predicting wildlife habitat (Wulder et al. 2008, García-Feced et al. 2011. Forest inventories are typically developed using the area-based approach (White et al. 2013), where the forest is segmented into a series of grid cell-areas, ranging from 10 m to 1 ha in size. First, the LiDAR point cloud and the desired forest attribute (e.g. stem density) within each grid cell are each measured. Next, predictive models are then developed relating the field measurements to the LiDAR measurements. Finally, these models can then be applied to every grid cell across a landscape. The resulting maps are referred to as enhanced forest inventories (EFIs).
While LiDAR data are becoming increasingly available to the public, few studies have emphasized mapping whole regions, focusing instead on specific municipalities or individual parcels. One example of regional LiDAR modeling occured in Sweden, which recently completed a project to develop nation-wide forest inventory maps at a 12.5 m resolution (Nilsson et al. 2017).
Similar maps have also been generated in Finland (Kangas et al. 2018). In Canada, large portions of Alberta's forests have had their vegetative functional groups mapped (Guo et al. 2017), and in New Brunswick, a province-wide effort has resulted in near wall-to-wall LiDAR inventories (Dick 2019). We note that each of these examples make use of largely homogenous LiDAR datasets.

The current approach
One common difficulty in generating regional LiDAR inventories is that many regions are comprised of a patchwork of LiDAR datasets acquired with various specifications, and with forest analytics as a secondary objective. This is particularly problematic because the standard approach for measuring a LiDAR point cloud for the development of an EFI model is to take a series of summary statistics describing point heights and vertical distributions. These include measures of the mean, variance, and vertical quantiles, as well as proportions of points that fall above certain height thresholds (McGaughey 2009, Silva et al. 2017. Unfortunately these traditional features suffer from several drawbacks, such as (1) a high degree of autocorrelation, (2) a propensity to change based on LiDAR sampling design (e.g. pulse density), and (3) a propensity to change based on phenology.
Several software suites exist for extracting height features from LiDAR, each producing upwards of fifty features including the heights of every 10 th percentile (Silvia et al. 2015). While powerful predictors, many of these measurements are also highly correlated, and so there is a risk of model overfitting without careful model selection (Junttila et al. 2015, Naesset et al. 2005. Unfortunately, many studies make use of every predictor and report on those most important, despite most modelling techniques being unreliable for ranking autocorrelated features (,).
A standard measure for assessing LiDAR quality is pulse density, which refers to the number of laser pulses landing within a given area (pls/m 2 ). Pulse density can vary both between and within LiDAR acquisitions, and frequently regional LiDAR collections consist of many acquisitions in which pulse density varies by up to an order of magnitude. Many studies have found that varying pulse density can adversely affect EFI predictions across different LiDAR data sets.
For example, in Gobakken and Naesset (2008) noted that the area-based approach was strongly effected by pulse density. Hansen et al. (2015) determined that EFI estimates could be subject to biases if predictions were made on point clouds of different densities than those used to train the model. Other differences in acquisition parameters related to LiDAR sampling designsuch as sensor type, pulse frequency, and flight altitudecan also result in different height features being generated over the same area of forest (Naesset 2009, Goodwin et al. 2006. Seasonality also has a major impact on LiDAR height features (Naesset 2005, White et al. 2015, Villikka et al. 2012).
In deciduous forests, the presence of leaves can result in LiDAR beams being intercepted higher in the canopy (Ørka et al. 2010). Generally, models developed using one LiDAR point cloud are often not applicable to another, prohibiting regional LiDAR modelling without unusually consistent LiDAR datasets (White et al. 2013).

Deep learning
Here we use deep learning to overcome these obstacles by developing a single model for predicting forest attributes that is applicable across many disparate LiDAR and satellite datasets.
Deep learning is a form of machine learning, and primarily refers to artificial neural networks of a sufficient complexity so as to interpret raw data, without a need for human-derived explanatory variables. These differ from simpler neural networks (such as perceptrons) which make estimates using a set of features derived from the data (e.g. height percentiles). Recently, deep learning has proven itself successful at classifying imagery despite varying contextual information, such as light levels and background subject matter (Krizhevski et al. 2012, LeCun et al. 2015. We posit that deep learning will improve EFI modeling by identifying useful spatial features in the LiDAR point cloud without the need for human-derived explanatory variables such as height metrics. These features can be complex shapes and gradients in 3-D space that are less subject to change relative to one another with different acquisition parameters, such as the edges of tree crowns (Ayrey and Hayes 2018).
We implemented a type of spatial deep learning model called a convolutional neural network (CNN). A CNN works by passing a series of moving kernels over spatial data. As the model trains, the weights of those kernels are tuned to identify features useful for predicting the dependent variables (such as the edges of objects). Deep CNNs stack many of these moving windows atop one another, allowing the algorithm to quantify complex features. The threedimensional CNN developed here uses a volumetric window to quantify a LiDAR point cloud that has been binned into voxel-space. The 3D CNN is thus able to quantify vertical as well as horizontal features, and shapes such as tree crowns, providing a level of complexity not captured by height metrics.
Early CNNs were first developed in the late 1990s and were used to classify hand written digits (Lecun and Bengio 2005). The technique was largely underrepresented in data science until advances in computing power, technique, open-source tools popularized them in 2012 . Since then, CNNs of increasing complexity have consistently outperformed models based on feature extraction for computer vision tasks (Szegedy et al. 2015, Tiagman et al. 2014, He et al. 2016). More recently, CNNs have been becoming a popular option for remote sensing problems. There has been much success in using 2D CNNs to classify aerial imagery, hyperspectral, and LiDAR data (Rizaldy et al. 2018, Ghamisi et al. 2017, Castelluccio et al. 2015. In relation to forestry, some have used segmentation algorithms to isolate individual trees from LiDAR, then used 2D CNNs to classify tree species (Guan et al. 2015, Ko et al. 2018. Work is also being done using 2D CNNs to identify individual tree crowns from high resolution imagery (Weinstein et al. 2019, Li et al. 2016. Progress has also been made in adapting CNNs to scan LiDAR point clouds and three dimensional space. Similar CNNs that make use of voxels to quantify point clouds have been used to identify household objects and malignancies in medical scans (Qi et al. 2016, Maturana andScherer 2015). In 2017 Qi et al. introduced PointNet, designed to interpret LiDAR data without voxels, although this technique does not make full use of the spatial relationship between neighboring points. In a remote sensing context, Maturana and Scherer (2015) used a 3D CNN to identify helicopter landing zones in LiDAR. Most similar to this study, Ayrey and Hayes (2018) tested a variety of CNN architectures to interpret LiDAR for the estimation of forest attributes.

Objectives
The primary objective of this study is to develop a regional EFI over the Acadian/New England Forest region, with a total of 85 % coverage of the New England states using publicallyavailable LiDAR. We overcome several challenges faced by having disparate LiDAR datasets through the use of deep learning, and we incorporate other remote sensing products such as spectral data, phenology, and disturbance history to improve model accuracy. A secondary objective is to assess the value of deep learning, and compare it to standard approaches for LiDAR modelling. A final objective is to compare our mapped estimates of biomass, percent conifer, and tree count to estimates derived via the design-based U.S. national forest inventory program. We will compare estimates made at the plot-level and at the county-level using both inventory techniques, and identify any obvious shortcomings. The end result will be a series of near wall-to-wall mapped forest inventory estimates of the region, with an accurate assessment of error across space and forest type. This will provide forest managers, ecologists, and climate modelers in the region with an unprecedented amount of detailed information about the forest.

Forest attributes
Our goal was to estimate many forest attributes which may be useful to ecologists, forest managers, and modelers. The complete list is found in Table 1. For brevity, at points throughout this manuscript we will highlight only the results of the aboveground biomass (BIOAG), percent conifer (PC), and tree count (TC) estimations. All other attributes can also be considered measurements of tree size, density, or species; and are often represented by these three.

Training data
For most applications deep learning requires very large datasets (Russakovsky et al. 2015).
To meet this requirement we combined thirteen different forest inventories collected over thirty two sites (Appendix Table A1). Within each inventory all trees greater than 10 cm in diameter were measured spatially with species and diameter recorded. In several inventories, tree heights were only measured on only a subset of trees, so species-specific non-linear height-diameter models were generated using a Chapman-Richards model form to impute tree height using the existing height measurements at each site. Some inventories were also measured up to ten years prior to LiDAR acquisition. In instances in which the temporal discrepancy between LiDAR and field data exceeded two years, trees were projected forward in time using the Forest Vegetation Simulator's Acadian Variant ).
Volume was estimated using regional taper equations (Li et al. 2012). A 10 cm cutoff was used for merchantable volume. Biomass was estimated using the component ratio method developed for the US Forest Inventory and Analysis (FIA) program (Woodall et al. 2011).
Each of the stem-mapped inventories were aligned visually with the LiDAR to correct for GPS error in plot location, and then segmented into 10x10 m grid cell plots. We selected this resolution in order to amplify the amount of unique plots available, while retaining plots large enough to contain several whole tree crowns. While they provide highly spatially-explicit models, this plot size is small compared to those used in other studies, and so can be prone to edge effects.
We accounted for edge effects by using regional diameter-to-crown width equations to project each tree's crown in space (Russell and Weiskittel 2011). Tree level basal area, biomass, and volume allometry were then multiplied by the proportion that each tree's crown overlapped the plot. Trees were therefore treated as areas containing biomass, rather than points which could lie one or another side of a plot boundary (,). This mimics the method by which the remote sensing instrument measures the trees, as LiDAR or imagery has no means of measuring the precise location of a tree's stem. Figure 1 demonstrates this correction. Preliminary testing using a subset of the data indicated that this correction greatly improved model performance. Figure 1. The percent of crown overlap of each tree in and around the plot was used as a modifier on that tree's basal area, biomass, and volume. This allowed for the development of models which more closely reflected what was visible to the remote sensing systems, while remaining unbiased across multiple cells.
We also augmented the sample size of our training data by allowing plots to overlap one another by a maximum of 25 %, and by including multiple LiDAR acquisitions of the same plot in the training dataset. The precise configuration of LiDAR returns will always vary between acquisitions. Similar augmentation techniques, such as transforming or rotating input images multiple times, and using adjacent still frames of videos have been successfully used in deep learning for many years (). Lastly, we included 500 plots with no trees on them to allow for better predictions in low-vegetation environments, which were sampled randomly across Northern New England using the 2011 National Land Cover Database (). The associated LiDAR point clouds were then manually checked for trees and discarded if trees were found.
Ultimately we assembled 24,606 plots for model training and preliminary evaluation. Of these, we randomly withheld 4,000 plots, including 1,000 for model validation (to determine the optimal stopping point during deep learning training), and 3,000 for model testing (used for model comparison and selection). Augmented plots related to withheld plots were removed from the training dataset, thus the final training set was comprised of 17,432 plots.

LiDAR
We aggregated 49 public LiDAR datasets from across the region, combining acquisitions of varying pulse density and seasonality. Although much of the pubic LiDAR in the US Northeast is flown leaf-off, we chose to develop models capable of functioning in either state to allow for potential future integration with leaf-on Canadian Maritime data. The training data ultimately consisted of a 53 to 47 % split between leaf-off and leaf-on respectively.
The majority of the LiDAR used for this study was funded and hosted by the US Geological Survey's national 3D Elevation Program (3DEP). These data were captured in leaf-off conditions between 2006 and 2018 at resolutions ranging from 0.5 to 10 pls/m 2 . We also incorporated LiDAR data acquired by NASA Goddard's LiDAR Hyperspectral and Thermal Imager (G-LIHT) as well as the National Ecological Observatory Network (NEON). These data were acquired in leaf-on conditions over several of our training sites with pulse densities ranging from 5 to 16 pls/m 2 .
Finally, we incorporated several private LiDAR datasets for training, including one each over the

Penobscot Experimental Forest, Baxter State Park in Maine, and Noonan Research Forest in New
Brunswick. Each of these had an average pulse density of 5-6 pls/m 2 , where the first two were leaf-off and the second leaf-on. Both pulse density and seasonality were included as model predictors.

Satellite variables
We also chose to include satellite derived spectral indices, disturbance metrics, and phenology data in our models. Each of these are spatially contiguous across our study area, and have proven useful for predicting forest attributes (Zheng et al. 2004, Pflugmacher et al. 2012, Clerici et al. 2012). All satellite data processing was done in Google Earth Engine (Gorelick et al. 2017 Crist 1985). This imagery was acquired between 2015 and 2017, imagery from between the 150 th and 244 th Julian days were used. All images were cloud-masked, then a single median composite was developed for the study area.
Resolutions greater than 10 m were resampled to match our plot size.
We also incorporated disturbance history using Landsat 5-8 and the LandTrendr disturbance detection algorithm (Kennedy et al. 2010). LandTrendr works by fitting a maximum of seven linear segments to the yearly medians of a spectral band within each Landsat pixel.
Vertices identify dramatic changes in the spectral characteristics of that pixel in time, and often correspond to disturbances. We ran LandTrendr over the greenness and wetness tassel cap indices, and NBR. Instances in which LandTrendr identified a vertex in two or more of the three bands, within two years of one another were retained as disturbances. We then condensed these data into the year of last disturbance and the magnitude of that disturbance (as a percent of vegetation index change).
Finally, we estimated growing season length across our study area using Moderate Resolution Imaging Spectroradiometer (MODIS) data with a resolution of 500 m. This was derived by subtracting the mean dates of greenness onset and senescence, and has been demonstrated to correlate well with site quality, and so may aid models which primarily use height to infer tree size  Figure 2. Inception-V3 consists of a series of preliminary convolution and pooling layers, followed by inception layers. Inception layers consist of a number of convolutions of varying sizes which are passed over the incoming data, each designed to pick up on different features, which are then concatenated. Inception-V3 consists of nine inception layers back to back, with intermittent pooling to reduce dimensionality. The final inception layer is fed into a fully connected layer for a classification or regression prediction. Each convolution was followed by a ReLU activation layer and batch normalization.
The model was first trained to estimate only BIOAG using LiDAR. We used a process called transfer learning to initialize the weights of a more complex model using the weights from the simpler one, which was designed to simultaneously predict all 14 of our forest attributes. Each forest attribute was standardized using z-scores so that their values fell upon the same range. A single loss function was then used to optimize the model to predict all forest variables (Equation 1). In which, the mean of the squared error of the k standardized forest attributes is summarized to a batch level mean of n training samples.
We included the satellite data as side-channel information by first developing a multi-layer perceptron to estimate BIOAG directly from the satellite variables. We took the weights from this model and used them to initialize a subcomponent of the larger model, which produced a 40 x 40 tensor that was then concatenated onto the LiDAR voxels (Zhou et al. 2017).

Training
Deep learning models were developed in Python using Google's Tensorflow version 1.4 (Abadi et al. 2016). The model training process took place in three stages using transfer learning to build upon each stage (1) A BIOAG model using only LiDAR, (2) A model predicting all fourteen attributes using only LiDAR, and (3) a model predicting all fourteen forest attributes using LiDAR and satellite metrics. Training time for the first stage was approximately five days using an NVIDIA Tesla k80, the following stages were trained more rapidly. This lengthy multistep training process made cross-fold validation highly impractical.

Traditional modelling
Deep learning models were compared to models trained using the standard suite of LiDAR height metrics, derived using the Rlidar package (Silva et al. 2015). This produces a series of summary statistics of the LiDAR return's heights and proportions above certain height thresholds.
We discarded metrics which made use of LiDAR intensity and return counts, as these could not be normalized between the different LiDAR acquisitions. We filtered out points lower than 0.5 m above ground, and used a 2 m threshold for many of the proportional metrics. Other studies in the region have used similar cutoffs (Hayashi et al. 2014). We also included the aforementioned satellite-derived metrics, as well as pulse density, and seasonality. In total 41 covariates were derived from the LiDAR and satellite data.
Random forest imputation in regression mode was used to model each of the forest attributes (Breiman 2001 (Genuer et al. 2015). New models were then developed using 2000 decision trees and one-third variable selection at each node-split. These hyper-parameters were fine-tuned using a subset of the data.
The random forest models were trained and validated using the withheld test plots.
Although accuracy can be assessed using out-of-bag sampling, we used the same validation scheme as the deep learning models due to data augmentation and consistency.

Validation
The training, validation, and testing data derived from the thirteen individual forest inventories are likely not fully representative of the landscape, leading to problems with spatialautocorrelation at the regional scale. We therefore performed two phases of validation. The first phase of validation made use of the 4,000 withheld plots. This was used for model comparisons between deep learning and traditional modelling, and to settle on the final model form.
The second phase of validation made use of an independent dataset, and was used to assess true model performance across the landscape. For this we used the United States Forest Service's FIA national inventory plots. These consisted of approximately 7,500 stratified-random plots with a nested sampling design within the states of Connecticut, Maine, Massachusetts, New Hampshire, Rhode Island, and Vermont. We used these to determine the maps' error and bias, assess spatialautocorrelation across the landscape, and compare our inventory estimates to FIA county-level estimates. We assessed error in Connecticut and Rhode Island separately as their forests increasingly represent a Mid-Atlantic forest type not fully represented by the training data. We affected by errors in plot location, as these were more subject to intra-canopy variability. A preliminary finding that the center plots (on which the GPS point is taken) produced a lower error than the subplots reinforced this conclusion. The aggregate plot-level measurements were less prone to location errors, but did not necessarily represent the entire range of variability that one would expect in a 10 m grid cell. It is important for validation plots to have roughly the same area as the grid cells being validated so that each have a similar range in values. We therefore assessed accuracy at a subplot-and plot-level, but used the plot-level errors to perform additional analyses.
This is roughly equivalent to assessing error were the map resampled to a 20 m resolution.
We decided not to use FIA plots for model training for several reasons: (1)

Phase one validation and model comparison
The first validation phase made use of withheld plots to compare fourteen random forest models using height and satellite metrics to two Inception-V3 CNNs.

Phase two validation
In the second phase of validation each of the mapped forest attributes were validated across Northern New England using FIA plots. Table 2 displays the results of this validation in terms of RMSE, RMSE as a percent of mean (nRMSE), and bias at both the subplot-and plot-level. We did not assess random forest performance in this phase because we did not apply these models at a regional level; such an effort would have been computationally costly in the extreme and was unnecessary given our model selection process in the first phase. Model error at the subplot-level was considerably higher than error at a plot level with each forest attribute. This is to be expected given that smaller areas are more likely to contain extreme values, and the subplot values are more likely to be affected by GPS inaccuracy. The opposite trend could be seen in bias, with 10/14 attributes exhibiting greater bias at the plot-level.
Results of the second phase of validation indicated that two phases were in fact necessary to obtain a more representative assessment of regional model performance.   England. In contrast, tree count had a nRMSE of 56 %. Model performance was worst in volume estimates of species groups. Estimates of the deciduous volume had a relatively high nRMSE, 61 %. Estimates of spruce/fir, and pinus strobus volume both had nRMSEs above 150 %, and could generally be considered not useful. We did not assess nRMSE of the percent species group estimates. The RMSE and biases of percent spruce/fir and percent pinus strobus were lower than that of percent conifer, but this is likely because their average values are smaller. Qualitatively, the maps of percent conifer appeared better, with the others suffering from more banding and local biases.
We also assessed model performance of BIOAG, PC, and TC in Northern New England spatially and by plotting their predicted verses observed values. Figure 2 illustrates the plot-level residuals of each of the three forest attributes. We note that the BIOAG residuals appear to be fairly evenly distributed across the landscape, with consistent model biases not immediately apparent. Positive PC residuals appear to be clustered mostly in Eastern Maine, where greater number of conifers are likely to be found, indicating that the model is underestimating in areas of proportionally higher conifers. Likewise, negative residuals in Vermont are an indication that the model is underestimating in areas with proportionally fewer conifers. Tree count followed a similar trend, greater residuals were encountered in more northern areas, which correspond to both greater stem density and more industrial forests.
These trends can also be observed in the predicted versus observed plots (Figure 3).
Biomass residuals fall relatively tightly along the 1:1 line. Little to no attenuation is observed at higher biomass values. Percent conifer residuals seemed to indicate a tendency to overestimate in low conifer environments, and underestimate in high conifer environments. Finally, tree count residuals appeared to follow the 1:1 line in low-medium density conditions, but often severely angle across the landscape a-priori as we did pulse density.

County-level comparisons
With the region mapped, we compared country level estimates derived by summing the values of our map with FIA county-level design-based estimates. This data is summarized in Table   3. We chose the 38 counties in Northern New England with complete LiDAR coverage. Initially we used all FIA plots within a county measured within two years of the LiDAR acquisition. We discovered however, that a large number of FIA plots without measured trees on them were located in suburban environments with trees. This resulted in underestimates by the FIA data of each forest attribute, so plots that were denoted as having no trees that fell within semi-forested suburbs were removed after manual aerial photo interpretation.
Examining aboveground biomass, our predictions fell within the 95 % confidence interval of the FIA's in 31/38 counties, and within 97.5 % confidence interval in 33/38 counties. Across the landscape of these counties the FIA estimated 4 % more biomass then our maps. This is to be expected given that our maps frequently had gaps between LiDAR acquisitions and occasionally had missing LiDAR tiles. The FIA's lack of urban tree sampling likely also played a large role. Eight of the counties were classified by the US Census Bureau as having an urban population greater than 50 %. In these urbanized counties FIA estimates were an average of 13 % lower than our own when including empty plots in suburban forested areas, and 11 % greater than our own after they were removed.
In estimating percent conifer, 25/38 of our estimates fell within the 95 % confidence interval of the FIA's estimate. In agreement with the map of residuals, the percentage of conifers was significantly underestimated in 6/11 counties in Maine, and significantly overestimated in 4/8 counties in Vermont. In 16/19 counties in Massachusetts and New Hampshire, our estimate of percent conifer was within the 95 % confidence interval of the FIA's. Across the entire landscape the two estimates were within 0.2 % of one another.
County level tree count estimates fell within the FIA's 95 % confidence interval 30/38 times. Once again, counties in Maine with greater numbers of small trees were more likely to be underestimated. In 5/12 Maine counties, mapped estimates of tree count were significantly lower than the FIA's estimates. Outside of Maine 24/27 counties had mapped estimates in agreement with FIA estimates. This is likely owing to an overall greater stem density in Maine due to more intensive industrial forestry, and a general proclivity of boreal forests to be denser.

Discussion
Our results indicate that 3D convolutional neural networks can be used to estimate forest attributes from disparate LiDAR and satellite data. These models outperformed random forest models which use the standard approach for generating forest inventories from LiDAR. They could also be effectively scaled to make regional high resolution maps/estimates which were often statistically equivalent to traditional forest inventories.

Model comparison
Our first objective was to compare LiDAR derived inventory estimates made using CNNs to estimates made using height metrics and random forest. We assessed this in our first phase of validation, in which several models were developed from training data and assessed using withheld plots. Random forest models trained using traditional height metrics and satellite data nearly always had considerably greater error than the two CNNs we trained. This mirrors the results found by Ayrey and Hayes (2018), in which 3D CNNs of varying architectures often outperformed generalized linear models and random forest. These results indicate that deep learning can be a more effective way of modelling LiDAR than the standard approach of height metrics.
In the estimation of species (percent conifer and percent spruce/fir), random forest did outperform the CNNs. These species breakdowns were more likely to rely on satellite spectral data than LiDAR structural data, and so we speculate that random forest did a better job of making use of the satellite covariates than our CNN. The CNN was initially trained to scan LiDAR voxels, the addition of satellite covariates was made after this training, and in such a ways as to concatenate satellite data onto voxel space. This may have been less than ideal. Zhou and Hauser (2017) outlined several methods for including side-channel data into a CNN, and we tried each of them with mixed results before settling on our concatenation method.
The CNNs also did not outperform the random forest models in terms of bias. Half of the random forest models had a lower absolute bias than the CNNs, indicating that both types of models did approximately as well as one another. We did not observe any notable trends in biases by forest attribute. Overall, we believe that the differences in absolute bias between models was often low enough to have been owing to the random variations of the testing data.
The comparison between the CNNs with and without satellite metrics highlighted that the CNN did benefit from spectral and disturbance information. Error decreased when estimating every forest attribute, and bias nearly always decreased as well. In terms of magnitude this improvement was often small, and our results indicate that even without the satellite metrics, a CNN could be trained which outperforms random forest with height and satellite metrics. We chose not to explore which satellite metrics were most useful, as deep learning models of this size run no risk of overfitting with extraneous inputs. However, such an analysis would be possible through a process similar to random forest's derived importance, and might be useful in narrowing down necessary remote sensing datasets.
We noted that there were numerous advantages to working with a single deep learning model aside from better performance. Our Inception-V3 model took a considerable amount of data and time to train, however, once trained the model could be quickly be applied to large regional LiDAR datasets. A single model predicting all 14 forest attributes presented less of a datamanagement challenge than 14 separate models. We also hypothesize that our CNN would be less likely to produce estimates in conflict with one another than 14 separate unconstrained models (e.g. more merchantable volume than total volume). Finally, there is a considerable amount of precedence in the field of deep learning for making use of pre-trained model weights to solve novel problems (Pan andYang 2010, Shin et al. 2016,). The rapid retraining of our CNNs indicates that this model can easily be fine-tuned with local data, retuned with non-local data, or applied to different problems to save modelers the effort of training a large CNN to interpret voxel space with randomized weights. The weights from our CNN could be used to initialize CNNs with other LiDAR-related objectives, such as individual tree crown segmentation or LiDAR classification.

Assessing performance
With the model form decided upon, we mapped all fourteen forest attributes across the study area (Figure 4). We assessed the maps' performance with our second phase of validation, which made use of independent FIA plots. We assessed accuracy at a supplot-, plot-, and countylevel. Our subplot-level error estimates were consistently quite high. The FIA plot locations in this region are subject to considerable error, and the FIA notes that plot location error can be up to 100 m (although in practice most plots are located within 12 m of their measured location, Hoppus and Lister 2007). Examining pixel-level accuracy using these subplots was problematic given the high degree of intra-canopy variation present in 10 m pixels. Our observation that the center subplot (on which the GPS location was taken) resulted in lower map error is a further indication that the locational accuracy of the surrounding subplots is suspect.
Our assessment of plot-level error (the aggregate of the subplots) produced more favorable error results. We achieved nRMSE values of between 40-50 % for those attributes quantifying tree size, which we consider to be a success given the small size of the grid cells used. Estimates of tree count, percent conifer, and deciduous volume we consider to have been made with moderate success with nRMSE values of between 56-61 %. Estimates of pinus strobus and spruce/fir species breakdowns we consider to be a failure, with nRMSEs exceeding 100 %. Despite this, these maps may still be of some use to some practitioners when aggregated to a more coarse resolution and binned into categories. We assessed model performance at the county-level and in space using aboveground biomass, percent conifer, and tree count. Our map of biomass residuals across Northern New England appeared to be relatively uniform. We noted no major trends in residuals in space, indicating that we are representing biomass across the landscape well. Notably, we did not experience any saturation of high biomass areas as is often the case with regional remote sensing studies (Zhao et al. 2016 Our model's performance can be summarized as having done a good job at predicting attributes closely related to tree size, having done a moderate job of predicting attributes related to tree density and percent conifer, and having done a poor job of predicting attributes related to species groupings. Other studies modeling forest attribute using LiDAR have likewise had more difficulty in estimating stem density and species (Treitz et al. 2012, Hayashi et al. 2014, Shang et al. 2017, Hudak et al. 2008. We can think of several reasons for why our models have also underperformed here. (1) Although we incorporated satellite spectral indices useful for species estimation, we believe that the model may have not made full use of them.
(2) Stem density was often underestimated in high density stands, but qualitatively the maps seemed to suffer from banding in areas with low pulse density LiDAR (<3 pls/m 2 ). It is possible that the model was making use of horizontal structural features in the canopy which could not be resolved in low density LIDAR. Ayrey and Hayes (2018) determined that 3D CNNs do make use of horizontal canopy structure (such as the edges of tree crowns).
(3) Finally, in re-examining our loss function, we find that 7/14 of our forest attributes were in some way related to tree size, while only one attribute estimated stem density. It is possible that our unweighted loss function inadvertently favored attributes estimating tree size, resulting in a model that identified features in the LiDAR that were more predictive of size.

Mapping errors
Unfortunately, the regional maps of forest attributes suffered from several types of errors.
One source of these errors were the LiDAR acquisitions themselves, which often didn't entirely overlap, or were collected improperly. Missing areas can often be observed between the gaps of the nearly 40 different LiDAR acquisitions over the region. In central Connecticut and Eastern Maine portions of the LiDAR were acquired with improper settings, resulting in vegetation being severely under-represented.
Banding errors occurred with forest attributes that had moderate/worse performance (tree count, percent conifer, and species/volume estimates). These bands were more likely to occur in areas in which pulse density fell below 3 pls/m 2 , and followed scan angle trends. In environments of low pulse density and high scan angle, these attributes were often underestimated, possibly owing to less horizontal structure being captured by the LiDAR. Maps estimating tree size, such as BIOAG, VOL, and BA, had few banding errors.

Our results in context
Several other studies have mapped aboveground biomass in this region and can be used to place our model's performance in context. trees/ha, 3.68 cm, and 13 m 2 /ha for tree count, QMD, and basal area respectively using 20 m cells at an experimental forest in Maine (Hayashi et al. 2014). Our regional models achieved errors of 293 trees/ha, 4.5 cm, and 7.9 m 2 /ha for these attributes; outperforming the local models in estimating tree count and basal area. Another study at an experimental forest in Massachusetts used large footprint LiDAR and radar to estimate biomass, achieving a RMSE of 66.6 Mg/ha with 25 m cells (Ahmed et al. 2010). Taken collectively, our regional models seem to perform on par or better than local modeling efforts.

Conclusions
In this study we mapped the forests of New England at a 10 m resolution, making estimates of fourteen different forest inventory attributes. We did so through the use of disparate LiDAR datasets as well as satellite spectral, phenological, and disturbance data. Our method of modeling these attributes was somewhat novel, and made use of three dimensional convolutional neural networks which are a form of deep learning. These CNNs outperformed standard modeling techniques, and proved themselves useful for large-scale mapping, making use of disparate data, and increasing data management and computational efficiency.
We validated our maps using an external dataset derived from the USFS's forest inventory and analysis program. We concluded that the most successful estimates were those attributes that quantified tree size, moderately successful estimates were those that quantified tree density or percent coniferous, and less successful estimates were those that quantified more specific species groupings. In particular, we found our biomass estimates to be in good agreement with the FIA's across the region.
We believe that both the models and the maps generated by this study will be of use to many other scientists. The weights from the CNN trained here could be used as a starting point to train models making estimates over different forests, or to other LiDAR-related remote sensing problems. Likewise, the maps developed here can assist with wildlife mapping, precision forestry, and carbon stock estimation in the region.