A dataset of 5 million city trees: species clustering and climate effects in urban forests

Sustainable cities depend on urban forests. City trees improve our health, clean the air, store CO2, and cool local temperatures. Comparatively less is known about urban forests as ecosystems, particularly their spatial composition, nativity statuses, biodiversity, and tree health. Here, we assembled and standardized a new dataset of N=5,132,890 trees from 63 of the largest US cities with detailed information on location, health, nativity status, and species. We further designed new tools to analyze the ecosystem structure of urban forests, including spatial clustering and abundance of native trees, and validate these tools in comparison to past methods. We show that city trees are significantly clustered by species in 93% of cities, potentially increasing pest vulnerability (even in cities with biodiverse urban forests). Further, non-native species significantly homogenize urban forests across cities, while native trees comprise 0.44%-85.6% (median=45.6%) of city tree populations. Native trees are less frequent in drier cities, and indeed climate significantly shapes both nativity and biodiversity in urban forests. Parks are more biodiverse than urban settings. Compared to past work which focused primarily on canopy cover and species richness, we show the importance of analyzing spatial composition and nativity statuses in urban forests (and we created new datasets and tools to do so). This dataset could be analyzed in combination with citizen-science datasets on bird, insect, or plant biodiversity; social and demographic data; or data on the physical environment. Urban forests offer a rare opportunity to intentionally design biodiverse, heterogenous, rich ecosystems.

and manually corrected misspellings in all species names (see Data S9, and Materials and  86 Methods, for full details). We manually coded all tree locations as being in a green space or in an 87 urban environment to enable comparisons between location types. Finally, we referenced data 88 from the Biota of North America Project on native status to classify each tree as native or not. 89 The resulting dataset (Figure 1, Data S1) comprised 63 city datasheets each with 28 standardized 90 columns (Table S4). Pittsburgh; not shown are 22,647 trees belonging to other species (black points in (B)). The 100 dataset includes information on species, exact location, whether or not a tree is native, tree 101 height, tree diameter, location type (green space or urban setting), tree health/condition, and 102 more (Data S1). Source data is Data S1 and Data S2; source code is Data S4. 103 104 factors were significantly correlated with effective species count but socio-cultural variables 144 were not (supporting Table S1); biodiversity is negatively related to temperature seasonality 145 (captured through environmental PC1). (D) The most abundant genus in each city is labeled 146 here; see most common species by city in supporting  Table S1 is a supporting table. Source data is Data S1 and 148 Data S2; source code is Data S4; and an associated tool to calculate effective species is Data S3. 149 150 151 We next investigated the spatial arrangement of biodiversity in urban forests. Biodiverse, 152 rather than species-poor, city tree communities offer many well-documented benefits. Biodiverse 153 forests are more effective in resisting diseases (Laćan & McBride, 2008), are more resilient in 154 the face of climate change (Roloff et al., 2009) and confer greater mental health benefits (Fuller 155 et al., 2007). Compared to biodiversity, the spatial arrangement of trees is less well-understood, 156 even though clusters of same-species of trees may be more susceptible to pest outbreaks (Greene 157 & Millward, 2016; Raupp et al., 2006). 158 We found that city trees were nonrandomly clustered by individual species in 43 of 46 159 cities (Fig. 3A). Additionally, a city's clustering score was not significantly correlated with 160 biodiversity metrics and is therefore a separate metric of interest (Fig. 3B, Fig. S4 clustering in a city was not correlated with biodiversity (supporting Fig. S4). Ncities=46. Fig. S4 is 178 a supporting figure for this figure. Source data is Data S1 and Data S5; source code is Data S4. 179 180 181 Our new dataset allows researchers and urban foresters to consider the utility of native vs. 182 non-native trees. Whether or not a city decides to plant native as opposed to introduced species is 183 a growing topic of interest (along with whether nativity matters, and how to define "nativity" 184 (Berthon et al., 2021;Gould, 1998;Sjöman et al., 2016)). We classify plants as "native" if they 185 occur in a particular region without direct or indirect human intervention. This definition does 186 not account for the substantial effects of Indigenous peoples on plant communities before 187 European contact, nor does this paper address the flaws with a "native-or-not" ecological 188 approach (see discussion of an alternative Indigenous ecology in (Grenz, 2020; McKay, 2021)). 189 Here we found that the percent of trees that were native varied across cities from 0.44% 190 to 85.6% with a median of 45.6% (Fig. 4). Wetter, cooler climates correlated with significantly 191 higher percentages of native trees (Fig. 4A-B). However, it is important to note a strong east-to-192 west gradient, by which fewer native trees were present in western states ( Figure 4A). Thus, 193 some social factor may have influenced the planting of native trees (Roman et al., 2018;194 Steenberg, 2018). However, after accounting for climate, younger cities had a higher percentage 195 of native trees (Table S2); perhaps urban forestry practitioners have been more likely to consider 196 nativity status in recent years. The observed east-to-west gradient deserves further research 197 attention. 198 In general, native plant species support richer local ecosystems (e.g., more diverse and 199 numerous bird and butterfly communities (Burghardt et al., 2009(Burghardt et al., , 2010 Urban foresters must consider tree hardiness when selecting species to plant, and our 210 dataset provides standardized metrics of tree health across many cities. Our preliminary analyses 211 suggest that whether or not a tree was native had no clear impact on tree health (Table S3). Trees 212 are generally healthier (i.e., have better condition) when they are smaller and/or in an urban 213 setting rather than in parks (Table S3), possibly because city arborists quickly remove unhealthy 214 trees in densely populated areas where they pose a fall risk. Further work is needed on within-215 species trends. 216 Are city tree communities more similar to each other than we would expect based on 217 geography and climate? Indeed, we found that non-native trees drive similar species 218 compositions between cities (Fig. 4C), reflecting the phenomenon of "biotic homogenization" 219 (M. L. McKinney & Lockwood, 1999). Unsurprisingly, environment is a significant driver of 220 tree community similarity between cities, but this association is stronger for native trees (Fig. 221 4D). 222 223  Beyond the analyses demonstrated above, our dataset could also be combined with social, 247 economic, and physical variables for new analyses (Fig. 5). Simple maps of biodiversity in the 248 Washington, DC area ( Fig. 5A-B) show that high biodiversity qualitatively overlaps with high 249 median household income (Fig. 5C). In other words, not only do "trees grow on 250 money" ( correlates with a more biodiverse community of insects, birds, and even non-tree plants. 264 Likewise, an analysis could consider whether the abundance of native trees correlates with other 265 important measures of ecosystem health (such as insect abundance). Since citizen-science 266 datasets typically include exact location, future work could assess these trends over fine scales 267 (e.g., within particular parks or in bounded neighborhoods) as well as across cities. 268 269

Conclusion 270
Urban forests are ecosystems over which humans exercise precise control. We have an 271 opportunity to design city tree communities that are biodiverse, spatially heterogenous, and 272 include native species-thereby building resilience against climate change ( urban heat islands with temperatures > 95ºF, tend to overlap with less-richly-forested areas. 290 Source data is Data S1 and open-access data available from the US Census and the DC Open 291 Data Portal (see Materials and Methods) and source code is Data S4. 292 293 294 295 Materials and Methods 296 297

Data Acquisition 298
We limited our search to the 150 largest cities in the USA (by census population). 299 To acquire raw data on street tree communities, we used a search protocol on both 300 Google and Google Datasets Search (https://datasetsearch.research.google.com/). We first 301 searched the city name plus each of the following: street trees, city trees, tree inventory, 302 urban forest, and urban canopy (all combinations totaled 20 searches per city, 10 each in 303 Google and Google Datasets Search). We then read the first page of google results and the top 20 304 results from Google Datasets Search. If the same named city in the wrong state appeared in the 305 results, we redid the 20 searches adding the state name. If no data were found, we contacted a 306 relevant state official via email or phone with an inquiry about their street tree inventory. 307 Datasheets were received and transformed to .csv format (if they were not already in that 308 format). We received data on street trees from 64 cities. One city, El Paso, had data only in 309 summary format and was therefore excluded from analyses. 310 311 312

Data Cleaning 313
All code used is in the zipped folder Data S5. Before cleaning the data, we ensured that 314 all reported trees for each city were located within the greater metropolitan area of the city (for 315 certain inventories, many suburbs were reported -some within the greater metropolitan area, 316 others not). 317 First, we renamed all columns in the received .csv sheets, referring to the metadata and 318 according to our standardized definitions (Table S4) for trees of a particular species. Wherever diameter was reported, we assumed it was DBH. We 324 created a column called "location_type" to label whether a given tree was growing in the built 325 environment or in green space. All of the changes we made, and decision points, are preserved in 326 Data S9. 327 Third, we checked the scientific names reported using gnr_resolve in the R library taxize 328 (Chamberlain & Szöcs, 2013), with the option Best_match_only set to TRUE (Data S9). 329 Through an iterative process, we manually checked the results and corrected typos in the 330 scientific names until all names were either a perfect match (n=1771 species) or partial match 331 with threshold greater than 0.75 (n=453 species). BGS manually reviewed all partial matches to 332 ensure that they were the correct species name, and then we programmatically corrected these 333 partial matches (for example, Magnolia grandifolia--which is not a species name of a known 334 tree--was corrected to Magnolia grandiflora, and Pheonix canariensus was corrected to its 335 proper spelling of Phoenix canariensis). Because many of these tree inventories were crowd-336 sourced or generated in part through citizen science, such typos and misspellings are to be 337 expected. 338 Some tree inventories reported species by common names only. Therefore, our 339 fourth step in data cleaning was to convert common names to scientific names. We generated a 340 lookup table by summarizing all pairings of common and scientific names in the inventories for 341 which both were reported. We manually reviewed the common to scientific name pairings, 342 confirming that all were correct. Then we programmatically assigned scientific names to all 343 common names (Data S9). 344 Fifth, we assigned native status to each tree through reference to the Biota of North 345 America Project (Kartesz, 2018), which has collected data on all native and non-native species 346 occurrences throughout the US states. Specifically, we determined whether each tree species in a 347 given city was native to that state, not native to that state, or that we did not have enough 348 information to determine nativity (for cases where only the genus was known). 349 Sixth, some cities reported only the street address but not latitude and longitude. For 350 these cities, we used the OpenCageGeocoder (https://opencagedata.com/) to convert addresses to 351 latitude and longitude coordinates (Data S9). 352 Seventh, we trimmed each city dataset to include only the standardized columns we 353 identified in tree city USA), city age * tree city USA, and the log-transformed number of trees in a given city. 380 We identified the best-fitting four models and report statistics in Table S1. We ran all models 381 with and without the two strongest outlier cities, Miami, FL and Honolulu, HI. 382 383 Spatial Structure 384 We wanted to quantify the degree to which trees were spatially clustered by species 385 within a city (rather than randomly arranged). To do so, we first clustered all trees within each 386 city using hierarchical density based spatial clustering through the hdbscan library in Python 387 (McInnes et al., 2017). HDBSCAN, unlike typical methods such as "k nearest neighbors", takes 388 into account the underlying spatial structure of the dataset and allows the user to modify 389 parameters in order to find biologically meaningful clusters. For city trees, which are often 390 organized along grids or the underlying street layout of a city, this method can more 391 meaningfully cluster trees than merely calculating the meters between trees and identifying 392 nearest neighbors (which may be close as the crow flies but separated from each other by tall 393 buildings). 394 We converted latitude and longitude values within a city to their planar projection 395 equivalents (in Universal Transverse Mercator (UTM)) using the from_latlon function in Python 396 package UTM (Bieniek et al., 2016). In total we had N = 59 cities with spatial information about 397 their trees. 398 We then clustered all the trees in a given city using HDBSCAN with parameters 399 min_cluster_size=30, min_samples=10, metric='manhattan', cluster_selection_epsilon=0.0004, 400 cluster_selection_method = 'eom'); we arrived at these parameters through trial and error with a 401 sample set of cities. 402 Once we had all trees in a city assigned to spatial clusters (or, for trees far from the 403 clusters, notated as "noise" and eliminated from further analysis), we used a bootstrapping 404 method to quantify the degree of homogenization within spatial clusters. For each cluster of trees 405 (e.g., a cluster of 120 trees in Pittsburgh, PA) we (i) calculated the observed effective species 406 number; (ii) we randomly resampled 120 trees from Pittsburgh's entire 45,703-tree-dataset and 407 calculated the effective species number of that random group of 120 trees; (iii) we repeated step 408 (ii) 500 times; (iv) we recorded the mean, median, and interquartile range of effective species 409 counts from those 500 samples; and (v) we divided the expected effective species (median 410 effective species count from all 500 samples) by the observed effective species count in the 411 actual spatial cluster of 120 trees. The resulting value therefore quantifies the degree to which a 412 spatial cluster is a random set of that city's tree species (values close to 100%) or a nonrandom 413 set of same-species clusters (values less than 100%). 414 415

Nativity Status 416
To determine whether or not a tree was native to the state in which it appeared, we 417 referred to the state-specific lists of native species from the Biota of North America Project. Each 418 tree species was therefore coded as native = TRUE, native = FALSE, or native = no_info. Some 419 tree records included only genus-level data, which was coded as "no_info". 420 We performed beta regression models with a logit link function using the package 421 betareg in R (Zeileis et al., 2019), with percent native trees in a given city as the dependent 422 variable. We assumed the precision parameter ϕ did not depend on any regressors. We started 423 with a model incorporating only environmental variables, based on the substantial evidence that 424 climate impacts native species biodiversity, and then added one variable at a time to determine 425 whether the additional variables improved the model's performance (tested through the lrtest() 426 function from the package lmtest (Hothorn et al., 2015). The best model incorporated the 427 following dependent variables: environmental PCA1, environmental PCA2, log(number trees), 428 and city age with no interaction terms. We identified one major outlier, Honolulu, HI, and re-ran 429 the model excluding the outlier (results did not significantly change). 430 431

Condition and Health 432
We asked whether a tree's condition within a given city was correlated with size (DBH), 433 location type (whether in the built environment or in green space such as a park), and nativity 434 status. Fifteen cities had two or more of these variables with adequate sample sizes, and we ran 435 separate logistic regression models by city because cities do not always score condition on 436 comparable scales. We coded tree condition as a binary variable, where "excellent," "good", or 437 "fair" condition trees were coded as 1 and "poor", "dead", and "dead/dying" trees were coded as 438 0. We used function glm2() in the R package glm2 (Marschner c et al., 2011), and for each model 439 determined whether it was a better fit than an empty model. We calculated odds ratios, 440 confidence intervals, and p-values (see Table S3). 441 442

Similarity Between Tree Communities 443
For N = 1953 city-city comparisons of street tree communities, we could calculate 444 weighted measures of similarity because we had frequency data. We used chi-square distance 445 metrics on species frequency data (because the actual count of trees reflected differences in 446 sampling efforts between cities). Chi-square similarity is calculated following Equation 2, where 447 n is the total number of species present in either city, x and y are vectors of species frequencies 448 for the two cities being compared, and for each species i, xi is the frequency of that species in 449 city x and yi the frequency of the same species in city y. Chi-square similarity is one minus the 450 chi-square distance. 451 452 Equation 2 453 We calculated environmental similarity as one minus the normalized euclidean distance in our 456 PCA plot of environmental variables. 457 To determine whether city species similarity was driven by native species, non-native 458 species, or neither, we performed a two-sample paired t.test using the function t.test in R 459 between the native species chi-squared similarity scores and the all-species chi-squared 460 similarity scores. Because the variables were not perfectly normally distributed (although they 461 were even and symmetric), we also performed a non-   Corresponding Author Ben Goulet-Scott 7 Email: bgoulet@g.harvard.edu 8 9 10 This PDF file includes: Table S1 (supporting main-text Figure 2) 17 Table S2 (supporting main-text Figure 4) 18 Table S3  19  Table S4  20 Legends for Datasets S1 to S9 21 22 Other supplementary materials for this manuscript include the following: 23 Datasets S1 to S9: 24 DataS1_City_Trees_Data_63_Files The most common species in each city is labeled (organized by family for display purposes).   Table S3. We ran logistic regression models to identify correlations between condition and (i) tree size, 105 (ii) tree location, and (iii) whether or not a tree was native. For size, smaller trees (lower diameter at 106 breast height) tended to have better condition (9 of 12 cities). For location type, trees in the built 107 environment tended to have better condition than those in parks (6 of 11 cities). For native status, results 108 were mixed ( native trees had no difference in condition for 8 of 15 cities, worse condition in 5 cities, and 109 better condition in 2 cities). 110 111 biological name of the tree species (Quercus rubra) greater_metro greater metro area in which the city is found, which will match the "city name" in the filename (Atlanta, for example, which includes Decatur and other suburbs) city city name, as it is properly spelled (Las Vegas) state state name (as it is properly spelled, not abbreviation) longitude_coordinate exact location of tree species (longitude) latitude_coordinate exact location of tree species (latitude) location_type where the tree is located (green, space, built, environment, no, info) zipcode zipcode of the location address address where the data was collected neighborhood neighborhood of the location of the tree location_name If the location is named without being an address, such as Smith Cemetery or Route 11 Median ward city ward district the district tree is located overhead_utility Is there an overhead utility (yes, no, conflicting)? diameter_breast_height_CM trunk diameter in cm at breast height condition tree condition as coded by the city-specific protocol height_M height of tree in meters native Is the tree native to the state (TRUE), not native (FALSE), or of unknown status due, for example, to being only genus level (no, info)? Assignments according to BONAP data.
height_binned_M range of heights into which the tree falls, converted from feet.
diameter_breast_height_binned_CM range of diameters into which the tree falls, converted from inches often percent_population some sheets are in "summary" format with only percent of population represented for each species.
114 Table S4. Here we define the standardized columns used herein. 115 116