GreenScan: Towards large-scale terrestrial monitoring the health of urban trees using mobile sensing

Healthy urban greenery is a fundamental asset to mitigate climate change phenomena such as extreme heat and air pollution. However, urban trees are often affected by abiotic and biotic stressors that hamper their functionality, and whenever not timely managed, even their survival. While the current greenery inspection techniques can help in taking effective measures, they often require a high amount of human labor, making frequent assessments infeasible at city-wide scales. In this paper, we present GreenScan, a ground-based sensing system designed to provide health assessments of urban trees at high spatio-temporal resolutions, with low costs. The system utilises thermal and multi-spectral imaging sensors fused using a custom computer vision model in order to estimate two tree health indexes. The evaluation of the system was performed through data collection experiments in Cambridge, USA. Overall, this work illustrates a novel approach for autonomous mobile ground-based tree health monitoring on city-wide scales at high temporal resolutions with low-costs.


I. INTRODUCTION
U RBAN greenery improves the resilience of cities to climate change.Nowadays, protecting, managing, and restoring greenery ecosystems is fundamental for climateresilient development, given the multiple risks posed to humanity and nature by global warming and climate change as per the latest UN-IPCC Report [1].In cities, tree canopies and vegetation provide a wide range of ecosystem services such as air filtering, carbon sequestration, reduced energy consumption, increased biodiversity and decreased local temperatures [2], [3].However, urban trees are experiencing an ample amount of abiotic stressors (e.g.soil salinity, heat waves) and biotic stressors (caused by living agents such as insects and bacteria) that are exacerbated due to climate change [4]- [6].As a result, their functionality, productivity, and survival are of increasing concern [7].Trees with poor health cannot provide most of their beneficial ecosystem services [8], [9].For instance, trees with low transpiration rates do not cool the environment sufficiently and trees with low growth rates have a reduced shading effect.By 2050, it is expected that about two-thirds of urban tree species worldwide will fail to provide the desired climate-positive benefits [10].
The practice of measuring and monitoring urban trees began over a century ago [11].Today, the health of trees This work was supported by the MIT Senseable City Lab Consortium. A. Gupta would like to thank Renswoude Foundation, FAST Delft and EFL Stichting for their financial support.
Akshit Gupta is the corresponding author.He was a visiting student researcher at the MIT Senseable City Lab at the time of this work, while studying at TU Delft (email: a.gupta-5@tudelft.nl)can be monitored using manual inspection by arborists with good quality results [12]; yet the high labor cost lead leads to assessments performed infrequently at very low temporal resolutions such as once every 3-5 years.Technology-assisted monitoring methods can complement manual inspections [13].However, these methods are impeded by variable data quality, low spatial granularity (remote sensing), or high operational costs (airborne sensing) [14].Further, most of these methods are unable to quantify the vegetation elements below the tree canopy such as green walls, short trees, or shrubs [15], [16].All these challenges lead to the lack of urban tree health data in cities and appropriate urban forest management.For instance, adverse health conditions in trees being discovered only after severe damage is already inflicted.Further, from an urban planning perspective, intricate relationships of urban trees with other micro-scale ecosystem services such as air quality improvements and benefits to public health are difficult to quantify.For instance, inappropriate placement of trees in outdoor environments can be detrimental as they can serve to trap air pollutants [17].
Recently, several projects have investigated developing novel alternatives for environmental sensing.For instance, applying AI (Artificial Intelligence) based methods on GSV (Google Street View) images to detect the presence of trees [18], [19], or using drive-by strategies to measure air pollution [20], [21] in a cost efficient way.For instance, it was demonstrated that just ten random taxis could capture data over one-third of streets in Manhattan (New York City) in a single day using drive-by sensing [22].Additionally, citizen sciencebased approaches [23] have also been successful in measuring urban environmental parameters [24].All these methods are set within the domain of opportunistic sensing and are aimed at developing platforms that can be deployed and operated without the need of an expensive or a dedicated infrastructure.Thus, allowing democratic access to even under-resourced cities that are affected by climate change in a disproportionate manner [25].
Following on this trend and the critical need for protecting and managing urban forestry, in this work, we develop a novel system, named GreenScan, which measures the health of urban trees on city-wide scales from ground level (terrestrially).The system fuses high-quality data from low-cost thermal and multispectral imaging sensors using custom computer vision models to generate two complementary tree health indexes namely NDVI (Normalised Difference Vegetation Index) and CTD (Canopy Temperature Depression) which respectively indicate the photosynthetic capacity and water stress levels of a tree.GreenScan was designed both for deployment in citizenscience paradigms by being carried by pedestrians, or in driveby sensing approaches by being mounted on urban vehicles such as taxis and garbage trucks.Thus, enabling terrestrial urban tree health measurements with high spatial and temporal resolutions at low costs for cities and municipalities around the world.
In this paper, we first give a brief overview of the state-ofthe-art technology tools and methods to monitor the health of urban greenery with focus on low-cost solutions.We present the design of GreenScan and describe the implementation of the hardware and software components.We evaluate the system with forty urban trees in uncontrolled outdoor environments and analyse the performance of the system.Finally, we conclude by identifying the immediate future research that can be enabled through large scale deployment of GreenScan while discerning the limitations of this work.

II. RELATED WORK
Currently, the health of trees is monitored through manual inspection by human experts, remote and airborne sensing through satellites or UAVs, direct installation of embedded sensors on/near the tree, handheld imaging based sensing or opportunistic sensing using street view imaging [13].A comparison of these methods in terms of working mechanism, cost, and quality of assessment is shown concisely in Table I.
Manual inspection involves the work of arborists (humanexperts) inspecting trees visually, often with the aid of tools such as borers (to extract a wooden core sample from the tree for laboratory analysis) or resistographs (to measure the electrical resistance of the trunk).These methods usually provide a high-quality assessment, but they are time-consuming due to the amount of human labor involved to perform a tree-by-tree assessment.Further, although effective, methods that require drilling and penetration in the living wood may create an entry path for pathogens or may alter the structural integrity of a tree.For a review on these methods, see [12].
Embedded sensing involves the deployment of sensors in the bark of a tree (the outer wooden part of a tree) or in the soil.These sensors can rely on physical, chemical or electrical phenomenon to detect the presence of parasites, e.g.detecting sudden minimal bark vibrations produced by parasites locomotion and feeding [26] as well as water uptake and transpiration, e.g.measuring electrical impedance using a pair of electrodes placed in the trunk at opposite positions [27].These methods generate data at high temporal resolutions with little or no human supervision required; yet at the cost of installing and maintaining one or more sensors per tree.For a review on these methods, see [28].
Imaging-based methods involve the use of optical sensors such as thermal imaging sensors, HMI (hyperspectral or multispectral imaging) sensors or LiDAR (Light Detection and Ranging).Thermal imaging is based on IR (InfraRed) radiation emitted from materials and it is mainly used to (i) measure cavities and physical damages in the living wood [29], [30], (ii) detect infections caused by insects and bacteria [31], [32], and (iii) calculate water stress levels by measuring the temperature of the leaves in the canopy [14].On the other hand, HMI sensors capture various bands in the electromagnetic spectrum, usually near-infrared and parts of the visible spectrum.This captured data is utilised to calculate various vegetation indexes, the most popular being NDVI (Normalized Difference Vegetation Index).HMI sensors are often used for remote sensing applications [33], although static sensors also exist [34].Calibration methods are critical to achieve quality results [35].LiDAR (Light Detection and Ranging) sensors, can be utilised to measure geometrical parameters such as the leaves surrounding a branch, trunk diameter, etc. and to estimate the LAI (Leaf Area Index) [36].However, contradictory studies have been observed on the usage of LiDAR with some works such as [37] claiming no increase in health classification performance with its addition.Usually, LiDAR and HMI sensor approaches are often deployed in tandem in both airborne [36] and ground-based [38] approaches.
Recently, street-view based methods based on RGB (Red, Green, Blue) images usually involving the use of google-street view images have became popular.These methods are utilised either to quantify the presence of urban greenery [18], [39], catalog species [40], and shading effects [41].While these approaches are cost-effective and scalable, they are only able to quantify the extent of urban greenery at a terrestrial level rather than its health.
When imaging-based sensors are deployed on satellites, airplanes or UAVs (Unmanned Aerial Vehicles), high spatial coverage can be achieved.However, satellites have a low temporal resolution due to infrequent re-visit time and data quality being dependent on the availability of clear skies [14].Data collection using UAVs and airplanes involves high operational costs and is unsuitable for highly urbanised environments due to aviation regulations.Most importantly, both airborne sensing and satellite imagery can only capture an overhead view of urban tree canopies.As a result, lower vegetation elements such as green walls, short trees, or shrubs are often missed or misinterpreted [15].
For a systematic review of the technological methods and tools for greenery health monitoring, see [13].

TABLE I:
A comparison of sensing approaches along with the working mechanism, cost, and quality of assessment to analyse tree health (* refers to relative cost where $ is the lowest cost and $$$$ is the highest cost for large-scale evaluation of multiple trees based on the scale in [12])

A. Research gaps and influence on design
We seek to provide a scalable system that provides high quality data with low-costs.Comparing the different approaches in Table I, it emerges that ground-based (terrestrial) sensing approaches combined with imaging-based methods can look at vegetation elements in a holistic manner with high quality data gathered either through drive-by sensing or citizen science paradigms.Additionally, the advances in deep learning based computer vision models for imaging data in the past decade enables the development of a system that is broad in scope.Narrowing this down, the past studies measuring tree health from ground level and utilising low-cost imaging sensors are also shown concisely in Table II.These studies are limited by requiring manual analysis of images by humans [42], outputting only raw data without ground truth validation [14], [43] or requiring controlled system deployment and operation [14], [44], [45].
Our work builds upon all these insights.Hence, in Green-Scan, we utilise HMI imaging and thermal imaging sensors to autonomously measure two health indexes namely NDVI and CTD from ground level.GreenScan is designed to be completely autonomous (by utilising computer vision model based on deep learning), is suitable to be deployed in noncontrolled environments and the early results are compared with a ground truth dataset provided by a municipality.While data from HMI and thermal imaging can be used to generate a number of health indices, we carefully chose NDVI and CTD indices after explicit considerations.Our choices are driven by 1) NDVI remains one of the most important and popular indices used in the domain [35] and the ground truth dataset provided by the municipality contains remote NDVI for a fair scientific comparison 2) CTD is one of the relatively simpler metrics for assessing properties of a tree such as water consumption and its resilience to drought and heat stress events [46], and it uses different wavelengths than NDVI (thermal imaging sensors instead of HMI sensors).In turn, generating two complementary health parameters for urban trees.

III. METHODOLOGY
The GreenScan system integrates low-cost thermal and multispectral imaging sensors which are attached to a single board computer.The system processes the imaging data generated by these sensors using a custom computer vision model to generate the two tree health indexes namely CTD and NDVI.All these components were encased in a 3D printed case as shown in Figure 2a.The case was designed such that it is suitable to be attachable to moving vehicles without any alterations using magnets, as shown in Figures 2b, 2c and 2d.
In this section, we aim to explain all the major modules of GreenScan.

A. System Architecture
The block diagram of the entire GreenScan system architecture is shown in Figure 1.The first five modules are related to hardware, while the remaining four modules are related to software.
1) Hardware Modules: In the following, we first provide the generic description of each hardware module followed by the 1.Thermal Imaging Sensor: A thermal imaging sensor with radiometric calibration (to measure true temperature) is attached to the central single board computer and captures longwave infrared images normalised to a suitable temperature range with low pixel resolution.A narrow temperature range is preferred to decrease the effect of non-linear noise across the sensor as the low cost thermal imaging sensors are constrained in terms of resolution.This long-wave infrared imaging data is used for generation of CTD (Canopy Temperature Depression) which indicates the water stress levels of a tree.For concrete implementation, we used FLIR Lepton 3.5 (spectrum: longwave-infrared @8 µm -14 µm) attached to an OpenMV cam H7 using a FLIR Lepton adapter module.This captured thermal images with a pixel resolution of 160 × 120 which are normalised to a suitable temperature range (−10 • to 40 • C).This temperature range was chosen based on the lowest and highest temperature (±10 • C) of trees found during the data collection experiments (in Section IV).The OpenMV cam H7 communicates with Raspberry Pi via RPC (remote procedure call) over USB.
2. Multispectral imaging sensor: A multispectral imaging sensor is attached to the central single board computer and captures RGN (Red, Green and Near-infrared) imaging data with high pixel resolution.The near-infrared and red imaging data is used for generation of the NDVI (Normalised Difference Vegetation Index).Further, this high resolution imaging data is used for segmentation of the tree canopy from the images using the custom deep learning model as described in the Image Segmentation Module.For implementation, MAPIR Survey 3W (spectrum: red@660nm, green@550nm, near-infrared@850nm) was attached to the Raspberry Pi over USB and captured RGN imaging data with a pixel resolution of 4000 × 3000.To control the MAPIR Survey 3W for triggering the capture and transfer of images, PWM (pulse width modulation) signals over the micro-HDMI port of MAPIR Survey 3W are utilised.
3. GNSS Receiver: A GNSS (Global Navigation Satellite System) receiver with support for GPS, GLONASS, and Galileo is used to find the current location of the system and geo-tag all the images of the trees captured.For this prototype, the RGN images captured were geo-tagged using the standard GPS adapter available for MAPIR Survey 3W.
4. Single Board Computer with/without Edge TPU: A single board computer without/with onboard edge TPU (Tensor Processing Unit) or USB edge TPU accelerator (such as USB TPU accelerator from coral.ai) acts as the central brain of the  5. Power Supply / Solar Panel: A lithium-ion battery (10000 mAh) is used to ensure uninterrupted power supply to the system along with support for charging over a solar panel or a standard power adapter (5V/2A).
2) Software Modules: Herewith, we provide the description of each software module followed by the concrete implementation of the same in the GreenScan system.A visualization of processing the images after each software module is also shown in Figure 3.The event trigger signals the beginning of processing on the Raspberry Pi and in the current prototype, a press of a push button is used as an event trigger for the data collection experiments.This event trigger can also be the co-location of the system with particular GPS coordinates fetched from a tree inventory database.For the thermal imaging sensor, this involves the initiation of callbacks requesting the transfer of the current image frame from FLIR Lepton 3.5.For the multispectral imaging sensor, this involves generating PWM signals to capture an image, mounting the memory card installed in the MAPIR Survey 3W with the Raspberry Pi, transferring the captured image to Raspberry Pi, and finally, unmounting the memory card from the Raspberry Pi.
2. Image Registration: Image registration involves matching or aligning images taken by two different sensors into a single coordinate system for further analysis [49].It includes detecting key points from one image and mapping them to another image.Since both the multispectral and thermal imaging sensors have different FoVs (field of views) and are un-aligned, this modules aligns the multispectral images to the thermal images through linear translation in both horizontal and vertical directions.Also, to compensate for wider FOV of the multispectral sensor, this module also handles zooming in on the multispectral images.For the current prototype, the values of translation in X and Y directions were found to be +50 (right) and +150 (upwards) pixels respectively and the zoom scale was found to be 0.57 (where 1 indicates no magnification and 0 indicates 100 % magnification) to perfectly overlay the thermal and RGN images.These parameters were found by manually taking multiple RGN and thermal images and overlaying them.An instance of inputs and outputs utilizing this module are shown in Figure 3. Further, automatic image registration using three image registration algorithms namely SIFT, SURF, and ORB [50] was also tested.However, these algorithms were not able to detect useful keypoints or features in the thermal images possibly due to the low resolution (160x120).
3. Image Segmentation: The is the most computationally intensive software module of the system.Recall that the aim of our system is to calculate the NDVI and CTD values for each tree in the images.However, these values should be calculated only for the leaves in tree canopy excluding the wooden parts, such as the trunk and branches.This is solved using a fusion of custom developed Mask R-CNN (Mask Regional Convolutional Neural Network) and pixel-wise NDVI analysis.Mask R-CNN [51] is an object detection and instance segmentation model that identifies and then draws a precise mask around the detected object.Given a multispectral RGN image captured using the multispectral imaging sensor, this task can be broken into two sub-problems as follows: • Detect the canopy part of the trees even in cases where the image contains multiple trees: This is solved using a custom-developed Mask R-CNN model.The Mask R-CNN model is trained using transfer learning and discussed in more detail in Section III-B.It segments the instances of the tree canopies in the RGN image by generating a mask (segmentation) over them as shown in Figure 5. • Remove Noise: Once the canopy of the tree is detected, the segmentation of only the leaves of the tree without the wooden branches and sky.The non-vegetation elements such as trunks, branches, and sky have very low NDVI values compared to vegetation elements which have significantly higher NDVI.Thus, we employ a thresholding based method which first calculates the individual NDVI of each pixel in the segmentation mask generated by Mask R-CNN and then eliminates pixels with NDVI values below a certain threshold.The calculation of NDVI for each pixel is computed by plugging the raw values of the red and nearinfrared channels of the pixel in (4).In order to eliminate noise along the edges of tree canopy, median filtering is also employed.The end result employing the above two-stage approach gives segmentation of only leaves present in the tree canopy while eliminating the sky, wooden branches, trunk and other street objects such as buildings and cars in the multispectral image.Since both the thermal and multispectral images are registered, the same mask of leaves in tree canopy can also be used for thermal images.With the MAPIR Survey 3W employed in the GreenScan system, a value of 0.02 was used as the cutoff to eliminate nonvegetation elements in the image.This value was derived using the analysis of the images captured during data collection experiments.An instance of inputs and outputs utilizing this module is also shown in Figure 3.
4. Analysis and Calculation Module: This module handles the calculation of final NDVI and CTD for each tree in the field of view of the imaging sensors.The CTD value is computed by calculating the raw temperature value for each pixel in the grayscale thermal image as per (2), computing the mean temperature over all pixels in the canopy and subtracting the ambient air temperature from the mean canopy temperature as per (3).CTD is calculated as: where T canopy and T air are canopy temperature and air temperature respectively in • C. The temperature of each pixel is calculated as: where P value is the pixel value in normalised thermal image, T min and T max are configured temperature range for the thermal imaging sensor respectively (−10 • C and 40 • C in our case).Then, as per (1), CTD is calculated as: where T pixel is the average canopy temperature for all segmented pixels in the image and T air is the air temperature respectively.
To calculate the NDVI, each pixel in the RGN image is split into its three constituting channels (red, green and nearinfrared).The raw NDVI value for each pixel is calculated from red and near-infrared channels as per (4).To compensate for the aperture adjustment, the focal adjustment and other mechanical adjustments performed by the multispectral imaging sensor, the raw NDVI is normalised by applying a correction factor similar to the dynamic range of a camera [52] as shown in (5).NDVI is calculated as: where N IR and Red are values of near-infrared channel and visible red channel for each pixel respectively.The corrected NDVI is calculated as: where N DV I raw is the raw NDVI of a pixel, N DV I max and N DV I min are maximum and minimum NDVI values for all pixels in the segmented image.Finally, the NDVI for the entire canopy is computed by taking the mean over the corrected NDVI values for all pixels consisting of leaves in the segmented tree canopy.

B. Development of Custom Mask R-CNN
For the system to operate autonomously, the images will be captured in an unsupervised fashion.Thus, in addition to multiple trees in a single image, they may contain other objects such as cars, buildings, grass, and snow.Hence, it is imperative to individually identify all the tree canopies in an image and feed them to the Analysis and Calculation Module.The custom mask R-CNN part of the image segmentation module solves this by providing instance segmentation of the tree canopies in the image.To our knowledge, there is no preexisting model available for instance segmentation of tree canopies or even trees in standard RGB images.The problem is further complicated as our input is RGN (Red, Green, Near-infrared) images from the multispectral imaging sensor instead of standard RGB images.For instance, we found that pre-trained models like Deeplabv3 [53], which can perform semantic segmentation of trees and vegetation on standard RGB images, perform poorly on RGN images.
1) Training Data: Any deep learning model requires training data in order to optimise the weights and activations of the layers.However, there does not exist a dataset with labels for instances of trees or tree canopies for RGN Images.Hence, we manually created the dataset using the RGN images collected during the data collection experiments (See Section IV-B).Each tree canopy in the image was manually annotated using a popular image annotation tool called LabelMe [54].During annotation, only tree canopies that were completely present in the image were labelled.After this process, our dataset consisted of 51 annotated RGN images with two classes namely tree canopies and background.
2) Training Process and Training Curve: Our dataset consists of a relatively small number of images to train a deep learning model like Mask R-CNN from scratch.Transfer learning combined with data augmentation was employed in order to develop a custom model by using an existing model pre-trained on a different dataset.For this, we used a Mask R-CNN pretrained [55] on COCO [56] (a dataset with 330K images) with ResNet101 as the backbone.We re-trained only the head layers (the top layers without the backbone) on our dataset.The batch size was configured as 4 and the number of epochs was 10.
The training was performed on the Google Cloud Platform with a N1 instance with 13GB memory and 2vCPUs.We also generated synthetic data by augmenting the original dataset with flips in the horizontal and vertical directions and applying Gaussian blur.This increased the training dataset size by 50% and acted as a regularizer.The manually annotated dataset (refer Section III-B.1)consisting of 51 images was split in the ratio of 70: 30 for training: testing.During re-training, each epoch took approximately 3 hours on the N1 instance.The training curve of the model is shown in Figure 4.It is seen from the training curve that only a small number of epochs are sufficient to reach the optimal validation loss on the test set owing to the retraining of only the head layers.The visual output results from our model are shown in Figure 5.
3) Model Quantization: Mask R-CNN is a relatively heavy model both from training and inference points of view.Hence, the developed Mask R-CNN was optimized to run on the edge at the cost of possible minute performance reduction.For this,

IV. EVALUATION
The system was evaluated using a dataset obtained from the municipality of Cambridge, USA as ground truth reference.We also conducted three data collection experiments to collect data of urban trees using the GreenScan system.In this section, we elaborate on this dataset and the data collection experiments followed by the obtained results.

A. (Ground truth) Tree Health Dataset
Municipalities in cities obtain tree health data through citywide surveys over years.For instance, in the city of Cambridge, USA, a survey is performed every 5 years whereas, for the city of Delft, The Netherlands, a survey is performed every 1 year to 2 years depending on the previously rated condition of the tree.For the evaluation of our work, we obtained the tree health dataset for the city of Cambridge, USA through the Cambridge Urban Forest Master Plan [16] to be used as ground truth reference.A 2018 dataset was obtained from the municipality.This dataset was created through a combination of manual in-person arborist visits, satellite-based remote sensing and aerial LiDAR [16].The dataset classifies the health conditions of trees into three categories, namely good, poor, and fair.The dataset contains information about 47,063 trees out of which 35,821 are in good health, 5176 are in fair health and 6066 are in poor health.Hence, most of the trees (> 75%) are rated as having a good health condition.In addition, the dataset contains information about the tree species, common name, the satellite-based NDVI, the latitude and the longitude, location, the shape length and shape area of the canopy and other parameters.This dataset was provided as Shapefiles (.shp, a data format used by GIS (Geographical Information Systems)) and was loaded to the online platform CARTO [58] (a GIS and spatial analysis tool).On a side note, the staleness of data in terms of time also necessitates the advancement in this field of tree health monitoring.

B. Data Collection Experiments
We collected multispectral (RGN) and thermal images through the developed system on three separate days in Cambridge, USA during the month of February 2022.A push button was used as the event trigger for the system.Hence, we employed the developed GreenScan system as a citizen science project with the 3D printed casing (pedestrian moving at walking speed in a straight line at a distance of 8m -20m from the tree).In total, we collected data for 49 trees spread over two species namely Red Pine and Eastern White Pine trees.The multispectral imaging sensor was configured with a shutter speed of 1/60s and ISO of 50.The thermal imaging sensor was configured to measure temperature in range of (-10, 40) • C. The sites of data collection experiments are shown in Figure 6, chosen based on the species and accessibility.
Species Constraints: There are two types of trees namely evergreen and deciduous trees.During winters, deciduous trees lose their leaves, thus hampering NDVI calculation.Hence, our analysis was constrained to evergreen trees due to data collection in the winter.The species namely Red pine and Eastern White Pine were selected because they are evergreen and they are the most widespread and easily accessible evergreen trees found from CARTO in the city of Cambridge.
Data Cleaning: During the first day of the data collection experiments, the Raspberry Pi hung up due to unknown reasons leading to a forced restart.On the third day of the experiments, owing to cold temperatures, the power supply had to be changed during data collection.These interruptions and restarts resulted in unstable values for a sequence of readings related to the canopy temperature by the thermal imaging sensor.As a result, these 11 data points were removed from our dataset generated using data collection experiments.In the end, our dataset was reduced to contain 40 trees.Distribution of the data collected from each of the tree species after data cleaning is shown in Table III.

C. Performance of custom Mask R-CNN
To measure the performance of our custom Mask R-CNN model, we calculated the standard evaluation metrics as used by COCO [59].Specifically, we measured mAP (mean Average Precision) / AP (Average Precision) at different IoU (Intersection over Union) thresholds (as per [59]).The performance of our custom Mask R-CNN with and without quantization is shown in Table IV.A comparison of inference time and model size comparing both the full model and the quantized model is also shown in Table IV.In order to measure the stability of our results, k-Fold cross-validation [60] was also performed with k=3, to evaluate the performance of the model on different training and test splits as shown in Table V.These results for different train-test splits as shown in Table V showcase the reliability of our model.
From Table IV, it is seen that there is no significant reduction in performance using quantization.The inference time of the quantized model is half compared to the nonquantized model along with the reduction of the model size.An example of segmentation outputs generated by the full model and quantized model on the same image is also shown in Figure 7.This also showcases the similar performance for both the full and the optimised model in a visual form.
From Table IV, it may appear that the AP (IoU =0.5:0.95:0.05)[59] for the quantized model is increased slightly compared to the full model.On further exploring this anomaly, it was found that this behaviour is exhibited due to our annotated dataset where most images contain only one full tree canopy as ground truth.Thus, a model (non-quantized model) generalising better to find partially visible tree canopies, in addition to the full tree canopy is penalised in terms of Precision (False Positive).Further, it is seen from Figure 8 that the performance of the quantized model decreases more than the full model at higher IoUs (IoU= 0.85 for quantized model compared to 0.90 for full model) signifying that it is slightly poorer at object localisation compared to the full model.Fig. 8: The AP scores with increasing IoU thresholds as per COCO metrics [56] for the full and quantized models

D. Results for the health of trees
We extracted three parameters from the ground truth dataset namely Ground Truth Condition (Health), Remote NDVI, and Area of the tree (measured using aerial LiDAR) from all the parameters present in the dataset.
A comparison of our system-measured NDVI and Remote NDVI is shown in Figure 9.As seen in this Figure, our measured NDVI is distributed similarly to the Remote NDVI for an individual tree (denoted by Tree Index in Figure 9) .Two datasets can be highly correlated but strongly disagree.Hence, a Bland-Altman plot [61] widely used to showcase the agreement between two monitoring methods measuring the same attribute was utilised.It plots the difference between the corresponding measurement values against the average of those values.From the Bland-Altman plot shown in Figure 10, it is seen that there is strong agreement between the two methods (Remote NDVI and our measured NDVI) with all points (representing data for each tree) except one lying within the 95% limits of agreement (average difference within ± 1.96 standard deviation of the difference, as the difference between values follows a normal distribution).
Pearson's correlation coefficient (r) was also measured to calculate the strength of the linear relationship between our measured values and ground truth data.The correlation matrix comprising all of our measured values with the three ground truth parameters namely Ground Truth Condition, Remote NDVI, and Area is shown in Figure 11.Further, the correlation results between the measured NDVI and CTD with ground truth parameters is shown in Table VI.
The distribution of CTD and NDVI with respect to ground truth health conditions is shown in Figure 12.From the NDVI distribution in Figure 12a, it is seen that the extent of agreement of NDVI with respect to the ground truth health conditions varies with respect to the species.For red-pine trees, the trees in good health condition have higher measured NDVI values than trees in poor and fair condition.A similar   VII.
1) High-level tree health analysis: From Figure 11, it is clear that there is almost no correlation between NDVI and CTD.Thus, they are independently measuring two different attributes related to tree health and useful to incorporate in the system.In recent works such as [62], the correlation between remote NDVI measured using two different satellites was found to be 0.74 (moderately strong).In this work, the Table VI shows moderately strong correlation (r=0.54 with p < 0.05) between our measured NDVI (ground based) and remote NDVI.This moderately strong correlation serves to showcase the validity of our approach for ground-based NDVI measurement using multispectral imaging sensors.
Since there is no ground truth reference attribute for CTD  which indicates water stress of trees, we checked the correlation of CTD with ground truth health condition as shown in Table VI.The weak-moderate correlation (r=0.28 with p < 0.05) between CTD and ground truth tree health condition can be attributed to the skewed distribution of the dataset where more trees are rated as having good conditions compared to poor and fair conditions.Further analysis of CTD distribution for all trees in Figure 12b shows the high variability of CTD for trees in poor condition leading to this overall weakmoderate correlation.
2) Species-wise tree health analysis: From the NDVI distributions for Eastern White pine in Figure 12a, it is seen that the good condition trees are generally distributed to have higher NDVI values than poor and fair condition trees.Thus, a simple threshold based classification algorithm can easily flag trees which might not specifically be in good health conditions.At a scale of tens of thousands of trees in a city, this can lead to a significant amount of cost savings.
From Figure 12b, while a higher CTD is found for red pine trees in poor condition than good and fair health condition trees, the same pattern is not applicable for eastern white pine trees.This inference about CTD is similar to earlier works such as [44] [47], where the tree species under observation has a significant influence on the results obtained from thermal imaging.Hence, further studies with varied species are required to measure the stability of CTD with respect to ground truth health conditions.

V. LIMITATIONS AND FUTURE WORK
Subsequent investigations stemming from this work and using our approach as a foundational framework are expected to illuminate and solve several novel challenges, subset of which are delineated as follows.
1) Feasibility of modelling-based classification: From the correlation matrix in Figure 11, it is seen that is no correlation between CTD and NDVI values.Hence, in autonomous models to classify tree health, both these measured parameters are useful features.A scatter plot between NDVI and CTD values for red pine trees in shown in Figure 13.From the scatter plot, it is seen that most of the fair and poor-condition trees are concentrated around a cluster between NDVI (0.20-0.35) and CTD (0-7).Hence, simple white-box machine learning algorithms like SVMs (support vector machines) with kernel [63] or logistic regression classifiers [64] can autonomously distinguish between good, and poor or fair condition trees.Further, the methodology can be expanded by adding humanin-the-loop-validation at intermediate steps to enhance the performance of the system.
2) Direction of movement and Robust positioning system: At present, our methodology is contingent upon aligning the system consisting the imaging sensors with the trees' orientation.However, given the intended practical application in real-world scenarios, the angle between the tree canopy and the camera's direction could potentially impact the segmentation of tree canopies.This can be improved by adding a simple image selection algorithm such that the majority of image frame is occupied by pixels belonging to a tree canopy.Furthermore, our current positioning method relies on GPS coordinates sourced from the tree survey dataset and the GNSS module within the system, which inherently faces uncertainties in positioning.Thus, an alternative positioning approach utilising RTK (Real-Time Kinematics) can enhance the positioning robustness.
3) Scalability in different weather conditions and geographical boundaries: The effect of different weather conditions with reduction of visibility, and sunlight directly facing the imaging sensor lenses needs further exploration.Secondly, the deployment and validation of the system in cities with different topographies and with geographical domain shifts can help in enhancing the generalization of the approach using large-scale training and validation datasets.

VI. CONCLUSION
Urban greenery provides various environmental services such as carbon sequestration, and cooling making them essential for building climate-adaptive cities.Currently, urban trees are experiencing atypical amounts of natural and humaninduced stresses leading to their volatile health.Yet, high costs make it infeasible for cities to perform frequent inspections on a large scale, leading to adverse health conditions being discovered only after severe damage.The current popular methods for monitoring the health of urban trees rely on an inperson inspection performed by arborists and remote sensing based on satellites or airborne imagery.However, all these methods are riddled with various challenges involving scalability, spatio-temporal resolutions, and quality of assessment.In this work, we developed a novel system called GreenScan to measure tree health autonomously from the ground level in urban cities.The GreenScan system fuses data from lowcost thermal and multispectral imaging sensors using custom computer vision models optimised for efficiency to generate the tree health indexes namely NDVI and CTD.The custom Mask R-CNN model fine-tuned using transfer learning was employed to fuse the data collected by the imaging sensors on the edge device.Deployment can be performed both in a drive-by sensing paradigm on moving vehicles such as taxis and garbage trucks or in a citizen-science sensing paradigm by humans.Initial evaluation of the system was performed through data collection experiments in Cambridge, USA.The custom Mask R-CNN developed performed admirably, with an AP IoU =0.50 = 0.938 despite the small dataset used for training.The tree health analysis revealed moderatelystrong correlation between our measured NDVI and the remote NDVI obtained from the ground truth dataset.Further, our measured NDVI distributions can be used to flag trees that are specifically not in good health conditions.For the measured CTD, a pattern with a theoretical agreement was applicable for one of the species observed.However, further large-scale evaluation studies over multiple species would help in improving the generalisability of the system.In essence, this work illustrates the potential of autonomous ground-based urban tree health monitoring on city-wide scales at high temporal resolutions and motivates future research at the intersection of environmental science and computer science.

Fig. 1 :
Fig. 1: Architecture Diagram of the GreenScan system (a) All hardware components encased with the 3D printed case.Dimensions in inches: 7.32" X 2.50" X 1.74" (b) Concept casing with magnets (c) The system attached on the top of a car (d) Closeup view of the system on the roof of a car

Fig. 2 :
Fig. 2: A visualization of the current GreenScan system and the concept casing

Fig. 3 :
Fig. 3: A visualisation of processing the images at each software module on the Raspberry Pi

Fig. 4 :
Fig. 4: The training curve of Mask R-CNN with epochs=10 and batch size=4.The red point indicates point of minimum loss and training losses are as defined in[51]

Fig. 5 :
Fig. 5: Performance of our custom Mask R-CNN.Notice how the model detects each instance of the tree canopy in the image and considers all the other objects as background

Fig. 6 :
Fig. 6: The trees were analysed in these locations.The red boxes indicate the Red Pine trees and the blue boxes indicate the Eastern White Pine trees.

Fig. 7 :
Fig. 7: Outputs from the full and quantized custom Mask R-CNNs

Fig. 9 :
Fig. 9: Variation of measured NDVI vs Remote NDVI for trees observed during data collection experiments.The tree index refers to an individual tree ID.

Fig. 10 :
Fig. 10: The Bland-Altman plot showcasing the agreement between the measured NDVI and Remote NDVI.The dashed-middle line shows the mean difference.The top most and bottom most lines refer to the 95% limits of agreement respectively.

Fig. 11 :
Fig. 11: Correlation matrix between our measured values (in bold) and parameters from ground truth dataset

Fig. 12 :
Fig. 12: The distribution of NDVI and CTD for the trees with respect to health

Fig. 13 :
Fig. 13: Scatter plot between NDVI and CTD (in • C) for red pine trees.The color of the points indicates the ground truth health with red denoting poor, yellow denoting fair, and green denoting good condition trees.

TABLE II :
A concise comparison of our work with earlier works in the field measuring tree health terrestrially

TABLE IV :
Performance of custom R-CNN model (Full and Quantized model)

TABLE V :
Results of 3-Fold Cross Validation of custom Mask R-CNN model

TABLE VI :
The correlation between our measured values and ground truth parameters

TABLE VII :
The mean of measured NDVI and CTD across species and health