Camera based Light Emitter Localization using Correlation of Optical Pilot Sequences

Visual identification of objects using cameras requires precise detection, localization, and recognition of the objects in the field-of-view. The visual identification problem is very challenging when the objects look identical and features between distinct objects are indistinguishable, even with state-of-the-art computer vision techniques. The problem becomes significantly more challenging when the objects themselves do not carry rich geometric and photometric features,for example, in visual identification and tracking of light emitting diodes (LED) for visible light communication (VLC) applications. In this paper, we present a camera based visual identification solution where objects or regions of interest are tagged with an actively transmitting LED. Motivated by the concept of pilot symbols, typically used for synchronization and channel estimation in radio communication systems, the LED actively transmits unique pilot symbols which are detected by the camera across a series of image frames using our proposed spatio-temporal correlation based algorithm. We setup the visual identification as a problem of localization of the LED on the camera image, which involves identifying the (pixels) and the unique ID corresponding to the LED. In this paper, we present the algorithm and trace-based evaluation of the identification accuracy under real-world conditions including indoor, outdoor, static and mobile scenarios. In addition to micro-benchmarking the localization accuracy of our technique across different parameter configurations, we show that our technique outperforms comparative techniques, including, color based detection, Support-Vector Machine based (SVM) machine learning, and You Only Look Once (YOLO), which is a state-of-the-art Convolutional Neural Network (CNN) deep learning based object identification tool.


I. INTRODUCTION
The advent of camera-based automation in mobile systems, advances in autonomous robotic systems and pervasive use of visual perception as an essential modality in cyber-physical systems, have urged the need for visual identification of objects in a given scene with high accuracy and precision. Fundamentally, this problem has long been studied and addressed along the dimensions of object detection/recognition and localization using computer vision. The advancements in deep learning have improved vision based recognition fidelity. Localization, along with 3D environment mapping, have improved significantly using visual SLAM (Simultaneous Localization and Mapping) [1], [2]; computer vision used with SLAM to build a map of an unknown environment and perform localization to locate the object or robot (self) inside the generated map.
Vision based techniques fundamentally reach a bottleneck when the objects of interest are identical, making differentiating objects using visual features alone impossible, and that when the environment is dynamic and mobile, thus causing problems for matching features across time for reliable visual SLAM. For example, an autonomous driving vehicle FIGURE 1: Depiction of different LED localization application scenarios: (left) indoor robot localization application, (right) outdoor 3D mapping, V2V and pedestrian localization using LED and camera. mapping the 3D environment suffers from distinguishing different identically looking buildings and other road side objects. The constantly changing scenery, due to motion, further complicates the process as the visual features are 'available' only for a short duration (even shorter depending on the speed of the vehicle). To address this issue, we propose that such objects in the scene, particularly those which can lead to such vision bottlenecks, be tagged with a light emitting diode (LED) which constantly transmits a unique ID (mapped to the object of interest in the scene) and a camera is used to localize this LED. The unique ID serves as a differentiator between objects, and the localization problem boils down to precisely identifying the pixels in the camera images that correspond to the LED. To this end, we propose a novel correlation localization technique that is fundamentally motivated by the concept of pilot symbols correlation used in radio packet communication reception. The pilot information in the form of barker code binary sequences are transmitted by the LED that are detected, demodulated on camera image pixels, and the corresponding sequence of digital data is cross correlated with the known pilot (barker code) sequence. A high correlation will mean that the particular camera image pixels correspond to the fact that the LED was detected at those pixels.
Correlation Localization. We setup the visual identification as a problem of localization of the LED on the camera image, which involves identifying the (pixels) corresponding to the LED. We treat that the unique ID for each LED in vicinity is registered in the camera system's database. Note that the purpose of these unique IDs is to differentiate the objects of interest within the scene in immediate vicinity of the camera. Thus these IDs can be reused and the number of IDs within a spatial region is finite and will scale linearly (with number of tagged objects of interest). Motivated by the concept of pilot symbols, typically used for synchronization and channel estimation in radio communication systems, the LED actively transmits unique IDs, or pilot symbols, which are detected by the camera across a series of image frames using our proposed spatio-temporal correlation based algorithm. This algorithm takes a window of image frames, registers the scene using compute vision image alignment technique, and performs a one-dimensional n-block correla-tion across the image -treating the image matrix of pixel intensity as a linear array of numbers. The n is the parameter that represents the number of elements in the array used for correlation. The fundamental idea is that only the pixels corresponding to the LED will follow a intensity variation pattern in accordance with the pilot symbols, while the other background pixels do not change significantly or are mostly static. This way, the pixels corresponding to the LED alone will reveal a high correlation output which thus helps isolate the LED pixel region with high accuracy and precision.
Applications. LED localization can be very helpful in a plethora of applications, particularly those relying on location based services and those which use cameras. As depicted in Figure 1 (left), LED localization can significantly assist in autonomous robot navigation and scene mapping. Active transmissions using LEDs and decoding using cameras is the fundamental concept of visible light communication (VLC). Hence, localizing a LED in itself fundamentally solves the key issue of transmitter identification and tracking in VLC. The concept of visible light positioning (VLP) has gained much interest in the research community for localizing ground objects based on locating LEDs and identifying them by decoding bits from LED transmissions. However, VLP depends on prior knowledge of the map or blueprint of LED placements and fundamentally tries to solve the dual problem (localize the camera device with respect to the local space based on detected LED positions using geometrical analysis). Accurate localization of the LED in the camera image will enhance VLP system fidelity. This is applicable even in outdoor scenarios (Figure 1 (right)) such as for mapping infrastructure (e.g. buildings), localizing safety critical events such as a pedestrian crossing the road, and for tracking target vehicle (transmitter and/or receiver) for vehicle-to-vehicle (V2V) communication (using VLC and/or radio wireless).
In summary, the key contributions of this paper are as follows: 1) Design and implementation of the correlation localization algorithm for localizing LED on camera images. 2) Real-world trace based experimental evaluation of the correlation algorithm in different indoor, outdoor, static and motion (car driving) cases. 3) Performance comparison of the optical correlation de-coding algorithm with color based thresholding and support-vector machine learning based localization accuracy metrics. 4) Comparative evaluation based discussion of advantages and disadvantages of using deep learning techniques for LED localization in camera images.

II. DESIGN MOTIVATION: CHALLENGES IN VISION FEATURE EXTRACTION
Features play a fundamental role in computer vision based algorithms; used for object detection, recognition, tracking, matching, classification applications and many more. Visual features in images, also referred to as key points, are essentially visual markers in the regions of interest (e.g. object) that can help characterize the particular image region. Computer vision algorithms for localization and tracking are fundamentally dependent on feature extraction from the scene, and every thing else that follows is largely based on the quality of the features. Some of the most prominent features that are used in computer vision applications include ORB [3], Scale Invariant Feature Transform (SIFT) [4], SURF [5], Histogram of Gradients (HOG) [6], Harris [7]. Other features that have gained prominence also include BRISK [8], MSER [9], EIGEN [10] and KAZE [11]. As a motivation experiment to demonstrate the challenge in LED localization using traditional vision techniques, we conducted a feasibility experiment with testing extraction of all the features listed in the previous paragraph. We used the MATLAB [12] computer vision toolbox to run each feature extractor on a sample image of a red color (monochrome) LED placed on a chair in room with some sunlight through the windows and no ambient artificial lighting. We can observe from Figure 2 that most of the feature extractors are not even able to find a single key point on the LED or close to the LED. Those that detect key points in this LED scene, such as SURF, KAZE and ORB, are very noisy as they are detecting multiple areas not representative of the LED as key points. Differentiating such key points is extremely challenging without much additional information, which is not the case. We can observe that HOG does detect some key points in a systematic manner, however, the problem of differentiating/cleaning the outliers is very challenging. In a more complex environment (backgrounds) the challenge will only become harder, as the key points will largely be concentrated on other aspects of the background that may have more visual characteristics than the LED. Clearly, the failure of traditional vision based feature extraction is attributed to the lack of knowledge or the ability to define features pertinent to the LED as it bears no clear and unique geometric or photometric characteristic.
As an additional measure, we tested the SIFT feature extraction and matching on the indoor LED scene, which worked better than the others, yet noisy. However, when the same LED was placed in a different setting -outdoor sunlight with trees background -the SIFT feature matching algorithm could not identify any credible key point on the LED in the outdoor setting and instead matches (wrongly) the indoor LED with the leaf region on one of the trees. This example is an additional evidence of the challenge in using feature extraction based techniques for LED localization.

III. SYSTEM ARCHITECTURE
The proposed system considers that objects or regions of interest in the space are tagged with a LED transmitter that serves as the meta identifier for the object and representative of where it is located within the scene. The LED is set to actively transmit unique IDs as a sequence of bits using ON-OFF keying (OOK), where bit 1 is mapped to a high intensity level (ON status) and bit 0 is mapped to a low intensity level (OFF status) of the LED. We consider that each LED is set to a unique ID sequence, however, this sequence can be programmatically changed.
At the receiver, a camera that is perceiving the scene registers the LED signals; as long as the LED is within the camera's field of view, typically at narrow (± 30-50 deg) or wide (± 50-80 deg) angles for traditional cameras. We consider that the camera receiver is operated at a frame rate (sampling rate) following the Nyquist criterion -at at least 2x the transmission rate. Thus, the LED signals are sampled by the camera such that each transmit bit has at least 2 image frames with at a set of pixels registering a pixel intensity corresponding to that bit's transmit signal intensity. If the sampling is clean, the pixels corresponding to the LED region will register a high pixel intensity when LED transmits a bit 1, and will register a low pixel intensity when the LED transmits a bit 0. Each camera image at each instance of time registers an LED's single state. Hence, for a N bits sequence ID, we consider 2N consecutive frames and input to our correlation localization algorithm to identify the LED's exact location on the camera pixel domain.

A. CORRELATION LOCALIZATION ALGORITHM
We define the localization problem as identifying at least a 3 x 3 pixel block in the camera sampled images that overlaps with the pixels that have registered the LED. The algorithm is setup as a two-phased approach. Phase 1 extracts the data from the images and prepares it for the LED pixel location identification using correlation calculations in Phase 2.

1) Phase 1: Data preparation
Image formatting. Each sampled image at the receiver, regardless of the original resolution, will be resized to VGA resolution (640 x 480 pixels). This is to minimize the image processing computation time. To ensure the transmissions from the LED are not creating disturbing flickering effects, we operate the LEDs at a minimum of 50Hz which thus requires the cameras to operate with at least 100 framesper-second (FPS) sampling. Today's off-the-shelf mobile cameras can reach 100FPS and beyond but at VGA resolution. The images are processed further in gray-scale. We use grayscale version of the sampled color images for post   processing only. The camera capture in our experiments is set to capture at full high-definition (1920 x 1080) resolution in RGB color in uncompressed format.
Registering the Images. When the transmitter and/or receiver is in motion, the images sampled at each instance (with 1/FPS seconds separation) may not be aligned spatially. This means that the actual pixel(s) position of the LED will not be the same across successive image frames. To account for this and to ensure the pixel positions of the LED can be spatially overlapped, we register the images using traditional computer vision based image alignment [2] techniques. The alignment is essentially achieved over an image pair, where one image is the reference and the other is the motion frame. The effective 'movement of the scene in pixels'is estimated and corrected (inversed) using a homography (pixel-to-pixel spatial relationship between image pairs) calculations. In our algorithm we take a set of 2N (for a N bit ID) consecutive image frames and conducts the image alignment for each sequential pair; that is, (img1, img2) then (img2, img3) and so on. Each pair of aligned images are then virtually superimposed onto the reference image's pixel domain. If the image alignment was ideal, then the LED pixel regions (and other objects in the scene) will precisely overlap. Inefficiencies in practical alignment algorithms can lead to slight mismatches in registration, however, can be considered insignificant as the primarily goal is to overlap as much of the LED pixels across the 2N frames with allowance of small errors. An example of the registration using image alignment process for a series of three image pairs is shown in Figure 5.

2) Phase 2: LED pixel localization using correlation
Correlation. The raw pixel intensity (P ) from each pixel coordinate (x, y) from each of the sets of 2N images are collected as a single 2N element row vector. We prepare another row vector of size 2N which contains the N ID sequence bits (I) with every alternate bit as a repetition of an ID sequence bit. These two row vectors are correlated and the effective correlation value is recorded as the correlation pixel intensity at the x row and y column of a correlation image matrix. We use the definition of cross-correlation between image pixel intensity and bit sequence values as follows: where, ⃗ P as the pixel intensities and ⃗ I as the ID bits are input vectors and corr ( ⃗ P , ⃗ I)[k] is the k-th element of their crosscorrelation.
Considering that pixel intensities are non-negative values   Localization after filtering. Ideally, only the pixels corresponding to the LED in the image will yield higher correlation values compared to other regions. However, in reality, imperfections in image sampling, image artifacts (e.g. blur) and possibility of other things in the scene that look similar to the LED, will result in possibly multiple pixel regions having high correlation values that may be very close to set a general threshold for detection. To address this issue, we first run a correlation and flag the pixels that have high correlation values that are within 10% difference of each other. We set all the other pixels as 0. From this coarse filtered set, we further flag all the pixels which have the least set of variations in their intensity across the 2N images. We identify this by setting a 25% gradient threshold for pixel intensity changes across the high to low transitions and vice-versa. We flag the pixels with less than this threshold of variation and set their values as 0, keeping the raw pixel intensity values intact for others. Then we run the correlation calculation for the modified column vectors and choose the pixel(s) with the maximum correlation value (within 1% difference) as the LED pixels.
Unwarping. The registration process essentially warps the set of images to a common pixel domain spatial reference. The LED pixel localization achieved in the previous step should be noted as the LED pixel location on the reference image. The actual LED pixel location on the other images in the set is computed by remapping the pixel coordinates across the registered images using the unwarping process. In this way, through a one-shot correlation process, the LED pixel can be spatial and temporally tracked continuously on each sampled image frame, without any additional computer vision feature extraction.

B. ASSUMPTIONS AND POTENTIAL SOLUTIONS
The fundamental assumption in our system is that the camera receiver has knowledge of the dataset of transmission IDs (bit sequences). We justify this assumption using the fact that such knowledge can be generated using multiple techniques depending on the application scenario; (a) the transmitter and receiver can agree apriori on the set of IDs (example usecase: for robot navigation and mapping in finite spaces with small number of LEDs); (b) the LED can transmit, using the VLC channel, a data packet appended to the bit sequence, with the sequence serving for coarse spatial detection of the LED region and the data packet containing the unique ID. The camera receiver can acknowledge reception of the unique ID using a feedback radio channel (example use-case: localizing in a conference setting a large number of mobile devices fit with LEDs); (c) the transmitter and receiver, both, can be connected to a common cloud (wired to infrastructure or cellular) server and commonly be informed on the unique IDs allotted for each LED at a specific location at specific time-slot (example use-case: LEDs attached to buildings or road infrastructure and camera on vehicles used for scene perception). VOLUME 4, 2016

IV. EVALUATION
We evaluate the performance of the optical correlation based localization method through a experimental trace-based analysis. We setup a LED and camera in indoor (home) and outdoor settings, and conducted experiments by varying different parameters in each experimentation trial, and collected data traces. Each data trace or sample is a camera image frame of a video footage recorded at specific resolution and video capture frame-rate. In our evaluations, we consider a single LED and a single camera setup, where we used a solidstate 1Watt LED for indoors and a 10Watt brake/trail light LED for outdoors, both modulated at 60Hz. We used a GoPro Hero 6 as the camera set at 120 frames-per-second. Each trace of our experiments was 1min long footage. Overall, our dataset for LED localization evaluation contains about 15000 non-repetitive (LED location on each frame differs from other by at 3-5 pixels) images. We evaluate our system across four different real-world LED-camera settings ( We evaluate the performance of our localization method using the average localization accuracy as the metric; defined as the ratio of the total number of camera image frames with successful localization to the total number of image frames in the data trace, averaged across multiple experimentation trials. We define a successful localization as when the localization algorithm detects at least one (nonoverlapping) 3 x 3 pixel region that intersects with the LED region-of-interest (ROI). An LED ROI is the rectangular pixel region that completely houses the LED in the particular camera image.
This heuristic choice of 3 x 3 pixel ROI corresponds to a strict threshold for the localization accuracy evaluation. It is common practice in computer vision analysis to require any detection ROI be larger than a 1 x 1 pixel. This creates a tradeoff -large ROI leads to more outliers and strict ROI can lead to low detection accuracy. However, we chose to use a strict threshold of 3 x 3 pixels in our evaluation, at a processing resolution of 640 x 480 pixels. We recall our mention from the earlier section that, regardless of the camera capture resolution, we convert all image frames to 640 x 480, to standardize the processing method as well as optimize for real-time performance.  The ROI is the set of pixels over a rectangular region in the image where all the pixels in the ROI encompass the LED. The pixels that correspond to the partial registration of the LED due to the curvature of the LED shape are not considered in the ROI.
The ROI will change with the distance between the camera and LED; at shorter distances the ROI will be larger thus providing a larger number of ROI intersecting 3 x 3 pixel regions, which significantly reduces as the distance increases. For example, the ROI of the brake/trail light LED at 5, 10, 15 and 20 m on the GoPro camera at VGA resolution are listed in Table 1. We observed that even at 20m range, there are at least three 3 x 3 non-overlapping LED regions that can be marked for localization of the LED.
In summary, we include the following evaluation results, 1) Comparative evaluation of localization accuracy of our optical correlation method in indoor and outdoor and under static and motion cases. We compare with (i) LED detection using color based thresholding, (ii) computer vision based technique that uses aggregate channel features (ACF) and support vector machine (SVM) machine learning, and (iii) a customized version of Convolutional Neural Network (CNN) based YOLO v3 deep learning object recognition model. 2) Micro-benchmark evaluation of our optical correlation method across variable, (i) distance between LED and camera, and (ii) number of images used for correlation.     3: Localization average precision (P), recall (R) and F1-score metric based comparative evaluation of optical correlation localization with color thresholding, ACF-ML detector and YOLO v3-Deep learning classifier. True Positive (TP) is when an LED location is accurately localized for a given frame. False Positive (FP) is when the LED is not present in the scene and but the system provides an erroneous LED localization output. True Negative (TN) is when the system reveals there is no LED when there is no LED actually. False Negative (FN) is when the system reveals LED localized pixels when there is no LED actually. To serve as NEGATIVE data, we captured images in different experiment settings used for our evaluation, without the LED transmitter.

A. COMPARATIVE EVALUATION
We compare the localization accuracy of our optical correlation localization method with traditional techniques. In particular, we consider color thresholding as a basic technique typically used in detection processes using computer vision. Next, we consider a more advanced feature based VOLUME 4, 2016 LED detection technique called ACF detector that marks a set of structural features on the object. The features are then set to learn using a SVM machine learning model. Finally, we compare with state-of-the-art deep learning classification techniques, particularly, with YOLO v3 that essentially functions as a single-shot classifier.

1) Baseline for comparison
In each of the comparative methods we use the traditional implementations and make slight modifications to fit out experimentation to set a common baseline for evaluation.
Color thresholding. Considering the color of the LED is more in the RED space, we set a threshold for the average intensity of the pixel to be detected as an LED. We calibrate the threshold for each experiment trace by selecting the average intensity of the HIGH (LED ON) and LOW (LED OFF) pixels across the images in each 1 min trace.
Machine learning with Aggregate channel features (ACF) [13]. This method is a supervised machine learning approach. ACF detector uses an effective sliding window detector to extract the variations in the structural features in the scene. During data labeling, we labeled by specifying a bounding box region for the LED region in each image. The outcomes of the ACF detector is the estimated LED detection region of pixels. The intersection over union (IoU) for the region is set to 0.5 (50%).
Deep learning with YOLO v3 [14].  [15] and labels were exported in the desired YOLO format. To train the model on the custom dataset, a transfer learning approach was adopted. YOLO v3 uses a variant of Darknet, which originally has 53 layer network trained on Imagenet. For the task of detection, 53 more layers are stacked onto it along with residual skip connections, and upsampling layers, forming a 106 layer fully convolutional underlying architecture for YOLO v3. The pre convolutional weights of darknet53-conv74 were used to train the custom YOLO model where the weights of darknet 53 model with pre convolutional weights were used for the initial 74 layers the rest were are trained from scratch on the data set we have collected. We considered three types of evaluation for the YOLO v3 model used for evaluation. First, we trained entirely on the indoor images and tested on the same. Next, we trained on the outdoor images and tested on the same.
Last, we trained on the entire dataset and tested on the entire dataset. We used 60:40 distribution for training:test sets, and randomized the test-set for total 5 trials. We computed the average of the localization accuracy across such an evaluation.

2) Results
We summarize the performance of our approach compared with the baseline techniques using average localization accuracy metric in Table 2. We observe that our optical correlation technique outperforms the comparative techniques in general. We make the following specific observations from the evaluation results: • We observe that the localization accuracy of our approach is relatively lesser in motion cases. Upon analysis we learned that the localization errors in motion cases are primarily due to the errors in the image registration process, which may not necessarily be 100% accurate. However, even with the a simple off-the-shelf image registration technique used in computer vision, our algorithm outperforms the comparative techniques. • The comparative techniques perform poorly in locating the LED from the scenes, especially for those frames where the LED is in 'OFF' state. Extracting LED locations in 'OFF' frames is challenging, as LED in general is not a feature-rich object. The LED OFF state further adds to the challenge as the intensity of the pixel region is very low and thus making geometric and photometric feature dependent analysis, such as color thresholding and ACF, very challenging. The lack of features fails to effectively train the YOLO v3 deep learning model for LED OFF states. • The YOLO v3 deep learning model performs the best when it is trained and tested across the entire dataset. When trained and tested on a specific setting such as only indoor or only outdoor, the model performs poorly. This is attributed to the lack of variations in features across the dataset which limits the learning process efficiency. We observe that there is no clear insight that can be gained about the learning process of YOLO v3 for LED detection as the accuracy numbers do not necessarily follow any trend. In this work, we setup a baseline deep learning LED recognition, which shows some potential, however, not better than optical correlation. We posit that these evaluation results reveal the need to further explore machine/deep learning models for LED localization. We have provided some examples of success and failure cases of YOLO v3 LED localization performance in Appendix A. • We also present the average precision, average recall, and average F1-score values for our evaluation in Table 3. The fidelity of the optical correlation method is reflected in its high average recall values and F1-scores.

1) Distance between transmitter and receiver
From Figure 8, it can be observed that for both indoor and outdoor static cases, the average LED locating accuracy is about perfect, when the LED-Camera distance is up to 10 m. However, at 15 m distance in outdoor experiment, the average accuracy is about 94%. In outdoor setup, when the LED is kept in a spot where the sun/ambient light shines bright on the LED (right image of Figure 9), due to the presence of saturated regions in the image which do not correspond to the LED. The intensity changes in the ON and OFF patterns will be impacted leading to detection errors. Such saturated regions might have higher correlation values under optical correlation leading to LED localization outliers. We consider both bright and shaded spot outdoor setup (shown in Figure 9) with the variations of distance in our analysis, and present the accuracy results in Figure 8. We report that with 20 input frames, when LED is placed at bright spot at 15 m distance, the average accuracy is about 88.5% and with the same specifications, at shaded spot the LED can be almost perfectly localized. These results clearly explain the impacts of LED-Camera distance and sunlight reflections on optical correlation localization accuracy.

2) Number of input frames for correlation
For all the outdoor static setups, we also test our system by varying the number of input image frames (from 10 to 100 images) in each execution of the correlation and report the results in Figure 10. We consider both, shaded and bright spot LED, cases while changing the correlation input frames in our analysis and show both results in left and middle illustrations of the Figure 10. We notice that with the increase in the number of input frames during correlation, the optical correlation method reaches near-perfection, even when LED is kept in extreme bright spot scenarios. Having more images during correlation helps generate a robust correlation value that can be easily delineated from outliers as there are more bits (values) being multiplied in the cross correlation process. Also, with larger number of images the chances that the scene can precisely mimic the variations in the ON/OFF (1/0) intensities become lower. In particular, we observe that with 10 frames, the accuracy is sub-par especially at distances beyond 10 m and static cases. However, just by increasing the input frames to 20, the accuracy can be significantly improved. In contrary to the characteristics and results of the static experiments, in motion driving cases, accuracy is higher when the number of input image frames in correlation is smaller. We report this behavior for all four driving patterns in Figure 10 (right). Under motion, the smaller the number of frames being considered for alignment is better as the amount of actual physical motion in the scene may be (almost insignificant) low. For example, 10 frames at 120 FPS is about 9 ms time span. The amount of motion that can happen within such a duration is typically low, except when the vehicle is driven at highway speeds. We observed from our analysis that the drop in accuracy with increasing frames is primarily due to registration errors, which is in turn a function of vehicle speed.

3) Car driving speed variation in localization accuracy
To evaluate our system performance in LED localization for outdoor motion cases, we extend the experimentation with the variation of car driving speed from 5 mph to 30 mph towards the LED emitter and include the results in Figure 12. By placing the LED transmitter as static on a tripod stand, we drive the car towards the LED attaching the camera on the wing (side-view) mirror of the car, as shown in Figure 11. We observe that the average LED localization accuracy is about 98% while driving the car at 5 mph and is about 87% when the car speed increases to 30 mph. As we mentioned earlier, the localization accuracy might be lesser in motion cases due to the dependency on the image registration performance. With higher driving speed, the movements in pixels are also greater compared to the static or slow driving VOLUME 4, 2016 cases. So, the misalignment still exists even after registering the motion frames. As shown in Figure 13, the misalignment in registered frames also increases when the car drives faster (30 mph) compared to a slower speed (5 mph). We observer that such misalignments are fairly small and are within the range that can be handled by state-of-the-art camera motion stabilization, such as by inverting the motion artifacts using motion vectors generated by inertial measurement units (IMU) or using computer vision optical flow methods. We target to incorporate such techniques in our future work.

4) Timing analysis of correlation algorithm
We present the execution time of each of the steps in our algorithm in Table 4. In static cases, the algorithm does not require to implement image registration and hence it performs faster than the motion cases. We report that our algorithm takes on average 0.29 seconds to locate the LED in each of the inputs of motion frames and can process each of the static images within the average of 0.22 seconds. We also compared the average LED localization processing time for each images of our algorithm with the other techniques which are used in our baseline localization performance comparison. We report each execution time of the implemented algorithms in Table 5 and notice that our correlation algorithm takes less time to process compared to simple color thresholding and machine learning based techniques. However, we do note that our technique, though slower than YOLO v3 based LED detection, we recall that YOLO v3 has much lower localization accuracy. This creates a tradeoff between computation versus accuracy, and we hypothesize that future works could use a hybrid method that integrates YOLO v3 with correlation to achieve the best of both worlds.

V. RELATED WORKS
In this section, we survey related works on object detection and localization.
Feature extraction based Computer Vision. Conventional feature extraction based computer vision techniques using different descriptors such as SIFT [4], HOG [6], SURF [5], Haar [16] to detect and localize the objects from the scenes [17] are commonplace. Feature based extraction architectures are not robust enough to identify the objects accurately from the scenes due to the constant changes in the image backgrounds, illumination conditions and the appearances of the objects. LEDs in particular are feature-less objects making feature definitions for LEDs in real-world settings very challenging.
Visible light positioning (VLP). Using LED beacons can enable precise object localization through Visible Light Positioning (VLP) [18]. Prior work has explored VLP across different applications such as, indoor localization, wearable devices, target tracking, etc [19]- [22]. In VLP, the transmitter LED needs to send it's location information to the corresponding receiver (can be photodiodes or imaging sensors) to estimate the localization parameters including the distance and the direction of the light signals. However, such dependency of getting the information of position related parameters beforehand makes the VLP systems challenging especially in scenarios where the object's location and envi-ronment are unknown.
Learning based tracking and Re-identification. In Intelligent Transportation System (ITS), identifying, locating, and tracking the same or similar type of vehicles is still challenging for computer vision applications [23]. Recently, deep convolutional neural networks based approach has been extensively used to solve the vehicle re-identification problem in works such as PROVID framework [24], DRDL model [25], CityFlow [26], VeRi-Wild [27]. For example, in DEx [28], a CNN based dual embedding expansion technique was implemented to create unique representations from each of the images. However, all the techniques require large and diverse datasets of the object in question which can be a bottleneck.
Multi-sensor fused based object detection. Fusing information or data [29] from different sensors to detect and locate objects is one of the common research trends in the community for the last few years. Sensors data from different 3D detectors such as camera (both monocular and stereo) [30]- [33], LiDar [34], [35], Radar [36], [37] have been fused in several experiments to tackle the object detection problem.
In [38]- [42], the authors propose different fusing techniques either by cascading the camera and LiDAR information or fusing the region of interest (ROI) features from the sensor information. In ContFuse [40], the system uses a convolution neural network based deep learning technique [43] to fuse VOLUME 4, 2016 ROI-wise the camera and LiDAR sensor data. To achieve full multi-sensor fusion, both point and ROI-wise features fusing have been implemented in [44]. However, fusing multisensor information is not an easy task to perform as there are challenges in every steps of data association, modality or alignment which needs a rigorous processing framework resulting in higher computational complexity.

VI. CONCLUSION
We designed a novel optical correlation based localization to precisely and accurately locate LED emitters in camera images. We designed and implemented the optical correlation algorithm and evaluated using real-world experiment traces. Upon evaluation in indoor, outdoor, static and motion cases, and comparing with traditional ML and non-ML techniques for LED detection, we showed optical correlation outperforms the comparative techniques. We showed that traditional feature based techniques fail due to lack of features in LED image regions. We learned from the evaluation that our optical correlation technique's localization accuracy has a trade off in static and driving cases for the choice of the number of input correlation frames. Our evaluation also revealed state-of-the-art classification using YOLO v3 deep learning does not necessarily solve the problem as the training process does not reveal any evidence that the model is able to learn unique characteristics about the LEDs. We posit that further exploration in optical correlation assisted deep learning models may be useful for improving optical camera reception fidelity, particularly in visible light and camera communication applications. We note that scalability is a problem when it comes to creating unique blinking sequences for each LED in the fieldof-view of the camera. We note to the reader that this can be resolved by using a finite set of sequences and reusing the sequences, but at different frequencies and different dynamic ranges (difference between ON and OFF intensities). The scalability question generates an interesting problem of recognizing the LED after it has been detected. We propose that we can use uniqueness in ID (sequence), frequency and intensity as parameters, which can overall, scale the number of options considering the number of permutations possible. Further, it is possible to use contextual relevance of the LEDs -what are they attached to and what are the objects/entities detected and recognized in vicinity -rely on state-of-theart computer vision object detection. We believe our current results present a foundation for the future work that can incorporate such and variations of techniques for addressing scalability. .

APPENDIX A FAILURE CASES OF LED DETECTION WITH YOLO V3
The YOLO v3 model's success examples are in Figure 14.
The model fails (Figures 15 and 16) on inter-dataset testing. This signifies that the performance is highly dependent on shape, lighting condition and data distribution which raises questions about feasibility of deep learning for LED detection tasks. The failure scenarios primarily occur in cases where there are varying lighting conditions (in figure 17).
There are significant number of failures when the state of the LED is OFF since the OFF state of the LED is transparent which reduces the features available for the model to learn. MARCO GRUTESER is a Professor of Electrical and Computer Engineering as well as Computer Science (by courtesy) at Rutgers University's Wireless Information Network Laboratory (WINLAB). He directs research in mobile computing, is a pioneer in the area of location privacy and recognized for his work on connected vehicles. Beyond these topics, his more than hundred peer-reviewed articles and patents span a wide range of wireless, mobile systems, and pervasive computing issues. He has served as program co-chair or vice-chair for conferences such as ACM MobiSys, ACM WiSec, IEEE VNC and IEEE Percom. He has delivered nine conference and workshop keynotes, served as panel moderator at ACM MobiCom, and as panelist at ACM MobiSys, IEEE Infocom, and IEEE ICC. DR. KRISTIN J. DANA received a Ph.D. from Columbia University (New York, NY) in 1999, an M.S. degree from Massachusetts Institute of Technology in 1992 (Cambridge, MA), and a dual B.S. BE degree in 1990 from New York University and the Cooper Union (New York, NY). She is currently a Full Professor in the Department of Electrical and Computer Engineering at Rutgers University. She is also a member of the graduate faculty in the Computer Science Department at Rutgers. Dr. Dana is the PI of a $3M NSF National Research Traineeship entitled Socially Cognizant Robotics for a Technology Enhanced Society (SOCRATES) to create a new vehicle for graduate training and research that integrates technology domains of robotics, machine learning and computer vision, with social and behavioral sciences (psychology, cognitive science and urban policy planning). Prior to academia, Dr. Dana was on the research staff at Sarnoff Corporation a subsidiary of SRI, developing real-time computer vision algorithms for motion estimation and change detection. She is the recipient of the General Electric "Faculty of the Future" fellowship, the Sarnoff Corporation Technical Achievement Award, the National Science Foundation Career Award and a team recipient of the Charles Pankow Innovation Award in 2014 from the ASCE. Dr. Dana's research expertise is in computer and robot vision including machine learning, computational photography, illumination modeling, texture recognition, reflectance models, as well as applications of vision to dermatology, remote sensing and precision agriculture. On these topics, she has published over 70 papers in leading journals and conferences. VOLUME 4, 2016