Compression Ratio Learning and Semantic Communications for Video Imaging

It is crucial to improve data acquisition and transmission efficiency for mobile robots with limited power, memory, and bandwidth resources. For efficient data acquisition, a novel video compressed-sensing system with spatially-variant compression ratios is designed, which offers high imaging quality with low sampling rates; To improve data transmission efficiency, semantic communication is leveraged to reduce bandwidth requirement, which provides high image recovery quality with low transmission rates. In particular, we focus on the trade-off between rate and quality. To address the challenge, we use neural networks to decide the optimal rate allocation policy for given quality requirements. Due to the non-differentiable issue of rate, we train the networks by policy-gradient-based reinforcement learning. Numerical results show the superiority of the proposed methods over the existing baselines.


I. INTRODUCTION
Cameras have become ubiquitous in robotic systems and enable robots to evaluate the environment, infer their status, and make intelligent decisions.However, modern cameras always suffer from the age-old problem of limited dynamic range caused by the full-well capacity, dark current noise, and read noise [1], and the challenge of motion blur resulting from either object motion during the image exposure or camera motions due to the robot movements.
To overcome some of these challenges, computational imaging techniques have been widely used for generating high dynamic range (HDR) images [2], [3], [4] or high-speed videos [5], [6].These methods use spatially-varying pixel exposure or pixel-wise coded exposure along with optimization techniques to get improved performance in imaging systems.Among these methods, snapshot compressive imaging (SCI) [7], [8], [9] captures a set of consecutive video frames with one single exposure and is popular for realizing sparse measurements and low requirements on memory, bandwidth, and power [10].
Due to the hardware limitation of conventional cameras, existing SCI systems capture an image by exposing photosensitive elements for a fixed exposure time, leading to a fixed temporal compression ratio for all pixels.The fixed compression ratio of the current SCI systems severely limits their abilities to record natural scenes in a measurement-efficient way.With a small compression ratio, more measurements are sampled, increasing the power, memory, and related-resources.With a large compression ratio, fewer samples are needed but the reconstructed video frames have poor quality.If pixels within SCI systems can be generated under different compression ratios, the video compressive sensing system can maintain high quality reconstruction and achieve efficient measurement.However, such a video compressive sensing system not only has special requirements on the hardware, but also requires an optimal pixel-wise compression ratio assignment policy, which is not trivial as both the shot/read noises and object/camera motions will affect the choice of compression ratios.
Fortunately, programmable sensors or focal-plane sensor-processors [11], [12], [13] can vary compression ratios spatially through pixel-level control of exposure time and read-out operations.Recent works in deep optics, on the other hand, demonstrate the superiority of jointly learning optic parameters and image processing methods over the traditional computational imaging techniques with heuristic designs on optics for applications in HDR imaging [14], video compressive sensing [15], [16] 1 , and motion deblurring [17].Inspired by these pioneering works, we focus on minimizing the number of measurements for various applications and develop a novel video compressive sensing system with pixelwise compression ratios, where the ratio allocation policy and the video reconstruction algorithms are optimized through a combination of supervised learning and policy gradient reinforcement learning (RL) [18], [19].
In addition to sensor development, we also study the optimal transmission method for sensor data generated by programmable sensors in terms of source and channel coding methods since the raw sensor data collected by the sensors at the robots need to be sent to the admins for human-robot interaction.For the sensing systems with image-type data [17], [15], [20], off-the-shelf codecs (e.g.BPG [21] or JPEG [22]) as well as other deep image compression methods [23] can be used as potential source coding and the fifth-generation (5G)standardized LDPC [24] or polar codes [25] can be chosen for channel coding.However, it is sub-optimal to design the source and channel coding for sensor data independently from the deep optics and video reconstruction algorithms for the 1 In these deep optic-based works on video compressive sensing, the coded aperture and the exposure time for generating a snapshot image is optimized but the compression ratio, the number of measurements over a time window, is fixed.Different from these works, we focus on studying the shutter speed and read-out frequencies to adjust per-pixel compression ratio.The most significant difference is that some pixel locations will have more measurements than other locations in the captured sensor data of our system.following two-fold reasons.The distribution of sensor data will be different from that of regular images under the influence of deep optics.Different pixels in the sensor data contribute differently to the following video reconstruction process and should be compressed differently.
To address these challenges, semantic communications [26], [27], [28], [29] are promising solutions.The deep optics and video reconstruction networks define special message generation and interpretation processes between transceivers, which are called semantic encoders and decoders in semantic communications, respectively.The communication systems should be optimized to ensure the interpretations of the messages through semantic decoders are correct rather than the delivery of raw sensor data itself.To realize semantic communications, one potential way is to introduce task-aware compression [30] for programmable sensors and design the compression methods under the guidance of the video reconstruction process.In this case, the channel coding is still designed separately while earlier studies on joint source and channel coding (JSCC) [31] have demonstrated the benefits of co-designing the source and channel coding process.In this work, a semantic communication framework that achieves the joint optimization of deep optics, data compression, channel coding, and reconstruction is designed, which significantly improves the transmission efficiency of sensor data.
Specifically, our contributions can be summarized as follows, • We introduce a RL-based method for adjusting the parameters in programmable sensors, which differs from existing deep optic methods based on differentiable models or functions [17].• We build a novel video compressive sensing system with spatially-variant compression ratios, where the ratio allocation policy is learned through an explicit rate-distortion function.
• We introduce a RL-based method for the explicit trade-off between transmission rates and task accuracy in semantic communications, where the rate allocation policy is trained jointly with coding modules.• We propose a semantic communication system for programmable sensors, realizing the co-design of deep optics, data compression, channel coding, and reconstruction algorithms.

II. VIDEO COMPRESSED SENSING WITH SPATIALLY-VARIANT RATIOS
As shown in Fig. 1, the principle of the proposed video compressed sensing system is to learn spatially-variant compression ratios for better imaging quality but with a fewer number of measurements.This idea can be further combined with the existing deep optic-based SCI systems with the pixel-wise coded aperture and shuttering functions [20].In SCI systems with a fixed compression ratio, a fixed number of measurements will be generated for all spatial locations.While in the proposed system with spatially-variant ratios, some locations will have more measurements than others.For example, given four video frames, two snapshots (m1, m2) Fig. 1: The illustration of spatially-variant compression ratios in video compressed sensing.
will be captured in the current SCI systems with 1/2 ratio.By contrast, the proposed systems will generate one snapshot with dense measurements (m1) and three other snapshots with sparse measurements (m2, m3, m4).Specifically, spatial locations with 1/4 ratio will only have measurements on m1 after one readout operation while locations with 1 ratio will have values on all four snapshots after four readout operations.
To improve the sensing efficiency, the ratio map should be jointly designed with the deep optics and video reconstruction algorithms.In this section, we will introduce the details of the proposed system including the forward model and the training losses.

A. System overview
We show the overall pipeline of the proposed sensing system in Fig 2 .Denote H and W as the height and width of video frames.We first generate a compression ratio map, M ∈ R H×W ×1 , from a small trainable matrix using a ratio generation network with one convolution layer, three residual blocks, and one transposed convolution layer.The spatial size of the trainable matrix is set as 1/8 of the ratio map.We then simulate the capture of a scene, S ∈ R H×W ×T , with a programmable sensor using M and possibly an extra random binary mask, B ∈ R H×W ×T , which is referred to as coded apertures in earlier works [20], [10].The captured sensor data, I, stacked with M is then fed into a video reconstruction network to produce a reconstructed video, V ∈ R H×W ×T .Two training losses (L 1 , L 2 ) are designed to update the learnable matrix, the ratio generation network, and the video reconstruction network to improve the reconstruction quality while at the same time reduce the average compression ratio.

B. Compression ratio generation
Given T video frames, five kinds of compression ratios are considered during the ratio map design, i.e. 0, 1/T , 2/T , 4/T , 8/T .Therefore, each pixel in M takes one of the five   discrete values.To generate the ratio map, the feature maps generated by the ratio generation network are designed to have five channels.After applying a Softmax function to the channel dimension of the feature maps, each channel indicates the discrete possibility of taking the corresponding ratio.Denote the feature maps representing the discrete probability distributions as P ∈ R H×W ×5 .The final ratio map is obtained through a sampling operation according to P .

C. Sensor forward model with varying ratios
The learned ratio map will guide the behaviors of the programmable sensor.The exposure time and readout operations under different ratios are summarized in Fig. 3.We consider that each measurement is read out after a continuous exposure ends and before a new round of exposure begins.Also, the generated "signal" electrons are cleared after each read-out operation.For example, when the ratio is 2/T , each exposure lasts T /2 frame time, generating 2 measurements in total.In each exposure, the signal will integrate from 0. For a specific ratio, r, the equivalent measurement matrix, A r , is of size rT × T with each row representing one measurement base.For the i-th (i = 1, 2, .., rT ) row, [A r ] i: is a binary vector with the (i − 1)/r column to the i/r column taking values 1 while for other columns 0. For example.
. The first row indicates the first measurement is the integrated signal from frame time 1 to T /2 while for the second row indicating the second measurement from time T /2 + 1 to T .With the measurement matrices, the sensing process at spatial location p with r p ratio can be modelled as, where S p ∈ R T ×1 is the signal at spatial location p, I p ∈ R rpT ×1 is the captured sensor data at location p, U(•) represents the camera exposure function capturing the noise effects of the camera [17].
If the random binary masks, B ∈ {0, 1} H×W ×T are used to modulate the signal [20], the sensing process can be modelled as, where is the signal at spatial location p.
Following [17], we consider two kinds of noises, i.e. shot noise n s ∈ N (0, σ s ) and read noise n r ∈ N (0, σ r ) in the camera exposure function.The level of shot noise is proportional to the signal electrons' strength.For a signal e in [0,1] range, σ s = √ eσ ss , where σ ss is independent of signals.By contrast, the level of read noise is fixed for a camera, depending on the photon flux.We use the same settings for σ ss and σ r as [17], that is, σ ss = 4.95 × 10 −3 and σ r = 7.25 × 10 −3 .With the definition of noises, U(e) = e + n s + n r .
Note that increasing r will decrease the signal strength of each measurement because of the short exposure.Since the distortion effects of shot noise and read noise on signals will increase as the signal strength decreases, the imaging quality will not increase unlimitedly as r increases.Also, using a long exposure will be more beneficial for stationary scenes as the signal strength can increase but not good for moving objects.

Spatially-variant compression ratio H x W x T H x W x 1
Embedding Embedding SAB SAB Conv The best choice of ratios should be a mix of small and large compression ratios for different locations so that the captured sensor data can take benefits from both the long exposure with a larger signal-to-noise ratio and the short exposure with less motion blur.

D. Video reconstruction network
After obtaining I, the next step is to reconstruct the targeted video using a deep neural network.The overall architecture of the proposed video reconstruction network is shown in Fig. 4. Our network consists of three components, an initial reconstruction stage (IR), a fusion network (FN), and a deep reconstruction network (DRN) built based on the single-stage spectral-wise transformers (SST) proposed in [32].The IR and FN are introduced to mitigate the influence of spatially-variant M and B. Specifically, if B is not considered, we get the initial reconstruction, V0 ∈ R H×W ×T , in IR by considering Eq. ( 1), where (A rp A T rp ) −1 is fortunately a diagonal matrix as defined in Sec.II-C.By contrast, if B is considered, we first rewrite Eq. ( 2) as, where Ârp = A rp diag(B p ) and diag(B p ) denotes the T × T diagonal matrix constructed from B p .In this case, some rows of Ârp may be all zero, making Ârp ÂT rp non-invertible.Since the row vectors of Ârp are orthogonal to each other, we reconstruct V0 by using the non-zero row vectors of Ârp independently, (5) Following the coarse-to-fine criteria, we further use a shallow FN to deal with the spatially-variant coded exposure and ratios, which takes V0p , M and possibly B as inputs and outputs the second-level reconstruction results, V1 ∈ R H×W ×T .The architecture of FN is shown in Fig 4.
Next, as illustrated in Fig 4, the DRN takes V1 as the input and reconstructs the final videos, V , by cascading three SSTs [32].Each SST follows the design of U-net [33] with an encoder, a bottleneck, and a decoder.The embedding and mapping block are convolutional layers (conv) with 3×3 kernels.The feature maps in the encoder sequentially pass through one spectral-wise attention block (SAB), one conv with stride 2 (for downsampling), one SAB and one conv with stride 2. The bottleneck is one SAB.The decoder has a symmetrical architecture to the encoder.Following the spirit of U-Net, the skip connections are used for feature aggregation between the encoder and decoder to alleviate the information loss from the downsampling operation.The basic unit of SST is SAB, whose architecture is also shown in Fig 4 .It has one feed forward network (FFN), one spectral-wise multi-head self-attention (S-MSA), and two layer-normalization.Different from original MSA that calculates the self-attention along the spatial dimension, S-MSA regards each feature map as a token and calculates the self-attention along the channel dimension, making it computational-effective.More details of S-MSA are explained in [32] 2 .

E. Training losses
We use L 1 , L 2 to train the ratio generation parts and the video reconstruction network, respectively.Specifically, the learnable matrix and the ratio generation network are trained based on policy gradients.In our system, each spatial location in P is regarded as an agent and its action space is the available compression ratios.Earlier works have proved the global convergence of policy gradient RL in multi-agency situation [34].We define the reward of each location, Q p (r p ), under action r p according to the rate-distortion trade-off theory, where V denotes target videos, ||V p − Vp || 2 denotes the meansquared error (MSE) (or called distortion) of the reconstructed videos at spatial location p, r p T denotes the number of measurements employed at location p, which can also be viewed as the compression rate, λ is an introduced parameters for rate-distortion trade-off.Increasing λ will penalize more on the compression rate, leading to a smaller average compression ratio.As this is not a sequential decision problem, there is no need to define future rewards.Although ||V p − Vp || 2 can only approximately evaluate the effect of action r p on the V p as the DRN will aggregate information from neighbouring pixels, it is also the most direct way to evaluate action r p .At the same time, the expected reward, J p , is the value function for spatial location p, where the expectation is w.r.t.r p with probability P p .Note that, MinL 1 = Max p J p .Following [19], we can approximate the gradient of the J p to parameters θ in ratio generation parts with samples generated from P p , where P p (r p ) denotes the probability of chosen action r p from distribution P p .Note that the update of θ is based on the average value of ∇ θ J p from different p.
On the other hand, the video reconstruction network is trained in a supervised way based on the MSE between V and V ,

III. EFFICIENT TRANSMISSION OF DATA FROM PROGRAMMABLE SENSORS
Sometimes the sensor data captured by robots' on-device sensors need to be sent to remote servers/devices through wireless channels for data storage or video reconstruction.Efficient transmission of these sensor data requires advanced data compression techniques to reduce the data rates and bandwidth requirement.As discussed above, existing communication systems focusing on reproducing the raw sensor data are suboptimal.As a part of semantic communications, designing the compression methods concerning the video reconstruction network based on task-aware compression [30] is promising.After compressing the original sensor data into compact bit streams, the transmitter will add parity bits and modulate the streams for robust transmission over unreliable channels using off-the-shelf methods.The number of modulated symbols after channel coding and modulation is the communication cost.
Nevertheless, there is an increasing belief in the communication community that the classic framework based on the Shannon separation theory needs to be upgraded for joint designs [29].For semantic communication systems, jointly optimizing the channel coding and modulation with the other components may lead to better video reconstruction quality with fewer communication resources.
In this section, we will design semantic communication frameworks for the proposed video compressed sensing systems based on both the task-aware compression and semantic communications with joint designs.
A. Task-aware compression 1) Architecture: The overall architecture of the deployed deep compression methods for the proposed video compressed sensing systems is shown in Fig. 5.It is mainly based on the architecture used in [30] by substituting the basic feature extraction unit from convolutional blocks to SABs.Some other changes are also made for simplicity 3 .For a sensor data I, we first reshape its dimension to H × W × 8, as each spatial location has at most eight measurements.Zero-padding is used for locations with fewer measurements.An encoder, g a , then takes the reshaped sensor data as inputs and generates a latent representation Y ∈ R , which is given by, Y = g a (I).Y is then quantized to Ŷ by a quantizer.The embedding and mapping blocks are conv3x3.The features inside the encoder subsequently go through five SAB×4-downsampling×1 pairs.Each downsampling operation will decrease the spatial size of the features by 1  4 but double the channel dimension.Next, a hyper-encoder, h a , takes Y as inputs and generates an image-specific side information The channel dimension of the features keeps constant in h a .Then the quantized side information Ẑ = Q(z) is saved as a lossless bitstream through a factorized entropy model and entropy coding.After that Ẑ is forwarded to the hyper-decoder, h s , to draw the parameters (µ, σ) of a Gaussian entropy model, which approximates the distribution of Ŷ and is used to save Ŷ as a lossless bitstream.For reconstructing I, a decoder, g s , operates on Ŷ and generates I by I = g s ( Ŷ ).At last, the reconstructed Î is used for video reconstruction.Note that g s and g a , h s and h a have symmetric architectures, respectively.
2) Training losses: The goal of task-aware compression is to minimize the length of the bitstreams and the distortion between V and V .This objective raises an optimization problem of minimizing ||V − V || 2 + β(−log 2 P Ŷ − log 2 P Ẑ ), where ||V − V || 2 denotes the video reconstruction quality, −log 2 P Ŷ − log 2 P Ẑ represents the number of bits used to encode Ŷ and Ẑ, and β is an introduced trade-off parameter between coding rate and video distortion.Increasing β will increase the compression ratio but decrease the video reconstruction quality.
3) Communication costs: After obtaining the compression systems with different compression ratios, we can then estimate the required number of modulated symbols to transmit these bit streams under different channel conditions.Specifically, the number of modulated symbols (in complex number) used for transmitting a bitstream depends on the implemented channel coding rate r c (e.g.1/3, 1/2, 2/3) and modulation 3 As the learned compression ratio is universal to all scenes, the spatial importance of I should also be independent from the contents of I, therefore, the side links conditioned on a quality map used in [30] can be safely removed.Also, instead of estimating the quality map by a task loss function, we directly change the the distortion metric in the training loss of [30] to task loss, which should be more precise in describing the spatial importance.

B. Semantic communications with joint design
Although the aforementioned deep compression method enables the co-design of deep optics, algorithms, and source coding, the widely-used entropy coding method [30], [23] prohibits the co-design of communication-specific components with deep compression methods.When entropy coding methods are used, each bit from entropy coding methods needs to be transmitted error-freely; otherwise, error propagation will happen in the entropy decoding process.This property leaves the communication systems with no choice but treat each bit equally and carefully.Semantic communications with joint design, on the other hand, allow the effect of channel noise to be considered during the compression process.
1) Architecture: As shown in Fig. 6, the framework consists of three components: semantic coders, semantic-channel coders, and a rate allocation network (RAN).Semantic coders define a special message generation and interpretation method between transceivers based on a shared knowledge base.Semantic-channel coders directly learn the end-to-end mappings between semantic messages and modulated symbols.RAN is responsible for controlling the transmission rates.Different from deep compression methods that focus on the source coding rate only, the transmission rates in semantic communications depend on source coding rate, channel coding rate, and modulation order.Specifically, the deep optic methods are special semantic encoders, which encode natural scene S into sensor data I in a predefined way and the video reconstruction networks, which decode the target video V from I, are special semantic decoders.The sensor data, I, is the semantic message of a scene to be shared between transmitters and receivers.Note that conventional communication systems emphasized the accurate transmission of I.With the semantic decoder (defining I → V ), the semantic communication systems should be optimized to maximize the quality of V under limited communication costs.Also, this is the first attempt to define the semantic coders in the generation process of source data.
Given I, the transmitter will use a semantic-channel encoder (SCE) to generate a predefined maximum number of modulated symbols, X ∈ R , from which some symbols will be chosen to transmit I through noisy channels by rate control techniques 4 .This process can be modelled as, where the number of modulated symbols is measured in real number.The SCE is composed of an embedding layer, three consecutive SAB×4-downsampling×1 pairs, two stacked SAB×1-3×3Conv×1 pairs, and a mapping layer.The size of feature maps after each operation is shown in Fig. 6.
To adjust the communication costs according to the semantic contents of message I, the RAN takes X as inputs, which are not only generated modulated symbols but also the high-level representations of I in feature space, and generates a spatially-variant coding rate map . Each element in the spatial location of F indicates how many symbols out of the 48 symbols in X at the same spatial location will be kept for transmission.Specifically, F takes four discreet value f ∈ {1, 2, 3, 4}, and only the first 12f symbols of X in the channel dimension will be transmitted.The generation of F is similar to the generation of M in Sec.II-B: the RAN generates a feature map representing the discreet probability distribution of taking each value from {1, 2, 3, 4} and a sampling operation is conducted to generate F .With F , a mask operation is conducted on X to delete unnecessary symbols.The remaining symbols X are then reshaped into a vector Z ∈ R nz×1 , where n z denotes the number of remaining symbols.Next, power normalization is applied to Z to satisfy power constraints, Finally, Ẑ is directly transmitted through wireless environments.The effect of channel noise on Ẑ can be represented as, where h ∈ CN (0, 1) is multiplicative noise and n ∈ N (0, σ n I nz×nz ) is additive noise.In additive white Gaussian noise (AWGN) channel, h = 1 and the value of σ n depends on the current signal-to-noise ratio (SNR).Simultaneously, each element of F can be quantized using 2 bits and the bitstream from F is also transmitted.Different from the transmission of Ẑ, the bitstream of F is transmitted in an error-free way as in the deep compression methods because a minor error in F will make it impossible to reshape the flatten vector Z back to X.
At the receiver side, X and F are reshaped from Z and bitstreams respectively.After concatenation, they are fed into a semantic-channel decoder (SCD), which has a symmetric architecture to the SCE.This process can be modelled as, The recovered sensor data Ĩ is used to generate videos using the video generation network.
2) Training losses: Similar to Sec.II-E, we use L 1 and L 2 to train the RAN and SCE/SCD, respectively.Specifically, the RAN is trained based on policy gradients RL, where X is the state, F describes the actions, and each spatial location of F is an agent.Note that the spatial size of V is 8 times that of F and X so the action taken at each spatial location of F will have a strong effect on the reconstruction quality of a 8 × 8 area of V .Considering this, we first define U = (V − V ) 2 and apply a 8×8 average pooling to U , obtaining We then define the reward of location p, Q p F (f p ) , under action f p as follows, where log(1/ Ũp ) describes the distortion caused by the agent at spatial location p, f p is the transmission rate (also the action) at p, µ is an introduced parameter for rate-distortion tradeoff.Increasing µ will penalize more on the transmission rate, leading to fewer modulated symbols to be transmitted.We adjust the communication costs by tuning µ.
At the same time, the expected reward, J p F , is the value function for spatial location p. MinL 1 = Max p J p F and the gradient of J p F to the parameters δ in RAN can be approximated as follows, Note that the update of δ is based on the average value of ∇ δ J p Q from different locations.On the other hand, the SCE and SCD are trained in a supervised way based on the MSE between V and V , 3) Communication costs: The communication costs consist of two parts: the transmission of Ẑ and B. As Ẑ is of shape n z × 1, the number of complex modulated symbols used to transmit Ẑ can be denoted as l (1) s = n z /2.On the other hand, B is of shape n B = HW 64 and each element can be represented by 2 bits, therefore the bitstream length from B can be modelled as l b = 2n B .If the channel coding rate is r c and the modulation order is r m , the length of the modulated symbols for B can be calculated as l

IV. EXPERIMENTS
In this section, we first demonstrate the superiority of the proposed video compressed sensing system in terms of sampling rate.Based on the developed sensors, we then evaluate the performance of different communication systems regarding communication costs.fixed ratio learned ratio Fig. 7: The performance comparison between learned spatially-variant ratios methods and fixed ratios methods in video imaging systems without coded aperture.

A. Video imaging with spatially-variant compression ratios
The following experiments are conducted to evaluate the effectiveness of the method proposed in Sec.II.
1) Dataset: Following [17], we use the Need for Speed (NfS) dataset [35] to train the network and evaluate its performance.The NfS dataset is collected with significant camera motions and suitable for representing the scene captured by moving robots' on-board cameras.The dataset consists of 100 videos obtained from the internet, from which 80 are used for training and 20 for testing.Each video is captured at 240 frames per second (fps) with a 1280 × 720 resolution that we center crop to 256×256.For each video, we select 80 random 16-frame-long segments within the video, therefore T = 16 in our experiments.The images are turned to gray images and normalized to be within [0, 1].Our input and output to the end-to-end model are both the 16-frame video segment.
2) Implementation details: Our model is implemented in PyTorch.The ratio generation parts are trained with the SGD optimizer (momentum=0.9)at a learning rate of 5 × 10 −3 .The video reconstruction network is trained with the Adam optimizer at a learning rate of 5×10 −5 .The training process is as follows.We first fix the ratio at 8/T for all spatial locations and train the video reconstruction network for 100 epochs.After that, we gradually increase λ in Eq. ( 6) from 5 × 10 −3 to 0.5 and jointly train the ratio generation parts and video reconstruction network.For each λ, we train for about 100 epochs.For baselines with a fixed compression ratio for all locations, we set the ratio from 1/T to 8/T and train the video reconstruction network for 100 epochs in each fixed ratio.Furthermore, we consider both cases where the binary mask, B, is used or not used.When B is used, it is randomly initialized and fixed during the experiments.
3) Results without binary mask: We first compare our method with its fixed-ratio version when B is not used.We use the peak signal-to-noise ratio (PSNR), 10 log 10 (1/MSE(V, V ), as the performance metric.The results are shown in Fig. 7, where r avg = 1 HW H i=1 W j=1 M i,j denotes the average compression ratio for all spatial locations.As shown in Fig. 7, the imaging quality increases along with the growth of Fig. 8: The performance comparison between learned spatially-variant ratios methods and fixed ratios methods in video imaging systems with coded aperture.
r avg for both methods until r avg reaches 0.5 (8/T ) where the effects of shot noise and read noise surpass the growth of the number of measurements.From the figure, the proposed method with learned ratios has a significant performance gain over the method with fixed ratio.For example, when r avg = 0.0625 (1/T ), the learned-ratio method has nearly 5 dB gain over the fixed-ratio method, demonstrating the superiority of learning a compression ratio map in reducing the number of measurement samples.4) Results with binary mask: We then consider B and show the performance comparison in Fig. 8. Due to the usage of binary mask, the system performance of the fixed-ratio method increases when r avg = 0.0625 (1/T ), showing the positive effect of coded apertures.However, as using B will decrease the signal strength, the performance of fixed-ratio method decreases when r avg increases when compared to the same method without B in Fig. 7. From the Figure, the proposed method with learned ratio also has a steady performance gain over the fixed-ratio method, further proving the effectiveness of the developed sensors.Fig. 9: The learned ratio maps in a 50 × 50 area of an image when B is used.

5) Learned compression ratio maps:
We show the learned ratio maps when B is used in Fig. 9.As discussed above, there are only five ratio choices from 0 to 8/T and their colors are shown in the color bar.From the figure, the learned ratio maps have fixed patterns for a local area.The patterns are different for different r avg .Besides, the learned ratio maps are a mix of low ratios and high ratios so that the benefits from both long exposure and low exposure can be considered.
6) Visual results: Fig. 10 shows the restored video frames from different methods.We use the fixed-ratio method with r avg = 1 and the learned-ratio method with r avg ≃ 1 in Fig. 8. From the figure, the frames from learned-ratio method have more texture details and less artifact.

B. Semantic communications for programmable sensors
In this subsection, we will first evaluate semantic communication frameworks based on the task-aware compression and the joint design, respectively.After that, we will compare semantic communications with the existing transmission methods.
In the following experiments, we first restore the network parameters related to the sensors from pretrained models in the previous subsection and fix them when training the semantic communication frameworks.We use the sensor network with r avg = 0.156 in Fig. 7.The experiments are called fixedsensing experiments.Next, we jointly train the sensing network with the communication parts, which are called jointsensing experiments.
1) Channel condition: We assume the sensor data is transmitted through the AWGN channel with SNR = 10 dB.
2) Implementation details of different semantic communication frameworks: We now describe in more details about the implementations of different semantic communication frameworks.
• Task-aware compression plus capacity-achieving channel coding (Compr+Cap): We first convert the sensor data into bitstreams and then assume the transmission of the bitstreams can reach Shannon channel capacity.In the considered channel condition, ls l b = 0.289.Note that is hard to achieve Shannon capacity in real systems, so its performance can only be regarded as an ideal reference for compression methods.the first 6f symbols of X rather than 12f .Also, exploration strategies are used when sampling f .Specifically, the sampling probability is set to F +0.4p, where we set p(1)=p( 2)=p( 3)=p(4)=0.25 to encourage the RAN to explore more on other actions so that the SCE/SCD can perform well on all actions.3) Comparison of semantic communications in fixedsensing experiments: We show the performance comparison of different semantic communication methods in the fixedsensing experiment in Fig. 11, where l avg denotes the average number of modulated symbols l s used for the video clips in the test dataset.From the figure, the 'Compr+LDPC' performs slightly better than 'SemCom+noRAN' while the proposed 'SemCom' method outperfroms these methods to a relatively large extend, showing the advantage of jointly designing the channel coding and modulation.It also demonstrates the effectiveness of directly implementing the rate-distortion tradeoff on the modulated symbols through the proposed RAN.Furthermore, the proposed 'SemCom' has a similar performance with 'Compr+Cap', showing that semantic communications with joint designs are promising ways to approach Shannon capacity.
4) Comparison of semantic communications in jointsensing experiments: The performance comparison of different semantic communication methods in the joint-sensing experiment is shown in Fig. 12. From the figure, SemCom performs significantly better than 'Compr+LDPC' and even surpasses the 'Compr+Cap' in large l avg cases, further proving the benefits of joint designs.However, we cannot conclude for sure that semantic communications can surpass Shannon capacity as the performance of 'Compr+Cap' depends on our implementation of task-aware compression methods.
5) Implementation of conventional communication methods: We now explain how the sensor data can be transmitted in conventional communication systems and introduce their implementation details.  transmit the raw sensor data regardless of its usage.To simulate this process, the raw sensor data need to be compressed and channel-coded.Recent works on deep JSCC [31] have shown that training a network for joint source and channel coding can perform better than using standardized image compression methods and channel coding methods.Therefore, we follow its design and build a deep network similar to SCE/SCD to transmit the raw sensor.The network is optimized by the meansquared error (MSE) between original sensor data and reconstructed sensor data.• Transmit reconstructed video by H.264 video coding plus LDPC plus QAM (Video+H.264+LDPC):Another choice is to reconstruct the video first via a locallydeployed video reconstruction network and then transmit the reconstructed video.In this way, the computationalcostive reconstruction network need to be run at the transmitter.To transmit the video, we use H.264 [36] for video source coding, LDPC for channel coding, and QAM for modulation.6) Comparison of semantic communications with conventional communication systems: The performance comparison between semantic communications and conventional communication systems is shown in Fig. 13.From the figure, Sensordata+JSCC performs the worst.This is because the important data for video reconstruction networks does not get targeted protection during transmission.Video+H.264+LDPCperforms better than Sensordata+JSCC as the videos have been reconstructed at the transmitter side.However, this method is still not as effective as SemCom, which shows we can achieve the efficient transmission of sensor data without running timeconsuming and resource-costive reconstruction algorithm at the transmitter side.

V. CONCLUSION
In this work, we propose a novel video imaging system with learned compression ratios.We also propose a semantic communication framework for programmable sensors with taskoriented reconstruction algorithms.From the perspective of algorithm development, we show that by combining the policygradient learning and supervised learning, we can achieve the explicit (compression or transmission) ratedistortion trade-off in different cases.The proposed training pipelines can be extended to many other applications.

Fig. 2 :
Fig.2:The overview architecture of the proposed video compressed sensing system with spatially-variant ratios.

Fig. 3 :
Fig. 3: The exposure time and readout operations in different compression ratios.

Fig. 4 :
Fig. 4: The architectural details of video reconstruction network.

Fig. 10 :
Fig. 10: Examples of restored video frames from sensing methods with or without learned ratios when compression ratio equals to 1/16.

Fig. 11 :
Fig. 11: The performance comparison among different semantic communication frameworks in fixed-sensing experiments.