Associating Latent Representations With Cognitive Maps via Hyperspherical Space for Neural Population Spikes

Recently, there has been a focus on drawing progress on representation learning to obtain more identifiable and interpretable latent representations for spike trains, which helps analyze neural population activity and understand neural mechanisms. Most existing deep generative models adopt carefully designed constraints to capture meaningful latent representations. For neural data involving navigation in cognitive space, based on insights from studies on cognitive maps, we argue that the good representations should reflect such directional nature. Due to manifold mismatch, models utilizing the Euclidean space learn a distorted geometric structure that is difficult to interpret. In the present work, we explore capturing the directional nature in a simpler yet more efficient way by introducing hyperspherical neural latent variable models (SNLVM). SNLVM is an improved deep latent variable model modeling neural activity and behavioral variables simultaneously with hyperspherical latent space. It bridges cognitive maps and latent variable models. We conduct experiments on modeling a static unidirectional task. The results show that while SNLVM has competitive performance, a hyperspherical prior naturally provides more informative and significantly better latent structures that can be interpreted as spatial cognitive maps.


Associating Latent Representations With Cognitive Maps via Hyperspherical Space for Neural Population Spikes
Yicong Huang and Zhu Liang Yu , Member, IEEE Abstract -Recently, there has been a focus on drawing progress on representation learning to obtain more identifiable and interpretable latent representations for spike trains, which helps analyze neural population activity and understand neural mechanisms. Most existing deep generative models adopt carefully designed constraints to capture meaningful latent representations. For neural data involving navigation in cognitive space, based on insights from studies on cognitive maps, we argue that the good representations should reflect such directional nature. Due to manifold mismatch, models utilizing the Euclidean space learn a distorted geometric structure that is difficult to interpret. In the present work, we explore capturing the directional nature in a simpler yet more efficient way by introducing hyperspherical neural latent variable models (SNLVM). SNLVM is an improved deep latent variable model modeling neural activity and behavioral variables simultaneously with hyperspherical latent space. It bridges cognitive maps and latent variable models. We conduct experiments on modeling a static unidirectional task. The results show that while SNLVM has competitive performance, a hyperspherical prior naturally provides more informative and significantly better latent structures that can be interpreted as spatial cognitive maps.
Index Terms-Latent variable models, neural population spikes, hyperspherical latent space, cognitive navigation.

I. INTRODUCTION
T HE brain encodes external input and spontaneous activity with neural population spikes [1], [2], [3]. Previous neural electrophysiology and computational neuroscience studies suggest that while the large-scale neural population activities are high-dimensional, they have low-dimensional latent factors governing the coordinated actions of neurons [4]. Discovering the latent factors is beneficial to revealing the unobserved and intriguing nature of the neural mechanisms underlying complex perceptual or cognitive processes [5], [6], [7]. There have been tremendous studies developing latent variable models to capture the low-dimensional structure from the high-dimensional neural activity. These approaches range from traditional machine learning techniques like principal components analysis (PCA) [8] and Gaussian process [9] to deep neural networks like sequential variational auto-encoders (VAEs) [10] and neural ordinary differential equations [11]. The community has shown that deep generative models provide more promising results in finding the latent representations and latent factors in terms of accuracy and efficiency [10], [12]. Most deep generative models are based on VAEs [13]. While they are theoretically solid upon a wellstudied probabilistic framework, these methods also inherit some intrinsic drawbacks of VAEs. In particular, the clusters of latent representations of different neural patterns may become entangled and thus difficult to identify and interpret. One main reason for the entanglement issue is that the default Gaussian prior distribution in VAEs has a limitation of encouraging points to concentrate towards the origin [14] and may introduce undesirable local optima [15]. Some works have tried to disentangle the latent representations by carefully designing additional constraints and the corresponding network architecture [16], [17]. These approaches significantly improve the separability between clusters and hence the interpretability of the data. However, the inferred latent distribution structures remains obscure to analyze neural data.
We investigated how to learn a better latent structure that helps explain neural data. Since most neural data are recorded based on navigating cognition [18], [19], [20], [21], [22], an insightful latent structure should mirror the directional nature. Nevertheless, the Euclidean space without further constraints is difficult to capture such characteristics through optimization. Due to manifold mismatch, existing models may lose topological information in the Euclidean latent space [23]. From the standpoint of representation learning, one probably loses information about important properties in the data because of the geometric distortion. Eventually, it harms analyzing neural data and discovering latent factors.
Noticing these facts and inspired by recent insights of deep learning in manifold [24], we attempt to address the geometric This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ structure issue by considering better priors. Specifically, in the present work, we draw on recent advances in non-Euclidean deep learning and introduce hyperspherical neural latent variable models (SNLVM). SNLVM improves existing deep generative models via simply replacing the Gaussian distribution on a Euclidean space with the Power Sphere distribution on a hyperspherical space [25]. We validate SNLVM on a reaching dataset, a task that the subject involves unitary cognitive mode during one trial. Compared with existing deep generative models, our model provides a more informative latent structure that can capture the navigating cognition nature.

A. Notations
Suppose we have simultaneously recorded spike trains from N neurons and corresponding behavior at T time steps. Let X ∈ N N×T be the observed Poisson spike counts matrix with elements x i,t denoting the spike count of neuron i ∈ {1, . . . , N} at time t ∈ {1, . . . , T }. Let Y ∈ R B×T be the observed corresponding behavior, with B being the dimension of the behavior. We consider latent variable models to find unobserved latent factors Z ∈ R M×T that well explain X with M N. Generally, the literature assumes that the latent factors evolve in terms of a dynamical system.
LFADS is an extension of variational auto-encoders (VAEs) [13] for neural spiking data. The encoder and the decoder are both recurrent neural networks (RNNs). The model is trained by maximizing the evidence lower-bound (ELBO) of the marginal log-likelihood of parameterized distribution utilizing the reparameterization trick: where q φ (Z|X) is a variational approximation for the intractable posterior p(Z|X) parameterized by φ, p θ (X|Z) is the generative model parameterized by θ , β > 0 is a hyperparameter, and K L(· ·) is the Kullback-Leibler divergence. In a nutshell, LFADS first encodes the observed spikes to latent representations g ∈ R L and then generates latent factors Z with the decoder based on the given initial states g. Finally, a linear operator maps Z to infer firing rates. LFADS enables one to conveniently evaluate the latent factors via the static latent representations g. Like the vanilla VAEs, LFADS assumes that the encoded variable g follows a Gaussian distribution, whose parameters are inferred from X. Studies have shown that LFADS significantly outperforms previous approaches based on probabilistic models. TNDM is inspired by a recently proposed model Preferential Subspace Identification (PSID) [31]. It decomposes the latent dynamics into behaviorally relevant and behaviorally irrelevant dynamics. However, PSID is a linear state-space model and cannot capture more complex nonlinear neural dynamics. TNDM combines PSID and LFADS. It decomposes the latent space and draws on recent progress on unsupervised representation learning [33], [34], introducing an additional independence penalty to encourage the latent space to capture different information from the data. TNDM shows the merit of explicitly introducing behavior modeling. It can be seen as improving LFADS by introducing behavior modeling and an unsupervised learning constraint.

C. Hyperspherical VAEs
To extract better latent representations that retain the underlying geometric structure of data, some previous work has explored distributions defined on non-Euclidean space and applied them to deep generative models. One such distribution that is convenient to implement yet has been illustrated to yield more benefits than the standard Gaussian assumption is the von Mises-Fisher (vMF) distribution, or the Gaussian distribution on a hypersphere. S-VAEs first systematically discuss drawbacks of the Gaussian prior and proposes to use the vMF distribution to capture data based on the intuition of manifold matching [23]. However, it requires a rejection sampling scheme to sample from a vMF distribution. Thus despite its generalizability, it may suffer from scalability issues. An alternative is to leverage the Power Spherical (PS) distribution [25] which preserves important properties and aspects of the vMF distribution, but is more stable and efficient.

A. Cognitive Maps
The very first step of a cognitive process is to perceive and learn the environment. The capacity of organizing knowledge, generalizing to new information, and responding to perception correspondingly often refers to cognitive maps [35], [36], [37]. The most famed studies around the concept of cognitive maps are based on investigating spatial behaviors [36]. One of the most vital aspects is spatial navigation [20], [38], [39], i.e., planning routes based on knowledge from the real-world space. Specifically, in the hippocampal-entorhinal system, neurons fire with distinct patterns at different spatial places [40]. Studies have shown that, for example, while place cells respond to the current location of the subject [41], firing patterns of grid cells can represent relationships between locations [42]. These activities encode knowledge about different spatial variables and their relationships. Eventually, in a navigation-based cognitive process, this system gives spatial cues to drive spontaneous output, such as motor control.
Navigation is all about directions. Mathematically, perhaps a spherical space is the best choice to describe vector directions, like a compass. For neural population activities based on behavior involving spatial navigation, a good LVM is supposed to capture such nature. In other words, there are some intermediate representations reflecting a circular structure that can be interpreted as navigation-based cognition. Unsurprisingly, existing improved LVMs may evaluate whether their latent representations can preserve the so-called "task structure" [16], [17], [53]. However, vector magnitude in the widely used Euclidean space is abundant and even detrimental for modeling navigation, which distorts the latent distribution structure. Even with carefully designed constraints, it is difficult to restore the geometric structure via parameter optimization. Ultimately, it hinders interpreting neural data. Therefore, if we expect the latent representations to mirror the navigation nature, the widely used Euclidean space is a suboptimal alternative.

B. Manifold Mismatch
Nickel and Kiela [54] is one of the first works proposing to adopt non-Euclidean latent space for deep representation learning. It shows that hyperbolic space is inherently superior for modeling hierarchically structural data like trees and graphs and inspires the following studies to choose a latent topology more suitable for data [55], [56].
Generally, it has been of significant interest to explore deep generative models with non-Euclidean latent space. In practice, most data points are generated by latent variables lying in a non-trivial M-dimensional manifold M and observed in a high-dimensional space R N with M N. Deep generative models try to recover the unobserved latent manifold M with a smooth parameterized map as the encoder f enc : where R D is the latent space. However, commonly M is not homeomorphic to R D . Therefore, if D ≤ M then f enc will never be a homeomorphism, which means that information about the topological structure of M is lost, more or less. Otherwise, if D > M then M can be embedded in R D when D is amply large. However, it would be problematic when sampling points in the latent space, since not the whole space is feasible for generating valid data points.
For one thing, due to limitations of a Gaussian prior, the inferred representations tend to be entangled and thus are difficult to interpret. A feasible improvement is introducing additional prior knowledge. For example, one can design constraints for learning disentangled and more meaningful representations. Existing works such as Swap-VAE [16] and TNDM [17] assimilate recent advances in deep unsupervised learning to isolate different information in different subspaces. These approaches successfully separate stable target-specific neural patterns and instance-specific individual fluctuations. A simpler solution is modeling external variables [17], [53]. It improves LVMs in identifiability significantly by introducing a strong constraint. Hence, the issue of disentanglement is already somehow alleviated.
For another thing, the manifold mismatch issue also damages interpretability. Deep neural networks learn representations favored for the objective function. However, existing models have not considered the geometric structure of the latent representations. Through gradient-based parameter optimization, neural networks can not naturally obtain latent representations with a circular structure. The main reason is that the Euclidean latent space is improbably homeomorphic to M, which is the cognitive space that generates observed neural data. Thus a better alternative would be to specify a latent space homeomorphic to M and to use distributions on M. The existence of a homeomorphism ensures that the model can preserve the latent structure without topological information loss, such that we can interpret the latent representations as spatial navigation in cognitive space. Eventually, it inspires us to adopt a hyperspherial latent space, which is naturally superior for modeling directional vectors.

C. Hyperspherical Latent Space
We introduce hyperspherical neural latent variable models (SNLVM), a deep generative model utilizing a hyperspherical latent space and behavior modeling to learn more interpretable representations for neural dynamics. Unlike previous works, the latent representations by SNLVM can be directly interpreted as the cognitive space. We outline the neural network architecture in Figure 1. SNLVM improves LFADS with two ingredients, hyperspherical latent space and behavior modeling. Note that unlike mGPLVM defining latent factors directly on a manifold [57], we assume that Euclidean latent factors are generated from latent variables on the hypersphere. We also include a comparison of network architectures between SNLVM and the baseline models in Figure 2.
For the community, there are additional desirable properties favored to learn representations on a unit hypersphere for neural networks. For example, when the intermediate features have a unit norm, the model becomes more stable, since matrix multiplication is ubiquitous. Moreover, when the features within a category are well clustered, they will lie on a hyperspherical crown and are linearly separable from the rest features, which is one of the most expected attributes [58].
A simple yet widely adopted distribution on the unit hypersphere S L−1 in R L is the von Mises-Fisher (vMF) distribution, like a Gaussian distribution on a hypersphere. Analogously, the vMF distribution is parameterized by a mean direction parameter μ ∈ S L−1 and a concentration parameter κ ≥ 0. For the special case of κ = 0, the vMF distribution becomes a uniform distribution on S L−1 . The probability density function of the vMF distribution for a random vector x ∈ S L−1 is where C L (κ) is the normalizing constant, and I v is the modified Bessel function of the first kind at order v. However, evaluating the Bessel function is usually inefficient and numerically unstable for large parameters. A better alternative is the Power Spherical (PS) distribution [25]. Analogous to the vMF distribution, the PS distribution is also a distribution on μ ∈ S L−1 parameterized by a mean direction parameter μ ∈ S L−1 and a concentration parameter κ ≥ 0. 1 In most aspects, the PS distribution behaves as same as the vMF distribution. Its density function is defined as with a = L−1 2 + κ and b = L−1 2 . 1 One may consider using a Gaussian prior followed by normalization to obtain unit vectors. Mostly it works. However, the effect of variational parameters μ and σ will be entangled because a Gaussian distribution with a smaller μ will be less concentrated (smaller κ) after normalization. Besides, despite its seeming complexity, the sampling algorithm for a PS distribution is sufficiently efficient.

Algorithm 1 Sample From a Power Spherical Distribution
With the Uniform distribution U(S L−1 ) as the prior, the KL divergence is where ψ(x) = ∂ ∂ x log (x). Note that the KL divergence term only depend on κ, which is consistent with the intuition that P S(·, 0) is equivalent to U(S L−1 ). In this work, we treat κ as a hyperparameter and fine-tune it to investigate how the PS distribution and the hyperspherical latent space work. Given a fix κ, the KL divergence term is a constant independent of network parameters. The same consideration is also adopted in Xu and Durrett [14]. They have shown that such a setting is beneficial to preventing posterior collapse, which refers to the regularization term forcing the model to estimate the posterior distributions closed to the prior.

D. Sampling Procedure
To leverage the reparameterization trick to efficiently optimize the objective function with gradient-based methods, we follow the scheme proposed in De Cao and Aziz [25] to sample from a PS distribution.
We summarize the sampling procedure for a PS distribution in Algorithm 1. In a nutshell, the sampling technique first decomposes points on S L−1 with its tangential component and normal component as y = tμ + √ 1 − t 2 v T , where v is a randomly sampled vector on S L−2 tangent to the hypersphere S L−1 at μ. Then it reduces to sampling a magnitude t defined by an affine transformation of a Beta distribution. The sampling module has a well-defined gradient w.r.t. parameters μ and κ. The predominant difference is that for the vMF distribution, one need to sample the magnitude t from a more complex distribution with an unnormalized density as where t ∈ [−1, 1]. One needs an acceptance-rejection sampling scheme to sample from this distribution. In contrast, sampling from a Beta distribution is more stable and efficient.

E. Regularization and Objective Function
Previous work has shown that involving behavior information in neural dynamics analysis is a simple yet promising improvement for LVMs. This strategy of modeling the relationship between the latent representations and task variables is also known as a hybrid scheme [53]. Both discrete labels [53] and continuous recordings [17], [31] help regularize the model. Specifically, maximizing the distribution likelihood of task variables serves as a strong prior, promoting latent representation to align with the actual unobserved latent manifold that generates the neural spikes based on the behavior. Consequently, we follow previous studies and define the ultimate objective function for our model as where λ b > 0 is a hyperparameter as a Lagrange multiplier for an additional constraint of behavior decoding. Since the KL divergence is a constant given a fix κ, we simply omit the term in the objective function. The model computes hyperspherical variational posterior distributions with an RNN given neural population spikes X. We sample latent representations hyperspherical latent representations g from the posterior distributions. Then we generate latent factors Z from g. Two linear model project Z to firing rates and behavior trajectory, respectively. Finally, we compute the objective function, the Poisson likelihood of inferring firing rates and the likelihood of inferring behavior.

A. Dataset
To verify the validity of SNLVM, we apply our model to data from a recently published rhesus macaque reaching experiment [59]. The data were recorded when a macaque was trained to perform a center-out reaching task in eight different directions on the computer screen. The dataset we use in the present work comprises simultaneously recorded neural population spikes from the primary motor cortex and associated hand position as the behavior. Because each trial is unidirectional, the spatial navigation is static for each trial, respectively. We evaluate our model and compare it with LFADS and TNDM on single-session data aggregated from one of the six trained macaques. The dataset contains 176 trials total, and we use 80% as the training set (136 trials), 10% as the validation set (17 trials), and 10% as the test set (17 trials).

B. Model Configurations
We implement SNLVM, LFADS, and TNDM as sequential VAEs and we fix the network architectures to be the same. After fine-tuning the hyperparameters, we fix all shared hyperparameters to be the same as well. We set the latent space as 64-dimensional. We implement the RNN encoder and the RNN decoder with GRU with 64 system states. We fine-tune and fix the KL divergence loss term β to be 0.1 for LFADS and TNDM based on the principle of β-VAE [60], [61]. We fix the behavior decoding loss term λ b to be 0.1 for TNDM and SNLVM. We train all models on a single Nvidia Titan RTX GPU for 1000 epochs with early stopping with batch size as 16. We train all models with the Adam optimizer with a learning rate starting from 0.01 with adaptive decay. We implement our model based on the code by TNDM [17]. 2

A. Quantitive Evaluation
We first wonder whether our model can well explain neural population spikes, since the very first requirement for LVMs is that models should provide promising results in inferring single-trial neuronal firing rates for the neural population spikes. The latent variables are meaningful and provide insights into the neural data only if they faithfully explain the observed neural activities. To assess the goodness of fit quantitively, we compute the Poisson log-likelihood as the primary metric. We also compute the root mean square error (RMSE) between inferred firing rates and empirical ground-truth firing rates, or peri-stimulus time histogram (PSTH). They are obtained by averaging neural spikes across trials within one particular condition. We evaluate behavioral decoding accuracy by fitting a linear regression model from latent factors to behavior and computing R 2 between decoded behavior and ground-truth behavior. These three metrics and their improved variants are common tools for evaluating LVMs [62]. All metrics are computed on held-out data.
We compare SNLVM with LFADS and TNDM by setting different numbers of the latent factors. We summarize the results in Figure 3. We observe that all models show gradual improvement with increasing latent factors since more latent factors can cover more variance of the neural population activity. We also observe that SNLVM and TNDM lightly but consistently outperform LFADS in fitting neural activity and reconstructing behavior. Moreover, the performances of SNLVM and TNDM are relatively close. The results indicate that SNLVM has competitive performance compared with previous deep LVMs. Note that SNLVM is unnecessary to significantly outperform others in terms of these metrics since its superiority lies in capturing latent structure.

B. Latent Representations Topology and Latent Factors
To elucidate the merit of adopting the hyperspherical latent space, we compare the estimated latent representations and the generated latent factors by LFADS, TNDM, and SNLVM. We train the models with 4 latent factors and visualize the results in Figure 4.

1) Latent Representations by t-SNE:
As shown by t-SNE, LFADS well clusters the representations of neural population spikes with the same direction owing to the expressiveness of deep neural networks as promised. Nevertheless, the latent variables are more sensitive to noise and hence the clusters seem entangled. In the contrary, the inferred latent variables by TNDM and SNLVM show significantly better separability between different neural patterns and aggregation within the same ones. TNDM can be seen as LFADS + behavior modeling, while SNLVM can be seen as LFADS + hypersphere + behavior modeling. Thus we deduce that this is achieved by involving behavior decoding in modeling. Behavior modeling encourages the representations to fit neural spikes and external variables simultaneously, and hence serves as a strong regularization, resulting in a more constrained model. Thus we can find that the estimated posterior distributions by TNDM and SNLVM are more compact and identifiable than LFADS, a benefit of hybrid modeling [53].
2) Latent Representations by PCA: Notably, PCA reveals the most important distinction in the latent topology between the models. Since t-SNE is a nonlinear dimensionality reduction technique attempting to preserve the local structure for neighbor points, it may lose important global structure information and thus may be misleading. Although mostly, PCA is not suitable for revealing the complex structure of a nonlinear manifold, it provides more authentic geometric information about the data, as it maintains most distance properties. Therefore, we examine the latent space structure by applying PCA as well. We observe that all models provide a circular or nearly circular latent structure consistent with other models on the same task [16], [53]. Remind that each trial only involves one target direction. In terms of neurophysiology, the subject maintains unitary neural states of cognitive navigation to perform one particular reaching trial. It implies that the navigational direction in the cognitive space is static for each trial, respectively. Therefore, the latent representations are supposed to preserve this cognitive nature. For LFADS, the Gaussian distributional restriction encourages the points to cluster in the center, eventually leading to unavoidable entanglement. TNDM introduces additional priors, especially behavior decoding, which significantly improve the separability of different classes, yet the geometric structure of latent representations remains vague. The results suggest that behavior modeling can not address the latent structure issue. Eventually, for these models, it is difficult to explain why the model learns such structure and what it mean in neurophysiology. Compared with LFADS and TNDM, SNLVM precisely and elegantly captures a better structure reflecting the spatial navigation in cognitive space. The results imply that hyperspherical latent space introduces a meaningful structure we can interpret. Besides, as a byproduct, the clusters by SNLVM are more compact and separable.
3) Latent Factors: We find that the variance of the latent factors by TNDM and SNLVM is smaller and has a clearer separation than LFADS. It again verifies the merit of involving task variables [53]. Moreover, TNDM introduces a prior to separate behaviorally relevant and behaviorally irrelevant factors. Thus half of the factors only contain homogeneous information. These factors are trivial because they repeat similar dynamics, containing little information about the neural data. However, this prior may be problematic, because it explicitly limits the model to capture more complex dynamics. By contrast, all 4 factors by SNLVM capture non-trivial neural dynamics. These factors are crucial to revealing more information about the neural manifold. Moreover, the process that spatial cognitive states drive observed neural activity is consistent with RNNs, generating dynamics with static initial states. Thus a more accurate latent structure helps the model discover appropriate initial states, which further helps the decoder find correct latent factors.

C. Analysis on Held-Out Direction
To further illustrate SNLVM provides more informative representations, we conduct an experiment with one held-out direction. In this experiment, we held out neural spikes and the corresponding behavior of one reaching direction during training. We investigate whether the models can infer meaningful latent representations and correct behavior reconstruction for the neural population spikes they have no access to during training. The results are shown in Figure 5.  We observe that LFADS, TNDM, and SNLVM all obtain a decent latent structure without the held-out trials. Given the held-out neural population spikes, the inferred latent representations are close to the results when they are trained with all data. As shown by the behavior reconstruction results, one can see that all models provide imperfect yet roughly correct behavioral trajectories. It indicates that these deep generative models successfully capture the latent structure and provide promising performance to unknown patterns.
Nevertheless, SNLVM shows the distinct merit of learning a better topological structure above LFADS and TNDM. For LFADS and TNDM, the latent structure is irregular and thus obscure to find the held-out data. The latent representations of the held-out data mix up with the rest completely. On the contrary, for SNLVM, compared with training with the whole dataset, the distribution of the clusters in the latent space shows unnatural asymmetry around the position that the held-out data should be. The results are consistent with the cognitive process. When a subject search a location that has never seen before, it naturally refers to familiar sites and identifies a reliable direction accordingly. SNLVM reproduces the process elegantly. Relatively, models with the Euclidean space and the Gaussian distribution are more difficult to obtain such topological structure. Thus it is hard to interpret latent representations with neurophysiological meaning, even though we understand faintly that they are relevant to the task.
We also investigate the results quantitatively. We fit a linear model to classify the held-out direction from the rest ones given the latent representations. We evaluate the result via mean classification accuracy (ACC) for the held-out direction. It shows that SNLVM provides a much separable representations. Therefore, the results illustrate significant superiority of leveraging the hyperspherical latent space over the Euclidean space.

A. Effects of Different κ
In our experiments, we find that it is beneficial to fix κ as a hyperparameter. It seems that it may reduce the flexibility and the expressiveness of the model to use a fixed κ for all the approximate posterior distributions. We find that when κ is amply large it is insensitive to the model performance because the behavior decoding scheme provides robust inductive bias. A wide range of κ values appear to work well in this  paper. Moreover, compared with the vMF distribution, the PS distribution is more stable given large κ. We perform a sweep over the setting of κ and visualize the inferred latent variables. As shown in Figure 6, we can see how the concentration parameter κ influences the latent structure. We find that with behavior modeling, when κ is larger than 10 2 , our model is insensitive to it and the latent structure stays the same. Consequently, to further understand κ empirically, we also zero out the behavior modeling term in the objective function to eliminate its gain. We observe that when κ is small, since all the posteriors may distribute more uniformly on the hypersphere, the representations are blended. On the contrary, the latent representations show clear separability and a better circular structure when κ is large enough. With extremely large κ, one can imagine that all posterior distributions may locate on a hyperspherical crown. Due to the PS distribution providing the numerical stability, the model works well even κ seems unusually larger than the model parameters.
Furthermore, we notice that the optimum κ is relatively larger than the trainable parameters in the model with proper L2 regularization. It can be seen as an outlier of the distribution of the parameters. Hence from the perspective of optimization, it may be difficult to learn the best mapping due to training stability issues. It illustrates why manually setting κ provides better results.

B. Ablation Study on Behavior Modeling
Both SNLVM and TNDM show the advantages of involving behavior decoding in the training process. As mentioned above, modeling task variables serves as an influential prior and facilitates the model to learn a more identifiable latent distribution. We wonder whether our model can still provide a more interpretable latent structure than LFADS without it. The result may further support the utilization of non-Euclidean latent space. We set λ b in the objective function to zero to investigate the model.
We first compare SNLVM and LFADS to quantitatively evaluate our model. The results are in Table I. We observe that the performance of SNLVM without behavior modeling is almost equivalent to LFADS. Unsurprisingly, now the only difference between LFADS and SNLVM is the assumption of latent variables. Although such an assumption may change the latent topological structure, it does not introduce additional inductive bias. Therefore, the quantitative results are close.
Then we compare the learned latent structure qualitatively. We summarize the results in Figure 7. We observe that without Fig. 7. Ablation study on behavior modeling. We visualize the normalized inferred latent variables via PCA and compare the results by LFADS and SNLVM with and without behavior decoding. We do not include TNDM to compare since behavior modeling is one of the most crucial aspects. behavior modeling, the geometrical structure of the latent variables inferred by SNLVM still preserves a circular geometry of the reaching directions. The predominant difference is that without behavior variables, noise in neural activities may affect the latent representations more significantly. Therefore the variance of the approximation posterior distributions is relatively larger. The results are promising indicators that behavior decoding does regularize the model to learn more compact representations for the same neural pattern, and thus it is more robust to noise. Moreover, whether including behavior decoding or not, the latent structure learned by SNLVM is consistently better than LFADS based on Gaussian prior in Euclidean space on preserving a circular structure of the task. The results agree with previous work and verify the advantages of using a hyperspherical latent space. Note that removing behavior modeling, our model is almost the same as LFADS except for the assumption of the latent space. We do not apply additional constraints to the latent space and intentionally encourage the model to learn this property.

C. Computational Efficiency
Experiments have shown that sampling from a PS distribution is at least 6× faster and more stable than from a vMF distribution [25]. In our experiment, because the sampling algorithm takes only one minor part of the model, the most computation cost is on the encoder and decoder, so the acceleration is much less significant. Across different experiment settings, SNLVM with PS distribution is about 2.5× faster than with vMF distribution on average. The effect varies according to the computing device. For example, in our test, the value decreases to 1.9× using a single Nvidia RTX 2060 GPU for Laptop.

VII. CONCLUSION
In this work, we discuss the neurophysiological implication of latent representations of LVMs. At first glance, it seems natural and obvious that the latent structure should be similar to the task setting. However, interpreting the latent representations in neurophysiology is crucial to further understanding more complex cognitive processes. Previous work fails to answer the question. Based on cognitive maps and spatial cognitive space, we point out that the latent representations are supposed to preserve the cognitive navigation nature. Then by introducing a simpler yet more effective prior, hyperspherical latent space, we improve existing VAE-based deep generative models to capture the latent topological structures. The geometric structure of the latent variables is naturally more informative that can be interpreted as cognitive navigation. Our insight introduces cognitive maps in LVMs for the first time. Nevertheless, it is also worth mentioning that our model needs further evaluation on other task. For example, current experiments only verified tasks with static cognitive navigation but not dynamic in one trial. How our model helps model abstract cognitive maps remains underexplored. We hope our theory can provide more insights for the community to understand complex cognitive or perceptual processes.