Unsupervised Multimodal Word Discovery Based on Double Articulation Analysis With Co-Occurrence Cues

Human infants acquire their verbal lexicon with minimal prior knowledge of language based on the statistical properties of phonological distributions and the co-occurrence of other sensory stimuli. This study proposes a novel fully unsupervised learning method for discovering speech units using phonological information as a distributional cue and object information as a co-occurrence cue. The proposed method can acquire words and phonemes from speech signals using unsupervised learning and utilize object information based on multiple modalities—vision, tactile, and auditory—simultaneously. The proposed method is based on the nonparametric Bayesian double articulation analyzer (NPB-DAA) discovering phonemes and words from phonological features, and multimodal latent Dirichlet allocation (MLDA) categorizing multimodal information obtained from objects. In an experiment, the proposed method showed higher word discovery performance than baseline methods. Words that expressed the characteristics of objects (i.e., words corresponding to nouns and adjectives) were segmented accurately. Furthermore, we examined how learning performance is affected by differences in the importance of linguistic information. Increasing the weight of the word modality further improved performance relative to that of the fixed condition.


I. INTRODUCTION
H UMAN infants can acquire their verbal lexicon with minimal prior knowledge based on the statistical properties of phonological distributions and co-occurrence of other sensory stimuli [1]- [3].Regarding the importance of fundamental statistical regularity in the lexical acquisition by infants, Saffran et al. observed that there are three key elements; (1) distributional cues, (2) co-occurrence cues, and (3) prosodic cues [4].Here, distributional cues are the statistical relationships regarding the phonological information in utterances, and co-occurrence cues are the information provided by the sensory stimulus that co-occurs with a specific utterance.Prosodic cues are information (such as intonation) included in utterances and the silent sections generated between utterances.A study of infant statistical learning [5] reported that infants are sensitive to statistical regularities in many domains such as speech, music, behavior, and spatial vision.Statistical learning mechanisms allow infants to discover statistical regularities in the environment, such as the words contained in utterances.
As co-occurrence cues are described as one of the important factors in lexical acquisition, infants observe various other types of sensory stimulus simultaneously when hearing speech [6].Humans can classify things into categories by observing various types of sensory information from early in childhood and these categories play an important role in human cognitive function [7].Additionally, it is considered that infants can change the type of important information to which they are attending depending on the progress of learning [8], [9].However, how to specifically change the importance of given information remains an open issue.Therefore, this study focuses on the importance of co-occurrence cues in the lexical acquisition process and the effect of changes in their importance for efficient learning.
There is an approach that aims to elucidate the lexical acquisition process by imitating the function of humans and expressing it via machine-learning methods [10]- [15].This type of approach is referred to as a constructive approach.The findings obtained from these computational models that partially imitate human language learning functions contain clues for elucidating human language learning functions.This approach can be used to develop robots with functions that more closely approximate those of humans.In this study, we focused on the language learning function of infants, who can discover voice units from spoken utterances.This function is expressed as a speech unit discovery method via unsupervised learning that does not use labeled data for its machine learning [14].In speech unit discovery using computational models, words and phonemes are often considered as speech units [13]- [15].There have been various approaches in this area of research such as those that assume that both words and phonemes are speech units [13], [14], or those that focus on only words [16] or only phonemes [17].However, the segmentation accuracy of the latter methods is low due to several factors.For example, an over-segmentation of words can occur based on the recognition error of phonemes.Therefore, it is important to consider both words and phonemes as speech units, using a double articulation analyzer (DAA).
Several computational models for the discovery of speech units have been proposed to utilize other types of information that co-occur with linguistic information [18]- [23].There are

Haptic Audio
There is a stuffed frog.
It's a small ball.
Tutor's utterances The stuffed animal is soft.
The round one is a ball.

Multimodal object categorization
Mutually effective ound a ball.
Fig. 1: Overview of this study.Phonemes and words are discovered by simultaneously using human utterances representing objects' characteristics and the multimodal object information that co-occurs with utterances.
various types of co-occurrence cues and the relationships between co-occurrence cues and linguistic information that have some explanatory value.Many studies have assumed a set of images and linguistic captions that explain the image [21]- [23].With such methods, the accuracy of speech unit discovery is improved by learning the association between the object in the image and the speech unit, as compared with cases in which no image is given.However, these studies are not aimed at lexical acquisition and use only one type of cooccurrence information.When considering the imitation of human statistical learning, it is desirable to use co-occurrence information based on multiple types of sensory stimuli simultaneously.There has been some previous research that meets this requirement [18].In this study, multiple modalities, specifically, the image of the object, the tactile feel when the object is grasped, and the sound when the object is shaken, are handled as co-occurrence cues for the spoken utterances that express the characteristics of the object.Similar studies [19], [20] used the position of a robot and the image at its place as co-occurrence cues for spoken utterances that express places.However, these studies assume that phonemes and syllables have already been acquired, and thus cannot conclude that lexical acquisition is completely achieved via unsupervised learning.
In this study, we propose co-occurrence DAA, a novel fully unsupervised learning method that discovers phonemes and words using phonological information as distributional cues and multiple other forms of sensory information as co-occurrence cues.The proposed method is based on the probabilistic generative model, HDP-HLM+MLDA, which integrates a hierarchical Dirichlet process hidden language model (HDP-HLM) [14] and multimodal latent Dirichlet allocation (MLDA) [24].The integration of the two models is based on the concept of Symbol Emergence in Robotics Tool Kit (SERKET) [25], [26] using the sampling importance resampling (SIR) method [27].SERKET is the theoretical framework for the integration of probabilistic generative models, and the construction and demonstration of the integrated inference algorithm with SIR is one of the novelties of this study.The overview of this study is presented in Fig. 1.We investigate how co-occurrence cues affect phoneme and word discovery performance and compare learning results depending on the importance of co-occurrence cues.Hence, the main contributions are as follows: 1) We construct a fully unsupervised learning method that uses not only distributional cues but also co-occurrence cues for phoneme and word discovery.2) We show using co-occurrence cues improves word segmentation performance.(mainly shown in Table II of Experiment 1) 3) We suggest that co-occurrence cues regarding objects facilitate the discovery of words regarding objects.(Mainly shown in Figure 7 of Experiment 1) 4) Performances of word discovery and object categorization are further improved by increasing the weight of the word modality.(Mainly shown in Table IV and Figure 9 of Experiment 2) This study is novel, and its results can be applied in basic research domains focused on a better understanding of language acquisition, as well as in a variety of practical applications using language (e.g., human-robot interactions).Here, we open the source code of the proposed method and speech dataset on GitHub 1 .
The remainder of this paper is structured as follows.First, Section II describes previous research that examines lexical acquisition and categorization by infants, computational models for phoneme and word discovery, and word discovery methods using co-occurrence cues.Next, Section III introduces the conventional methods, MLDA and NPB-DAA, as the background for the proposed method.Section IV describes the proposed method.Then, we describe the experiments in Sections VI and VII.Finally, we provide a conclusion and directions for future work in Section VIII.

II. RELATED WORK
We describe four types of related work considered in this study.Section II-A describes studies on lexical acquisition and categorization in infants.Section II-B describes how the constructive approach has been applied to lexical acquisition.Section II-C describes unsupervised speech unit discovery methods that work from speech data only.Section II-D describes word discovery methods that use co-occurrence cues.

A. Lexical Acquisition and Categorization in Infants
Various approaches have been studied to date to elucidate the factors influencing lexical acquisition in infants [3]- [5], [8], [9], [28].In general, it is believed that infants discover words using speech statistical distribution information.For example, in one study on the role of distributional cues in word segmentation [4], a word segmentation experiment was conducted using an artificial language and adult subjects.These experiments suggested that distributional cues play an important role in the early word segmentation of language learners.However, word segmentation using distributional cues alone is difficult owing to biases and deficiencies in observed words during language learning.Moreover, the language input may vary due to factors such as dialect, accent, speaking rate, and external environment and context changes.Therefore, Saffran et al. contended that not only distributional cues but also multimodal sensory information such as prosodic and cooccurrence cues, are important in lexical acquisition [4].
Although experiments with infants have reported some results, there are some remaining issues.Pelucchi et al. [2] showed that infants are sensitive to syllable transition probabilities in natural language stimuli and that statistical learning is robust enough to support lexical acquisition in the real world.One study on word segmentation for infants learning English [8] focused on accents during speech and showed that distributional cues play an important role in the early stages of word segmentation learning.Therefore, it is considered that infants change the importance assigned to each source of information depending on their progress in the language learning process.Kuhl et al. [9] investigated whether the importance of information affects perceptual accuracy as learning progresses.Previous experimental evaluations [2], [8], [9] are widely used to assess language learning.However, because these behavioral experiments were conducted after learning, they are susceptible to various external factors.For example, they cannot observe the dynamic progress of learning and similarly cannot target adult subjects.Choi et al. [28] proposed using the measurement results of electroencephalograms worn during the experiment to overcome these problems.However, this introduces new issues, such as the costly nature of electroencephalogram measurements.Therefore, a constructive approach based on a computational model, as introduced in Section II-B, is useful.
Co-occurrence cues have been described as one of the most important factors in lexical acquisition.Infants can observe various other types of sensory stimuli and hear speech simultaneously.In fact, it has been reported that 10-month-old infants can discover simple categories from visual information [6].In this way, it has been observed that humans can classify things into categories by observing various types of sensory information from early childhood and that these categories play an important role in human cognitive function [7].One study of language acquisition by infants reported that infants tend to understand a word as the name of a category to which the target object belongs, rather than as a proper noun [29].However, the details of nature and the relationship between lexical acquisition and category formation in infant development remain unresolved and controversial [30].Additionally, Okanoya et al. presented a hypothesis of string-context mutual segmentation in language evolution [31].Our proposed method can be interpreted as a computational model that represents the functions an agent should have to embody this hypothesis.

B. Constructive Approach to Lexical Acquisition
Several studies have used a constructive approach to imitate the functions and developmental processes of humans and express them as machine learning methods to further elucidate the process.The advantage of this approach is that it can be analyzed relatively easily, with the learning results and parameters as a computational process.Of course, it cannot be concluded at this stage whether the machine learning methods used in this approach accurately represent the human developmental process.However, the knowledge obtained from methods that have functions similar to human lexical acquisition can and will be used to understand the human lexical acquisition process better.In addition, the knowledge obtained from machine learning methods will be used to develop robots that have functions closer to humans.
In recent years, several studies on lexical acquisition using a constructive approach have been conducted [12]- [15], [32].Several machine learning methods imitate the lexical acquisition process of infants using a constructive approach, such as the phoneme and word discovery method known as the nonparametric Bayesian double articulation analyzer (NPB-DAA).NPB-DAA is an unsupervised-learning method based on Bayesian inference, which assumes that the time-series data has a double-articulation structure.Here, double articulation refers to a structure in which the time-series data can be segmented into a certain unit, and each unit can also be segmented into chunks.For example, human utterances can be segmented into units of words, and each word can be segmented into another unit called phonemes, thus it has a double articulation structure.One of the main features of NPB-DAA is that not only can words and phonemes be acquired through fully unsupervised learning, but it can also be applied to relatively small datasets.Therefore, in this study, NPB-DAA was used as the base of the proposed method, and the outline is described in Section III.
In the constructive approach for lexical acquisition, many studies only used speech signals.For example, a lexical discovery method [13] that supports variously changing phoneme and word expressions extended adapter grammars [33], which is the nonparametric Bayes morphological analysis.However, owing to limitations in the noisy-channel model used for modeling variability, sufficient lexical acquisition performance could not be achieved in this study.Studies on syllables are common in the field of lexical acquisition and speech recognition tasks by infants [8], [34], [35].Alternatively, machine-learning studies focusing on syllables are relatively rare because it is difficult to achieve unanimity in the detection and definition of syllables [12].Previous experimental results, however, have shown high accuracy, especially in word segmentation, which is useful for future reference in the approach of word discovery.Audio Word2vec [15], which is an extension of Word2Vec [36] applied to speech data, segments speech utterances at the word level and then converts those words into vector representations.However, in these studies, speech unit discovery was performed using only the information obtained from the speech signal, and other sensory co-occurrence cues were not used.

C. Unsupervised Speech Unit Discovery Methods from Speech Data Only
Lexical acquisition by unsupervised learning is costeffective because it does not require a large amount of labeled data to be prepared for learning.Studies on discovering units, such as words, from spoken utterances using unsupervised learning have been conducted using various approaches [16], [17], [37]- [40].The main purpose of these previous studies was to enable automatic speech recognition to be applied to languages with few resources for learning, rather than imitating the infant statistical learning process via the constructive approach, as was introduced in the previous section.Kamper et al. [16] proposed a method that used acoustic word embeddings in word segmentation by unsupervised learning; however, they did not explicitly deal with phoneme or syllable segmentation but focused only on word segmentation.
In research on speech unit discovery using unsupervised learning, a method based on a variational auto-encoder (VAE) was proposed [38]- [40].The Bayesian hidden Markov model VAE [38] is a speech unit discovery method that extends VAE by embedding a Bayesian framework in the hidden Markov model VAE [41].Specifically, by assuming the Dirichlet process (DP) as a prior distribution for the distribution of speech units, it is possible to automatically infer the total number of speech units.Our proposed method can also automatically infer the number of phonemes and words using DP.Neural network-based speech representation learning [39] can obtain discrete representations using vector quantized VAE (VQ-VAE) [42].Additionally, it can retain a significant amount of linguistic information and the invariance of the speaker.Niekerk et al. [40] investigated the usefulness of vector quantization in learning representations that separate speech content and the characteristics peculiar to the speaker.Recently, wav2vec-U [43], a method for unsupervised speech recognition using phonemized unlabeled text via generative adversarial networks [44], has been developed.These prior studies showed high performance in word discovery.However, the models used did not use purely unsupervised learning from only speech data and functioned with preliminary assumptions such as the use of texts or codebooks of phonemes.Therefore, they are different from developmental models that imitate the lexical acquisition processes of infants with the aim of understanding their functions.

D. Word Discovery Methods Using Co-occurrence Cues
Some studies have taken the approach of using cooccurrence cues other than utterances simultaneously in word discovery [18], [20]- [23].The motivations for using information other than utterances include improving the performance of word discovery and providing linguistic connections to co-occurrence cues.Nakamura et al. [18] proposed a word discovery method using multimodal sensor data that can be observed from an object as co-occurrence cues in word discovery.In this study, object categorization was performed using utterance information and multimodal sensor data, and each word was linked to the object category.Specifically, the nested Pitman-Yor language model (NPYLM) [45] was used for word discovery, and multimodal latent Dirichlet allocation (MLDA) [24] was used for object categorization.NPYLM can discover words via unsupervised learning, but because the input needs to be in text format, it is assumed that phonemes or syllables can be recognized.MLDA is an unsupervised categorization method that can handle multiple modalities simultaneously (for details, see Section III-A).In addition, SpCoA++ [20], SpCoSLAM [19], and ReSCAM [46] can learn the place category and the lexicon based on the syllable recognition lattices and the sensor observations about the place as co-occurrence cues.These studies reported that simultaneous learning of categories and the lexicon improves accuracy in both word segmentation and categorization.In addition to the situational context co-occurring with speech, leveraging a topdown grammar learning process also improves word segmentation performance [47].However, these studies assumed a certain level of prior knowledge of phonemes and syllables.In our study, we proposed a method that can simultaneously detect phonemes and words by referring to the above approach.This can be interpreted as a machine learning method that imitates the process of an infant acquiring lexicon.
Although not focused on lexical acquisition or speech unit discovery, some studies link speech and images [21]- [23].The visually linked speech recognition model projects speech utterances and images into a common semantic space [21].The method for finding a word and associating that word with an object in an image uses both the image and its speech caption [22].Such a method does not use existing speech recognition devices or prior linguistic annotations.However, word segmentation using this method is insufficient for sections of speech that are not sufficiently associated with images.The hybrid model comprising a deep neural network and a hidden Markov model discovers words from images representing objects and their audio captions [23].Because the above model does not consider word-level information, there remains the problem of confusing different words that share phonemes.

III. FOUNDATIONAL METHODS
The proposed method is based on NPB-DAA, which is an unsupervised phoneme and word discovery method from phonological features, and MLDA, which is an unsupervised object categorization method for multimodal information obtained from objects.This section provides an overview of two foundational methods as MLDA in Section III-A and NPB-DAA in Section III-B.The integrated model and inference in the proposed method are described in Section IV.
MLDA can discover the category of an object by clustering observation data obtained when a robot sees, grasps, and makes sounds with various objects, without requiring handlabeled category labels.NPB-DAA can segment speech into phonemes and words based on only features of the speech, without any hand-labeled segmentation boundary.Therefore, MLDA and NPB-DAA may not always produce accurate results, and different variations of results may be obtained depending on the ambiguity of the observation data.For example, a robot may observe various objects including coins and buttons, but MLDA may classify them into the same category owing to their visual similarity.On the other hand, humans may verbally say "This is a coin found on the street corner" or "A round button fell" and NPB-DAA may segment the speech data into the words such as "/coi/ /nf/" and "/but/ /to/ /nf/".In our study, we integrate these methods by connecting the features of objects and speech in robots and making them refer to each other's learning results, to teach the robot that "coin" and "button" are different words and belong to different categories.
A. Multimodal Latent Dirichlet Allocation: MLDA Nakamura et al. [24] extended the latent Dirichlet allocation (LDA) [48] as an MLDA that can generalize multiple types of sensory observation simultaneously to enable object categorization using multimodal data.For instance, images may have certain colors and shapes while audio data may have specific acoustic features.By extracting common categories across these different modalities, MLDA can understand the relationships between multiple modalities.The word distributions are obtained based on the observed frequencies of words for each category.It has been shown that object categorization using MLDA is closer to human senses than categorization using a single modality.For more details on this approach, refer to the original MLDA paper [24].
LDA [48] is a prominent representative method for topic modeling.The original LDA was developed to estimate the potential topic, that is, the latent category, for each word from text documents including many sentences.Figure 2a shows the graphical model of LDA.o w d,i w is a bag-of-words representation in a document.The bag-of-words representation is a way to represent text as a collection of words and their frequencies, without considering the order in which they appear 2 .θ w represents the word appearance probability for each category, and β w is a hyperparameter of the Dirichlet prior distribution.z w refers to the index of the category assigned to each word.
π is a parameter of the multinomial distribution representing the probability of the appearance of each category, and the Dirichlet prior distribution with α as a hyperparameter is used as the prior distribution of this multinomial distribution.D is the number of documents, I w is the number of words in a document, and K is the number of topics.
Figure 2b shows the graphical model of MLDA.Here, the superscripts w, a, h, and v represent different modalities, indicating linguistic, auditory, tactile, and image information, respectively.o * refers to the features of each modality.θ * represents the appearance probability of different features for each category in each modality, and each follows the Dirichlet prior distribution with β * as a hyperparameter.z * refers to the index of the category assigned to each feature in each modality.π is a parameter of the multinomial distribution, and α is a hyperparameter of the Dirichlet prior distribution.The number of objects is D, the number of observed features is I * , and the number of categories is K.The features of each modality are represented as bag-of-features.
The collapsed Gibbs sampler is used for MLDA parameter estimation, as shown in Algorithm 1.The collapsed Gibbs sampler uses marginalized conditional probabilities on π and θ m regarding the category z m di assigned to the i-th feature of the d-th object in the modality m as follows: where Dim(m) is the dimension number of the histogram of modality m.The subtraction subscript in Eq. (1) indicates that the data in that index is excluded from the histogram.
In addition, N mkdo m is the count number of data assigned to the category k and data o m of modality m in the d-th object.The count numbers are shown as follows: where N mk is the count number assigned to the category k for each feature in all objects for modality m.Therefore, the global parameters of MLDA can be acquired as the estimation result in the following: end for end for until a predetermined exit condition is satisfied.phoneme and the acoustic features that make up the phoneme.
The word model consists of a phoneme bigram model and a word dictionary.The bigram language model is the probability of transitioning to the word that appears after each word, and the word dictionary is the probability of the phonemes that make up each word.For more details, refer to previous research [14].NPB-DAA estimates latent phonemes, latent words, language models, and acoustic models, which are the latent variables of HDP-HLM, using the blocked Gibbs sampler 3 .NPB-DAA uses a nonparametric Bayesian method (specifically, the stick-breaking process (SBP) [49], which is based on the Dirichlet process) to automatically estimate the appropriate number of categories (i.e., number of phonemes and word types) for the data.In practice, a weak-limit approximation [50] in the SBP is used to specify the maximum limit 3 HDP-HLM is the name of the probabilistic generative model, and NPB-DAA is the name of the inference algorithm for finding phonemes and words in HDP-HLM by blocked Gibbs sampler.number of categories for implementation.For the inference algorithm, refer to the paper in which it was defined [14].
Figure 3 shows the graphical model representation of HDP-HLM, which is a generative model.The generative process is omitted in this paper.In HDP-HLM, latent words continuously generate observation data for a certain period.In addition, the latent word corresponds to the word z s , and the i-th word z s = i has the phoneme string Here, L i represents the number of phonemes of i-th word w i .The superscripts LM and WM represent the language model and the word model, respectively.A word model is part of a language model that expresses the kind of phonemes each word is composed of and is referred to as a word dictionary.β LM and β WM are base measures for the language model and the word model, respectively.In addition, α LM , γ LM , α WM , γ WM are the hyperparameters of the language model and the word model, respectively.π LM j is the output from DP(α LM , β LM ), which expresses the transition probability.π WM j is the output from DP(α WM , β WM ), which expresses the transition probability of the next latent character string from a latent phoneme j. w ik represents the k-th latent phoneme in the i-th latent word.In addition, l sk is the k-th latent character in the s-th latent word z s .ω l sk is a parameter of the duration distribution of the latent character l sk .In HDP-HLM, the latent word z s is generated by the previous latent word z s−1 and the language model.The duration D sk of l sk is sampled based on the determined sequence w z s .The observation data y t is generated from the output distribution h(θ x t ) corresponding to x t = l s(t)k(t) .Here, the map functions s(t) and k(t) represent the word and phoneme indicators of the latent word string at time t, respectively.Here, the observation time-series data y t is associated with the feature vector obtained from the audio signal at time t.

IV. PROPOSED METHOD: UNSUPERVISED PHONEME AND WORD DISCOVERY METHOD WITH CO-OCCURRENCE CUES
In this section, we describe the proposed method, which performs unsupervised phoneme and word discovery using multimodal sensor data obtained by a robot.The integration of the two methods involves one module sending inference results to the other module, and iterative learning improves overall learning.Based on the information about the object category, word segmentation is more accurately corrected.Using word segmentation results, better object categories are formed.Initially, uncertain categories or incorrect words are gradually self-organized and corrected.
The proposed model, HDP-HLM+MLDA, is the integration of HDP-HLM (NPB-DAA) and MLDA (Fig. 5).An overview of the proposed inference algorithm, co-occurrence DAA 4 , is shown in Fig. 4. The inference algorithm is realized by sampling importance resampling (SIR) [27], which samples candidates of word sequences using NPB-DAA and weights the candidates using the MLDA.In other words, this algorithm performs iterative learning with NPB-DAA and MLDA.

Phoneme and word discovery
There is a stuffed frog.
It's a small ball.

Spoken Utterances regarding objects
The stuffed animal is soft.
The round one is a ball.

A. HDP-HLM+MLDA: Building an Integrated Probabilistic Generative Model
To integrate HDP-HLM and MLDA, we adopt the idea of the Symbol Emergence in Robotics Tool KIT (SERKET) [25], [26], which is an integration framework for probabilistic generative models.SERKET makes it possible to easily construct a large-scale generative model and its inferences by hierarchically connecting the base models, which are its constituent units, while maintaining the independence of each program that is the integration source.By constructing the integrated model according to the SERKET framework, it is possible to optimize the parameters of the integrated model, even if the parameters estimated independently in each base model are used.In the proposed method, o w corresponding to the word sequences is shared by HDP-HLM and MLDA.
A graphical model of the proposed method is shown in Figure 5.Each variable follows the definition in the graphical model of the base models shown in Section III.In the proposed graphical model, the part corresponding to HDP-HLM is expressed, and some variables are changed to avoid duplication.Here, the word sequences o w , the language model (LM), the word model (WM), including the word dictionary, and the acoustic model (AM) correspond to z s , π LM , {π WM ,W }, {ω, θ } in the graphical model of HDP-HLM, respectively.
The probability distribution to generate a word sequence P(o w | z w , θ w , G ) can be defined using unigram rescaling (UR) approximation [51].The UR approximation represents category-dependent N-gram word probability as follows: where the global parameters of HDP-HLM related to the proposed method are denoted as G = {AM, WM, LM}.

B. Co-occurrence DAA: Procedure of the Inference Algorithm by NPB-DAA and MLDA
The inference algorithm applies SIR to the UR approximation based on the SERKET framework [25].The target distribution is the posterior category-dependent word probability distribution P(o w | y, z w , θ w , G ).The proposal distribution is the N-gram word probability distribution P(o w | y, G ), which is estimated by NPB-DAA.Resampling is performed according to the weights provided by the word distribution in the object category by MLDA.This learning procedure enables the proposed method to acquire the lexicon considering the object categorization results by MLDA and to categorize objects using the word sequences estimated by NPB-DAA.
Specifically, the following procedures and formulas are used for learning.First, the model parameters are initialized.The initialization is the same as in the previous NPB-DAA paper [14].The set of initial parameters G (q) 0 (as t = 0) is sampled as Q candidates independently.Next, the following procedure, from I.-V., is iterated T times (t ∈ {1, 2, . . ., T }): I. To generate word sequences with consideration of the object categories using SIR, we sample Q candidates of the proposed distribution proportional to their respective weight values from the candidates sampled at the (t − 1)th iteration.Then, the parameters of each candidate are updated, and the word sequences are estimated as follows: pre , H , where the set of hyperparameters of HDP-HLM is H = {G, H, γ LM , α LM , γ WM , α WM }, the q-th global parameter candidate of HDP-HLM at t-th iteration is G (q) t , and the global parameter candidate resampled by weight W G pre .In addition, δ (•) is the Dirac delta mass in Eq. ( 8).Note that in the (t = 1)th iteration, the weight in the (t − 1)-th iteration does not exist, so the initial value is copied as pre .
Here, NPB-DAA(•) is the process of one iteration of the blocked Gibbs sampler by NPB-DAA.II.Object categorization using MLDA uses q-th word sequences candidate o w(q) and co-occurrence cues o a,h,v .The global parameters θ w(q) and π (q) are obtained as follows: Here, the set of word sequences for each candidate o w(q) is converted to a bag-of-words representation.We apply the collapsed Gibbs sampler until the MLDA categorization is sufficiently converged.The above process is performed for each set of word sequences o w(q) estimated by all Q parameter candidates.
III.The category k assigned to each word is sampled from the probability distribution based on θ w(q) and π (q)5 .The probability that the category k is assigned to each word o w(q) d,s,i w is as follows: IV.The weight of the parameter candidate of each HDP-HLM, corresponding to the second term on the right side in Eq. ( 7) is calculated as follows: The weight is then normalized to use probability: This normalized weight is held until the (t +1)-th iteration and used to sample candidates.V.The candidate with the largest weight is adopted as the estimation result in the t-th iteration.Increase iteration value (t ← t + 1) and return to Step I.

C. How to Weight Each Modality for Multimodal Observations
Similar to MLDA, the proposed method introduces weighting to adjust the degree of influence on categorization by modality.Weighting in MLDA increases the quantity of the feature itself (i.e., increasing the frequency of each occurrence of the histogram).By changing the word modality weight, it is possible to investigate the effect on categorization in lexical acquisition.
The weighting process of categorization for each modality is calculated as follows: where hist m is the original observation feature histogram and modality_weight m is the weight value of a modality m.

V. SPOKEN UTTERANCE AND MULTIMODAL DATASET
This section describes the dataset of spoken utterances for phoneme and word discovery.

A. Overview
To evaluate the performance of speech unit discovery with co-occurrence cues, we generated a speech dataset corresponding to the multimodal object sensor data.We used the Multimodal Object Dataset 165 [52], which is an open dataset that includes vision, haptic, and audio sensory data, as well as multimodal co-occurrence cues 6 .Here, we used 24 objects from the dataset for experiments.See Nakamura et al.'s paper [52] for details regarding the observation process for each modality.
Figure 6 shows the image list for objects in the dataset.Objects were categorized into one of seven potential categories: stuffed toys, sweets, bottles, balls, spray cans, food cans, or cup noodles.Table I shows an example of speech sentences, with the Japanese phonemes and the English translation.The speech dataset is the content that teaches the characteristics and names of each object for a total of 75 Japanese sentences.Each speech item had a duration of approximately 2-3 seconds per utterance.

B. Procedure for Creating Speech Dataset
A Japanese speaker was recorded in an anechoic chamber using an omnidirectional microphone (SHURE PG27-USB).The speech was uttered as clearly as possible to avoid speech recognition errors.The speech was saved as a 1-channel wav file with a sampling frequency of 16.1 kHz.The silence intervals before and after the utterance were removed using 6 Multimodal Object Dataset 165: http://hp.naka-lab.org/subpages/mod165.html the automatic speech recognition system Julius7 [53] because they would greatly affect the accuracy of phoneme and word discovery if they remained in the dataset.Next, the Melfrequency cepstral coefficients (MFCC) and the first and second derivatives of the MFCC were extracted from the speech data as features.The MFCC features were extracted with the frame width set to 25 msec and the frameshift length set to 10 msec.The MFCC and first and second derivatives of the MFCC were 12-dimensional features.In this study, we used a deep sparse auto-encoder with parametric bias in the hidden layer (DSAE-PBHL) [54] to extract 12D, 8D, 5D, and 3D features in a stepwise manner.Here, the features were compressed using DSAE-PBHL because (i) the experimental results of a previous study [54] showed higher word discovery accuracy compared to MFCC, and (ii) dimensional reduction reduces the computational cost.Using the above procedure, each speech utterance was used as a 9-dimensional acoustic feature.

VI. EXPERIMENT 1: PHONEME AND WORD DISCOVERY USING CO-OCCURRENCE CUES OBTAINED BY OBSERVING REAL OBJECTS
In this experiment, we compared the performance of the proposed method, which uses speech and its co-occurrence cues, with that of NPB-DAA, which uses only speech signals.
We investigated the hypothesis that the use of co-occurrence information contributes to the improvement of phoneme and word discovery accuracy in situations close to real-world environments.We also investigated whether the word sequences discovered by exploiting co-occurrence with object information also affect the performance of object categorization.The co-occurrence information is described in Section V, and the experiments were conducted using a multimodal object dataset.

A. Condition
This experiment used the dataset described in Section V.The hyperparameters of the language model LM of HDP-HLM were α LM = 10.0 and γ LM = 10.0.The limit of the number of words by weak-limit approximation was 50 words.The hyperparameters of the word model WM of HDP-HLM were α WM = 10.0 and γ WM = 10.0.The limit of the number of phonemes by weak-limit approximation was 50 phonemes.The duration distribution assumed a Poisson distribution of α 0 = 200 and β 0 = 10.The emission distribution of the acoustic features assumed a multivariate Gaussian distribution.The prior distribution is a normal inverse Wishart distribution of µ 0 = 0, Σ 0 = I (unit matrix), κ 0 = 0.01, and ν 0 = 14 (= Dimension + 5 = 9 + 5).Here, we used NPB_DAA8 as the source code for the NPB-DAA implementation.We also set the number of candidate parameters for HDP-HLM in each iteration to Q = 10.
MLDA uses the histograms of the four modalities as input, and the number of object categories to be set in the prior was taken as reference values from Experiment 2 in Okuda et al.'s paper [56], where a similar dataset is used.

C. Evaluation Metrics
We prepared latent letters, that is, phonemes, and word ground truth labels for all datasets and evaluated the relationship between the ground truth labels and estimated latent letters and words as word discovery performance.We used the automatic annotation tool provided by Julius GMM to prepare the ground truth labels.We also evaluated the relationship between the ground truth of object categorization by the tutor and the estimated object categorization results as categorization performance.
The evaluation metrics were as follows: Normalized Mutual Information (NMI) [57]: NMI is one of the most widely used evaluation metrics in clustering tasks for unsupervised learning.NMI is an evaluation value obtained by normalizing the amount of mutual information between the correct clustering result and the estimated clustering result to take a value ranging from 0.0 to 1.0.NMI is evaluated for phoneme, word, and object categories.Adjusted Rand Index (ARI) [58]: ARI is one of the most widely used evaluation metrics in clustering tasks in unsupervised learning.ARI takes 1.0 when the clustering result matches the correct label and 0.0 when it is random.ARI is evaluated for phoneme, word, and object categories.Object categorization accuracy (ACC): ACC is a metric used to evaluate the performance of object categorization in a series of studies on MLDA [18], [24].This metric represents the matching rate when the label is changed so that the estimated clustering label value most closely matches the correct clustering label value.

D. Results
Table II shows the evaluation results for phonemes and words at the end of training.The proposed method has a higher word discovery performance than the baseline methods.As a result, we have shown that using co-occurrence cues improves word discovery performance in lexical acquisition.In contrast, the phoneme discovery performance was almost the same for all methods.This is likely because phonemes are the smallest speech units without meaning, whereas words are the speech units that can be assigned meaning.
Table III shows the performance of object categorization at the end of training.The proposed method showed better categorization performance than MLDA.The proposed method showed higher values with the modality parameters that were empirically set by Nakamura et al. [18] than with the modality parameters determined by preliminary experiments.HDP-HSSM+MLDA performed poorly in speech segmentation and inaccurately in categorization because it did not assume double articulation.As a result, more accurate word discovery resulted in higher categorization performance.The results also suggest that more accurate object categorization leads to higher word discovery performance.
Figure 7 shows examples of the results of word segmentation of speech.These figures were drawn from the results  Iterative estimation 0.700 ± 0.069 0.772 ± 0.053 0.517 ± 0.112 at the last iteration of the trial with the highest word ARI.
Here, words that describe the characteristics of an object are nouns and adjectives such as "sweets" and "soft."Most of the word segmentation results of NPB-DAA resulted in oversegmentation.The word segmentation results of the proposed method reduce over-segmentation for words that describe the object characteristics.For example, the word /nuigurumi/ is segmented as multiple words in Figure 7a, while it is correctly segmented as a single word in Figure 7d.Additionally, Figure 7e is an example of almost exact word segmentation.The proposed method correctly segmented the word /okashi/, a feature of the object in this utterance.Figure 7f suppressed the over-segmentation of the word /ju:su/, but the words /botoru/ and /dayo/ were under-segmented.As a result, the existing methods tended to consider a certain percentage of utterances as several words, whereas the proposed method could segment words that represented object features more accurately.

VII. EXPERIMENT 2: EFFECTS OF THE WEIGHT OF WORD MODALITY ON THE PERFORMANCE OF WORD DISCOVERY OBJECT CATEGORIZATION
By varying the word modality weights, we aimed to investigate the implications for lexical acquisition and object categorization.The hypothesis tested in this experiment is "Uncertain word segmentation results in the early stage of learning can have a negative impact on classification and hinder overall performance improvement." The proposed method is realized by the coupling of two modules and their mutual iteration.The weight of a modality in object categorization controls the degree of influence, that is, the importance of the modality when merging the two modules.Therefore, adjusting the weight of a word modality may affect its classification performance.Word discovery performance can also be affected by categorization because word discovery is acquired by exploiting co-occurrences with object information.Therefore, in this experiment, we focused on the importance of word modality modality_weight w described in Section IV-C.
As an additional evaluation for comparison, we performed a method using mutual information (MI) [20] instead of weighting based on unigram rescaling (UR).The weighting by MI is equivalent to the logarithm of the weighting by UR.MI provides a softer resampling of candidates than UR.

A. Condition
The dataset and hyperparameter settings were the same as those described in Section VI.For the weight settings of the word modalities, we applied variable weight settings in addition to the fixed value (200) used in the experiments in Section VI.
In the increase condition, this weighting did not use the uncertain word segmentation results for categorization in the initial stage of learning and increased the weights after some progress in the NPB-DAA iteration.This means that lexical acquisition and concept formation takes place through separate mechanisms in the early stage, followed by the integration of knowledge from both.The increase condition of the word modality weight (modality_weight w (t)) was set according to the number t of iterations of the blocked Gibbs sampler of NPB-DAA, as follows: = max(0, min(30 + 10(t − 10), 200)).
In the decrease condition, we set a change in weight that is considered inappropriate for comparison.This weighting strongly uses the word segmentation results for categorization in the early stages of learning but gives less credence to the word segmentation results.The decrease condition of the word modality weight (modality_weight w (t)) was set according to the number t of iterations of the blocked Gibbs sampler of NPB-DAA, as follows:

B. Results
Table IV shows the performance of word discovery and categorization at the end of training for each weight value  setting.As in Experiment 1, there was no difference in phoneme performance, so we omitted the evaluation in this experiment.As a result, the performance was higher when the word modality weights were gradually increased than when they were fixed.
Overall, the co-occurrence DAA * 2 of the increase condition had the highest performance.The increase condition does not use word modality in the early stages of learning, when word discovery is uncertain, but uses it for categorization after some learning has already progressed.This suggests that improved categorization performance by the increase condition can be used as co-occurrence cues for lexical acquisition, potentially enhancing word discovery performance.
Alternatively, the decrease condition, which reduced the word modality weights, slightly decreased the word discovery performance and lowered the categorization performance.The experimental results support our hypothesis that uncertain word segmentation results in the early learning stages may negatively impact classification.
Figure 9 shows examples of the results of word segmentation of speech.Figures 9a, 9b, 9c show examples that even fixed conditions could not be accurately segmented.In these examples, the increase condition improved word segmentation performance.In addition, when compared with the results of NPB-DAA in Experiment 1, over-segmentation could be significantly alleviated for words that describe the characteristics of an object.
In the comparison of weighting in SIR, the word modality weights were higher in UR for the increase condition and higher in MI for the fixed condition.UR worked well with the increase condition because it could focus on more appropriate candidates.MI worked to retain various candidates, but in some cases, it failed to narrow down the appropriate candidates.The results suggest that not only MI used in the conventional method, but also UR which is mathematically consistent with the proposed method, can be effective as criteria.

VIII. DISCUSSION AND CONCLUSION
In this paper, we proposed an unsupervised phoneme and word discovery method that exploits phonological and cooccurrence cues, to imitate the lexical acquisition process of infants using statistical learning.The main features of the proposed method are the following two points: (i) It integrates HDP-HLM, a probabilistic generative model for simultaneous phoneme and word discovery, and MLDA, a probabilistic generative model for multimodal object categorization.(ii) Multimodal sensor observations of image, tactile, and audio stimuli can be simultaneously used as co-occurrence cues for lexical acquisition.Experimental results showed that the proposed method improved word discovery performance for the entire utterance compared to the existing methods that do not use co-occurrence cues.These results suggest that the proposed method can find words more accurately than existing methods, insofar as the words express the characteristics of the object.In addition, increasing the modality weight of the words in the categorization improved the categorization and word discovery performance.
The study focused on the critical elements for language acquisition (i.e., co-occurrence cues) claimed by Safran et al [3]- [5].Therefore, perception/cognition and learning, rather than utterance and production, were considered.Future perspectives may involve the utilization of language to utter (reproduce) learned words and sentences on their own.Other essential factors involved the development of language acquisition include a perceptual reorganization and vocabulary spurt [59]- [62].Constructing a unified computational model that accounts for multiple developmental stages in language acquisition remains a challenging and unresolved issue.
The proposed method has a limitation in setting the number of object categories.However, HDP-MLDA [63], which extends MLDA with a nonparametric Bayesian method (specifically, the Dirichlet process), automatically estimates the appropriate number of categories for the data.It is easy to extend the MLDA part of the proposed method to HDP-MLDA.In the future, the limitation on the number of object categories could be resolved.
Our future study prospects include (i) enabling word discovery that incorporates prosodic cues [56] into the proposed method, and (ii) using the acquired words for human-robot interaction and feedback learning through speech synthesis.
In this study, we employed NPB-DAA as an unsupervised phoneme and word discovery method and MLDA as a categorization method using multimodal object information.The essence of the proposed method is the integration procedure that exploits the co-occurrence of phonological and object information in probabilistic generative models.It does not depend on any particular model as long as it can be represented by a probabilistic generative model.Therefore, in the future, it will be possible to reconstruct the proposed method based on various speech unit discovery and categorization models.In this experiment, we confirmed categorization accuracy using multimodal information when the speech dataset was recognized accurately.First, the modality weight for categorization was determined by categorizing each modality.Then, the categorization performance when the weight of the word modality was changed was measured.The result of this experiment can be interpreted as the upper limit (that is, the topline) of the categorization using this dataset.
1) Condition: The word modality uses the word histogram created based on the transcript, which is generated based on the utterance content of the dataset as word sequences.The multimodal information of an object includes all three modalities of vision, haptic, and audio.The number of categories and the values of the hyperparameters were the same as those described in Section VI.Here, the weight setting of each modality used a value between 0, 10, 20, . . ., 300 for the word modality, 100 for the vision, 100 for the haptic, and 50 for the audio.The latter three values were fixed.The weight value 0 indicates that categorization was performed while excluding the word modality.Weight value settings for values other than word modality were set to be proportional to the categorization accuracy, referring to the accuracy when categorization was performed for each modality alone (See Table V).
Because the seed value of the random numbers was fixed for the implementation of MLDA, the trial was performed with each weight value.The Gibbs sampler of MLDA was 1000 iterations.The number of iterations of the Gibbs sampler was determined in advance by investigating learning iterations in which the categorization converged sufficiently.
2) Result: Figure 10 shows the categorization accuracy for each weight of word modality.Comparing the categorization accuracy for each weight, the accuracy of the weight value 0 without using the word modality was low, and the linguistic information contributed to the improvement of the categorization accuracy in this dataset.In addition, the weights of 10 to 30 showed almost the same accuracy as when the word modality was not used.The accuracy tended to increase slightly from 40 to 200, and there was no significant change in accuracy after the weight of 210.Therefore, it can be inferred that the weight of the word modality was between 40 and 200.

B. Preliminary experiment II: Relationship of categorization performance and modality weight using word sequences by NPB-DAA
In this experiment, we examined the appropriate weight value setting of the word modality using the word sequence estimated by unsupervised learning.Specifically, the results of the phoneme and word discovery experiments by NPB-DAA were used as MLDA inputs, and the categorization accuracy was compared by setting various weight values.In addition, by performing categorization using the word sequences estimated in each iteration of NPB-DAA, we investigated how the categorization accuracy changed when using word sequences that had not been sufficiently learned, and how to set the weight value at the time.1) Condition: The multimodal object and speech dataset is described in Section V.The values of the hyperparameters and the modality weights for object categorization were the same as those described in Section A-A.The iteration numbers for the Gibbs sampler of MLDA and NPB-DAA were 1000 and 100, respectively.Ten trials were performed by NPB-DAA for each modality weight setting.
2) Result: Figure 11 shows the accuracy of object categorization for each word modality weight.Figure 12 shows the changes in the accuracy of object categorization for each word modality weight when using word sequences estimated by NPB-DAA for each iteration of NBP-DAA.From Figures 10  and 11, it can be inferred that the weight of word modality of 40 to 200 is appropriate because the relationship between the weight of the word modality and the categorization accuracy was similar to that of Section A-A.Furthermore, from Figure 12, the number of iterations of NPB-DAA did not significantly affect the categorization accuracy for each weight of word modality, except for the initial stage of NPB-DAA.

Fig. 2 :
Fig. 2: The graphical models of (a) LDA and (b) MLDA (a version consisting of four modalities; From the top: word, audio, haptic, and vision modalities).
) B. Nonparametric Bayesian Double Articulation Analyzer: NPB-DAA NPB-DAA is an unsupervised phoneme and word discovery method proposed to computationally imitate the lexical acquisition process of human infants.NPB-DAA uses an acoustic model and a word model to express phonemes and words.The acoustic model stochastically represents the duration of each Algorithm 1 Collapsed Gibbs sampler for MLDA.repeat for all m, d, i do u ← random value between [0, 1]

Fig. 4 :
Fig. 4: Flow of the iterative estimation of co-occurrence DAA.The following procedure was implemented to utilize the co-occurrence of object information and phonological information.(i) Phonemes and words are learned in each iteration of NPB-DAA; word sequences are estimated for each of the multiple model candidates.(ii) Object categorization is performed using each candidate of estimated word sequences and multimodal object information.(iii) The probability distribution of words that are likely to appear in each category is obtained from the categorization results.(iv) Each model candidate is weighted based on the appearance probability of words in the category assigned to each word included in the word sequences estimated from the utterance.If the candidate has a higher weight, the words frequently appearing in the category to which the object belongs can be estimated.(v) Sampling to select the model to be trained in the next iteration based on the weighting.The model to be updated is resampled with a probability proportional to the calculated weight value.

Fig. 5 :
Fig. 5: Graphical model representation of HDP-HLM+MLDA, which is the integrated model of HDP-HLM and MLDA.Some of the variables corresponding to HDP-HLM are collectively shown as one variable.

Fig. 7 :
Fig. 7: Examples of word segmentation results (Experiment 1): The upper part of each sub-figure shows waveform of the target speech.The lower part of each sub-figure shows the segmentation position estimated by learning, color-coded for each word.The correct segmentation position of the word is overlaid as a gray line.The horizontal axis represents the number of speech frames and the vertical axis represents the training iteration.The list of numbers at the bottom of the lower part of each sub-figure corresponds to the index sequence of the words estimated by training.The sub-caption shows the actual phoneme sequence of the utterance.The underlined part is a characteristic word for an object.

( a )Fig. 9 :
Fig. 9: Examples of word segmentation results (Experiment 2): The upper sub-figures (a, b, c) show the results of the co-occurrence DAA (Fixed condition).The lower sub-figures (d, e, f) show the results of the co-occurrence DAA (Increase condition).

Fig. 10 :
Fig. 10: Relationship between the weight of word modality and categorization accuracy when using transcription sentences, that is, ground truth.(Preliminary experiment I)

Fig. 11 : 12 :
Fig. 11: Relationship between the weight of word modality and categorization accuracy when using word sequences estimated by NPB-DAA.(Preliminary experiment II)

TABLE I :
Examples of uttered sentences.
Uttered sentences (Japanese phoneme) English /kore wa omocha/ This is a toy./jyuusu no botoru dayo/ It's a bottle of juice./supureikaN wa katai/The spray can is hard.

TABLE II :
Phoneme and word discovery performance.(Experiment 1)

TABLE IV :
Phoneme and word discovery performance and object categorization performance.(Experiment 2)

TABLE V :
Categorization accuracy when vision, haptic, or audio modalities were used individually.(Preliminary experiment I)