Double Articulation Analyzer with Prosody for Unsupervised Word and Phoneme Discovery

Infants acquire words and phonemes from unsegmented speech signals using segmentation cues, such as distributional, prosodic, and co-occurrence cues. Many pre-existing computational models that represent the process tend to focus on distributional or prosodic cues. This paper proposes a nonparametric Bayesian probabilistic generative model called the prosodic hierarchical Dirichlet process-hidden language model (Prosodic HDP-HLM). Prosodic HDP-HLM, an extension of HDP-HLM, considers both prosodic and distributional cues within a single integrative generative model. We conducted three experiments on different types of datasets, and demonstrate the validity of the proposed method. The results show that the Prosodic DAA successfully uses prosodic cues and outperforms a method that solely uses distributional cues. The main contributions of this study are as follows: 1) We develop a probabilistic generative model for time series data including prosody that potentially has a double articulation structure; 2) We propose the Prosodic DAA by deriving the inference procedure for Prosodic HDP-HLM and show that Prosodic DAA can discover words directly from continuous human speech signals using statistical information and prosodic information in an unsupervised manner; 3) We show that prosodic cues contribute to word segmentation more in naturally distributed case words, i.e., they follow Zipf's law.


Double Articulation Analyzer with Prosody for Unsupervised Word and Phoneme Discovery Yasuaki Okuda, Ryo Ozaki, and Tadahiro Taniguchi
Abstract-Word and phoneme discovery are important tasks in language development for human infants. Infants acquire words and phonemes from unsegmented speech signals using segmentation cues, such as distributional, prosodic, and co-occurrence cues. Many pre-existing computational models that represent the process tend to focus on distributional or prosodic cues. This paper proposes a nonparametric Bayesian probabilistic generative model called the prosodic hierarchical Dirichlet process-hidden language model (Prosodic HDP-HLM) for simultaneous phoneme and word discovery from continuous speech signals. Prosodic HDP-HLM, an extension of HDP-HLM, considers both prosodic and distributional cues within a single integrative generative model. We propose a prosodic double articulation analyzer (Prosodic DAA) by deriving an inference procedure for Prosodic HDP-HLM. We conducted three experiments on different types of datasets, i.e., Japanese vowel sequence, utterances for teaching object names and features, and utterances following Zipf's law and demonstrate the validity of the proposed method. The results show that the Prosodic DAA successfully uses prosodic cues and outperforms a method that solely uses distributional cues. In contrast, the phoneme discovery performance did not improve. We also show that prosodic cues contributed to word discovery performance more when the word frequency was distributed more naturally, i.e., following Zipf's law. The main contributions of this study are as follows: 1) We develop a probabilistic generative model for time series data including prosody that potentially has a double articulation structure; 2) We propose the Prosodic DAA by deriving the inference procedure for Prosodic HDP-HLM and show that Prosodic DAA can discover words directly from continuous human speech signals using statistical information and prosodic information in an unsupervised manner; 3) We show that prosodic cues contribute to word segmentation more in naturally distributed case words, i.e., they follow Zipf's law.

I. INTRODUCTION
S PEECH signal segmentation problems that identify word and phoneme boundaries from continuous speech using segmentation cues, e.g., distributional cues and prosodic cues, are essential for human infant language acquisition. This task is easy if speech signals are always given as a single word. However, many infant-directed speeches are known to consist Y. Okuda and R. Ozaki  of multiple words [1], [2]. Nevertheless, human infants can discover words and phonemes from raw continuous speech signals. This word discovery from continuous speech signals is a difficult task because infants cannot use any information that explicitly identifies the boundaries of words but only uses cues contained in continuous speech signals, i.e., unsupervised learning. In addition, phoneme discovery needs to be performed using speech signals in an unsupervised manner as well.
Human infants can exploit numerous cues to discover words from continuous speech signals in the language acquisition process [3]. These cues are 1) distributional, 2) prosodic, 3) cooccurrence, and other cues. 1) The distributional cues represent the statistical relationships that one element of speech sound follows another. 2) The prosodic cues rely on acoustic information, such as silent pause, stressed syllables, rhythmic bias, and suprasegmental features, e.g., pitch [4]- [8]. 3) Co-occurrence cues represent events and objects observed in accordance with the utterance of a word. It has been reported that 8-month-old infants can discover words from fluent speech based solely on distributional cues [9], [10]. It has also been reported that 7-month-old infants can discover words from fluent speech based on distributional cues rather than prosodic cues [11]. In contrast, it has also been reported that 2-month-old infants can perform word discovery using prosodic information [12]. As a result, considering both distributional and prosodic cues is crucial for developing an unsupervised phoneme and word discovery model. Several computational models for simultaneous unsupervised word and phoneme discovery using distributional cues have been developed [13]- [15]. Taniguchi et al. proposed a probabilistic generative model for simultaneous word and phoneme discovery, called the hierarchical Dirichlet process hidden language model (HDP-HLM) and its inference algorithm [15]. The HDP-HLM is a probabilistic generative model for time series data that potentially has a double articulation structure, i.e., hierarchically organized latent words and phonemes embedded in speech signals. An unsupervised learning method called the nonparametric Bayesian double articulation analyzer (NPB-DAA) was proposed based on HDP-HLM. The NPB-DAA can estimate the double articulation structure and discover words and phonemes that simultaneously are acquired in acoustic and language models. However, their performance is still limited.
Prosodic cues are also essential for word discovery [16] and have been reported to be effective in machine learning methods for word segmentation [17], [18]. However, a probabilistic generative model that effectively integrates prosodic and dis-tributional information has not yet been proposed. Therefore, in this study, we focus on the introduction of prosodic cues that are believed to help word discovery as supplemental cues. We develop an unsupervised machine learning method that can discover words and phonemes directly from continuous speech signals using distributional and prosodic cues by extending the NPB-DAA. Furthermore, we also propose an unsupervised learning method called prosodic double articulation analyzer (Prosodic DAA) and a probabilistic generative model called the prosodic hierarchical Dirichlet process hidden language model (Prosodic HDP-HLM) and its inference algorithm.
In what cases do prosodic cues especially contribute to word segmentation from the viewpoint of a statistical model of word discovery? This is an important question to ask. If the words are distributed evenly enough, distributional cues may be sufficient for word segmentation. However, it is widely known that word distribution follows Zipf's law [19]. Zipf's law is an empirical law found in many types of social, physical, and other scientific domains. The frequency of a word is inversely proportional to its rank in the frequency table of words. This means that there are so many words that are far less frequently observed than other frequently observed words. It is naturally considered that capturing distributional cues from such data is more difficult than capturing data in which every word is observed in an equally frequent manner. We hypothesized that prosodic cues contribute to word discovery performance more when words are distributed more naturally than artificially prepared datasets, i.e., in adherence to Zipf's law.
The main contributions of this paper are as follows: 1) We develop a probabilistic generative model for time series data including prosody that potentially has a double articulation structure and propose the Prosodic DAA by deriving the inference procedure for Prosodic HDP-HLM. 2) We show that the Prosodic DAA can discover words directly from continuous human speech signals using statistical information and prosodic information in an unsupervised manner. 3) We show that prosodic cues contribute to word segmentation more in case words are naturally distributed, i.e., they follow Zipf's law.
The remainder of the paper is structured as follows. Section II describes the background of the proposed method. Section III presents the prosodic HDP-HLM by extending HDP-HLM, describes the inference procedure of prosodic HDP-HLM, and proposes Prosodic DAA. Section IV, Section V, and Section VI evaluate the performance of the proposed method using Japanese vowel sequence utterances for teaching object names, features, and utterances following Zipf's law. Section VII concludes the paper.
II. BACKGROUND Human infants can use prosodic cues to discover words in the language acquisition process, as mentioned above. Prosodic cues have been shown to help word segmentation in language acquisition [20]. Based on this, Ludusan et al. extended the unsupervised word segmentation method [21] to use prosodic cues and showed that prosodic cues help unsupervised word segmentation [17], [18]. However, a computational model based on a probabilistic generative model, which makes use of distributional and prosodic cues jointly for simultaneous phoneme and word discovery, has not been proposed.
The unsupervised learning of an acoustic model, i.e., phoneme discovery, is a clustering task of feature vectors obtained from continuous speech signals. Mixture models, such as the Gaussian mixture models and hidden Markov models, have been used to categorize feature vectors of phonemes [22]- [26]. Phoneme acquisition is a complex categorization task in a feature space because of the overlap of the distribution of the feature vectors of each phoneme. The actual sound of a phoneme depends on its context. The importance of feedback information from segmented words in phoneme acquisition has been reported [27]. Therefore, simultaneous word and phoneme discovery is essential.
Statistical unsupervised simultaneous learning methods of acoustic and language models have been proposed [13], [14], [28], [29]. Word segmentation and phoneme categorization are mutually dependent, as pointed out in [27]. Therefore, an integrated probabilistic generative model for unsupervised simultaneous learning of acoustic and language models is preferable. The unsupervised phoneme and word discovery from raw continuous speech signals can be regarded as an analysis of the double articulation of time series data. Double articulation is a hierarchical latent structure in which states, corresponding to words in language, have transitions among them in a stochastic manner at the higher level and those at the lower level, e.g., phonemes, have transitions in a deterministic manner inside the high-level state; for example, a word has a deterministic sequence of phonemes. Therefore, the phoneme and word discovery problem can be regarded as a double articulation analysis problem [15], [30], [31]. For double articulation analysis, Taniguchi et al. developed the NPB-DAA, which integrates the phoneme and word discovery processes into a single inference process of a unified generative model called HDP-HLM [15] 1 . They showed that it could achieve phoneme and word discovery to some extent. However, HDP-HLM only models distributional cues and does not use prosodic cues, such as silent pause and fundamental frequency. This paper proposes Prosodic HDP-HLM, which models distribution and prosodic cues related to word segments by extending the HDP-HLM.
The Zero Resource Speech Challenges aim to construct a system that learns an end-to-end spoken dialog system, using only the information available in the language acquisition process, have been organized [33], [34]. Probabilistic computational models that achieved unsupervised direct word discovery from continuous speech signals were proposed in the Zero Resource Speech Challenges. Kamper et al. proposed an unsupervised word segmentation system that segments and clusters speech data into a unit, such as a word [35]. Recently, methods involving representation learning have also been developed [36]. However, an integrative probabilistic generative model, especially based on Bayesian nonparametrics, involving prosodic and distributional cues, has not been proposed.
In robotics, unsupervised word discovery methods have been studied to achieve online lexical acquisition and overcome the out-of-vocabulary problem [37]- [44]. Several models use co-occurrence cues (e.g., object and place categories) to improve word discovery performance [38], [41], [42], [44]. However, phoneme discovery and prosodic cues have rarely been considered.
Based on the above background, we propose a probabilistic generative model called the prosodic HDP-HLM for time series data, including prosody that potentially has a double articulation structure, containing both an acoustic model and language model. The latent double articulation structure of time series data can be analyzed in an unsupervised manner by assuming prosodic HDP-HLM as a generative model of observation data and inferring latent variables of the prosodic HDP-HLM. The unsupervised machine learning method for double articulation analysis is called Prosodic DAA. An overview of the proposed method is presented in Fig. 1). The Prosodic DAA uses distributional cues and prosodic cues simultaneously in an explicit manner.

III. PROSODIC DAA A. Generative Model: Prosodic HDP-HLM
This section describes a probabilistic generative model, the prosodic HDP-HLM, by adding auxiliary observations corresponding to prosody to HDP-HLM, which is a nonparametric Bayesian probabilistic generative model for time series data that potentially has a double articulation structure.
A graphical model of the prosodic HDP-HLM is shown in Fig. 2). Notably, most of the generative processes are the same as HDP-HLM except for variables related to prosodic features The generative process of the prosodic HDP-HLM is described as follows: where GEM and DP represent the stick-breaking and Dirichlet processes, respectively. LM represents the language model and WM represents the word model. The parameters γ WM and α WM are hyperparameters of the word model, β WM is Latent letters a global transition probability that becomes the base measure of the transition probability distributions, and π WM j represents the transition probability from latent letter j to the next latent letter. The parameters γ LM and α LM are hyperparameters of the language model, β LM is a global transition probability that becomes the base measure of the transition probability distributions, and π LM i represents the transition probability from latent word i to the next latent word. The superscripts LM and WM indicate the language and word models, respectively.
The latent letter 2 sequence of the i-th latent word w i is sampled from π WM w i,k−1 . The duration distribution g and observation distribution h have parameters ω j relating to the j-th latent letter and θ j generated from the base measures G and H. In addition, the prosodic observation distribution h Prosody has parameters φ q generated from the base measures H Prosody q . The variable z s is the s-th latent word in the latent word sequence and corresponds to the superstate in the hierarchical Dirichlet process hidden semiMarkov model (HDP-HSMM) [45]. The duration time D s is the frame duration of the sth latent word z s . The latent letter l sk = w z s k corresponds to the kth latent letter of the sth latent word. The duration time D sk is the frame duration of the latent letter l sk . The duration time D sum s is the frame duration from t = 1 to the end point of the word z s . The variables x t and y t indicate the hidden state and observation data at time t, respectively. In word and phoneme discovery, we assume that y t represents the spectral feature representation, for example, the mel-frequency cepstral coefficient (MFCC). The time frames t 1 sk and t 2 sk are the start and end points of the segment corresponding to l sk , respectively.
The duration time D sk of the k-th latent letter l sk of the s-th latent word z s in the word sequence is drawn from the duration distribution g(ω l sk ). The duration of the latent word z s is D s = ∑ L zs k=1 D sk , where assuming g is a Poisson distribution, the duration distribution of a latent word z s also follows a Poisson distribution because of the reproductive property of the Poisson distribution. In this case, the Poisson distribution parameter of the duration of the latent word is ∑ L zs k=1 ω l sk . In addition to the variables described above, which are also in the HDP-HLM, the prosodic HDP-HLM has additional prosody-related variables. The variables Y t and F t are prosodic observation data at time t and indicate that a new word begins at t + 1 when F t = 1. In this case, F t = 1 when t = D sum s . The parameter q is the variable relating to the value of 0 or 1 in indicator F t . We assume that Y t is a prosodic feature observed in accordance with the word boundaries, i.e., F t = 1.

B. Inference procedure
The approximated blocked Gibbs sampler for the prosodic HDP-HLM can be derived in the same way as the approximated blocked Gibbs sampler for the HDP-HLM. The inference procedure of HDP-HLM, called NPB-DAA, can estimate the double articulation structure from time series data. The Prosodic HDP-HLM can find latent words and letters from time series data, including prosody, in an unsupervised manner, by inferring the latent local and global parameters of prosodic HDP-HLM.
In the HDP-HLM, we adopted the backward filtering forward-sampling procedure, which is the inference method of HDP-HSMM adapted to HDP-HLM. By extending the backward filtering forward-sampling procedure of HDP-HLM, we can obtain an inference procedure for prosodic HDP-HLM. The calculation of the backward messages of the latent word z s = i in prosodic HDP-HLM is as follows: where z s(t) represents the latent word z s at time t and D t+1 represents the duration of the latent word beginning at time t + 1. The probability β t (i) is obtained by marginalizing all latent words j at time t + 1. The probability β * t (i) is the probability that the latent word i begins at time t + 1. This probability β * t (i) is obtained by marginalizing all duration frames d. The probability p(y t+1:t+d ,Y t+1:t+d |i, d) in (17) shows the probability that observations y t+1:t+d and prosodic observations Y t+1:t+d are generated by the latent word i. The likelihood of the latent word p(y t+1:t+d ,Y t+1:t+d |i, d) is as follows: p(y t+1:t+d ,Y t+1:t+d |i, d) where the variable R (L i ,d) is a set of L i -dimensional natural number vectors whose element summation is d. The value of (19) can be calculated efficiently using dynamic programming. The forward message α t (k) can be recursively calculated as follows: where the forward message α t (k) is defined as the probability that the k-th latent letter in the latent word w i transitions to the next latent letter at time t. As a result, β t (i) and β * t (i) can be calculated. The backward filtering forward-sampling procedure allows the blocked Gibbs sampler to directly sample latent words from observation data without explicitly sampling latent letters in prosodic HDP-HLM, similar to HDP-HLM. In the forward-sampling procedure, the latent word z s(t+1) and the duration D s(t+1) of the latent word z s(t+1) are sampled iteratively using backward messages as follows: where D sum 1:s = ∑ s <s D s . From the calculation formula shown above, the latent word z s(t+1) and the duration D s(t+1) of the latent word z s(t+1) can be sampled using β t (i) and β * t (i). Once the latent words and their duration are sampled, the other parameters, e.g., model parameters and a latent letter sequence for each latent word, can be sampled in exactly the same way as the original HDP-HLM [15]. For more details, please refer to the original paper [15].

C. Prosodic DAA and prosody features
The inference procedure of Prosodic HDP-HLM allows the estimation of the double articulation structure from time series data. Therefore, we call the unsupervised machine learning method based on Prosodic HDP-HLM Prosodic DAA, in the same way as the unsupervised learning method that is based on the original HDP-HLM is called NPB-DAA 3 .
Generally, Prosodic DAA does not specify a feature extraction method for prosody features. Any prosody features that are informative for word boundaries can be used. In the experiment described later in this study, we use the fundamental frequency F 0 , and silent pauses are used as prosody observations. We focus on these two prosodic cues because they are likely universal cues for word discovery [46]. The second-order differential of the fundamental frequency F 0 and the duration of silent pause extracted from audio, instead of removing them, are given as prosodic feature observations Y t := (Y 1,t ,Y 2,t ), respectively. Further details of the feature extraction are described in the experimental section.
However, if another prosody feature co-occurring with a word boundary is prepared, such additional prosody features can be easily introduced into Prosodic DAA without any extension of the model.

IV. EXPERIMENT 1: CONTINUOUS JAPANESE VOWEL SPEECH SIGNAL
In the first experiment, we evaluated Prosodic DAA using Japanese vowel speech signals to verify the applicability of the proposed method to actual human continuous speech signals.
The speech utterances in the dataset do not have rich prosody features and are monotonous. In addition, the word distributions are artificially designed and the distributional cues are relatively easy to find. This dataset was used to examine whether the method could find words and phonemes using distributional cues. In this experiment, we evaluated whether the Prosodic DAA can perform word and phoneme discovery in the same way as NPB-DAA on this dataset.
In this experiment, we compared the proposed method Prosodic DAA and NPB-DAA [15], i.e., statistical word and phoneme discovery with and without prosodic cues.

A. Conditions
We used the same dataset 4 as in [15], [47]. The data consisted of 60 audio files; a native female Japanese speaker read 30 artificial sentences aloud twice at a natural speed and recorded it. The sentences comprised five words {aioi, aue, ao, ie, uo}, which consisted of five Japanese vowels {a, i, u, e, o} representing {ä, i, W B , e fl , o fl } in phonetic symbols respectively. By combining the 5 words, the 30 sentences include 25 twoword sentences, e.g., "aioi aioi," "aue ie," and "uo ao," and five three-word sentences i.e., "aioi uo ie," "aue ao ie," "ao ie ao," "ie ie uo," and "uo aue ie," were prepared. The set of two-word sentences consisted of all possible word pairs.
The data were encoded into 12-dimensional MFCC timeseries data as observation data for spectral features. The frame size of MFCC was set to 25 ms, and the frame shift of MFCC was set to 10 ms. We used DSAE as an adaptive feature extractor in the same way as [47] and extracted 3-dimensional data as observation data and the DSAE parameters α = 0.003, β = 0.7, and η = 0.5. For more details, please refer to the original paper on NPB-DAA with DSAE [47].
The prosody features were extracted as follows. The secondorder differential of the fundamental frequency F 0 (Y 1,t ) and duration of silent pause (Y 2,t ) are extracted and used as time series data for prosody feature observations. Robust Epoch and Pitch EstimatoR (REAPER 5 ) were used to extract the fundamental frequency F 0 and the parameters of frame size and minimum and maximum F 0 were set to 0.01, 40.0, and 300.0, respectively. A section where the volume below the threshold is continuous for a certain period is defined as a silent pause. Notably, the silent pauses were removed from the audio data. When the duration of the silent pause after time frame t was detected and extracted, the duration d sil was set to Y 1,t = d sil representing the silent pause from the current frame to the next frame. The threshold of maximum volume and minimum period of silent pause were set to -8 dB and 0.01 s, respectively.
In this experiment, we employed an open-source large vocabulary continuous speech recognition engine, Julius 6 [48] as a baseline method for comparison with the proposed Prosodic DAA. The acoustic model of Julius was trained using a large speech dataset in a supervised manner. For the experiment, we used a GMM-based triphone model and a DNN-based triphone model.
We prepared two different groups of conditions for Julius. The first group used Julius because it is a generic speech recognition system. In this group, Julius used a generic word dictionary. The second group used the true word dictionary of the dataset. Therefore, in this group, Julius used the true word list and true phoneme list for continuous speech recognition.

B. Results
The phoneme and word discovery task, i.e., double articulation analysis, can be regarded as an unsupervised clustering task. We evaluated the experimental results using the adjusted rand index (ARI), which quantifies the performance of a clustering task. ARI becomes 1 when the clustering result matches the ground truth and becomes zero when the data are clustered randomly. We provided phonemes, i.e., latent letters and word ground truth labels, to all datasets and evaluated the experimental results.
In Table I, phoneme ARI (i.e., the average ARI for latent letters) and word ARI (i.e., that for latent words) estimated by NPB-DAA and proposed Prosodic DAA using only silent pause, only F 0 and both F 0 and silent pauses are shown. The ARI for estimated latent letters and words shows how accurately each method estimated latent letters and words, which correspond to phonemes and words in speech signals. A higher ARI indicates a more accurate estimation of latent variables. The experimental results showed an average ARI of 20 trials.
In Table I, the ARI of latent letters, i.e., phonemes and latent words of two different groups of conditions for Julius; The results show that the proposed Prosodic DAAs in all conditions outperformed NPB-DAA at word ARI. In contrast, there is almost no difference between Prosodic DAAs in all conditions and NPB-DAA at phoneme ARI. Comparing Prosodic DAAs in all conditions with NPB-DAA and calculating the t-test at p = 0.05, Prosodic DAA using both F 0 and silent pause, and Prosodic DAA using only silent pauses were statistically significantly different from NPB-DAA at word ARI (p = 3.6 × 10 −4 and p = 3.4 × 10 −2 , respectively). There were no statistically significant differences in all combinations of DAAs at phoneme ARI.
The statistically significant differences in word ARI between Prosodic DAA using only silent pause and NPB-DAA showed that the word segmentation performance improved when using prosodic cues. Moreover, the statistically significant differences in word ARI between Prosodic DAA using both F 0 and silent pause and Prosodic DAA using only silent pause showed that the word segmentation performance improved by using both F 0 and silent pause instead of only silent pause. The results of Julius in all conditions were lower than the results of DAAs. This is likely because the acoustic and language models assumed in the dataset in Experiment 1 and that in Julius are very different. These results show that the Prosodic DAA improved the word discovery performance of NPB-DAA even when the target data had moderate prosody features.

V. EXPERIMENT 2: JAPANESE CONTINUOUS SPEECH SIGNAL INCLUDING PROSODY
In the second experiment, we evaluated our proposed method using naturally spoken Japanese speech signals, which contain consonants, ordinal Japanese vocabularies, and richer prosody features than the speech signals used in Experiment 1. This experiment was conducted to verify the applicability of the proposed method to actual human continuous Japanese utterances.

A. Conditions
We prepared the dataset of continuous Japanese utterances 7 . The data consisted of 70 audio files; a native male Japanese speaker read 70 artificial sentences aloud once at a natural speed and recorded it. The data consisted of sentences that teach names and features of objects, for example, "kore wa omocha" in English "This is a toy" and "yawarakai yo" in English "It's soft." The sentences comprised 26 words, consisting of 26 Japanese phonemes. We prepared phonemes, i.e., latent letters, and word ground truth labels, to all datasets and evaluated the relationship between the ground truth labels and estimated latent letters and words using the ARI. We used the automatic annotation tool provided by Julius GMM to prepare ground truth labels.
All the feature extraction methods and hyperparameters were used in the same way as in Experiment 1, except for the following. The data were encoded as observation data into 36-dimensional MFCC, which is a concatenation of 12dimensional MFCC, the differential of 12-dimensional MFCC, and second-order differential of 12-dimensional MFCC time series data. We used DSAE as an adaptive feature extractor in the same way as [49] and extracted 9-dimensional data as observation data. For more details, please refer to the original paper on NPB-DAA with DSAE for natural speech signals [49]. To extract prosody features, the threshold of the maximum volume and minimum period of silent pause was set to -10 dB and 0.03 s, respectively.
Regarding the hyperparameters for HDP-HLM and Prosodic HDP-HLM, the maximum number of letters and words and the maximum frame duration of words were set to 50 and 120 for weak-limit approximation.
In this experiment, we employed two different groups of conditions for Julius as a baseline method, similar to Experiment 1.

B. Results
In Table II, the ARI of latent letters, i.e., phonemes, and latent words of two different groups of conditions for Julius, NPB-DAA, and proposed Prosodic DAA using only silent pause, only F 0 ; and both F 0 and silent pauses are shown. The ARI for estimated latent letters and words shows how accurately each method estimated latent letters and words, which correspond to phonemes and words in speech signals. ). There were no statistically significant differences in any combination of DAAs at phoneme ARI.
The statistically significant differences in word ARI between Prosodic DAA using either F 0 or silent pause and NPB-DAA showed that the word segmentation performance improved when using prosodic cues. The statistically significant differences in word ARI between Prosodic DAA using both F 0 and silent pause and Prosodic DAA using either F 0 or silent pause showed that the word segmentation performance improved by using both F 0 and silent pause instead of either F 0 or silent pause. In addition, the statistically significant difference in word ARI between Prosodic DAA using only silent pause and Prosodic DAA using only F 0 showed that the word segmentation performance was improved by using silent pause instead of F 0 . Notably, the results of several ARIs of Julius GMM outperform Julius DNN likely because the dataset used in Experiment 2 as true labels was annotated using the automatic annotation tools of Julius GMM.
These results show that the proposed Prosodic DAA is a more effective machine learning method for estimating the latent double articulation structure of time series data, including prosody.

VI. EXPERIMENT 3: CONTINUOUS JAPANESE SPEECH SIGNALS FOLLOWING ZIPF'S LAW
In the third experiment, we evaluated the Prosodic DAA using continuous, more naturalistic Japanese speech signals than those in Experiment 2 from the viewpoint of distributional properties. It is widely known that words are distributed following Zipf's law, which is a power law, in other words, in documents and utterances [19]. Zipf's law is an empirical law found in many types of social, physical, and other scientific domains. In data satisfying Zipf's law, the rank-frequency distribution has an inverse relation. The frequency of a word  is inversely proportional to its rank in the frequency table of words.
where α is a positive constant. In natural language, the word rank-frequency distribution follows α = 1 in many cases [50]. The dataset used in Experiments 1 and 2 did not follow Zipf's law. Figure 3 shows the log-log plot of the rank-frequency distributions of datasets for Experiments 1 and 2. This shows that the datasets did not follow Zipf's law. Mathematically, the corpus following Zipf's law has more words that appear less frequently and the distributional cues for word segmentation are more difficult to capture. Therefore, we hypothesize that prosodic cues contribute significantly to word and phoneme discovery.

A. Conditions
The dataset used in Experiment 3 was the same type of dataset used in Experiment 2, except for word frequency distribution. The word frequency is adjusted to follow Zipf's law 8 . Figure 3 shows the log-log plot of the rank-frequency distribution of the dataset used in Experiment 3. The figure shows that the dataset follows Zipf's law, where α = 1. A native Japanese male speaker read aloud 42 sentences at a natural speed; the utterances were recorded. The sentences are  Table III shows ARIs of discovered phonemes and words. Each shows an average of over 20 trials. The baseline methods are the same as those in Experiment 2.

B. Results
The experimental results show that the Prosodic DAA improved the word ARI compared to NPB-DAA in every condition significantly at p = 0.05 with t-test (p = 7.2×10 −21 , 2.1 × 10 −06 , and 1.3 × 10 −8 for Prosodic DAA with both features, F 0 , and silent pause, respectively ). In contrast, significant differences about phoneme ARI in every condition were not found.
This result shows that the prosodic cue contributes to the word segmentation task even when the target data follow Zipf's law. Comparing Tables 2 and 3, we find that the performance of NPB-DAA, which only uses distributional cues, deteriorated. However, we can also find that the introduction of prosodic cues in Experiment 3 improved the ARIs more than those in Experiment 2. This suggests that when prosodic cues contribute to word discovery in natural speech signals, the distributional cues are statistically hard to capture.

VII. CONCLUSION
In this study, we proposed the Prosodic DAA for discovering words directly from continuous human speech signals using statistical information and prosodic information in an unsupervised manner. For this purpose, we proposed a probabilistic generative model called the Prosodic HDP-HLM by extending the HDP-HLM. Based on the generative model, we derived an inference procedure by expanding the blocked Gibbs sampler proposed for HDP-HLM. To evaluate the performance of the proposed method, we conducted three experiments. In the first experiment, we applied the proposed method to actual human Japanese vowel speech signals. In the second experiment, the proposed method was applied to actual human Japanese utterances. In the third experiment, the proposed method was applied to Japanese utterances whose word distribution follows Zipf's law. The results showed that the proposed method could make use of prosody information and outperformed NPB-DAA in word segmentation performance. However, the phoneme discovery performance did not improve. This suggests that prosodic cues, i.e., second derivatives of F 0 and silent pauses, do not contribute to phoneme discovery. In addition, the third experiment suggests that prosodic cues contribute to word segmentation if the distributional cues are difficult to capture, for example, the word distribution follows Zipf's law.
Word and phoneme discovery from more natural speech signals will be a crucial challenge in our future work. We performed word and phoneme discovery from speech signals. However, we limited the number of words and phonemes in the experiment. Therefore, we did not test our method on the openended learning of words and phonemes. Language acquisition from speech signals, emphasizing prosodies such as infantdirected speech by human parents and natural speech signals such as daily conversation, is a topic for our future work.
Computational cost is still a problem in our method. Current inference procedure requires O(T N 2 L max d 2 max ), where T is the number of frames, L max is the maximum number of latent letters in a latent word, d max is the maximum frame length of a latent word, and N max is the maximum number of words [51]. The inference time of the dataset conditioned the maximum number of letters and words to 10 in Experiment 1 and took approximately 4 min, and the dataset conditioned the maximum number of letters and words to 50 in Experiment 2 and took approximately 30 min for Gibbs sampling 100 iterations using two Intel Xeon CPU E5-2650 v2 2.60 GHz, 8 cores, 16 threads CPUs. The inference time depends on the maximum number of words. Therefore, an improvement in the computational cost is still required for a large dataset with many words. Introducing a neural network for inference is a feasible approach to reduce the computational cost, i.e., amortized inference [52]. This allows us to make use of GPUs.
One direction of our future research is to develop a robot that automatically learns words and phonemes simply by speaking and responding to them. In this study, we focused on language acquisition from statistical information and prosodic information and proposed a mathematical model for language acquisition. Several studies have suggested that using cooccurrence information improves the accuracy of language acquisition [53], [54]. Another direction of our future research is to combine co-occurrence cues into a double articulation analyzer and obtain a mathematical model for more accurate word and phoneme discovery.