Skip to Main Content
This work presents a new interactive learning method for spoken word acquisition through human-machine multimodal interfaces. During the course of learning, a machine makes a decision about whether an orally input word is a word in the lexicon the machine has learned using both speech and visual cues. Learning is carried out on-line, incrementally, based on a combination of active and unsupervised learning principles. If the machine judges with a high degree of confidence that its decision is correct, it learns the statistical models of the word and a corresponding image class as its meaning in an unsupervised way. Otherwise, it asks the user a question in an active way. The function used to estimate the degree of confidence is also learned adaptively on-line. Experimental results show that the method enables a machine and a user to adapt to each other, which makes the learning process more efficient.