It is elusive how the skull-enclosed brain enables spatio-temporal multimodal developmental learning. By multimodal, we mean that the system has at least two sensory modalities, e.g., visual and auditory in our experiments. By spatio-temporal, we mean that the behavior from the system depends not only on the spatial pattern in the current sensory inputs, but also those of the recent past. Traditional machine learning requires humans to train every module using hand-transcribed data, using handcrafted symbols among modules, and hand-link modules internally. Such a system is limited by a static set of symbols and static module performance. A key characteristic of developmental learning is that the “brain” is “skull-closed” after birth - not directly manipulatable by the system designer - so that the system can continue to learn incrementally without the need for reprogramming. In this paper, we propose an architecture for multimodal developmental learning - parallel modality pathways all situate between a sensory end and the motor end. Motor signals are not only used as output behaviors, but also as part of input to all the related pathways. For example, the proposed developmental learning does not use silence as cut points for speech processing or motion static points as key frames for visual processing.