Recently an increasing interest in music retrieval can be observed. Due to the growing amount of online and offline available music and a broadening user spectrum more efficient query methods are needed. We believe that only a parallel multimodal combination of different input modalities forms the most intuitive way to access desired media for any user. In this paper we introduce a query by humming, speaking, writing, and typing. The strengths of each modality are combined in a synergetic manner by a soft decision fusion. Songs can be referenced by their according melody, artist, title or other specific information. Further more the recognition of the actual user's emotion and external contextual knowledge helps to build an expectance of the intended song at a time. This constrains the hypothesis sphere of possible songs and leads to a more robust recognition or even a suggestive query. A combination of artificial neural networks, hidden Markov models and dynamic time warping integrated in a Bayesian belief network framework build the mathematical background of the chosen hybrid architecture. We address the implementation of a working system and results achieved by the introduced methods.