The success of query-by-concept, proposed recently to cater to video retrieval needs, depends greatly on the accuracy of concept-based video indexing. Unfortunately, it remains a challenge to recognize the presence of concepts in a video segment or to extract an objective linguistic description from it because of the semantic gap, that is, the lack of correspondence between machine-extracted low-level features and human high-level conceptual interpretation. This paper studies three issues with the aim to reduce such a gap: 1) how to explore cues beyond low-level features, 2) how to combine diverse cues to improve performance, and 3) how to utilize the learned knowledge when applying it to a new domain. To solve these problems, we propose a framework that jointly exploits multiple cues across multiple video domains. First, recursive algorithms are proposed to learn both interconcept and intershot relationships from annotations. Second, all concept labels for all shots are simultaneously refined in a single fusion model. Additionally, unseen shots are assigned pseudolabels according to their initial prediction scores so that contextual and temporal relationships can be learned, thus requiring no additional human effort. Integration of cues embedded within training and testing video sets accommodates domain change. Experiments on popular benchmarks show that our framework is effective, achieving significant improvements over popular baselines.