A top-down task-dependent model guides attention to likely target locations in cluttered scenes. Here, a novel biologically plausible top-down auditory attention model is presented to model such task-dependent influences on a given task. First, multi-scale features are extracted based on the processing stages in the central auditory system, and converted to low-level auditory "gist" features. These features capture rough information about the overall scene. Then, the top-down model learns the mapping between auditory gist features and the scene categories. The proposed top-down attention model is tested with prominent syllable detection task in speech. When tested on broadcast news-style read speech using the BU Radio News Corpus, the model achieves 85.8% prominence detection accuracy at syllable level. The results compare well to the reported human performance on this task.