We propose a two-stage method for detection of abnormal behaviours, such as aggression and fights in urban environment, which is applicable to operator support in surveillance applications. The proposed method is based on fusion of evidence from audio and optical sensors. In the first stage, a number of modality-specific detectors perform recognition of low-level events. Their outputs act as input to the second stage, which performs fusion and disambiguation of the firststage detections. Experimental evaluation on scenes from the outdoor part of the PROMETHEUS database demonstrated the practical viability of the proposed approach. We report a fight detection rate of 81% when both audio and optical information are used. Reduced performance is observed when evidence from audio data is excluded from the fusion process. Finally, in the case when only evidence from one camera is used for detecting the fights, the recognition performance is poor.