This paper presents a proposal of a general framework that explicitly models local information and global information in a conditional random field. The proposed method extracts global image features as well as local ones and uses them to predict the scene of the input image. Scene-based top-down information is generated based on the predicted scene. It represents a global spatial configuration of labels and category compatibility over an image. Incorporation of the global information helps to resolve local ambiguities and achieves locally and globally consistent image recognition. In spite of the model's simplicity, the proposed method demonstrates good performance in image labeling of two datasets.