This study proposes a multi-criteria and hierarchical evaluation system for building extraction from remotely sensed data. Most of current evaluation methods are focused on classification accuracy, while the other dimensions of extraction accuracy are usually ignored. The proposed evaluation system consists of three components: 1) the matched rate, including evaluation metrics for the traditional classification accuracy (e.g., completeness, correctness, and quality); 2) the shape similarity that describes the resemblance between reference and extracted buildings, including image-based and polygon-based metrics; and 3) the positional accuracy which is measured by distances at feature points such as building's centroid. The system also hierarchically evaluates extracted buildings at per-building, per-scene, and overall levels. To reduce the redundancy among different metrics, principal component analysis and correlation analysis are employed for metrics selection and aggregation. Four different building extraction methods, using high-resolution optical imagery and/or LiDAR data, are implemented to test the proposed system. The experiment demonstrates that the proposed system is more consistent with human vision compared to traditional classification accuracy metrics. This system can highlight perceptible differences between the extracted building footprints and the reference data, even if this difference is insignificant measured by the traditional metrics.