Sort-last parallel rendering is an efficient technique to visualize huge datasets on COTS clusters. The dataset is subdivided and distributed across the cluster nodes. For every frame, each node renders a full resolution image of its data using its local GPU, and the images are composited together using a parallel image compositing algorithm. In this paper, we present a performance evaluation of standard sort-last parallel rendering methods and of the different improvements proposed in the literature. This evaluation is based on a detailed analysis of the different hardware and software components. We present a new implementation of sort-last rendering that fully overlaps CPU(s), GPU and network usage all along the algorithm. We present experiments on a 3 years old 32-node PC cluster and on a 1.5 years old 5-node PC cluster, both with Gigabit interconnect, showing volume rendering at respectively 13 and 31 frames per second and polygon rendering at respectively 8 and 17 frames per second on a 1024 x 768 render area, and we show that our implementation outperforms or equals many other implementations and specialized visualization clusters.