Clustering is a technique to decentralize the design of future wide issue VLIW cores and enable them to meet the technology constraints in terms of cycle time, area and power dissipation. In a clustered design, registers and functional units are grouped in clusters so that new instructions are needed to move data between them. New aggressive instruction scheduling techniques are required to minimize the negative effect of resource clustering and delays in moving data around. In this paper we present a novel software pipelining technique that performs instruction scheduling with reduced register requirements, register allocation, register spilling and inter-cluster communication in a single step. The algorithm uses limited backtracking to reconsider previously taken decisions. This backtracking provides the algorithm with additional possibilities for obtaining high throughput schedules with low spill code requirements for clustered architectures. We show that the proposed approach outperforms previously proposed techniques and that it is very scalable independently of the number of clusters, the number of communication buses and communication latency. The paper also includes an exploration of some parameters in the design of future clustered VLIW cores.