Skip to Main Content
Until recently, a steadily rising clock rate and other uniprocessor micro architectural improvements could be relied upon to consistently deliver increasing performance for a wide range of applications. Current difficulties in maintaining this trend have lead microprocessor manufacturers to add value by incorporating multiple processors on a chip. Unfortunately, since decades of compiler research have not succeeded in delivering automatic threading for prevalent code properties, this approach demonstrates no improvement for a large class of existing codes. To find useful work for chip multiprocessors, we propose an automatic approach to thread extraction, called decoupled software pipelining (DSWP). DSWP exploits the finegrained pipeline parallelism lurking in most applications to extract long-running, concurrently executing threads. Use of the nonspeculative and truly decoupled threads produced by DSWP can increase execution efficiency and provide significant latency tolerance, mitigating design complexity by reducing intercore communication and per-core resource requirements. Using our initial fully automatic compiler implementation and a validated processor model, we prove the concept by demonstrating significant gains for dual-core chip multiprocessor models running a variety of codes. We then explore simple opportunities missed by our initial compiler implementation which suggest a promising future for this approach.