Skip to Main Content
This paper implements a technique that enhances parallel execution of auto-generated OpenMP programs by considering architecture of on chip cache memory. It avoids false-sharing in 'for-loops' by generating OpenMP code for dynamically scheduling chunks by placing each core's data cache line size apart. An open-source parallelization tool called Par4All has been analyzed and its power has been unleashed to achieve maximum hardware utilization. Some of the computationally intensive programs from Poly Bench have been tested on different architectures, with different data sets and the results obtained reveal that the OpenMP codes generated by the enhanced technique have resulted in considerable speedup.