Skip to Main Content
Non-blocking communications are widely used in parallel applications for hiding communication overheads through overlapped computation and communication. While most of the existing implementations provide a non-blocking version of point-to-point communications, there is no portable and efficient implementation of non-blocking collectives, partly because application execution contexts need to be interrupted by dependent communications. This paper presents a portable and efficient user-level implementation technique of non-blocking communications. It allows users to design non-blocking collectives by declaring their operations and dependencies using provided APIs without being concerned with complicated management of their progression. While user-level implementations can be less efficient than kernel-level ones due to the cost of OS context switches, we solve this problem by employing the Marcel user level light-weight thread library when invoking communication operations. More specifically, each communication operation is mapped to one Marcel thread and scheduled to be executed when each operation's dependencies are satisfied by certain events. All executable operations and main user thread are executed simultaneously without any explicit invocations. Performance evaluations with micro benchmarks demonstrate the effectiveness of our proposed technique. Compared to existing OS-thread based method, it reduces CPU load to less than 10% while achieving similar level of communication latencies. We also discuss and compare the descriptive power of internal expressions for non-blocking communications.