Collective communication operations provided by The Message Passing Interface (MPI) are heavily used by scientific applications at large scale. The current MPI standard, MPI-2.2, only defines blocking collective communication calls, which does not allow simultaneous computation and communication. It is expected that MPI-3 will allow for non-blocking collective communication. The newly introduced ConnectX-2 Infini Band adapter from Mellanox features an offload mechanism that enables the Network Interface Card (NIC) to perform a series of communication and reduction operations without the involvement of the host processor. Current generation MPI stacks implement each collective operation using point-to point operations. To take advantage of offload feature in a rapidly changing architectural environment for all MPI collectives, they must be re-designed using flexible and generalized primitives. The primitives can then be used to compose various collective algorithms. The primitives must provide increased overlap with adapters supporting offload capabilities with varying collective group sizes and communication message sizes. In this paper, we take on the challenge of designing collective communication primitives with good overlap characteristics and evaluate their performance using ConnectX-2 offload feature. We also show how collectives such as Barrier can be designed using our communication primitives. Our evaluation reveals that we can achieve near perfect (94% - 100%) overlap of computation and communication by using our primitives. Additionally, we observe performance improvement of up to 5% using the Recv-Replicate primitive for data transfer.