This paper shows the effectiveness of using optimized MPI calls for MPI based applications on different architectures. Using optimized MPI calls can result in reasonable performance gain for most of MPI based applications running on most of high-performance distributed systems. Since relative performance of different MPI function calls and system architectures can be uncorrelated, tuning system-dependent MPI applications by exploring the alternatives of using different MPI calls is the simplest but most effective optimization method. The paper first shows that for a particular system, there are noticeable performance differences between using various MPI calls that result in the same communication pattern. These performance differences are in fact not similar across different systems. The paper then shows that good performance optimization for an MPI application on different systems can be obtained by using different MPI calls for different systems. The communication patterns that were experimented in this paper include the point-to-point and collective communications. The MPI based application used for this study is the general-purpose transient dynamic finite element application and the benchmark problems are the public domain 3D car crash problems. The experiment results show that for the same communication purpose, using alternative MPI calls can result in quite different communication performance on the Fujitsu HPC2500 system and the 8-node AMD Athlon cluster, but very much the same performance on the other systems such as the Intel Itanium2 and the AMD Opteron clusters.