Skip to Main Content
Fast recovery from software and hardware failures is very essential to communication systems, especially, when it is used for mission-critical applications such as public safety systems. A failure in the network infrastructure can affect a large number of users and may result in loss of lives. The infrastructure software applications that provide services to the mobile stations according to some defined communication protocols play a key role for system availability. The real-time peer-to-peer nature of these communication protocols poses a real challenge in developing a recovery mechanism that can work in such environments. In this paper, we introduce a new recovery method that takes into account the layered architecture of the communication protocols and their peer-to-peer communication pattern. The method is based on communicating extended finite state machine and does not assume transient and fail-stop failures. Furthermore, an experimental testbed has been implemented to evaluate our new approach. The experimental results have shown that the infrastructure applications can reliably recover and quickly restore the servicing level that the system was performing immediately prior to the failure. Moreover, the failure-free overhead caused by this approach is relatively low, and is experimentally found to be less than 5%.