Skip to Main Content
In high performance clusters current parallel application communication needs such as traffic pattern, communication volume, etc., change along time and are difficult to know in advance. Such needs often exceed or do not match available resources causing resource use imbalance, network congestion, throughput reduction and message latency increase, thus degrading the overall system performance. Studies on parallel applications show repetitive behavior that can be characterized by a set of representative phases. This work presents a Predictive and Distributed Routing Balancing (PRDRB) technique, a new method developed to gradually control network congestion, based on paths expansion, traffic distribution, applications pattern repetitiveness and speculative adaptive routing, in order to maintain low latency values. PRDRB monitors messages latencies on routers and logs solutions to congestion, to quickly respond in future similar situations. Traffic congestion experiments were conducted in order to evaluate the performance of the method, and improvements were observed.