Skip to Main Content
As a result of that every computer can have different CPUs, memory size, GPU devices and so on, they are heterogeneous and unreliable, dynamic load balancing is a difficult problem for a GPU cluster system needs to solve. In this paper, we discuss a method that can dispatch the appropriate tasks to each node to achieve load balancing. We assume that each node has an initial capability of hyper-computing, according to number of completed tasks in each cycle; this capability of each node will be updated dynamically. We will also show that how the tasks resend when some nodes disconnect to improve the system's reliability. In our experiments, the load of each computing node can be balanced within a few minutes, and if some nodes disconnect, the computing tasks can be completed normally.