BPCM: A Flexible High-Speed Bypass Parallel Communication Mechanism for GPU Cluster

With the increasing complexity of computational tasks faced by artificial intelligence technology, the scale of machine learning models continues to expand, and the data volume and frequency of parameter synchronization also increase. This will cause the communication bandwidth within the GPU cluster to become the biggest bottleneck for distributed model training. Many existing solutions cannot be widely promoted due to the need for professional equipment support, high cost, and difficulty in use. To solve this problem, this paper proposes a multi-network card bypass parallel communication mechanism based on Intel DPDK technology to increase the bandwidth within the GPU cluster at a lower cost and make full use of the idle CPU resources of the GPU server to accelerate data transmission. Firstly, we propose a data transmission model based on multiple network cards, and design a port load balancing algorithm to ensure load balancing of multiple network cards. Secondly, the model and algorithm of CPU multi-core scheduling are implemented to reduce CPU energy consumption, resource occupation, and the impact on other applications. Furthermore, for multiple application scenarios, a rate adjustment model and algorithm are designed and implemented to ensure fair use of application bandwidth. Finally, the experimental results show that this mechanism can provide high bandwidth for GPU clusters with inexpensive multi-network cards, and provide superimposed bandwidth of multi-network cards in a single connection, which has high reliability and transmission efficiency, and is simple to use and flexible to expand.


I. INTRODUCTION
In recent years, artificial intelligence technology has achieved unprecedented breakthroughs and achievements in many application fields relying on powerful computing power and massive training data. With the continuous expansion of artificial intelligence applications, the difficulty of solving problems is increasing, the computing tasks are becoming more and more complex, and the scale of machine learning models is becoming larger and larger, which propose higher requirements on the computing power of computers. It may take weeks to complete model training on a machine equipped with a single GPU, so it is a trend to extend model training The associate editor coordinating the review of this manuscript and approving it for publication was Guangjie Han . to multiple GPUs [1]. The current popular single-machine multi-GPU server has a limited number of GPUs, and the cost is too high. Moreover, it cannot perform well in larger-scale model training. This has led to a distributed machine learning framework with multiple machines and multiple GPUs that is becoming a hot topic in the industry. There are many open-source distributed machine learning systems, such as Spark, PyTorch, and TensorFlow [2]. With the continuous increase of the scale of machine learning models and the rapid development of computing engines such as GPUs, not only the amount of training data required by model training has become larger, but also, the amount and frequency of parameter synchronization data has increased. The surge in the amount of interactive data among GPU nodes during model training will make network communication between mechanism are designed and implemented by other members of the team. Due to the high-speed data transmission rate of BPCM, the traditional TCP reliable transmission protocol is not applicable. The BPCM reliable transmission protocol will be introduced in the next paper. This paper only discusses end-to-end unreliable data transmission at high speed.
In the next section, we will describe the background and motivation of BPCM. Then we introduce the physical topology, cluster structure and system framework of BPCM in the third section. In the fourth section, we introduce the design and implementation of BPCM in detail. We will verify multiple performances of BPCM by designing multiple sets of experiments in the fifth section, and conclude conclusion in sixth section.

II. BACKGROUND AND MOTIVATIONS A. BACKGROUND
In a distributed machine learning system, communication time-consuming among GPU nodes is easy to become the bottleneck of the entire distributed training task [9]. To solve this problem, researchers have proposed many excellent solutions from the aspects of reducing data transmission volume, increasing communication bandwidth, improving network communication protocols, and underlying physical topology.
In the distributed training process, excellent parameter synchronization architecture or algorithms can reduce the amount of communication data and delay. At present, the parameter synchronization architecture widely used in large-scale distributed machine learning systems includes: Parameter Server (PS) architecture, Ring architecture [10] and Hierarchical Parameter Synchronization (HiPS) [11]. The parameter server (PS) architecture is widely adopted due to its advantages of simple deployment, good scalability, and robustness. Since the number of parameter servers is far less than the number of working machines, it is easy to cause parameter servers to become network bottlenecks in general. Besides, the introduction of parameter servers also increases the hardware cost of the system. Although stronger hardware devices can greatly reduce communication delays, the corresponding costs will also increase. In the Ring architecture, each node only communicates with two adjacent nodes before and after it, avoiding centralized communication bottlenecks. However, the parameters are sequentially transmitted in the communication ring, which often takes more times to complete synchronously, and the communication overhead increases [12]. Hierarchical Parameter Synchronization (HiPS) [13] divides nodes into multiple groups, and performs parameter synchronization layer by layer and in parallel, thereby reducing the impact of communication delays [14], [15]. DAIET proposed the use of programmable switches to implement In-Network Aggregation (INA) to transfer the logic of parameter aggregation from a dedicated parameter server to a switch with high throughput [16]. Parameter Hub proposed PBox, which is equipped with multiple network cards on a centralized PS to match IO performance with memory bandwidth [17].
Network transmission protocols also have an important impact on parameter synchronization performance, such as traditional TCP/IP protocols. Although TCP/IP protocols have achieved success in the WAN, it is not suitable for dedicated distributed machine learning systems due to its insufficient performance in congestion control and system implementation. To improve its network performance, RDMA (Remote Direct Memory Access) is applied to distributed machine learning systems [18]. GDR (GPU Direct RDMA) technology makes full use of the zero-copy (zerocopy) feature, greatly reducing the transmission delay among GPUs, and the end-to-end performance improvement can be up to 2.43 times [18]. MPI + RDMA takes advantage of the All-Reduce primitive provided by MPI, and have been widely used [19], [20]. NVIDIA introduced a new bus and its communication protocols NVLink and NVSwitch [21], [22], which greatly improved the communication bandwidth among multiple GPUs and reduced latency, so that the PCIe bandwidth no longer became a communication bottleneck among multiple GPUs [23]. NCCL (Nvidia Collective multi-GPU Communication Library) is a multi-GPU collective communication library introduced by NVIDIA [24], [25]. NVIDIA has made a lot of optimizations for it. NCCL uses the Ring communication architecture, and can use PCIe, NVLink, Socket, and RDMA as the underlying communication protocol. Blink [26], launched by Microsoft Research, UC Berkeley, and the University of Wisconsin-Madison research team, can make full use of all homogeneous and heterogeneous network transmission lines to achieve optimal data aggregation among GPUs.
The underlying physical network topology also directly affects communication performance. Fat-Tree [27] is a widely used network topology in data centers, but it will cause many problems when used in distributed machine learning systems. Fat-Tree is a switch-centric topology. It constructs a non-blocking network through rich connections among adjacent layer switches. The server accesses the network through only one network card, which causes the limited network bandwidth of the node. BCube [27] builds a network through multiple layers of switches, each server is directly connected to each layer of switches through multiple network cards, and parameters can be updated in parallel.
In smaller machine learning models, excellent parameter synchronization architectures or algorithms perform well, but appear to be inadequate in the training of larger models [9]. The INA model migrates parameter server tasks to high-throughput switches, but requires programmable switch support, increasing cost and difficulty in use. RDMA technology also requires professional hardware equipment to implement, such as InfiniBand, iWarp (Internet Wide Area RDMA Protocol), RoCE (RDMA over Converged Ethernet), etc., and the price is higher. NVLink performs well on NVIDIA DGX servers, but cannot scale to ordinary multi-machine GPU clusters [28], [29]. Blink is also only suitable for data interaction among multiple GPUs in single machine. Although the 10 Gigabit network card can provide 10 Gbps bandwidth, compared to 100 Gbps bandwidth among multiple GPUs in a single machine, it still easily becomes a communication bottleneck between two nodes in the GPU cluster. Through the Linux system bonding mode to achieve multiple network card bandwidth overlay, not only a single connection cannot break the theoretical upper limit of the bandwidth of a single network card, but also requires the switch to support IEEE802.3ad, and the configuration is cumbersome.

B. MOTIVATIONS
If we want to achieve high-speed data transmission at low cost, many aspects can be used for reference in previous research. In the RDMA technology, the system kernel is bypassed, and user-space directly performs data transmission with the RNIC, which can greatly reduce the processing delay of the system protocol stack on data packets. The traditional TCP/UDP protocol has insufficient performance in congestion control and system implementation. Its transmission efficiency is not high enough which is more prominent in the local area network and needs to be improved. The idea of multi-network card bandwidth superposition proposed by Parameter Hub and BCube can not only use multiple low-cost gigabit network cards to provide the bandwidth of the 10 Gigabit network card, or multiple 10 gigabit network cards to provide 100 Gbps bandwidth, but also the price is cheaper.
Intel's DPDK technology can provide support for the implementation of the above requirements. BPCM is based on DPDK. BPCM uses the DPDK user mode driver technology to move the processing of data packets to user-space. Not only does it bypass the system kernel protocol stack, enabling network cards to approach line-rate transmission, but also users can customize communication protocols based on the data link layer. BPCM also combines the characteristics of DPDK multi-core and multi-port to achieve parallel highspeed transmission of multiple network ports, which can achieve bandwidth overlay. The advantages of BPCM are as following: 1) Ordinary network cards and Layer 2 switches can be used, and the cost is low. 2) Bypass transmission does not conflict with traditional network architecture. 3) Support multiple network cards, easy to expand network bandwidth. 4) Avoiding the traditional kernel protocol stack, the data processing speed is faster. 5) Custom protocols based on data links are implemented in user space, with high data transmission efficiency and security. 6) Give full play to the characteristics of CPU multi-core; make full use of idle CPU resources in a distributed machine learning environment. 7) Based on the Layer 2 switch topology, the cluster scale is easy to expand, simple to use, and flexible in configuration. 8) Provide a superimposed bandwidth of multiple network cards for a single connection.

III. DESIGN OF BPCM
The BPCM proposed in this paper is a high-speed transmission mechanism based on Intel's DPDK technology. It not only designs from the underlying physical topology, cluster structure, system framework, communication model, etc., but also makes full use of the characteristics of the server CPU's multi-core and multi-port to provide user applications with high-speed communication interfaces similar to the traditional Socket communication model. The detailed description of BPCM is given below.

A. PHYSICAL TOPOLOGY
BPCM is a bypass parallel communication mechanism based on the data link layer and combined with DPDK's multi-NIC technology. It has no impact on the traditional network structure. It can be achieved by connecting multiple network cards of the server to the Layer 2 switch through ordinary network cables. For the expansion of the BPCM cluster nodes, there is no conflict with the traditional switch cascading technology. It is only necessary to connect multiple network cards of the server to be expanded to the Layer 2 switch through network cables. As shown in Fig. 1, node1, node2, and node3 are connected through a Layer 2 switch to realize the physical topology of BPCM without affecting the traditional network structure.

B. CLUSTER STRUCTURE
The BPCM cluster adopts a simple master-slave architecture model, as shown in Fig. 2. The master node needs to collect the real-time information of each node regularly and synchronize to each slave node to ensure that each slave node has a record of the instant information of any node in the cluster. The real-time information of these nodes includes the number of network cards, the real-time load of each network card, and the node CPU utilization. The research on high reliability and high availability in the BPCM cluster master-slave architecture is not discussed in this paper, and this task is the responsibility of the rest of the team.   communication protocol is completed by BPCM. BPCM encapsulates functions based on DPDK technology. The core layer includes modules such as multi-NIC load balancing, energy consumption control, rate adjustment, and reliable transmission, as shown in Fig. 3. The port load balancing module realizes the balanced scheduling of the tasks of each network card to avoid overloading the tasks of a single network card. The energy consumption control module can dynamically schedule CPU cores for processing according to the task scale. When the task is heavy, it wakes up other cores to transmit data; when the task is light, it makes the core sleep and frees up CPU resources. The rate adjustment module can perform rate matching transmission according to the receiving capability of the destination node and the sending capability of the local machine to avoid useless data transmission. The rate adjustment module can also ensure the fairness of data transmission in the case of multi-user data transmission. The reliable transmission module implements a custom transmission protocol, which can ensure reliable delivery during high-speed transmission. Due to the high-speed parallel communication characteristics of BPCM, traditional reliable transmission protocols are not applicable and need to be re-designed. The reliable transmission of BPCM will be introduced in detail in the next paper. This paper only studies based on unreliable data packet transmission.

IV. IMPLEMENT OF BPCM
A. COMMUNICATION MODEL BPCM aims to provide users with a simple and easy-touse communication interface similar to traditional Socket, and the low-level communication process is transparent to users. Users only need to call the interface provided by BPCM to realize data transmission and reception, without VOLUME 8, 2020 having to care about the underlying implementation. BPCM implements a full-duplex protocol, that is, the data sender and receiver can communicate with each other at the same time during the communication process.

1) DATA TRANSFER MODEL
Similar to the traditional Socket communication process, user applications in BPCM communicate with each other through bound ports. The user application needs to bind the receiving port, and BPCM will allocate an independent receiving buffer for data reception. The user application specifies the sending host number and the receiving port, and sends data to the common sending buffer. The bottom transmission core of the BPCM takes out the data to be transmitted from all users in turn from the public transmission buffer, and then selects the network card port for transmission according to the port load balancing policy. The BPCM bottom receiving core receives the data packet from the network port, parses it, and puts it into the corresponding receiving buffer according to the port number. The data transmission model of BPCM is shown in Fig. 4. UA_1 of Node_1 and UA_1 of Node_2 communicate with each other.

2) MEMORY MODEL
BPCM is implemented based on DPDK, and uses many memory objects provided by DPDK to process data packets. Fig. 5 shows the BPCM memory model. When a user sends data, first put the data to be sent into the Mbuf object space requested from the DPDK memory pool and encapsulate the header information, and then add the Mbuf pointer to the public sending buffer ring. The sending core of BPCM reads the Mbuf pointer to be sent in turn from the public sending buffer ring, parses the destination node number, calculates the source node send port and destination node receive port, fills in the Ethernet packet header MAC information, and adds it to Send_mbuf_table; When full, the DPDK port sending function is called for batch sending. The receiving core of BPCM receives data packets in batches from the network port into Recv_mbuf_table, and analyzes and processes them sequentially; adds the data packet Mbuf pointer to the corresponding receiving ring according to the port number in the data packet. The user takes the Mbuf pointer from the corresponding private receiving buffer ring and reads the data. Relying on pointers for data transfer throughout the process reduces the number of memory-copy and reduces processing delays.

3) MULTIPLE PHYSICAL CHANNELS
With the increase in the number of computers in the BPCM cluster and the expansion of the network scale, a single switch cannot meet the access requirements of multiple network card servers. Moreover, the failure of a single switch will render the entire BPCM unusable. Therefore, BPCM cluster needs to interconnect multiple switches to replace a single switch. To achieve the high availability of the BPCM cluster, a reasonable network connection topology is necessary. As shown in Fig. 6, multiple network cards of a server can be connected to multiple switches to achieve redundancy of physical links, and the connection is still available when the switch fails. For example, when the Layer 2 switch 2 in Fig. 6 fails, node 1 and other nodes can communicate with each other through several other switches, which is the solid line in Fig. 6. For multiple switches, users can choose cascading, stacking, and clustering methods according to their needs.

B. PORT LOAD BALANCING
Because BPCM is multi-port parallel communication, port load balancing is essential. Traditional port distribution strategies include random methods, polling methods, balanced round-robin methods, and weighted polling methods.
Because it only considers the fairness of data transmission and reception, it does not take into account the actual operating environment (such as receiver port load, port performance, etc.), which is prone to port congestion and a large number of Packet loss problems. As each node in the BPCM can obtain the receiver's instant information, this paper combines the port load information of the sender and receiver to propose a dynamic load balancing mechanism to ensure the stability of data transmission.

1) PORT LOAD BALANCING MODEL
Only efficient port load balancing algorithms can guarantee high-speed data transmission, and overly complex algorithms are not suitable for BPCM. The dynamic load balancing model in this paper is shown in Fig. 7. The sending core takes the mbuf to be sent from the public sending buffer and extracts the five-tuple information Quintuple (s_node, d_node, s_pump, d_pump, ptype), where s_node represents the source node number, d_node represents the destination node number, and s_pump represents the source port Number, d_pump indicates the destination port number, and ptype indicates the packet type. The extracted five-tuple information is calculated by using a Microsoft Toppliz algorithm to calculate a hash value, and then performing a modulo operation with the total number of the sender and receiver sides to select a sending network port and a receiving network port. According to the load of the selected network port, decide whether to call the adjustment function to reselect a suitable network port. Finally, an optimal quad (Quad (s_port, s_mac, d_port, d_mac)) is obtained. After the corresponding information is filled in the Ethernet packet header, the DPDK port is called to send data.

2) PORT LOAD BALANCING ALGORITHM
In this section, the network port load information model is established based on the network card dynamic parameters obtained by the network port statistics function provided by DPDK. Each symbol is described in detail in Table 1. For the universal full-duplex network card, we can get six basic dynamic parameters, N _ip_port i , N _op_port i , N _ib_port i , N _ob_port i , N _imp_port i , N _iep_port i . By collecting through the period T, the rate values of the param- the network port during the T period can be calculated. The calculation methods are shown in (1)- (7).
In the period T, the average size PS_ib_port i of received packets processed by network port port i and the average size PS_ob_port i of sent packets processed by network port port i can be calculated according to (8) and (9). Packets that are discarded because the NIC queue is full and error packets should be included in the total number of packets received by the NIC. The receiving load factor of the NIC is calculated as (10), where V _imb_port i is the number of bytes that are discarded because the queue is full, V _ieb_port i is the number of bytes discarded for reception errors. Packets that are sent incorrectly should also be included in the total number of packets sent by the network card. The calculation method of the load rate of the network card is (14), where V _oeb_port i is the number of bytes dropped due to the transmission error.
V _ieb_port i (t) V _oeb_port i (t) To prevent the load information from abruptly changing due to collection interference and improve the accuracy of the load information, this paper automatically adjusts the calculated network port load rate to make the transition smooth. The smooth transition calculation formula is shown in (15) and (16), r is the adjustment factor, which is usually set to 0.8.
To measure whether the current network port load rate is too high compared to other network port load rates, the load balancing degree of network port is proposed. The load balancing degree of network port can reflect the difference in load rate among network ports, and has reference value for whether to call the adjustment function. The receiving load balancing degree and sending load balancing degree of network port are defined by (19) and (20). For the selection of the sending network port, when the sending load balancing degree of the network port exceeds the user-defined load balancing degree threshold δ, the adjustment function is called to select a suitable sending network port again; for the selection of the receiving network port, when the receiving load balancing the degree of the network port exceeds the user. When the load balancing degree threshold δ is defined, the adjustment function is called to select a proper receiving network port again. Finally, the best network port information quad Quad (s_port, s_mac, d_port, d_mac) is returned.

C. ENERGY CONSUMPTION CONTROL
Since the polling work of the DPDK is continuous, the CPU core utilization can reach 100%. When there is no data to send or receive, this will seriously consume CPU resources and cause unnecessary energy consumption. Although DPDK itself provides power management functions, including the user's dynamic adjustment of the CPU frequency and the CPU core's hibernation at different depths based on adaptive algorithms, it is not suitable for the research scenario in this paper. Reducing the CPU frequency on the GPU server will affect the work of other service processes. Moreover, the adaptive sleep algorithm cannot cope with sudden traffic conditions. Therefore, for this application scenario, we designed a unique energy control module.

1) LCORE MODEL
In the energy consumption control module, we divide the CPU core into a master core and multiple slave cores, as shown in Fig. 8. The master core is responsible for scheduling the slave core to work or sleep. The core's working state is divided into transmit and receive mode, receive mode and transmit mode, and the sleep states are divided into long sleep, short sleep and user-defined sleep.  Throughout the entire operating state, we divide the core state into three parts: a work area, a light sleep area, and a deep sleep area, as shown in Fig. 9. Sleep time at the core of the shallow sleep zone is in milliseconds, and sleep time at the core of the deep sleep zone is in seconds. At the same time, a core can only exist in one area. The master core always exists in the work area, and is in the receiving and sending mode, and monitors the traffic status in real-time, and schedules the slave core. At any time when all cores are not fully operational, there is at least one slave core in the shallow sleep zone. When the master core changes the state of the slave core, a slave core can only sequentially switch from the work area to the light sleep area, and then to the deep sleep area, or from the deep sleep area to the light sleep area, and then to the work area. The conversion process is unable to directly switch from the work area to the deep sleep area or from the deep sleep area to the work area. This is to avoid the wake-up delay from the core and better cope with traffic bursts. The core in the shallow sleep zone will be automatically scheduled to the deep sleep zone if it exceeds a certain time. The core switching among other sleep zones requires the master core to schedule it.

2) LCORE SCHEDULING ALGORITHM
This section builds an energy consumption control model based on the dynamic parameters of each network card obtained previously and monitors the flow parameters of each logical core to dynamically and adaptively schedule the logical core work. Each symbol is described in detail in Table 2.
In the whole logic core scheduling process, since the working state of the logic core can only be in one of three regions, (21) always exists. The changes in sent and received traffic play a decisive role in logical core scheduling, so this paper models the sending and receiving traffic. For data transmission, the logic core needs to take out the data to be transmitted from the public transmission buffer for transmission, and the load of the public buffer can be calculated by (24). When the load of the public buffer exceeds the user-set β value, all cores need to be called for sending.
When the byte rate V user _p_send of all users sends data exceeds the byte rate V _ib_lcore of all cores, the byte growth VOLUME 8, 2020 rate V used _b_psb of the public send buffer is greater than zero, indicating that the core needs to be increased for transmission; when V user _b_send is equal to V _ib_lcore, V used _b_psb is equal to zero, indicating the current working core is just able to complete the sending task; when V user _b_send is less than V _ib_lcore, V used _b_psb is less than zero, indicating that the working core has excess sending capacity.
The variable V used _b_psb is instructive for evaluating the size of sent traffic. The byte rate received by all working cores is equal to the byte rate successfully received by all network cards, as shown in (33). It is not accurate to estimate the size of the access traffic using the byte rates received by all working cores. We use V * _ib_port calculation, which includes not only the successful byte rate of the network card, but also the discarded and incorrectly received byte rate.
The maximum transmission and reception rate that a single core can complete is the theoretical limit rate of a single network card. Therefore, to send current-scale traffic, the number of cores N lcore _need_send needs to be increased is rounded up by (39), where V max _b_port is the theoretical limit of a single network card speed. The number of cores cN lcore _required_recv required to receive current-scale traffic is calculated by rounding up from (40), and the number of cores to be increased N lcore _need_recv is calculated from (41), where k is the number of cores that have been working.
According to the positive and negative values of N lcore _need_send and N lcore _need_recv, it is judged whether the sending data and the receiving data are the wake-up core or the sleep core. Wake-up or hibernate cores can be comprehensively determined according to the transmitted data and received data, and the number of cores to be awakened or hibernated, as well as the number of cores that are awakened or hibernated, such as (44), as shown at the bottom of the next page.
When waking up the cores, the number of cores to be woken up N lcore _need_wakeup exceeds the number of cores in the light sleep area, then all cores l in the light sleep area are woken up to work, and then the max {1, N lcore _need_wakeup − l} cores in the deep sleep area are scheduled to enter the light sleep area. For the working mode of waking up the core, if both sending and receiving data need to wake up the core, the maximum number of cores is used to wake up, and the core is in the sending and receiving mode; if one of the sending and receiving data needs to wake up the core, it wakes up the required number of cores, and the cores are in their corresponding modes; if the data transmission and data reception are both sleeping cores, the smallest number of cores is selected for sleeping, and the sleeping cores enter the shallow sleep area. If there are not enough cores for the main core to wake up, that is, most of the cores are in the working area, all working cores need to be switched to the transmit and receive mode.

D. RATE ADJUSTMENT
When multi-user applications send data, to ensure the fairness of bandwidth, it is necessary to limit the transmission bandwidth of each user application. Moreover, to reduce failed transmissions, the receiving capability of the receiver must also be considered. When the user application sends data, it can dynamically adjust the real-time sending rate according to the calculated value of the rate adjustment module.

1) RATE ADJUSTMENT MODEL
In a BPCM cluster, the master node periodically synchronizes real-time information such as the number of current user applications monitored by each node, and the current transmission rate of each network port to all nodes in the cluster. Each node in the cluster receives the synchronization information and sends it to the main core for processing through a private channel. The main core extracts synchronization information from the private pipeline, parses related fields and updates them to the local cache table, and at the same time shares the related information of the rate adjustment module to the global monitoring information table for use by each user application. When each user application sends data, it can obtain the current user application number of the node and the receiving node, and the current transmission rate of each network port from the global monitoring information table, and calculate the sending rate of the application according to the rate adjustment algorithm. As shown in Fig. 10.

2) RATE ADJUSTMENT ALGORITHM
This section establishes a user application rate adjustment model based on the dynamic parameters of each network card of the sending server and receiving server. The symbols are described in detail in Table 3.
Through the global monitoring information, we can obtain the maximum theoretical total bandwidth V max _ib_port and available total bandwidth V available _ib_port received by all network cards of the sending node and the receiving node, and the maximum theoretical total bandwidth V max _ob_port and total available bandwidth V available _ob_port that can be sent, the number of applications of the receiving nodeN app _recv and the number of applications of the sending nodeN app _send. For the maximum sending rate supported by the sending node user sending application, such as (51), we choose the maximum value between the average sending bandwidth and the available sending bandwidth; for the maximum receiving rate supported by the user receiving application at receiving node, such as (52), we choose the maximum value between the average receiving bandwidth and the available receiving bandwidth. The final sending bandwidth V max _b_send of the sending application selects the minimum of the maximum sending rate V * max _b_send supported by the user sending application and the maximum receiving rate V * max _b_recv supported by the user receiving application at receiving node, as shown in (53). Finally, we can get the sending rate V max _p_send of the current user sending application according to (54). Equations (51) and (52) reflect our idea of ensuring that each user uses the bandwidth fairly while making full use of it. Equation (53) can minimize the possibility of transmission on the premise of making full use of the bandwidth.

V. EVALUATION
Below we have experimentally verified the algorithm of each module of BPCM. We use 6 servers and a 48-port Gigabit Layer 2 switch to build a BPCM cluster. The server configuration details are shown in Table 4. Each server CPU is an 8-core Intel Xeon E3-1230 V2@3.3Ghz, two four-port Intel-I350-T4 gigabit network cards, and a total of 8 network ports. The BPCM system is developed based on the DPDK-17.11.3 version. It is configured with 8G huge page memory, one port receiving queue and one sending queue, and the size is 1024, which can be adjusted.

A. TRANSMISSION PERFORMANCE EVALUATION
To verify the transmission performance of BPCM, we designed several sets of experiments. Because this paper only studies the performance of BPCM based on unreliable data packet transmission, traditional UDP transmission is selected for comparison experiments. We discussed the transmission performance of UDP and BPCM in the case of a single network card and four network cards. By sending 5GB of data multiple times, and selecting the data packet size as 64B, 128B, 256B, 516B, 1024B, the transmission rate, packet loss rate, and transmission delay are counted. Fig. 11 shows the statistics results of UDP and BPCM data transmission between the sending server and the receiving server when using a Gigabit network card. It can be seen from Fig. 11 (a) that in the case of a single network card, BPCM benefits from the DPDK user mode driving feature, which reduces the processing time of the traditional kernel protocol stack. When the data packet is small, the transmission rate is significantly better than UDP. The reduction of processing time also causes BPCM to be significantly lower in packet loss rate than UDP. Fig. 11 (b) shows that the BPCM transmission delay is lower than UDP.
The Linux system is not friendly to the parallel transmission of multiple network cards. Although a bonding module with multiple network cards is provided in the kernel, this module is mainly used for network load balancing and redundancy. To increase bandwidth, the switch also needs to support (802.3ad) IEEE 802.3ad Dynamic link aggregation. Although the bonding module can increase the total bandwidth, it cannot break the bandwidth limit of a single connection (the theoretical rate of a Gigabit network card). To achieve the total bandwidth transmission rate, the cooperation of multi-connection and multi-threading technology is also required. To compare the performance of parallel transmission of multiple network cards, we bind the four network cards of the sending server and the receiving server to mode four through the bonding module of the Linux system, and enable the LACP function of the switch in the LAN, and establish four connections through four threads for UDP data transmission. Compare the transmission performance of the UDP single connection and multiple connections with BPCM, as shown in Fig. 12.
In Fig. 12 (a) single connection transmission test, it can be seen that UDP in bonding mode cannot break the limit of the gigabit rate of a single connection network card, and BPCM single connection can make full use of the total bandwidth of four network cards. Due to the influence of the underlying bonding mode in UDP in bonding mode, the packet loss rate increases significantly. Compared with the BPCM of a single network card, the packet loss rate of the four network cards BPCMs increases slightly, but it is significantly lower than the UDP packet loss rate in the bonding mode. Fig. 12 (b) shows that although UDP in bonding mode can make full use of the bandwidth of four network cards through multithreading and multiple connections, it is still lower than the transmission rate of BPCM, and the packet loss rate has increased significantly. In the case of BPCM with multiple connections, the packet loss rate does not change much.
Through the above experiments, we can see that no matter whether there is a single network card or multiple network  cards, BPCM is superior to traditional UDP in terms of transmission performance. Due to the tedious configuration of the bonding module of the Linux system and the bandwidth limitation of a single connection, the utilization of multinetwork cards is not high. The programming model of multithread and multi-connection requires high technical staff. BPCM can double the bandwidth in a single connection, making it easier to use.

B. PORT LOAD BALANCING PERFORMANCE EVALUATION
This section verifies the performance of the port load balancing algorithm proposed in this paper and compares it with a random algorithm. In the case where the number of network ports is equal or not, 5GB of data is sent multiple times, and the packet sizes are selected to be 64B, 128B, 256B, 516B, and 1024B. The load rate and packet loss rate of each network port of the sending server and receiving server were calculated. Fig. 13 shows the load of each network card with different network port selection algorithms when the sending server and the receiving server use four Gigabit network cards for transmission at the same time. As shown in Fig. 13 (a), the load balancing network port selection algorithm in the sending server makes the load rate of each network card more uniform than the random network port selection algorithm. The load rate of each network port is almost the same. The receiving server in Fig. 13 (b) shows the same result. The load-balanced network port selection algorithm has a more uniform load rate on the sending server than on the receiving server. This is because there is a certain delay in synchronizing the load information of each network port on the receiving server to the sending server. As a result, the sending server has an error in selecting the receiving port. Fig. 14 shows the load of each network card with different network port selection algorithms when the number of network cards of the sending server and the receiving server are not equal. The sending server uses four Gigabit network cards, and the receiving server uses two Gigabit network cards. Fig. 14 (a) and Fig. 14 (b) also show that the load balancing network port selection algorithm in the sending server makes the load rate of each network card more uniform than the random network port selection algorithm. Due to the bandwidth limitation of the two network cards of the receiving server, the load rate of the network port of the sending server is also limited. This is due to the effect of the rate adjustment module of the BPCM.
Similarly, in the case where the number of network ports of the sending server and the receiving server is equal or  unequal, we have also calculated the packet loss rate of the two network port selection algorithms with different packet sizes. As shown in Fig. 15. in both cases, the packet loss rate of the random network port selection algorithm is higher than that of the load-balanced network port selection algorithm, and the packet loss rate of the load balancing network port selection algorithm does not change much in the two cases.

C. ENERGY CONSUMPTION EVALUATION
This section experimentally verifies the effectiveness of the logical core scheduling algorithm of the BPCM energy control module. Tests were repeated with a total of 5GB and 64B packets. In the process of sending data, the CPU utilization of the sending server and the receiving server and the working state of each logical core will be recorded at each time. In the case where the number of network ports of the sending server and the receiving server is equal or unequal, discussions are made separately.
In Fig. 16, the number of network ports on the sending server and the receiving server is the same. Both are four Gigabit network ports. The CPU utilization on the sending server and the receiving server is always consistent with  the changing trend of the number of working cores. The light sleep area always maintains at least one core number. When the sending server starts to send data, there will be jitter in the number of working cores. This is because the user sends data into the common sending buffer, which is a high-speed memory operation, which will affect the core scheduling.
In Fig. 17, the sending server is four Gigabit Ethernet ports, and the receiving server is two Gigabit Ethernet ports. The overall change trend of the parameters of the sending server and the receiving server is consistent with Fig. 16. However, due to the limitation of the receiving server's receiving bandwidth, the number of working cores of the sending server and the receiving server both decreased. This is because the BPCM rate adjustment module limits the sending rate of the sending server.

D. RATE ADJUSTMENT EVALUATION
The BPCM rate adjustment module is designed to ensure bandwidth fairness and bandwidth matching for multi-user applications. In the experiment, we used five user applications to send 5GB of data sequentially, with a packet size of 64B. When the number of network ports of the sending server and the receiving server is equal or unequal, the sending bandwidth of each user application at each time is recorded. Finally, the user application bandwidth is compared with the BPCM data transmission without the rate adjustment module. Fig. 18 shows the change in user application bandwidth when the number of server network ports is the same or different under the role of the rate adjustment module. With the same number of network ports on the server, the total bandwidth is almost fully utilized, and each user application distributes the bandwidth evenly, as shown in Fig. 18 (a). When the number of network ports of the server is different, the total bandwidth of the sending server is limited by the total bandwidth of the receiving server. However, each user application still distributes the bandwidth evenly, as shown in Fig. 18 (b).
When the number of network ports of the server is the same or different, we monitor the final allocated bandwidth and packet loss rate of each user application without using the rate adjustment module or using the rate adjustment module, as shown in Fig. 19 and Fig. 20. In both cases, users who do not use the rate adjustment module have extremely uneven application bandwidth allocation, and the packet loss rate has   significantly increased, which is more prominent when the number of network ports is not equal.

E. ANALYSIS OF COMMUNICATION RELIABILITY OF BPCM
BPCM, as a high-speed parallel communication system, involves not only software design, but also the underlying physical equipment. To achieve reliable and durable highspeed communication of the BPCM system, a lot of work has been done in the design of the BPCM system, which will be described in detail below.

1) HARDWARE RELIABILITY
The underlying physical devices of BPCM include switches, network cards, and network cables. On one hand, the reliability of the physical device itself is one of the determinants of reliable communication. For physical devices, indicators that measure their reliability include MTBF (Mean Time Between Failure), MTTR (Mean Time To Repair), and MTTF (Mean Time To Failure), where MTTF = MTBF + MTTR. For today's electronic products, under normal working conditions, its MTTF value can reach 10 years. On the other hand, achieving reliability through multiple devices is another way. As described in Section 4.1.3, BPCM uses multiple network cards, multiple switches, and multiple lines to achieve connectivity reliability on the physical lines among nodes. Even if a device or line fails, nodes can still communicate through other channels. High reliability at the physical level is the basis of BPCM's high reliability.
Software level reliability is also critical for BPCM systems. Firstly, the master-slave architecture of the BPCM cluster enables the master node to monitor the connectivity of the network cards and links of each node in the cluster through the heartbeat mechanism, and synchronize to each node in real-time to ensure that each node can quickly shield faulty network cards and links (This part of the work is done by other team members). Secondly, the port load balancing module can avoid overloading each network port and link or failing. Finally, reliable transmission protocols are essential. Because BPCM provides a higher transmission rate, the traditional TCP protocol is not suitable. Therefore, it is necessary to redesign the reliable transmission protocol. This part of the work will be introduced in the next article.

VI. CONCLUSION
As a hardware platform of distributed machine learning system, GPU cluster plays a decisive role in the speed of machine learning model training. The ever-expanding scale and parameters of machine learning models place higher demands on the network bandwidth within GPU clusters. In small-scale GPU clusters, existing solutions cannot be applied due to the need for professional equipment support and high costs. To address this issue, this paper first proposes a multi-NIC bypass parallel communication mechanism based on DPDK, making full use of the idle CPU resources of the GPU server to accelerate data transmission. Then it proposes port load balancing algorithm and CPU multi-core scheduling algorithm for multi-NIC and CPU multi-core to ensure load balance of network card and low CPU energy consumption. In addition, rate adjustment algorithms are proposed to ensure fair use of bandwidth in multi-user scenarios. Finally, the experiment shows that BPCM can make full use of idle CPU resources, take advantage of the characteristics of multiple network cards, and achieve a multiplied increase in communication bandwidth. Moreover, the properties of BPCM based on the second layer of the network determine its simple to use and flexible to expand. The high bandwidth, high efficiency and low cost of BPCM show that it has certain application prospects.
Future work will focus on the support of BPCM for 10 Gigabit network cards and the optimization of data transmission efficiency. On this basis, we will design a special reliable transmission protocol and upper layer communication protocol for distributed model training.