Enhancing 5G/IoT Transport Security Through Content Permutation

Internet of Things (IoT) or massive machine-type communications (mMTCs) is one of the essential aspects addressed by 5G mobile telecommunications. Some 5G core networks will be deployed with software-defined networking (SDN) where security is an important issue. The overhead of offering security in 5G is high, which is typically incurred in encoding and decoding. This paper proposes the enhanced content permutation algorithm (eCPA) that can effectively implement secret permutation in a 5G/IoT transport network with the P4 SDN switches to protect the 5G packets (especially for IoT). The encoding/decoding speed, including packet routing, can be up to 6.4 Tb/s, which is the fastest in the world.


I. INTRODUCTION
5G mobile telecommunications networks are being deployed in many cities in the world.It is claimed that 5G has made significant enhancement over 4G LTE technologies in three aspects: Ultra-reliable and Low Latency Communications (URLLC), enhanced Mobile Broadband (eMBB) and massive Machine Type Communications (mMTC).To support these three aspects, both the radio system (Figure 1(1) and ( 2)) and the core network (Figure 1(3) and ( 4)) have been advanced.In the control plane of the core network (Figure 1(3)), several network functions are connected through the Service Based Interface (SBI).Among these network functions, the Access and Mobility Management Function (AMF; Figure 1(5)) controls the User Equipment (UE; Figure 1(1)) through the N1 interface and the Radio Access Network (RAN; Figure 1(2)) through the N2 interface.The Session Management Function (SMF; Figure 1(6)) controls the User Plane Function (UPF; Figure 1(4)) through the N4 interface.
The user data, in particular, Internet of Things (IoT) packets for mMTC [1] are delivered between the UE and the IoT server located at the external Data Network (DN; Figure 1 (7)) through the RAN and the UPF via N3, N9, and N6 interfaces.An example of IoT server is IoTtalk developed to support smart campus [2] and smart farming [3].
The associate editor coordinating the review of this manuscript and approving it for publication was Ilsun You.In the advanced 5G core network, Software Defined Networking (SDN) is used to support network function virtualization (NFV) and network slicing [4], [5], [20].Unlike traditional data network, SDN decouples the data and the control planes, where the SDN controller in the control plane is responsible for issuing instructions to the SDN switches in the data (user) plane.Therefore, as illustrated in Figure 2, SDN perfectly matches the 5G core network architecture.Specifically, the SDN controller (Figure 2 (1)) is used in the control plane of the 5G core network, and the SDN switches (Figure 2 (2)-( 4)) are used in the UPF.
The user data are delivered between the UE and the DN through the path ( 5)-( 6)-( 2)-( 3)-( 4)-( 7) in Figure 2. In [5] we have shown how to aggregate the IoT packets through the SDN network.In this paper, we extend the previous work to investigate the security aspect for SDN-based 5G transport network.
SDN provides programmability, centralized policy management and global network state visibility to the 5G system.To maintain these additional advantages of SDN in the 5G core network, we need to properly secure the communication channels to prevent the threats as well as to protect data privacy.Several studies have focused on the 5G security issues.The study in [6] gives surveys on 5G authentication and privacy preserving.The study in [7] gives surveys on privacy, replay, bidding down, man-in-the-middle, and attacks on control and user planes.The study in [8] provides a formal analysis of 5G authentication.The study in [9] pointed out that 5G specifications have made some unrealistic assumptions that render 5G systems vulnerable to adversarial attacks unless the optional security features are enforced.The study in [10] analyzes threats and provides solutions for 5G security, which pointed out that in LTE, IPsec is the most commonly used security protocol, and with slight modifications, IPsec tunneling can be used to secure 5G communication by integrating authentication, integrity, and encryption.In the recent years, secret permutation (transposition) has been used for protecting different types of multimedia and IoT data, including speech files, digital images, videos, and sensor data [11], [12].These existing security schemes may consume a significant amount of network resources and execution overhead.To support low-cost permutation cipher, we propose a new permutation mechanism in the SDN switches to secure the user data delivery in the UPF of the 5G core network without incurring extra packet processing overhead.Specifically, we show how permutation cipher can be achieved in the DPF network (Figure 2 (2)-( 4)).
Suppose that a packet is delivered from the UE to the DN.In secret permutation, we encrypt (encode) the payload of the packet at SDN Switch 1 (Figure 2 (2)) and decrypt (decode) it at SDN Switch 2 (Figure 2 (4)).The process consists of two parts.A permutation cipher scheme [12] is used in the SDN controller (Figure 2 (1)) to produce a permutation cipher key.The key is sent to both SDN Switches 1 and 2. We partition the payload of the packet into n portions (codewords), and the key is used at Switch 1 to shuffle all codewords of the payload around.When the packet is received at Switch 2, the same key is used to recover the original order of the codewords.The overhead of permutation for ciphering and deciphering is high in a CPU architecture [13].In an SDN network, the switches are designed based on a pipeline architecture, and therefore, the permutation mechanism cannot be implemented the same way as that for the CPU architecture.The OpenFlow SDN switches were originally designed for fast packet processing with limited intelligence.These OpenFlow switches cannot manipulate the header fields and therefore cannot perform permutation in the data plane.Fortunately, the user is allowed to describe how to manipulate the packet payloads in an SDN switch based on the P4 (Programming Protocol-Independent Packet Processor) technology [4], [5].P4 is a reconfigurable, multi-platform, protocol-independent and target-independent packet processing language, which is used to dynamically facilitate programmable and extensible packet processing in the SDN data plane.A P4 program describes how to parse and manipulate the packet headers through the operations that may modify the header fields of packets or the content of the metadata registers in the switch.
Based on the SDN P4 architecture, this paper proposes a permutation mechanism to run on Switches 1 and 2 at line rate assuming that the UPF (Figure 2 (2)-( 4)) is deployed with the P4 switches.The paper is organized as follows.Section 2 describes the content permutation algorithm (CPA) proposed in [14].Section 3 describes how CPA is implemented in a P4 switch and proves that our implementation is correct.Section 4 describes how to handle permutation of large packet payload through interleaving.Section 5 illustrates the experiment environment and summarizes our findings.

II. CODEWORD PERMUTATION BY THE CONTENT PERMUTATION ALGORITHM
In this section we introduce the Content Permutation Algorithm (CPA) [14] to demonstrate the feasibility of codeword permutation on P4 switch; i.e., to use a binary vector as a cipher key to conduct permutation on a packet payload.For n ≥ 2, we partition the payload of a packet into n codewords.Denote the payload of n codewords (portions) as s n =<π 1 , . . ., π n > where π i is the i-th codeword of the packet payload.Let . ., 0)} be the set of all binary vectors of length n, which are used to generate all members in P(s n+1 ).
CPA permutes s n with the key k n−1 , and stores the result in another storage s * n .Figure 3 lists the pseudo code of Algorithm CPA(s n , k n−1 ) that uses k n−1 to map s n to s * n .This algorithm manipulates two pointers max and min.Initially, max= n; min= 1.In each iteration, a binary bit is used to determine if action should be taken to move the codewords.If x i = 1, the codeword pointed by max is placed into π * i and then max indexes the next codeword of s n .Otherwise, the codeword pointed by min is placed into  π * i and then min stores the index of the last codeword in s n .In [14], the correctness of CPA is guaranteed by the following theorem.
Theorem 1: After the execution of CPA on s n for 1 ≤ i ≤ n − 1, the resulting codeword π * i is where The proof of Theorem 1 is omitted, and the reader is referred to [14] for the details.
Figure 4 illustrates how the codewords of a packet payload are manipulated by CPA with the pointers min and max, where n = 5. Figure 4(a) shows that min initially points to the first codeword of s 5 and max initially points to the 5-th codeword (Line 4 in Figure 3).Lines 5-9 are a loop that is executed 4 times in our example.In the first iteration,

III. CPA FOR P4 SWITCH
The CPA algorithm is typically implemented in a CPU architecture using a general-purpose programming language such as C, where the variables min and max are declared as pointers.For 1 ≤ i ≤ n, each of them point to a specific i-header field (codeword π i ) in the payload (bit sequence s n ), and 2n|π i | units of memory are used by CPA to permute the codewords, where |π i | is the size of π i .
The pseudocode in Figure 3 cannot be directly implemented in a real P4 switch due to two limitations of the pipeline architecture.First, the pointer data structure is not supported to index an array.Second, the number of header fields and metadata that can interact with each other is limited.Therefore, the size of a packet payload cannot be too large.We will elaborate more on this issue in the next section.
The first limitation comes from the nature of P4 language that only allows the user to instantiate variables to be data fields (metadata fields or header fields) instead of pointers, which are allocated in a specific memory location in the P4 switch when the program is compiled.
To reduce the space complexity of CPA and to avoid the usage of pointers, we propose enhanced CPA (eCPA), a space-efficient and pointer-free improvement for CPA, which only utilizes (n+1) |π i | memory units.A P4 switch can manipulate the header fields of a packet through the pipeline architecture.By utilizing this feature, eCPA treats the payload of a packet as the ''header'', and the codewords of the payload are treated as the header fields.
In Figure 5, the content of π n+1 at Line 8 was modified at Line 7. We prove that eCPA is correct; i.e., with the same binary cipher vector, eCPA shuffles to produce the same result as CPA does.The following theorems show that s n (n − 1) is the resulting permutation of s n after eCPA is executed, which satisfies Theorem 1 (i.e., the execution of eCPA outputs the same result as CPA).

VOLUME 7, 2019
Theorem 2: Let θ i = |{l|x l = 1, 1 ≤ l ≤ i}|.Consider the execution of eCPA.For 1 ≤ j ≤ i < n, after the i-th iteration is executed, rearrangement of codewords for s n (i) satisfies the following loop invariants.For 1 ≤ j ≤ i, For i < j ≤ n, For j = n + 1, Proof: We prove by induction on i that the loop invariants (1)-( 5) hold after the i-th iteration of the loop in Lines 5-9 is executed, where 1 ≤ i < n.
Similarly, for j = n + 1, Eq. ( 7) implies that and invariant (5) is satisfied.Since i = 1, the execution of Line 8 results in Substitute Eq. ( 7) into the above equation to yield and Eq. ( 8) satisfies invariant (2) for j = 1.Therefore, invariants (1)-( 5) are satisfied in the Base Case.Inductive Case: Assume that i > 1 and after the i-th iteration is executed, the codewords in s n (i) satisfy all loop invariants.Now we show that the invariants hold for the (i + 1)-th iteration.If x i+1 = 0, no action is taken.We have s n (i + 1) = s n (i).That is, for 1 ≤ j ≤ n π j (i + 1) = π j (i) Since both invariants (1) and ( 3) hold for the i-th iteration, the above equation can be re-written as Since x i+1 = 0, θ i+1 = θ i , the above equation is re-written as Eq. ( 9) indicates that both invariants (1) and (3) hold for the (i + 1)-th iteration.For j = n + 1, since invariants ( 4) and ( 5) hold for the i-th iteration, we have Since θ i = θ i+1 , Eq. ( 10) can be rewritten as which indicates that invariants (4) and ( 5) hold at the end the (i + 1)-th iteration.
For x i+1 = 1, we first note that the loop at Line 7 does not affect π j (i) for 1 ≤ j < i + 1 at the (i + iteration.That is, Since invariant (2) holds for the i-th iteration, the above equation is re-written as and invariant (2) holds for the (i+1)-th iteration except for the case where j = i + 1.We will elaborate on this case in deriving Eq. ( 15) later.
Q.E.D Immediately from Theorem 2, we have the following theorem.
Theorem 3: The execution of eCPA has the same result as CPA.
Proof: From Theorem 2, after the (n − 1)-th iteration of eCPA is executed, we have for j < n − 1, and Therefore, s n (n − 1) satisfies Theorem 1, and the execution of eCPA has the same result as CPA.
Q.E.D Figure 6 illustrates how the example in Figure 4 is executed by eCPA in a real P4 switch.The P4 switch processes a packet based on the rules of a set of match-action tables.When a packet arrives at the P4 switch, it is handled by the parser, the ingress pipeline, the traffic manager (queues), the egress pipeline and the deparser.The parser extracts the header fields and the payload of the packet following the parse graph defined in the P4 program.The ingress pipeline consists of match-action tables arranged in the stages of the pipeline, which manipulates the packet headers and generates an egress specification to determine the set of ports to send out the packet.The traffic manager queues the packets before they are sent to the egress pipeline.The egress pipeline may further modify the packet header.At the deparser, the headers are assembled back to a well-formed packet.In Figure 6, an iteration is represented by a color oval rectangle, and a stage of the egress pipeline is represented by a dashed rectangle.The figure shows that stage i consists of two parts where the first part is overlapped with the i-th iteration of eCPA and the second part is overlapped with the (i + 1)-th iteration.The first stage does not have the first part, and the last stage does not have the second part.To adapt to the P4 pipeline architecture with read-before-write field access constraint, the pipeline actions in Figure 6 is slightly different from the eCPA pseudocode in Figure 5. Specifically, when x i = 0 (e.g., i = 1 and 3 in Figure 6), the second part of stage i performs and the first part of stage i + 1 performs These additional actions result in π i (i) = π i (i − 1), which do not affect the execution result of eCPA.In this way, no matter what value x i is, the first part executions of all stages are the same.
The first part of stage i executes a Stateless Banzai atom to implement Line 8 of Figure 5, which moves the codeword from π n+1 (i − 1) to π i−1 (i − 1).When x i = 0, the second part of a stage executes a Stateless Banzai atom to move the codeword π i (i − 1) back to π n+1 (i).On the other hand, when x i = 1, the second part of a stage executes multiple Stateless Banzai atoms to implement Line 7 of Figure 5, which shifts codewords from π n (i − 1) , π n−1 (i − 1) , . . ., π i (i − 1) to π n+1 (i) , π n (i) , . . ., π i+1 (i).By comparing the initial and the final states in Figures 4 and 6, it is clear that the result of eCPA is the same as that for CPA.
We note that in the last iteration (Figure 6 (e)), the last two codewords are swapped if x n−1 = 1.Therefore, we can use the swap atom of P4 to perform this task in the second part of the fourth stage (see Figure 7 (e)).If x n−1 = 0, then no action is taken.With this minor modification, the last stage (Stage 5) in Figure 6 is eliminated.

IV. PARALLEL EXECUTION OF CPA
As mentioned in the previous section, the number of codewords in the payload of a packet is limited due to the memory architecture of the P4 switch.In our example, eCPA is implemented in Inventec D5264 series switches [15], where the data fields (metadata and headers) used in the pipeline are divided into 8 groups.At every pipeline stage, multiple arithmetic units are executed to manipulate the data fields in the same group, where each of the arithmetic units takes two data fields as the inputs and the output is stored in a data field.In our example, at most ε m = 16 data fields can be In this figure, for 1 ≤ n * ≤ n, 1 ≤ l ≤ ε = 12, and 1 ≤m ≤ 8, the n * -th codeword π n * is mapped to π m,l , the l-th codeword in the m-group m,8 , where With the fine-interleaved arrangement ( 16), we execute eCPA( m,8 , k ε−1 ) for 1 ≤ m ≤ 8 simultaneously.

V. DISCUSSION AND CONCLUSION
We have implemented eCPA in Inventec D5264 series switches [15].To evaluate the performance of eCPA, the hardware of the UPF network in Figure 2 is shown in Figure 9(a), and the function block diagram is shown in Figure 9(b), which is similar to the configuration in [16].
In our setup, the sender (Figure 9(b) (1)) is a Supermicro server (Intel(R) Xeon(R) CPU E5-2675 v3 @ 1.80GHz (16 Cores 32 Threads)) that uses the Mellanox Connect-X5 100G NIC to connect the first Inventec P4 switch (Figure 9(b) (2)).By using Mellanox Connect-X5 with DPDK [17] and Pktgen [18], the Supermicro server generates  7)), which receives the packets from the second P4 switch.Each of the P4 switches based on Barefoot Tofino chips handles the packets without queueing, and its maximum processing capability per output port is given by its ''line rate'', i.e., 100Gbps per port (and there are 64 ports in the switch, which results in 6.4 Tera bps).When the packets arrive, they can be processed with eCPA at the line rate, and no packet is dropped at the switch.We have also implemented eCPA in Edgecore Wedge100BF series switches [19].Similarly, results are observed.
We implemented permutation cipher in the ingress pipeline but did not utilize the egress pipeline for the following reason.In the P4 switch hardware architecture, the ingress and the egress pipelines are not physically independent, which share the same hardware pipeline and resources.In our case, at most 8 exact match table lookups can be triggered in a hardware pipeline stage, and when we use them for ingress, they cannot be used for egress.
Depending on the P4 switch model, one or more parallel pipelines can be ''folded'', i.e. chained together.For example, the top of the line P4 chip offers 4 pipelines, each one divided in ingress and egress, for a total of 4 × 2 × 12 = 96 stages.Such a feature is not used in the current eCPA implementation.Since each group of memory only contains 16 codewords, it can be handled by 15 pipeline stages in eCPA.Chaining 4 pipelines together to access 96 pipeline stages does not help to increase the total number of data fields for permutation cipher because the total number of available memory groups is limited to 8 (e.g. six 16-bit groups and two 32-bit groups, and the remaining 6 groups are already used for other systematic purposes).These 8 groups of memory are manipulated by 12 pipeline stages unfolded (or 15 folded pipelines) for 8 eCPAs executed in parallel.There are no extra memory groups left for the remaining 84 stages to operate.
The eCPA implementation can perform permutation and forwarding functions simultaneously.At the first stage of the ingress pipeline, eCPA reads the key bits from stateful memory to metadata, and use the remain resources (idle MAUs) to conduct forwarding function.Then in stages 2-12, the permutation is performed.
Our study indicates that eCPA can encode and decode the IoT packet payloads at line rate.In conclusion, we have shown that the expensive computation of secret permutation can be performed at the speed of 6.4 Tera bps in a P4 switch, which is the fastest in the world.

FIGURE 3 .
FIGURE 3. The pseudo code of Algorithm CPA s n , k n−1 .

FIGURE 4 .
FIGURE 4.An example of CPA execution for n = 5.

FIGURE 5 .
FIGURE 5.The pseudo code of Algorithm eCPA s n , k n−1 .

FIGURE 6 .
FIGURE 6.The eCPA execution for the example in Figure 4.

FIGURE 7 .
FIGURE 7. The eCPA execution with the swapping atom.

FIGURE 9 .
FIGURE 9.The experiment setup with 2 Supermicro servers and 2 Inventec P4 switches.100G bits per second (bps) traffic to emulate the data streams from multiple RANs (Figure 2 (6)).The first P4 switch connects to the second P4 switch (Figure 9(b) (3)) through 100G QSFP+ cable.The first switch performs the permutation encoding, and the second P4 switch (Figure 9(b)(3)) performs decoding at the line rate.The SDN WAN (Figure 2(3)) is not included in our experiments.Decoding process similarly reverses the eCPA algorithm, and the details are omitted.The receiver is another Supermicro server (Figure 9(b) (4)) that emulates IoT server in the DN (Figure 2(7)), which receives the packets from the second P4 switch.Each of the P4 switches based on Barefoot Tofino chips handles the packets without queueing, and its maximum processing capability per output port is given by its ''line rate'', i.e., 100Gbps per port (and there are 64 ports in the switch, which results in 6.4 Tera bps).When the packets arrive, they can be processed with eCPA at the line rate, and no packet is dropped at the switch.We have also implemented eCPA in Edgecore Wedge100BF series switches[19].Similarly, results are observed.We implemented permutation cipher in the ingress pipeline but did not utilize the egress pipeline for the following reason.In the P4 switch hardware architecture, the ingress and the egress pipelines are not physically independent, which share the same hardware pipeline and resources.In our case, at most 8 exact match table lookups can be triggered in a hardware pipeline stage, and when we use them for ingress, they cannot be used for egress.Depending on the P4 switch model, one or more parallel pipelines can be ''folded'', i.e. chained together.For example, the top of the line P4 chip offers 4 pipelines, each one divided in ingress and egress, for a total of 4 × 2 × 12 = 96 stages.Such a feature is not used in the current eCPA implementation.Since each group of memory only contains 16 codewords, it can be handled by 15 pipeline stages in eCPA.Chaining 4 pipelines together to access 96 pipeline stages does not help to increase the total number of data fields for permutation cipher because the total number of available memory groups is limited to 8 (e.g. six 16-bit groups and two 32-bit groups, and the remaining 6 groups are already used for other systematic purposes).These 8 groups of memory