Synchronization of Variable-Length Constrained Sequence Codes

We study the ability of recently developed variable-length constrained sequence codes to determine codeword boundaries in the received sequence upon initial receipt of the sequence and if errors in the received sequence cause synchronization to be lost. We first investigate construction of these codes based on the finite state machine description of a given constraint, and develop new construction criteria to achieve high synchronization probabilities. Given these criteria, we propose a guided partial extension algorithm to construct variable-length constrained sequence codes with high synchronization probabilities. With this algorithm we construct new codes and determine the number of codewords and coded bits that are needed to recover synchronization once synchronization is lost. We consider a large variety of constraints including the runlength limited (RLL) constraint, the DC-free constraint, the Pearson constraint and constraints for inter-cell interference mitigation in flash memories. Simulation results show that the codes we construct exhibit excellent synchronization properties, often resynchronizing within a few bits.


I. INTRODUCTION
Constrained sequence (CS) codes have been widely used to increase the efficiency and reliability of data storage and digital communication systems such as optical recording, magnetic recording, flash memories, DNA-based storage, cable transmission, visible light communications and wireless energy harvesting, among other applications [1]- [4]. Since Shannon's 1948 paper [5], the design of CS codes has been an active research area where efficient CS codes that satisfy a great variety of constraints have been proposed. Although most CS codes in the literature are fixed-length codes [1]- [4], [6]- [16], recent advances show that variable-length CS codes have the potential to achieve higher code rates with simpler codebooks [17]- [26]. Since CS codes typically do not have strong error-correction capabilities, decoding of CS codes may result in error propagation. In practical systems CS codes are commonly used in conjunction with an interleaver and an error control code (ECC) such as an LDPC code [27], a polar code [28]- [31] or a The associate editor coordinating the review of this manuscript and approving it for publication was Rui Wang . product code [32]- [35] to overcome error propagation that may occur during CS decoding. Alternatively, automatic repeat request (ARQ) could be used such that when errors are detected at the output of the CS decoder, the system requests retransmission until the decoded sequence is error free.
Apart from their advantages, variable-length CS codes have the drawback that because of noise that occurs during transmission, erroneously received sequences may result in loss of codeword-boundary synchronization at the decoder. Mis-synchronization typically results in insertion or deletion errors, which may cause burst errors at the output of the CS decoder that may be difficult for the ECC to correct. To enable practical implementation of variable-length CS codes, we aim to develop codes with good synchronization properties, to make it feasible for ECCs to correct errors that occur at the output of the CS decoder. Alternatively with ARQ, it is desired that error propagation be limited to within the current received packet, to ensure that subsequent packets are unaffected.
In this paper, we consider the variable-length CS codes constructed by the approaches in [20]- [25] where the design guideline in those papers has been to achieve the highest possible code rate. We show that different strategies should be considered when we aim to develop codes that achieve both high efficiency and good synchronization properties. We consider a variety of constraints, including the runlength limited (RLL) constraint, constraints that mitigate ICI in flash memories, the Pearson constraint, and the DC-free constraint. The contributions of this paper are summarized as follows.
• We show that states selected based on the criteria in [21]- [25] may not result in the best synchronization properties, and we develop criteria for selecting the specified state that results in superior synchronization.
• We develop an algorithm to perform partial extensions that result in a set of codewords with high code efficiency along with high synchronization probability.
• We evaluate the synchronization properties of the proposed codes, including the number of codewords and number of coded bits that the receiver typically requires in order to regain synchronization once it is lost.
• We show that our proposed codes have good synchronization properties because they regain synchronization quickly in order to reduce the required error control capabilities of the outer ECCs. We begin with necessary background information.

II. PRELIMINARIES A. CS CODING THEORY
RLL and DC-free codes are two widely used classes of CS codes. RLL coded sequences have the property that the number of bits between transitions is bounded. They can be constructed by first generating a (d, k) sequence, where d and k denote the minimum and maximum number of logic zeros between consecutive logic ones, followed by differential encoding that encodes a logic one as a change in value and a logic zero as no change [1]. DC-free codes are designed so that the spectral components at low frequency are suppressed to match the characteristics of the physical channel. In the time domain, the running digital sum (RDS) of the DC-free encoded sequence is limited to N different values [1]. RDS is the ongoing summation of encoded bit weights in the sequence, where a logic one has weight +1 and a logic zero has weight −1. Other constraints include the Pearson constraint that is immune to unknown channel gain and offset [13]- [15], and constraints that mitigate inter-cell interference in flash memories [23], [36]- [39].
It is well known that a constraint can be described with a finite state machine (FSM) that contains states, edges and labels. For an FSM with S states, the directed graph underlying the constraint is described by an S × S adjacency matrix D = {d ij }, where d ij is the number of edges transitioning from state i to state j. The transition probability matrix is denoted by an S × S matrix Q = {q ij }, where q ij is the probability of transitioning from state i to state j. Based on D, the maxentropic transition probabilities and steady-state distribution can be evaluated to describe the statistical properties when the maximum amount of information is represented by the FSM [1], [40]. Based on the maxentropic transition probabilities and steady-state distribution, it is possible to obtain the maxentropic probability of each codeword, which is the occurrence probability of the codeword in the coded sequence when the maximum amount of information is conferred.
As outlined in [1], maxentropic transition probabilities in a FSM are given by where 1 ≤ i, j ≤ S, and p is the eigenvector of D associated with the eigenvalue λ max which is the largest real root of the determinant equation where I is an identity matrix and z represents roots of the determinant equation [1]. The maximum amount of information that can be carried in a sequence that satisfies the constraint is the capacity of the constraint C, which is defined as [5] where N (m) is the number of constraint-satisfying sequences of length m. Given the FSM description of the constraint, the capacity can be evaluated as C = log 2 λ max . We denote C M as the maximum possible code rate of a constrained coded sequence constructed with variable-length codewords with lengths from the set M . L M denotes the set of word lengths in M , and C M is determined as where λ max is the largest real root of the characteristic equation [20]- [22]. We denote the maximum length of words in the minimal set as l max , and |M | as the number of words in M . Note that a word length l i could appear more than once in L M since there may be multiple words with the same length. The corresponding maximum possible efficiency

) MINIMAL SETS AND EXTENSIONS
We now briefly review the construction of single-state capacity-approaching variable-length constrained sequence codes introduced in [20]- [24]. As discussed in [20]- [22], a critical step in construction of these codes is the formation of a minimal set from which words can be concatenated to generate constraint-satisfying codewords. A minimal set M i can be established by enumerating all words exiting and re-entering a specific state i in the FSM. Codes constructed based on different states in the FSM have different maximum possible code rates, therefore as outlined in [21], [22] one needs to adhere to certain criteria to select the specified state VOLUME 9, 2021 in the minimal set which has the highest maximum possible code rate. Let M i be a minimal set that contains K words. A partial extension of M i , denoted M i,p , is formed by concatenating all K words in M i to any single word in M i , generating 2K − 1 words. Subsequent partial extensions can be performed by appending the K words from M i onto any word from the previous extension. We denote word U = W + V in the updated partial extension set as the result of concatenating word V in M i to word W in the current set, where + denotes concatenation. After performing partial extensions, we obtain a set of codewords that are instantaneously decodable due to the fact that no codeword in the minimal set M i is a prefix of another, which also holds for any of its partial extensions M i,p [21]- [25].

2) NORMALIZED GEOMETRIC HUFFMAN (NGH) CODING
NGH coding [41]- [43] is used to assign codewords in M i,p to corresponding source words such that the maximum information density is approached. Starting with the desired codeword probabilities as the input probabilities, NGH coding merges the two smallest probabilities q i and q j according to the following rule to obtain the merged probability: The smaller probability is pruned from the Huffman tree when the lower condition is satisfied. As in the well-known Huffman construction technique, this process is repeated until a single value remains, and source words are assigned based on the merging pattern. Given a one-to-one correspondence between variablelength source words and variable-length codewords, and assuming independent and equiprobable logic values in the source stream, the average code rate R is where s i is the length of i-th source word that is mapped to the i-th codeword of length o i . The efficiency of a variable-length code is defined as η = R/C. After obtaining R, NGH coding repeats the above process with updated input probabilities and with C replaced by R when calculating the maxentropic probabilities in (1) -(2), untilR converges.
Since different partial extensions result in different codebooks with different η, we establish parameters such as the maximum number of source words in the codebook n max , or maximum codeword length l max , and exhaustively search over all codebooks that satisfy these limits to find the one with the highest η.
Example 1: ((d = 1, k = 3) RLL code [25]): the FSM of the (d = 1, k = 3) RLL constraint is shown in Fig. 1. According to [21], [22], we choose state 1 as the specified state upon which to construct the code; its minimal set is established as M 1 = {01, 001, 0001}. We may choose to directly perform NGH coding over the minimal set to construct the simple codebook shown in Table 1 which has efficiency η = 98.9%. By performing extensions as illustrated in Fig. 2, we construct the code shown in Table 2 Table 2.

C. SYNCHRONIZATION
We wish to develop variable-length CS codes with good codeword synchronization properties, in which their codebooks have as many synchronizing codewords as possible. A synchronizing codeword guarantees that whenever it is received, the decoder achieves synchronization because it is guaranteed to correctly identify the end of this codeword and therefore maintain or recover synchronization, regardless of the correctness of the previously received codewords [44]. According to [44], a synchronizing codeword C = c 1 c 2 . . . c n in the set of codewords from codebook C must satisfy the following two conditions: Condition 2: ∀j < n such that c 1 c 2 . . . c j is a suffix of any codeword in C, c j+1 c j+2 . . . c n is a valid codeword in C.
In particular, Condition 1 indicates that if a synchronizing codeword C is an internal bit string of any codeword X , the last bit of C must also be the last bit of X . This guarantees that correct identification of C results in correct recognition of the end of a codeword, hence synchronization is maintained or recovered. Condition 2 ensures that whenever mis-synchronization occurs that results in the receiver incorrectly identifying c j as the end of a codeword, resynchronization is guaranteed to occur at the end of codeword C because c n will be identified as the end of C. It is shown in [44] that some Huffman codes exhibit good synchronizing properties because their codewords re-synchronize the decoder regardless of the previous synchronization status.
We denote the synchronization probability (denoted as sync probability) P of a codebook C as the sum of occurrence probabilities of all synchronizing codewords. Note that synchronizing codewords appear more frequently in coded sequences constructed from codebooks with higher values of P, and therefore that decoders will typically re-synchronize more quickly once synchronization is lost, resulting in fewer burst errors for the outer ECC to correct. Our goal is therefore to construct codes with high synchronization probability.
To construct these codes, we first consider the minimal set, and then consider partial extensions of this minimal set to achieve high efficiency and high sync probability. Note that, prior to NGH coding, a minimal set and its partial extensions are not codebooks because their words have not been assigned source words, and therefore it is not possible to obtain their occurrence probabilities. However, because we expect good codes to have probabilities that are close to maxentropic, in our construction approach we approximate the sync probability P of a minimal set M and its partial extensions M p as the sum of the maxentropic probabilities of each synchronizing word.
Lastly, we note that existence of synchronizing codewords is a sufficient but not necessary condition for the receiver to regain synchronization, since in some situations the receiver may correctly regain synchronization on nonsynchronizing words, as we will show in our simulation results. Thus, we expect there to be a strong correlation, but not necessarily a direct relationship, between the sync probability and synchronization performance.

III. MINIMAL SET CONSTRUCTION
Three criteria are introduced in [21], [22] to select the specified state that results in the minimal set with the highest possible code rate. In situations where we wish to construct codes with good synchronization properties, criteria should be developed that result in codebooks with high sync probabilities. In this section we consider selection of the specified state from the minimal set, and illustrate our selection criteria with a variety of constraints. In the next section we consider partial extensions of this minimal set.
Observation 1: In the FSM description of a constraint, if a state has a loop associated with itself, then selection of this state as the specified state is likely to correspond to a minimal set with a small sync probability, unless the loop corresponds to a synchronizing word.
Observation 1 arises when a single bit 1 or 0 that is associated with the specified state appears as a word in the minimal set. The word 1 or 0 is likely to violate Condition 1, therefore this case should be avoided unless codeword 0 or 1 is a synchronizing word. We demonstrate Observation 1 with the following examples. Fig. 3. We consider the (d = 2, k = ∞) RLL constraint as an example. To maximize the efficiency of the code, in accordance with the criteria in [21], [22] we would select state 1 as the specified state, and find the minimal set to be M 1 = {0, 100}. More generally, as is evident from Fig. 3 It can be verified that the second word is a synchronizing word. However, it can be observed that word 0 in M 1 is not a synchronizing word since it does not satisfy Condition 1. But it can also be verified that, if we select any other state as the specified state, the minimal set will have a sync probability of 100%. For example in the FSM of the (d = 2, k = ∞) RLL constraint, M 2 = {001, 0001, 00001, 000001, . . .} in which every word in M 2 is a synchronizing word. However this minimal set contains an unlimited number of words, and therefore will result in a less efficient code than that constructed using M 1 [21], [22]. This demonstrates that a code optimized for efficiency may not be optimized for sync probability, and vice versa.   Fig. 4. We consider the (d = 0, k = 3) RLL constraint as an example. To maximize the efficiency of the code, in accordance with the criteria in [21], [22], we select state 1 as the specified state, and establish the minimal set as M 1 = {1, 01, 001, 0001}. It can be observed that for the (d = 0, k = 3) RLL constraint, and more generally for any (d = 0, k) constraint, every word in M 1 is a synchronizing word, and hence the sync probability is 100%. In this case, the selection criteria in [21], [22] also result in the minimal set with the highest sync probability. Note that this is in agreement with Observation 1 since the single-bit word 1 associated with state 1 is a synchronizing word.
Observation 2: In the FSM description of a constraint, if outgoing edges and incoming edges of a state correspond to the same bit sequence, it is likely that selection of this state as the specified state results in a minimal set with few synchronizing words. This occurs because the prefix of a codeword W will be suffix of one or more codewords in the minimal set, and the remaining bits of W may not be a valid codeword, thus violating Condition 2.
Observation 2 arises when a word in the minimal set (that corresponds to the outgoing edges) can be divided into a suffix of another word (that corresponds to the incoming edges) plus the remaining bits, and therefore it might occur that these remaining bits are not a valid word, which violates Condition 2. We explain Observation 2 with the following example, in which we use the notation W\ζ to represent a substring of codeword W where the prefix ζ of W is excluded, and denote M \W as the set of words in the minimal set M where word W is excluded.
Example 4: (general (d, k) constraints): we consider general (d, k) constraints where d = 0 and k = ∞. The corresponding FSM is shown in Fig. 5. According to the criteria in [21], [22], any state from states 1 to d + 1}. Since every word is a synchronizing word, the sync probability is 100%. However, different from the criteria in [21], [22], as introduced in Observation 2, state σ, 1 < σ ≤ d should not be selected as the specified state. Consider state 2 as an example. One of the words W in its minimal set is 00 . . .

10.
Therefore, Condition 2 is not satisfied and hence W is not a synchronizing word, making the sync probability of M 2 less than 100%. A similar observation can be made for states σ, 2 ≤ σ ≤ d, where we can see that these states have both an outgoing edge and an incoming edge that are associated with bit zero, which violates Condition 2. W could be a synchronizing word iff W\0 was a valid word, which is not the case.
Based on the above observations and examples, we propose the following criteria to select the specified state that results in high sync probability.
Criterion 1: if a state has a loop associated with itself, this state should not be selected as the specified state unless the loop corresponds to a synchronizing word.
Criterion 2: if outgoing edges and incoming edges of a state correspond to the same bit sequence, this state should not be selected as the specified state.
Note that these two criteria are guidelines for general constraints when selecting the specified state. Therefore one should consider the specific characteristics of each constraint when selecting the specified state. In addition to the (d, k) RLL constraints that we have discussed, we now illustrate state selection for a variety of other constraints.

A. CONSTRAINTS THAT MITIGATE ICI IN FLASH MEMORIES
In flash memories, one of the dominant errors is due to inter-cell interference (ICI) [23], [36]- [39], because variations of the electrical charge on one floating-gate transistor impacts the voltages of its neighboring transistors via the parasitic capacitance-coupling effect. Many constraints have been exploited in ICI mitigation [23], [36]- [39]. CS codes that forbid the pattern 101 have been designed to limit ICI [23], [36]- [39]; the FSM describing this constraint is shown in Fig. 6. According to the criteria in [21], [22], states 1 and 3 are equally preferred as the specified state due to the fact that they both result in the highest maximum possible code rate. However, according to Criterion 1, state 3 is inferior to state 1 in terms of sync probability since state 3 has a loop associated with itself that corresponds to word 0, and the word 0 is not a synchronizing word. A closer look at M 3 = {0, 100, 1100, 11100, . . .} reveals that word 0 is not a synchronizing word, since 0 does not satisfy Condition 1. However, every word in M 1 = {1, 001, 0001, 00001, . . .} is a synchronizing word, including the one-bit word on the self-loop on state 1, resulting in the sync probability of M 1 to be 100%. Therefore, considering the criteria in both [21], [22] and in this paper, state 1 is preferred. Now consider ICI mitigation in multi-level cell (MLC) flash memories, where each coded symbol can be 0, 1, 2 or 3 and the pattern 303 is forbidden by the constraint. The FSM that represents this constraint is shown in Fig. 7.  Different from the FSM in Fig. 6, according to the study in [21], [22], state 3 is better than state 1 in terms of maximum possible code rate. As demonstrated in the previous example, however, based on Criterion 1, M 3 has the word 0 that is not a synchronizing word whereas as we demonstrate below, every word in M 1 is a synchronizing word. This results in the sync probability of M 1 to be 100%. Therefore, we note that in general there is a tradeoff between the maximum possible code rate and the sync probability, and it is the choice of the system designer as to which to optimize when selecting the specified state and constructing the corresponding minimal set.
To show that the sync probability of M 1 is 100%, consider first the quaternary bit 3 in M 1 . It can be observed that 3 does not violate the synchronization conditions and hence is a synchronizing word. Consider the set M 1 \3. All words in M 1 end with a quaternary bit 3, and no word in M 1 \3 starts with a quaternary bit 3, therefore Condition 2 is satisfied for all words in M 1 \3. Furthermore, since the quaternary bit 3 only appears at the end of each word, Condition 1 is satisfied for all words in M 1 \3. Thus, M 1 has a sync probability of 100%.

B. THE PEARSON CONSTRAINT
The Pearson constraint that is immune to unknown channel gain and offset can be regarded as a type of T -constrained code where each of the T pre-defined symbols appears at least once in every codeword [45]. As discussed in [14], [18], [24], a known construction for q-ary Pearson codes is to ensure that every q-ary codeword has at least one symbol ''0'' and one symbol ''1''.
The FSM of the binary Pearson constraint is shown in Fig. 8 [24]. According to the criteria in [21], [22], we select state N as the specified state to achieve the highest maximum possible code rate. However, it can be observed that state N does not satisfy Condition 2 since both an outgoing edge and an incoming edge correspond to bit 0 (and bit 1). Therefore, state N results in a sync probability lower than 100%. The minimal set is M N = {01, 10, 001, 110, 0001, 1110, . . .}. It can be verified that every word in M N other than the words 01 and 10 is a synchronizing word, and the sync probability is 50%.
State N − 1 and N + 1 are inferior to state 1 in terms of the maximum possible code rate [21], [22], [24], however, it can be verified that only words 01 and 101 are not synchronizing words in M N −1 , and the sync probability of M N −1 (and M N +1 ) is 62.5%. Therefore, it is again observed that a tradeoff exists between the maximum possible code rate and the sync probability.

C. DC-FREE CONSTRAINTS
We consider DC-free constraints with N different RDS values, as depicted in the FSM shown in Fig. 9. According to [21], [22], state N /2 should be selected as the specified state since its minimal set has the highest possible code rate. We first consider the case of the DC-free constraint with N = 5. We have that M 3 = {10, 01, 1100, 0011, 110100, 001011} with l max = 6. Unfortunately it can be verified that M 3 does not contain a synchronizing word, thus M 3 has a sync probability of 0%. Now we consider the general case where N > 5. The detailed construction algorithm of minimal sets can be found in [22]. To illustrate, in Table 3 we list a minimal set when N = 7; this is Table 8 in [22]. Note that since state N /2 has sequences 11 and 00 associated with both its outgoing and incoming edges, it does not satisfy Criterion 2 and thus may not be preferred in terms of sync probability. In fact, we can prove that with N ≥ 5, the minimal set M N /2 has sync probability 0%. The proof is as follows. It is clear that words 01 and 10 are not synchronizing words since they violate Condition 1. Furthermore, we observe that VOLUME 9, 2021 any of the remaining words W in M N /2 ends with 00 or 11. Therefore, W is a synchronizing word iff W\00 is a valid word when 00 is the prefix of W, and similarly iff W\11 is a valid word when 11 is the prefix of W. However, W\00 cannot be a valid word since a valid word in M N /2 must have an equal number of zeros and ones, while W\00 has two more ones than zeros. A similar argument can be made for W\11. Hence none of the words in M N /2 is a synchronizing word and therefore the sync probability is 0%.
However, we note that state 1 does not violate Criterion 2, and that the minimal set associated with state 1 has a nonzero sync probability. For N = 5 and l max = 6, M 1 = {10, 1100, 110100, 111000}, where it can be verified that 110100 and 111000 are synchronizing words, resulting in a sync probability of 7.4% which is equal to what is possible with M 5 and is higher than that can be achieved in other states.
We also note that with additional prior knowledge at the decoder, performance can be improved in terms of sync probability. For example, assume the decoder exploits its knowledge that all DC-free codewords are of even length. With N = 5, the sync probability of M 1 is improved since one more word, 1100, becomes a synchronizing word. In Section V we present more results assuming that the decoder has additional knowledge related to the constraint.
In this section we investigated the sync properties of a variety of constraints, and demonstrated that states that satisfy Criteria 1 and 2 can result in minimal sets with high sync probabilities. In the next section, we show that variable-length CS codes constructed via partial extensions of minimal sets can achieve both high efficiency and high sync probability.

IV. PARTIAL EXTENSIONS
Using words in the minimal set as the set of codewords may not result in capacity-approaching codes since higher efficiency is often achieved with larger codebooks. Therefore, we perform partial extensions over the minimal sets to generate larger codebooks. However, different from [20]- [26] where partial extensions are exhaustively performed and the one that has the highest efficiency is selected, in this section we introduce an algorithm that efficiently guides the partial extension process such that the resulting codebook has a high sync probability. Note that performing partial extensions without care can reduce the sync probability, as we show in the following example.
Example 5: (d = 1, k = 3 RLL constraint) As discussed in the previous section, we select state 1 in Fig. 5  This example motivates us to design a guided partial extension algorithm that aims to simultaneously achieve both high code efficiency and high sync probability.

A. EXTENDING SYNCHRONIZING VERSUS NONSYNCHRONIZING WORDS
We first consider whether to extend synchronizing words or nonsynchronizing words when our aim is to keep the sync probability high. We consider the following proposition and observation.
Proposition 1: Extending a synchronizing word lowers sync probability.
Proof: Consider a synchronizing word W in M . If we extend W, the |M | resulting words from extension of W cannot all be synchronizing words since one of the resulting words is W = W + W. W becomes a suffix after this extension, and W \W = W can no longer be a valid word. Therefore, W is not a synchronizing word, which lowers the sum of maxentropic probabilities of all synchronizing words.
Observation 3: Extending a nonsynchronizing word may increase the sync probability.
It is often the case that an extended word W constructed through an extension of a nonsynchronizing word W is a synchronizing word, since W consists of W concatenated with a valid word. W becomes the suffix of the extended word W = W + W, hence any W = W is likely to satisfy Condition 2.
For example, consider the ICI constraint in Fig. 7 Based on Proposition 1 and Observation 3, we propose extending nonsynchronizing words whenever possible. Only when the minimal set does not contain a nonsynchronizing word do we recommend extending synchronizing words. Selection of synchronizing words that will be extended is discussed below.

B. THE GUIDED PARTIAL EXTENSION ALGORITHM
We start from the minimal set M , i.e., M p = M . In each partial extension we first obtain the set of suffixes S where ∀S ∈ S is a suffix of a synchronizing word W in M p , i.e., W = ζ + S where ζ denotes any suffix that exists in M p . We search for a nonsynchronizing word N such that N / ∈ S is the target word that we would like to extend. The reason for this is as follows. Suppose a synchronizing word W can be represented as W = ζ + N , N ∈ M p ∧S. In this case, extending N would make W nonsynchronizing since N is not a valid word any more, which would reduce the sync probability. Therefore, we extend a nonsynchronizing word N / ∈ S. At the same time, N should not have a synchronizing word as its suffix, i.e., N = +W (where denotes any sequence) should not be extended since extension of N will result in W violating Condition 1 and becoming a nonsynchronizing word. If we cannot find such a word N , then a partial extension will result in a synchronizing word becoming nonsynchronizing. In this case we choose to perform extension for each of the nonsynchronizing words and choose the one that results in the highest synchronization probability. If there are no nonsynchronizing words in M p , we must extend a synchronizing word. We note that extending a longer synchronizing word is more likely to reduce the sync probability than extending a shorter synchronizing word, since a longer synchronizing word W may correspond to a greater number of valid words W where W = + W , and extension of W results in W violating Condition 1. Returning to Example 5, it is straightforward to confirm that extension of word 0001 is not preferred since it results in words 01 and 001 violating Condition 1. On the other hand, extension of word 001 only excludes word 01 from being a synchronizing word, and extension of the word 01 does not result in any word violating Condition 1. Therefore, we choose to extend the shortest synchronizing word, and we propose Proposition 2 based on the following lemma.
Lemma 1: Under the condition that M p does not contain nonsynchronizing words, the shortest synchronizing word cannot be represented as a suffix plus a valid word.
Proof: The proof is straightforward and is omitted. Proposition 2: Under the condition that M p does not contain nonsynchronizing words, extending the shortest synchronizing word W reduces the sync probability by λ −2l W max where l W is the length of word W.
Proof: Since W is a synchronizing word, it satisfies Condition 1. Therefore words resulting from extension of W also satisfy Condition 1. From Lemma 1 we know that these words also satisfy Condition 2 except for the extended word W = W + W. It can be easily checked that all words in M p \W remain synchronizing words, because they satisfy both Conditions 1 and 2. The reduction of sync probability as the result of extending W is therefore the maxentropic probability of word W , which is λ −2l W max . Based on the above discussion, the proposed guided partial extension algorithm is shown in Algorithm 1. This algorithm is initialized with M p = M and J = 0 where J is the recursion depth, and is recursively called a number of times until J exceeds the pre-established limits. With each recursion depth the algorithm outputs a codebook with the highest synchronization probability at the current depth.

V. NUMERICAL RESULTS
In this section, we present results regarding the efficiency and sync probability of codes constructed based on the procedures outlined above. We evaluate, under the case of a binary symmetric channel (BSC), the sync properties in terms of the average number of bits and average number of codewords that the decoder requires to regain synchronization once synchronization is lost. We consider the BSC because a BSC is a general channel model that, with appropriate extension, can represent a wide range of scenarios where coded bits are corrupted in digital transmissions due to various factors such as additive noise, fading, interference, etc. The decoding algorithm that we consider is the conventional bit-by-bit decoding algorithm described in the Appendix of [26], and which is reproduced here as Algorithm 2.

A. UPPER BOUNDS OF THE AVERAGE NUMBER OF CODEWORDS AND BITS BEFORE RESYNCHRONIZATION
We first derive an upper bound on the number of codewords and the number of coded bits that are required for the decoder to regain synchronization once synchronization is lost. We denote the upper bound on the number of codewords N c and the number of coded bits N b as N c and N b , respectively.
To evaluate N c , under the condition that there are no errors in the received symbols during synchronization, we consider the case when synchronization occurs only as a result of the occurrence of a sync word. We note that, after loss of synchronization, if the next received codeword is a synchronizing codeword (that occurs with probability P) then the receiver will regain synchronization. However, if the next codeword is not a synchronizing codeword (with probability 1 − P) but the subsequent word is a sync word, then synchronization will occur after two words with probability (1 − P)P. Continuing, we have that Now consider the case when errors on the binary symmetric channel occur with probability p c . The probability that a codeword is correctly received is, on average, (1 − p c )ō whereō denotes the average length of codewords, i.e.,ō = s i 2 −s i o i . Therefore, we have In a similar fashion, N b is derived as whereō − 1 is, on average, the maximum number of bits of the currently received codeword that has caused missynchronization.
As noted above, in the derivation of (7) -(9), we assume that resynchronization occurs only on synchronizing codewords. However, it can be observed in Algorithm 2 that it is possible for resynchronization to also occur on nonsynchronizing codewords. Therefore, N c and N b are indeed upper bounds on the average number of codewords and the average number of bits that the receiver requires to regain synchronization, since the actual number of codewords the receiver requires to resynchronize can be smaller. As we will show in the simulation results, good synchronization properties can be observed even with a small value of P and correspondingly large values of N c and N b .

B. SIMULATION RESULTS
We now consider simulation results for synchronization with several classes of constrained sequence codes.

1) RLL CONSTRAINTS
Consider, for example, the (d = 1, k = 3) RLL constraint. In accordance with the discussion in Section III, we select state 1 as the specified state, hence M 1 = {01, 001, 0001}. We perform the guided partial extension algorithm over M 1 with J recursions according to Algorithm 1, J = 0 → 9. The code efficiency and sync probability of the resulting codes are shown in Fig. 10, where J = 0 on the horizontal axis represents the code constructed with codewords from the minimal set. From this figure we can see that after several iterations of extension, we can construct codes with code efficiency near 99% and sync probability near 100%. The decrease at J = 1 is due to the fact that P = ∅ and we therefore have to extend a synchronizing word in the first extension, resulting in M 1,p = {001, 0001, 0101, 01001, 010001} where 0101 is not a synchronizing word since it does not satisfy Condition 2.
For comparison purposes, consider the codebook that corresponds to J = 4 shown in Table 4. This code achieves η = 98.90% and P = 96.88%. Note that this codebook has the same number of codewords and an efficiency very close to the code given in Table 2, however, the sync probability of the code in Table 2 is only 21.88%, which is much lower than the code in Table 4. This demonstrates that the guided partial extension algorithm can effectively generate codebooks with high sync probabilities.  We now consider the sync properties of the (d = 1, k = 3) RLL codes we constructed. The coded sequence is transmitted over a BSC with crossover probability 0.1. A source sequence of 50000 bits is randomly generated and encoded into a constrained sequence with the number of codewords ranging from ∼9000 to ∼18000 with J = 0 → 9, according to the codebook. Once synchronization is lost due to errors that occur during transmission through simulation of Algorithm 2, we obtain the number of bits and the number of codewords before the receiver regains synchronization. We consider all the occurrences that synchronization is lost, we report the average number of bits and average number of codewords that the receiver receives before it resynchronizes; the results are shown in Figs. 11 and 12. It can be seen that the receiver generally requires less than one codeword to regain synchronization, demonstrating that these codes have good synchronization properties in the sense that once synchronization is lost, they recover synchronization quickly. N c first increases from J = 0 → 1 and then decreases from J = 1 → 9, which is consistent with the sync probability shown in Fig. 10. Fig. 12 shows that N b is around 8 for J = 1 → 9. This is because codebooks with larger J have longer codewords, therefore N b does not reduce as dramatically as N c . Note that with smaller crossover probabilities the receiver would need fewer codewords and coded bits to recover synchronization once synchronization is lost, and vice versa.
In Fig. 13 we demonstrate the ratio of the number of events when synchronization is achieved on synchronizing codewords to the total number of synchronization events, for J = 0 → 9. As demonstrated in Fig. 13, these ratios   are less than 100% indicating that some synchronization events occur on nonsynchronizing codewords. This demonstrates that Algorithm 2 permits synchronization to occur on nonsynchronizing codewords, as we mentioned above. This explains why in Figs. 11 and 12 the actual average number of codewords and average number of bits the receiver requires for resynchronization are lower than N c and N b , since N c and N b assume that synchronization only occurs on synchronizing codewords.
Finally, as is clear from these figures, for the (d = 1, k = 3) RLL constraint there is no significant advantage to using a codebook other than the minimal set because it satisfies Criteria 1 and 2, and hence has excellent sync properties, while also having high efficiency. In contrast, in the next subsection we examine situations in which J = 0 is not the best choice when we compare with other codebooks constructed using our guided partial extension algorithm.

2) FLASH MEMORIES
We consider constraints that mitigate ICI in flash memories, including the single-level cell (SLC) and multi-level cell (MLC) flash memories. For the SLC flash memories, the constraint was shown previously in Fig. 6; discussion in Section III-A reveals that it is sufficient to use M 1 as the minimal set. Therefore, we construct our codebooks based on state 1. The code efficiency and sync probability are shown in Fig. 14 where it can be seen that, similar to the situation with the (d = 1, k = 3) RLL constraint, the sync probability decreases as J = 0 → 1 since the synchronizing word 1 is extended, resulting in the nonsynchronizing word 11. For J = 1 → 9, the sync probability increases up to 99.6%. The sync performance is shown in Figs. 15 and 16, where it is evident that on average less than one codeword and less than 8 bits are required for the receiver to regain synchronization.   is of high priority and the specified state is selected according to the criteria in [21], [22], the guided partial extension algorithm will likely help improve the sync probability. For example, we consider MLC flash memories and the FSM of the constraint that forbids pattern 303 as shown in Fig. 7. The capacity of this constraint is 1.978 bit/symbol. According to our criteria, state 1 has the best sync probability, but a lower code rate than states selected according to the criteria described in [21]- [25]. We instead consider selecting state 3 as the specified state which has the best maximum possible code rate, but worse sync probability compared to state 1. Note that state 3 does not satisfy Criterion 1, because word 0 is not a synchronizing word in M 3 . If we directly perform NGH coding over M 3 , the resulting codebook, shown in Table 5, achieves 99.6% of the capacity and has a sync probability of P = 75% since codeword 0, which occurs with 25% probability, is not a synchronizing codeword. We now show that the sync probability increases with the proposed guided partial extension algorithm.  Fig. 17 shows the code efficiency and sync probability of codebooks we have constructed for J = 0 → 9. It can be seen that along with small increases in efficiency, the sync probability increases from 75% to 99.99%, which demonstrates the effectiveness of the proposed algorithm. Fig. 18 and 19 show the average number of codewords and 45874 VOLUME 9, 2021   the average number of bits that the decoder requires to receive to regain synchronization. It can be seen that on average, less than 0.4 codewords and less than 5 bits are needed to regain synchronization, which illustrates that the constructed codebooks have good synchronization properties. N c is smaller than 1 because even when errors exist in the received bit sequence, Algorithm 2 usually correctly identifies the end of the current codeword.

3) THE PEARSON CONSTRAINT
With the Pearson constraint, we present the results with codes that are constructed using state N in Fig. 8 as the specified state. Fig. 20 shows the code efficiency and sync probability of the constructed codebooks with J = 0 → 9. It can be seen that the sync probability increases from 50% to 61.5%. Figs. 21 and 22 show the average number of codewords and the average number of bits that the decoder needs to receive to regain synchronization. It can be seen that even though the sync probabilities are relatively low, on average approximately one codeword and fewer than 7 bits are needed to regain synchronization.

4) THE DC-FREE CONSTRAINT
We present the results of the DC-free constraint with N = 5, which has been adopted in visible light VOLUME 9, 2021 communication systems for flicker reduction and dimming control [3], [46], [47], and consider the tradeoff between code efficiency and sync probability. According to the discussion in Section III-C, state 1 is better than state 2, which is better than state 3 in terms of sync probability. However, according to the study in [21]- [25], the opposite is true when attempting to maximize the code rate.
In Fig. 23 we present the result of code efficiency and sync probability for codes constructed with states 1, 2 and 3 as the specified state, with J = 0 → 9. It can be seen that the abovementioned conclusion is verified, and the expected tradeoff between code efficiency and sync probability that arises from different states as the specified state is clearly seen. Figs. 24 and 25 show the average number of codewords and the average number of bits before the decoder regains synchronization. It can be seen that on average, between 2 to 7 codewords and between 10 to 35 bits are needed to regain synchronization. It is also demonstrated in Figs. 23-25 that state 1 is the best in terms of code efficiency while state 3 is the worst, but the opposite is observed in terms of synchronization properties. Therefore we once again observe the tradeoff between code efficiency and synchronization property. Note that N c and N b for states 2 and 3 are infinity since P = 0. However, because it is possible for the decoder to regain synchronization on nonsynchronizing codewords, these codes demonstrate relatively good synchronization properties even though P = 0. To improve the synchronization properties, we use the prior knowledge that the codewords are of even length, and process two bits per decoding attempt. In this case, the sync probabilities are improved as shown in Fig. 26, and the average number of codewords and binary coded bits that are needed to regain synchronization are reduced as shown in Figs. 27 and 28. It can be seen that on average, approximately one codeword and fewer than 7 bits are needed to regain synchronization.
Note that in this case, the results do not indicate that the synchronization performance for state 1 > state 2 > state 3. The reason is illustrated in Fig. 29, and is explained as follows. Codeword 2 in the minimal set with state 1 as the specified state is not a synchronizing word since it does not   satisfy Condition 1. However, the chances that reception of codeword 2 result in mis-synchronization is only when the quaternary bit before 2 (which can be 2 or 0) is incorrectly detected as 3 and the quaternary bit after 2 (which can be 2 or 3) is incorrectly detected as 0. The probability of this case is low, hence codeword 2 can be regarded as an ''almost synchronizing codeword''. It follows that the sync probability of the minimal set of state 1 can be regarded as ''almost 100%''. Similar reasoning holds for states 2 and 3, making their sync   probability ''almost 100%'' , and the number of codewords that are needed to regain synchronization is similar for all three states.
We also note that, with prior knowledge that all codewords have even length, in case that the receiver misses a single bit in the detection process, the above-mentioned decoding process with two bits per decoding attempt will never re-synchronize. Therefore, we propose starting the two-bit-grouping at both odd and even positions and performing decoding with both alternatives. In situations where the receiver misses a bit or mistakenly clocks in an extra bit, the decoding attempt that starts at odd positions will re-synchronize, otherwise the decoding attempt that starts at even positions will re-synchronize the received sequence.

VI. CONCLUSION
In this paper, we have discussed initially establishing and re-establishing codeword boundaries in the decoding of variable-length constrained sequence codes. We studied criteria for selection of the specified state in the minimal set that achieves a high synchronization probability, and showed that these criteria can lead to selection of different specified states compared to the criteria developed in [21], [22] that aimed to maximize code rate. We then proposed the guided partial extension algorithm to increase code efficiency while maintaining high sync probability. Finally, we presented simulation results that demonstrate the code efficiency and sync probability of the codes we constructed; these results included the average number of received codewords and the average number of coded bits that the decoder requires to regain synchronization should synchronization be lost. We demonstrated that it is possible to construct highly efficient variable-length CS codes that exhibit good synchronization properties such that very few codewords and very few bits are required to regain synchronization.