By Topic

• Abstract

SECTION I

## INTRODUCTION

Cryptographic hash functions provide a basic data authentication mechanism and are routinely used as building blocks in other cryptographic constructions. For a given input $m$, a cryptographic hash function $H$ outputs a digest $H(m)$ of some small fixed length. For most tasks, it is required that finding distinct inputs with the same digest—a collision—be difficult. However, recent research has demonstrated that widely used hash functions, including SHA-1 and MD5, are vulnerable to collision attacks [29], [36], [37]. In response to these concerns, the U.S. National Institute of Standards and Technology (NIST) started in November 2007 a public competition to develop new cryptographic hash functions to augment a set of standard functions that includes the SHA-1 and SHA-2 algorithms. This competition, commonly known as the SHA-3 competition, motivated a growing interest in developing cryptographic hash functions and in rigorously scrutinizing their security.

Verified security [8], [10] is an emerging approach to security proofs of cryptographic systems. It adheres to the same principles as provable security, but revisits its realization from a formal verification perspective. When taking a verified security approach, proofs are mechanically verified and built with the aid of state-of-the-art verification tools, such as SMT solvers, automated theorem provers and interactive proof assistants. EasyCrypt [8] is an automated framework that aims to make verified security accessible to cryptographers with a limited background in formal methods; it has been successfully applied to verify exact security bounds of several digital signature and encryption schemes.

In this paper, we report on an extension of EasyCrypt and its application to build and verify exact security proofs of the Merkle-Damgård construction [23], [31], which underlies the design of many cryptographic hash functions. In its simplest formulation, Merkle-Damgård iterates a compression function $f : \{0,1\}^{k} \times \{0,1\}^{n} \rightarrow\{0,1 \}^{n}$ over the blocks of an input message padded to a block boundary. For a fixed public initialization vector IV, the digest of a padded message with blocks $x_{1} \ \Vert \cdots \Vert \ x_{\ell}$ is computed as TeX Source$$f(x_{\ell}, f(x_{\ell-1}, \ldots f(x_{1}, {\rm IV})\ldots))$$

One way of arguing that iterated constructions like Merkle-Damgård are secure is to show that they preserve security properties of the underlying compression function. The seminal works of Merkle [31] and Damgård [23] show that if messages are padded in some specific way, finding two colliding messages for the above iterated construction is at least as hard as finding two colliding inputs for the compression function $f$; said otherwise, that the construction preserves the collision resistance of the compression function. We present a proof of a generalization of this result in EasyCrypt. Our proof applies when the padding function is suffix-free, i.e. the padding of a message $m$ is not a suffix of the padding of any other message $m^{\prime}$.

An alternative method for proving the security of a hash function is to show that it behaves as a random oracle when the compression function, or some other lower-level building block, is assumed to be ideal. The indifferentiability framework of Maurer et al. [30] provides a rigorous simulation-based definition that captures this intuition and implies a strong composability result. Glossing over technical subtleties [33], a hash function $H$ indifferentiable from a random oracle can be plugged into a cryptosystem proven secure in the random oracle model for $H$ without compromising the security of the cryptosystem. We present a proof in EasyCrypt of the indifferentiability of the Merkle-Damgård construction from a random oracle. Our proof, which follows the proof of Coron et al. [22], applies when the padding function is prefix-free, i.e. the padding of a message $m$ is not a prefix of the padding of any other message $m^{\prime}$.

### Organization of the Paper

Section II overviews the foundations and verification mechanisms implemented in our extension to EasyCrypt; Section III describes the Merkle-Damgård construction and its security properties; Section IV describes a machine-checked proof that Merkle-Damgård preserves collision resistance when used with a suffix-free padding, while Section V describes a machine-checked proof of its indifferentiability from a random oracle when the padding is prefix-free; Section VI discusses the applicability of our results to generalizations of the Merkle-Damgård construction and the finalists of NIST SHA-3 competition. We conclude in Section VII.

SECTION II

## A PRIMER ON EASYCRYPT

Building a cryptographic proof in EasyCrypt is a process that can be decomposed in the following steps:

• Defining a formal context, including types, constants and operators, and giving it meaning by declaring axioms and stating derived lemmas.
• Defining a number of games, each of them composed of a collection of procedures (written in the probabilistic imperative language described below) and adversaries declared as abstract procedures with access to oracles.
• Proving logical judgments that establish equivalences between games. This may be done fully automatically, with the help of hints from the user in the form of relational invariants, or interactively using basic tactics and automated strategies.
• Deriving inequalities between probabilities of events in games, either by using previously proven logical judgments or by direct computation.

In the remainder of this section, we briefly overview some key aspects of the process of building an EasyCrypt proof. Note that the work reported in this article benefited from several extensions of the tool with respect to [8]; these extensions include:

1. Support for reasoning about programs with loops. Loops were used to represent iteration in the Merkle-Damgård construction.
2. Mechanization of the Failure Event Lemma of [11], implemented in EasyCrypt as an extension to the mechanism that directly computes probability bounds. This was used to bound the success probability of the distinguisher in the proof of indifferentiability presented in Sect. V.
3. Proof engineering mechanisms to manage the size of proof obligations and the theories that external solvers use. These mechanisms were essential for the successful verification of the proofs presented in this paper.

Probabilistic experiments are defined as programs in pWHILE, a strongly-typed imperative probabilistic programming language. The grammar of pWHILE commands is defined as follows: TeX Source\eqalign{ {\cal C} :: &= \ {\rm skip}\cr &\vert \quad {\cal V} \leftarrow {\cal E} \cr &\vert \quad {\cal V}\ \mathop{\leftarrow}^{\} \ {\cal D}{\cal E} \cr &\vert \quad{\rm if} \ {\cal E}\ {\rm then}\ {\cal C} \ {\rm else}\ {\cal C} \cr &\vert \quad {\rm while}\ {\cal E} \ {\rm do} \ {\cal C} \cr &\vert \quad {\cal V} \ \leftarrow {\cal P}({\cal E}, \ldots, {\cal E}) \cr &\vert \quad {\cal C}; \ {\cal C} }\eqalign{&{\rm nop} \cr &{\rm deterministic \ assignment}\cr &{\rm probabilistic \ assignment} \cr &{\rm conditional} \cr & {\rm loop} \cr & {\rm procedure \ call} \cr &{\rm sequence} } The only non-standard feature of the language are probabilistic assignments; an assignment $x \ \displaystyle\mathop{\leftarrow}^{\$} \ d$evaluates the expression$d$in the current state to a distribution$\mu$on values, samples a value according to$\mu$and assigns it to variable$x$. The key to the flexibility of EasyCrypt is that the base language of expressions and distribution expressions can be extended by the user to suit the needs of the verification task. The rich base language includes expressions over Booleans, integers, fixed-length bitstrings, lists, finite maps, and option, product and sum types. User-defined operators can be axiomatized or defined in terms of other operators. In the following, we let$\{0,1\}^{\ell}$denote the uniform distribution on bitstrings of length$\ell$. A program (equivalently, a game) in EasyCrypt is represented as a set of global variables together with a collection of procedures. Some of these procedures are concrete and given a definition as a command$c \ \in \ {\cal C}$, while some others may be abstract and left undefined. Quantification over adversaries in cryptographic proofs is achieved by representing them as abstract procedures parametrized by a set of oracles; these oracles must be instantiated as other procedures in the program. Commands operate on program memories, which map local and global variables to values; we let${\cal M}$denote the set of memories. The semantics of a command$c \ \in \ \cal C$is a function$([\!\vert c \vert\!]) \ :\ {\cal M} \rightarrow {\cal D}({\cal M})$from program memories to sub-distributions on program memories. Note that programs that do not terminate with probability 1 generate sub-distributions with total probability less than 1. We refer the reader to [9] for a detailed description of the semantics of pWHILE as it has been formalized in the Coq proof assistant. In what follows, we denote by${\rm Pr}[c, m:A]$the probability of event$A$w.r.t. to the distribution$\big[\!\vert c \vert\!\big]m$and often omit the initial memory$m$when it is not relevant. Although EasyCrypt is not tied to any particular cryptographic model, it provides good support to reason about proofs developed in the random oracle model. A random oracle${\cal O} : X \rightarrow Y$is modelled in EasyCrypt as a stateful procedure that maps values in$X$into uniformly and independently distributed values in$Y$. The state of a random oracle can be represented as a global finite map${\mbi L}$that is initially empty. Queries are answered consistently so that identical queries are given the same answer: ### B. Probabilistic Relational Hoare Logic The foundation of EasyCrypt is a probabilistic Relational Hoare Logic (pRHL), whose judgments are quadruples of the form: TeX Source$$\vdash c_{1}\sim c_{2}:\Psi \Rightarrow \Phi$$ where$c_{1}, c_{2}$are programs and$\Psi, \Phi$are first-order relational formulae. Relational formulae are defined by the grammar: TeX Source$$\Psi, \Phi::=e \ \vert \ \neg\Phi\ \vert \Psi\wedge\Phi\vert \ \Psi\vee\Phi \ \vert \Psi\Rightarrow\Phi \ \vert \forall x.\ \Phi \ \vert \exists x.\ \Phi$$ where$e$stands for a Boolean expression over logical variables and program variables tagged with either$\langle 1\rangle$or$\langle 2 \rangle$to denote their interpretation in the left or right-hand side program; the only restriction is that logical variables must not occur free. The special keyword res denotes the return value of a procedure and can be used in the place of a program variable. We write${\rm e}\langle i\rangle$for the expression$e$in which all program variables are tagged with$\langle i \rangle$. A relational formula is interpreted as a relation on program memories. For example, the formula$x\langle 1\rangle+1\leq y\langle2\rangle$is interpreted as the relation TeX Source$$R=\{(m_{1}, m_{2})\vert m_{1}(x)+1\leq m_{2}(y)\}$$ The validity of a pRHL judgment is defined in terms of a lifting operator${\cal L}:{\cal P}(A\times B)\rightarrow {\cal P}({\cal D}(A)\times {\cal D}(B)). Concretely, TeX Source\eqalign{ \models c_{1} & \sim c_{2}:\Psi \Rightarrow \Phi \ {\displaystyle{\mathop{=}^{\rm def}}} \cr & \forall m_{1}, m_{2}.\ m_{1}\Psi m_{2}\Rightarrow([\![c_{1}]\!]m_{1}){\cal L}(\Phi)([\![c_{2}]\!] m_{2})} Formally, let\mu_{1}$be a probability distribution on a set$A$and$\mu_{2}$a probability distribution on a set$B$. We define the lifting$\mu_{1}{\cal L}(R)\mu_{2}$of a relation$R \subseteq A \times E$to$\mu_{1}$and$\mu_{2}$by the clause: TeX Source$$\exists\mu:{\cal D}(A \times B).\ \pi_{1}(\mu)=\mu_{1}\wedge \pi_{2}(\mu)=\mu_{2}\wedge {\rm supp} (\mu) \subseteq R$$ where$\pi_{1}(\mu)$(resp.$\pi_{2}(\mu)$) denotes the projection of$\mu$on its first (resp. second) component and${\rm supp}(\mu)$is the support of$\mu$as a sub-probability measure—if$\mu$is discrete, this is just the set of pairs with positive probability. Figure 1 shows some selected rules that can be used to derive valid pRHL judgments. There are two kinds of rules: two-sided rules, which require that the related programs have the same syntactic form, and one-sided rules, which do not impose this requirement. One-sided rules are symmetric in nature and admit a left and a right variant. We briefly comment on some rules. The two-sided rule [Rnd] for random assignments requires the distributions from where values are sampled be uniform on some set$X$; to apply the rule one must exhibit a function$f:X\rightarrow X$that may depend on the state and is 1–1 if the precondition holds. The one-sided rule$[{\rm Rand}\langle 1\rangle]$for random assignments simply requires that the post-condition is established for all possible outcomes; in effect, this rule treats random assignment as a non-deterministic assignment. Similarly to Hoare logic, the rules for while loops require to exhibit an appropriate relational invariant$\Phi$. The two-sided rule [While] applies when the loops execute in lockstep and thus requires proving that the guards are equivalent. The one-sided rule$[{\rm While}\langle 1\rangle]$further requires exhibiting a decreasing variant$v$and a lower bound$m$. The premises ensure that the loop is absolutely terminating, which is crucial for the soundness of the rule. The relational Hoare logic also allows capturing the well known cryptographic argument H$x$is uniformly distributed and independent of the adversary's view”, which is certainly one of the most difficult to formalize. We formalize this argument in EasyCrypt by proving that re-sampling$x$preserves the semantics of the program. Suppose we want to prove that in a program$c$, a variable$x$used in an oracle${\cal O}$is uniformly distributed and independent of the view of an adversary${\cal A}^{\cal O}$. Let${\cal O}^{\prime}$be the same as${\cal O}$except that it re-samples$x$when needed. We identify a condition used that holds whenever$\cal A$obtained some information about$x$(and thus, re-sampling would not preserve the semantics). We then prove that the conditional statement$c^{\prime}\ \displaystyle\mathop{=}^{\rm def}$if$\neg {\rm used}$then$x \ \displaystyle\mathop{\leftarrow}^{\$} \ X$ can swap with calls to ${\cal O}$ and ${\cal O}^{\prime}$, i.e. TeX Source$$\vdash c^{\prime};y \leftarrow {\cal O}(\vec{e})\sim y \leftarrow {\cal O}^{\prime}(\vec{e});c^{\prime}: \Phi \Rightarrow \Phi$$ where $\Phi$ implies equality over all global variables. From this, we can conclude that $c^{\prime}$ can also swap with calls to ${\cal A}^{\cal O}$ and ${\cal A}^{{\cal O}^{\prime}}$, and hence that the semantics of the program $c$ is preserved when ${\cal O}$ is replaced by ${\cal O}^{\prime}$. The advantage of using such kind of reasoning is that it is generally much easier to reason about a game where $x$ is sampled lazily, since its distribution is locally known.

We conclude with some observations on the mechanization of reasoning in pRHL. We implement in EasyCrypt several variants of two-sided and one-sided rules of pRHL in the form of tactics that can be applied in a goal-oriented fashion to prove the validity of judgments. For instance, instead of implementing rule $[{\rm Rnd}\langle1\rangle]$, we combine it with the [Seq] rule to obtain the following more easily applicable rule: TeX Source$$\vdash c_{1}\sim c_{2} : \Psi\Longrightarrow\forall v\in{\rm supp}(d).\ \Phi\{e\langle 1\rangle/x\langle1\rangle\} \over \vdash c_{1};\ x {\displaystyle\mathop{\leftarrow}^{\} }d\sim c_{2}:\Psi\Longrightarrow\Phi$$

Figure 1. Selected pRHL rules

The application of a tactic may generate additional verification subgoals, and logical side conditions that are checked using SMT solvers, automated theorem provers and, as a last recourse, interactive proof assistants. Depending on their nature, application of the tactics can be fully automated or require user input. For instance, applying the tactics that mechanize the rules for while loops, requires the user to provide an adequate invariant. In the case of the two-sided rule, a new subgoal is generated to prove the correctness of the user-provided invariant, whereas the equivalence of the loop guards is checked automatically as a logical side-condition.

In addition to tactics that mechanize basic rules of pRHL, EasyCrypt implements automated strategies that combine the application of a weakest precondition transformer wp with heuristics to apply basic tactics. The wp transformer operates on deterministic loop-free programs. These strategies can often be used to deal automatically with large fragments of proofs, letting the user focus in the parts that require ingenuity.

Since cryptographic results are stated as inequalities on probabilities rather than pRHL judgments, it is important to derive probability claims from pRHL judgments. This can be done mechanically by applying rules in the style of TeX Source$${m_{1} \Psi m_{2} \ \vdash c_{1} \sim c_{2}:\ \Psi \Longrightarrow \Phi \ \Phi \Rightarrow A\langle 1 \rangle\Rightarrow B\langle 2\rangle \over {\rm Pr}[c_{1},m_{1}:A] \leq {\rm Pr}[c_{2},m_{2}:B]}$$

Game-based proofs often argue that two programs $c_{1}$ and $c_{2}$ behave identically unless a failure event $F$ is triggered. This is used to conclude that the difference in probability of any event between the two programs is bounded by the probability of $F$ in one of them. Although a syntactic characterization of this lemma is often used (in which failure is represented by a Boolean flag), it can be conveniently expressed and implemented in EasyCrypt in a more general form using pRHL.

#### Lemma 1

(Fundamental Lemma).Let $c_{1}$ and $c_{2}$ be two terminating commands and $A, B, F$ events such that TeX Source$$\vdash c_{1}\sim c_{2} : \Psi\Longrightarrow F\langle1\rangle\Leftrightarrow F\langle2\rangle\wedge(\neg F\langle1\rangle\Rightarrow A\langle1\rangle\Leftrightarrow B\langle2\rangle)$$ Then, if the initial memories of both games satisfy $\Psi$, TeX Source$$\vert {\rm Pr}[c_{1}:A]-{\rm Pr}[c_{2}:B]\vert \leq {\rm Pr}[{\rm G}_{1}:F]={\rm Pr}[{\rm G}_{2}:F]$$

In most applications of the above lemma, the failure event $F$ can only be triggered in oracle queries made by an adversary. When the adversary can only make a known bounded number of queries, the following lemma, which we implemented in EasyCrypt, provides a means to bound the probability of failure. (We describe its hypotheses informally, but note that most of them can be captured by pRHL judgments.)

#### Lemma 2

(Failure event lemma). Consider a program $c_{1};c_{2}$, an integer expression $i$, an event $F$, and $u \ \in \ {\Bbb R}$. Assume the following:

• Free variables in $F$ ‘ and $i$ are only modified by $c_{1}$ or oracles in some set $O$;
• After executing $c_{1}, F$ does not hold and $0 \leq i$;
• Oracles ${\cal O} \in O$ do not decrease $i$ and strictly increase $i$ when $F$ is triggered;
• For every oracle ${\cal O}$ in $O, \neg F\Rightarrow {\rm Pr}\lceil {\cal O}:F\rceil\leq u$ Then, ${\rm Pr}[c_{1};c_{2}:F\wedge i\leq q]\leq q\cdot u$

Finally, EasyCrypt implements a simple mechanism to directly compute bounds for the probability of an event in a program. This mechanism can establish, for instance, that the probability that a value uniformly chosen from a set $X$ equals an expression that does not depend on it is exactly $1/\vert X \vert$, or that the probability that the same uniformly sampled value belongs to a list of $n$ values that does not depend on it is at most $n/\vert X\vert$.

SECTION III

## THE MERKLE-DAMGåRD CONSTRUCTION

Merkle-Damgård is a method for building a variable input-length (VIL) hash function from a fixed input-length (FIL) compression function. In its simplest form, the digest of a message is computed by first padding it to a block boundary and then iterating a compression function $f$ over the resulting blocks starting from an initial chaining value IV. A compression function $f$ maps a pair of bitstrings of length $k$ and $n$ (equivalently, a bitstring of length $k+n$) to a bitstring of length $n$: TeX Source$$f : \{0,1\}^{k}\times\{0,1\}^{n}\rightarrow\{0,1\}^{n}$$ A padding function pad converts an arbitrary length message into a list of bitstrings of block size ($k$ is the block-size): TeX Source$${\rm pad} : \{0,1\}^{\ast}\rightarrow(\{0,1\}^{k})^{\ast}$$

#### Definition 3

(Merkle-Damgård). Let $f$ be a compression function and pad a padding function as above, and let IV $\in \{0,1\}^{n}$ be a public value, known as the initialization vector. The hash function MD is defined as follows: TeX Source\eqalignno{ & {\rm MD}\quad \qquad : \{0,1\}^{\ast}\rightarrow\{0,1\}^{n} \cr & {\rm MD} (m)\quad \displaystyle\mathop{=}^{\rm def}f^{\ast} ({\rm pad} (m), {\rm IV})} where $f^{\ast} \ : \ (\{0,1\}^{k})^{\ast}\times\{0,1\}^{n} \ \rightarrow\{0,1\}^{n}$ is recursively defined by the equations TeX Source$$f^{\ast} (nil, y) \displaystyle\mathop{=}^{\rm def}y \qquad f^{\ast}(x::xs, y)\displaystyle\mathop{=}^{\rm def}f^{\ast}(xs, f(x, y))$$

The security properties of the compression function preserved by the Merkle-Damgård construction greatly depend on an adequate choice of padding to thwart certain types of attacks. In the remainder, we consider prefix- and suffix-free padding functions.

#### Definition 4

(Prefix- and suffix-free padding). A padding function pad is prefix-free (resp. suffix-free) iff for any distinct messages $m, m^{\prime}$, there is no $xs$ such that ${\rm pad} (m^{\prime})= {\rm pad}(m)\ \Vert \ xs (resp. \ {\rm pad}(m^{\prime})= xs \ \Vert \ {\rm pad}(m))$.

Security properties of hash functions are stated as claims about the difficulty of an attacker in achieving certain goals. Collision resistance states that it is hard to find distinct $a, b$ such that $H(a)=H(b)$. Pre-image resistance states that given a digest $h$, it is hard to find $a$ such that $H(a)=h$. Second preimage resistance states that given $a$, it is hard to find $b\neq a$ such that $H(a) \ = \ H(b)$. Finally, resistance to length-extension attacks states that it is hard to compute $H(a \ \Vert \ b)$ from $H(a)$. The precise formulation of these notions and their relationship is addressed in detail in [34].

An established method for proving the security of domain extenders, like MD above, is to show that they are property preserving: for instance, the seminal works of Merkle [31] and Damgård [23] show that if the compression function $f$ is collision resistant, then the hash function MD with some specific padding function is also collision resistant. Property preservation also applies for other notions; a representative panorama of property preservation for collision resistance, preimage and second preimage resistance appears in [5]. In Section IV we use EasyCrypt to reduce the collision resistance of suffix-free MD to the collision resistance of the underlying compression function.

An alternative method for proving the security of domain extenders is to show that they preserve ideal functionalities, i.e. that when applied to ideal functionalities they yield an ideal functionality. The notion of indifferentiability of Maurer et al. [30] provides an appropriate framework.

#### Definition 5

(Indifferentiability). A procedure ${\cal C}$ with oracle access to an ideal primitive ${\cal G}$ is $(t_{\cal S}, q, \epsilon)$-indifferentiable from ${\cal F}$ if there exists a simulator $s$ with oracle access to ${\cal F}$ and executing within time $t_{\cal S}$, such that for any distinguisher ${\cal D}$ that makes at most $q$ oracle queries, the following inequality holds TeX Source$$\vert {\rm Pr}[b\leftarrow {\cal D}^{{\cal C},{\cal G}}():b]-{\rm Pr}[b\leftarrow D^{{\cal F},{\cal S}}():b]\vert \leq \epsilon$$

Intuitively, the distinguisher is either given access to ${\cal C}^{\cal G}$ and ${\cal G}$, or it is given access to ${\cal F}$ and ${\cal S}^{\cal F}$ (see Figure 2). The probability that it succeeds in distinguishing the two scenarios must be small.

Figure 2. Indifferentiability of ${\cal C}$ from an ideal functionality ${\cal F}$

In the application considered in this paper, ${\cal C}$ represents the Merkle-Damgård construction, ${\cal G}$ represents the compression function and ${\cal F}$ represents an idealized hash function. Thus, the role of $\cal S$ is to simulate the behavior of the compression function, i.e. it should behave towards ${\cal F}$ like ${\cal G}$ behaves towards the Merkle-Damgård construction. In Section V, we use EasyCrypt to define a simulator $\cal S$ that proves indifferentiability of MD from a VIL random oracle when the compression function ${\cal G}$ is modeled as a FIL random oracle—random oracles [13] are functions that map values in the input domain into uniformly and independently distributed values in the output domain; see Section II for a precise definition.

We conclude this section with two observations on the two proof methods. First, indifferentiability from random oracles provides weaker guarantees than initially anticipated—see [19] and [33] respectively for discussions on the random oracle model and on the notion of indifferentiability—but remains nevertheless a useful heuristics to gain confidence in the design of hash functions. Second, the two methods are complementary. On the one hand, indifferentiability from a VIL random oracle entails resistance against collision, preimage, second preimage, and length-extension attacks. On the other hand, property preservation is often established under weaker hypotheses and moreover, exact security bounds derived from indifferentiability proofs are sometimes looser than bounds delivered by direct proofs of property preservation.

SECTION IV

## COLLISION RESISTANCE

We show that finding collisions for MD with a suffix-free padding is at least as hard as finding collisions for $f$. A collision for the compression function $f$ is a pair of inputs $xy_{1}, xy_{2}$ satisfying the predicate TeX Source$${\rm coll} (xy_{1}, xy_{2})\mathop{=}^{def}xy_{1}\neq xy_{2}\wedge f(xy_{1})=f(xy_{2})$$

#### Theorem 6.

Let MD be a Merkle-Damgård hash function with compression function $f$ and a suffix-free padding pad. For any algorithm ${\cal A}$ finding collisions for MD of at most length $p$, there exists an algorithm $\cal B$ that finds collisions for $f$ with the same probability and with an overhead of $O(p\cdot t_{f})$, where $t_{f}$ is a bound on the time needed for one evaluation of $f$.

Consider the experiment CRMD below, in which an adversary $\cal A$ performs a collision attack against MD:

We prove in EasyCrypt that the algorithm $\cal B$ shown in Fig. 3 finds collisions for $f$ in the experiment ${\rm CR}^{f}$ with at least the same probability as ${\cal A}$ finds collisions for MD in CRMD, i.e. TeX Source$${\rm Pr}\left[{\rm CR}^{{\rm MD}}:{\rm res}\right]\leq {\rm Pr}\left[{\rm CR}^{f}:{\rm res}\right] \eqno{\hbox{(1)}}$$ (Recall that res is a keyword that stands for the value returned by the main procedure of the games.) Algorithm $\cal B$ obtains from $\cal A$ a pair of messages $m_{1}, m_{2}$, pads them, and iterates the compression function over the first blocks of the longer padded message until the remaining suffix is the same length as the other padded message. It then computes the remaining iterations to compute ${\rm MD}(m_{1})$ and ${\rm MD}(m_{2})$ in parallel. If both messages collide, a collision for $f$ must occur in one of these parallel iterations.

Figure 3. A collision-finder $\cal B$ for the compression function $f$

In order to show (1) it suffices to prove the relational judgment: TeX Source$$\vdash {\rm CR}^{{\rm MD}}\sim {\rm CR}^{f} : {\rm true} \Longrightarrow {\rm res}\langle1\rangle\Rightarrow {\rm res}\langle2\rangle \eqno{\hbox{(2)}}$$ Proving this judgment involves non-trivial relational reasoning because equivalent computations in the related games are not performed in lockstep. We begin by inlining the call to ${\cal B}$ in ${\rm CR}^{f}$ and showing that the relational post-condition TeX Source\eqalignno{ & (m_{1}, m_{2})\langle1\rangle=(m_{1}, m_{2})\langle 2\rangle\wedge \cr & (h_{1}=\ {\rm MD}(m_{1})\wedge h_{2}=\ {\rm MD}(m_{2}))\langle 1\rangle} holds after the call to ${\cal A}$ in both programs and the two calls to $\sf F$ in CRMD. To show this, we prove that oracle $\sf F$ correctly implements function MD using the one-sided rule for loops—the needed invariant is simply $f^{\ast}(xs, y)={\rm MD}(m )$. At this point, note that if $m_{1}=m_{2}$, judgment (2) holds trivially (we only have to check that $\cal B$ terminates). We are left with the case $m_{1}\neq m_{2}$. Assume w.l.o.g. that $\vert {\rm pad} (m_{2})\vert \leq\vert {\rm pad} (m_{1}) \vert$, in which case $\cal B$ never enters its second loop and the following invariant holds for the first: TeX Source\eqalignno{ & f^{\ast}(xs_{1}, y_{1})={\rm MD}(m_{1})\wedge f^{\ast}(xs_{2}, y_{2})= MD (m_{2})\wedge \cr & m_{1}\neq m_{2}\wedge\vert xs_{2}\vert \leq\vert xs_{1}\vert \wedge xs_{2}= {\rm pad }(m_{2})\wedge &\hbox{(3)} \cr & \exists xs^{\prime}.\ xs^{\prime}\Vert xs_{1}= {\rm pad} (m_{1})} We prove that if the messages $m_{1}, m_{2}$ output by $\cal A$ collide, the last loop necessarily exits because a collision is found. This can be shown by means of the following loop invariant: TeX Source\eqalignno{ & f^{\ast}(xs_{1}, y_{1})={\rm MD} (m_{1})\wedge f^{\ast}(xs_{2}, y_{2})= {\rm MD} (m_{2})\wedge \cr & \vert xs_{2}\vert =\vert xs_{1}\vert \wedge \cr & (xs_{1}=xs_{2}\Rightarrow y_{1}\neq y_{2})} Note that (3) and the negation of the guard of the first loop imply that the above invariant holds initially. In particular, the last implication holds because if $x{\cal S}_{1}$ and $x{\cal S}_{2}$ were equal, there would exist a prefix $x{\cal S}^{\prime}$ such that $x{\cal S}^{\prime} \ \Vert \ {\rm pad} (m_{2})= {\rm pad} (m_{1})$, contradicting the fact that pad is suffix-free. Finally, observe that the last loop can exit either because a collision for $f$ is found or because $x{\cal S}_{1} \ = \ {\rm nil}$. In this latter case, it must be the case that $x{\cal S}_{2} \ = \ {\rm nil}$ and therefore $y_{1}={\rm MD}(m_{1})={\rm MD}(m_{2})=y_{2}$. However, from the last implication in the invariant we also have $y_{1}\neq y_{2}$, which leads to a contradiction that renders this case trivial.

SECTION V

## INDIFFERENTIABILITY

We prove the indifferentiability of the MD construction from a random oracle in $\{0,1\}^{\ast}\rightarrow\{0,1\}^{n}$ when its compression function $f$ is modeled as a random oracle in $\{0,1\}^{k}\times\{0,1\}^{n}\rightarrow\{0,1\}^{n}$ and its padding function is prefix-free. Our proof is based on [22].

#### Theorem 7

(Indifferentiability of MD).The Merkle- Damgård construction MD with an ideal compression function $f$, prefix-free padding pad, and initialization vector IV is $(t_{\cal S}, q_{\cal D}, \epsilon)$-indifferentiable from a variable input-length random oracle $F:\{0,1\}^{\ast}\rightarrow\{0,1\}^{n}$ where TeX Source$$\epsilon={3 \ell^{2} \ q_{D}^{2} \over 2^{n}} \qquad t_{S}=O(\ell \ q_{D}^{2})$$ and $\ell$ is an upper bound on the block-length of pad $(m)$ for any message $m$ appearing in a query of the distinguisher:

In what we call the real scenario, a distinguisher $\cal D$ has access to an oracle $F_{q}$ implementing the function MD and to a random oracle $f_{q} : \{0,1\}^{k} \times \{0,1\}^{n} \rightarrow \{0,1\}^{n}$ that models the compression function. In contrast, in the ideal scenario, $\cal D$ has access to a random oracle $F_{q}:\{0,1\}^{\ast}\rightarrow\{0,1\}^{n}$ and $f_{q}$ is simulated. See Fig. 4 for a formulation of these two scenarios as games. To prevent $\cal D$ from making more than $q$ oracle queries, we enforce a bound $q=\ell \ q_{\cal D}$ on the counter ${\bf q}_{f}$, that counts the number of evaluations of the compression function in game ${\rm G}_{\rm real}$. Note that this is more permissive than the proof of Coron et al. [22], since it allows the distinguisher to trade queries to $F_{q}$ for queries to $f_{q}$. Indeed, if $\cal D$ makes $n_{f}$ queries to $f_{q}$ and $n_{F}$ queries to $F_{q}$, we require TeX Source$${\rm q}_{f}\leq n_{f}+\ell \ n_{F} \leq \ell(n_{f}+n_{F}) \leq \ell \ q_{\cal D}=q$$ We show that the simulator $f_{q}$ in ${\rm G}_{\rm ideal}$ behaves consistently with a random oracle. Whenever the distinguisher makes a query $(x, y)$ to oracle $f_{q}$, the simulator looks among all previous queries for a sequence that could be the chain of inputs to the compression function used to compute the hash of some message $m$, for which $x$ is the last block of pad $(m)$. We call such a sequence a complete chain, and we define it formally below. When such a sequence is found, the simulator queries $F$ for the hash of $m$ and forwards the answer to the distinguisher. Otherwise, the simulator answers with a uniformly distributed random value. Figure 5 shows how this simulator would react to a sequence of queries TeX Source$$y_{2}\leftarrow f_{q}(x_{1}, {\rm IV});y_{3}\leftarrow f_{q}(x_{2}, y_{2});y_{4}\leftarrow f_{q}(x_{3}, y_{3})$$ where $x_{1} \ \Vert \ x_{2} \ \Vert \ x_{3}= {\rm pad} (m)$. The first two queries will be answered with random values, while the third completes a chain and is answered by forwarding ${\rm pad}^{-1}(x_{1} \ \Vert \ x_{2} \ \Vert \ x_{3})$ to $F$; this maintains the consistency with the real scenario.

Figure 5. An example illustrating how the simulator works

#### Definition 8

(Complete chain).A complete chain in a map $T : \{0,1\}^{k}\times\{0,1\}^{n}\rightarrow\{0,1\}^{n}$ is a sequence $(x_{1}, y_{1})\ldots(x_{i}, y_{i})$ such that $y_{1}={\rm IV}$ and

1. $\forall j=1\ldots i-1 (x_{j}, y_{j})\in {\rm dom} (T)\wedge T[x_{j}, y_{j}]=y_{j+1}$
2. $x_{1}\Vert\ldots\Vert x_{i}$ is in the domain of pad−1

The function findseq $((x,y),T^{\prime}$ used by the simulator searches in $T^{\prime}$ for a complete chain of the form $(x_{1}, y_{1})\ldots(x_{i}, y_{i})(x, y)$ and returns $x_{1} \Vert \ldots \Vert x_{i}$, or $\perp$ if no such chain is found.

To help SMT solvers and automated provers check logical side-conditions arising in our proofs, we needed to derive several auxiliary lemmas: e.g., if a finite map $T$ is injective and does not map any entry to the value IV, every complete chain is determined by its last element—that is, for any given $(x, y)$, the value of findseq $((x,y),T^{\prime}$ is uniquely determined. All of these lemmas have been mechanically verified based solely on the axiomatization and definitions of elementary operations. In many cases, EasyCrypt is able to verify the validity of these lemmas automatically. The more involved lemmas have been manually verified in the Coq proof assistant.

Figure 4. The game ${\rm G}_{{\rm real}^{\prime}}$

The proof proceeds by stepwise transforming the game ${\rm G}_{\rm real}$ into the game ${\rm G}_{\rm ideal}$, upper-bounding the probability that the outcome of consecutive games differ. By summing up over these probabilities, we obtain a concrete bound for the advantage of the distinguisher in telling apart the initial and final games. Specifically, we prove: TeX Source$$\vert \displaystyle {\rm Pr}[{\rm G}_{\rm real}: b]-{\rm Pr}[{\rm x G}_{\rm ideal}:b]\vert \leq{3q^{2}\over 2^{n}}x \eqno{\hbox{(4)}}$$

Figure 6. The game ${\rm G}_{{\rm real}^{\prime}}$

We begin by considering the game ${\rm G}_{{\rm real}^{\prime}}$ defined in Fig. 6. We introduce events bad1, bad2 and bad3 that will be needed later. First, we introduce a copy of oracle $f$, which we call $f_{\bf bad}$. Both use the same map $T$ to store previously answered queries, the difference is that $f_{\bf bad}$ may trigger events bad1 and bad2. We also introduce the lists Y and $Z$ that allow us to appropriately detect when these events occur. In addition, we modify the simulator $f_{q}$ to maintain a map $T^{\prime}$ of queries known to the distinguisher. Observe that $T^{\prime}\subseteq T$, because queries to $F_{q}$ result in entries being added only to $T$, whereas queries to $f_{q}$ result in the same entries being added to both $T$ and $T^{\prime}$. Additionally, the simulator $f_{q}$ behaves in two different ways depending on whether findseq $((x, y), T^{\prime})\neq\perp$. If this condition holds, there is a complete chain in map $T^{\prime}$ ending in $(x, y)$. In this case, in game ${\rm G}_{\rm ideal}$ the simulator should call oracle $F$ to maintain consistency with the random oracle; otherwise the simulator could just sample a fresh random value. In this game, oracle $f_{q}$ returns the same answer in both cases, but sets bad $\{1,2,3\}$ accordingly. Lastly, we also unroll the last iteration of the loop in $F_{q}$.

Note that instrumenting the game with the additional map $T^{\prime}$ and the failure events bad $\{1,2,3\}$ does not change the observable behavior. Therefore, TeX Source$${\rm Pr}[{\rm G}_{\rm real}:b]={\rm Pr}[{\rm G}_{{\rm real}^{\prime}}\ :\ b]$$

In game GrealRO, defined in Fig. 7, we introduce a random oracle $RO : \{0,1\}^{\ast}\rightarrow\{0,1\}^{n}$ and replace every call $f_{\bf bad}(x, y)$ in game ${\rm G}_{{\rm real}^{\prime}}$ where $(x, y)$ ends a complete chain in $T$ with a call to RO $(m, y)$ where $m$ is the unpadded message of the chain. I.e., in oracle $f_{q}$ we call RO if findseq is successful and in oracle $F_{q}$ we call RO instead of the last call to $f_{\bf bad}$. We also introduce the map $I:\Bbb{N}\rightarrow\{0,1\}^{n}\times {\Bbb B}$ which enumerates all sampled chaining values and includes a tainted flag to keep track of values known to the distinguisher. We introduce an indirection in map $T$ and $T^{\prime}$ through the use of map ${\mbi I}$. This allows us to keep track of the order in which queries were made and to know which answers we could re-sample without introducing inconsistencies in the view of the distinguisher.

The failure events that were introduced in the last step capture certain dependencies on previous queries that the distinguisher may exploit to tell apart games ${\rm G}_{{\rm real}^{\prime}}$ and ${\rm G}_{\rm realRO}$. We prove that games ${\rm G}_{{\rm real}^{\prime}}$ and GrealRO behave the same provided these failure events do not occur.

1. bad1 is triggered whenever oracle $f_{\bf bad}$ samples a random value that is either IV or has already been sampled for a distinct query before. The role of this event is twofold: on the one hand, if IV is sampled as a random value, then there could exist a complete chain in $T$ that is a suffix of another complete chain in $T$ as illustrated in the first example of Figure 8 (here $T[x_{2}, y_{2}]= {\rm IV}$). The problem is that oracle $F_{q}$ in the game ${\rm G}_{\rm real}$ will generate the same values for the two messages corresponding to those two chains, while $F_{q}$ in the game ${\rm G}_{\rm ideal}$ most likely will not. On the other hand, if a sampled value has been sampled for another query before, then there could exist two complete chains in $T$ that collide at some point and are identical from that point on as illustrated in the second example of Figure 8. Again the two corresponding messages would yield the same answer in ${\rm G}_{\rm real}$ but most likely not in ${\rm G}_{\rm ideal}$ on queries to $F_{q}$. By requiring that event ${\bf bad}_{1}$ does not occur, we guarantee that in game ${\rm G}_{{\rm real}^{\prime}}$ the map $T$ is injective and does not map any value to IV.
2. bad2is triggered whenever oracle $f_{\bf bad}$ samples a random value that has already been used as a chaining value in a previous query. This means that this query may be part of a chain of which the distinguisher has already queried later points in the chain, which should not be possible. The event also captures that no fixed-points (i.e. entries of the form ${\mbi T}[x, y]=y$) should be sampled.
3. bad3is triggered whenever a chaining value $y$ in a query has already been sampled as a random value and is in the range of $T$ for some previous query $(x^{\prime}, y^{\prime})$, but $(x^{\prime}, y^{\prime})$ does not appear in the domain of $T^{\prime}$ and $(x^{\prime}, y^{\prime})$ is not the last element of a complete chain in $T$. Intuitively, this means that $y$ was never returned by $f_{q}$ or $F_{q}$ and hence the distinguisher managed to guess a random value.

In order to relate games ${\rm G}_{{\rm real}^{\prime}}$ and ${\rm G}_{\rm realRO}$ in case that findseq $((x,y),T^{\prime})$ in $f_{q}$ succeeds in both games, we need to show that the call $f_{\bf bad}(x, y)$ in ${\rm G}_{{\rm real}^{\prime}}$ and the call ${\rm RO} (m, y)$ in ${\rm G}_{\rm realRO}$ behave similarly. For this we show that the following invariant is preserved in both games: for all complete chains $\cal C$ in the map $T$ of game ${\rm G}_{{\rm real}^{\prime}}$ with last $({\cal C}) \in {\rm dom} (T)$, it holds that $\cal C$ 'S associated message is in dom $({\mbi R})$ of game ${\rm G}_{\rm realRO}$ and, vice versa, every message in dom $({\mbi R})$ of game ${\rm G}_{\rm realRO}$ has a corresponding complete chain $\cal C$ in the map $T$ of game ${\rm G}_{{\rm real}^{\prime}}$ with last $({\cal C}) \in {\rm dom} (T)$. This invariant allows EasyCrypt to prove this case by inferring that $(x, y) \in {\rm dom} ({\mbi T})$ in game ${\rm G}_{{\rm real}^{\prime}}$ if and only if $m \in {\rm dom} ({\mbi R})$ in game ${\rm G}_{{\rm realRO}}$.

Proving that the aforementioned invariant is preserved in the games requires several other invariants. Most of them merely relate the representation of maps in both games; we omit these technical details. The essential invariant is that the distinguisher queries $f_{q}$ for points in a chain only if it has already queried the preceding part of the chain. This is important as it implies that each chain will be completed by a query for its last element, in which case findseq will detect this query and the corresponding message will be added to ${\mbi R}$. In game ${\rm G}_{{\rm real}^{\prime}}$, the predicate ${\rm set{}_{-}bad3}$ enforces this ordering by triggering event ${\bf bad}_{3}$. The probability of this event is negligible, because it means that $y$ was never output by $f_{q}$ or $F_{q}$ and hence is not known to the distinguisher. In game GrealRO, we use the map ${\mbi I}$ to iterate over all chaining values in order to check for the ordering mentioned above.

In oracle $F_{q}$ of game ${\rm G}_{\rm realRO}$, the computation of the Merkle-Damgård construction is split into three stages due to the different usage of the maps $T^{\prime},T^{\prime}_{i}$, and $T$ The first loop computes the construction for values that were already queried by the distinguisher and are therefore in dom $(T^{\prime})$. The restriction that the distinguisher may only query chains in order implies that such values occur only in the prefix of a chain. The second loop handles values that were already used before by oracle $F_{q}$, and the third loop samples fresh chaining values. Relating the final call to $f_{{\rm bad}}$ in game ${\rm G}_{{\rm real}^{\prime}}$ and the final call to RO in game ${\rm G}_{\rm realRO}$ is similar to this case in oracle $f_{q}$. We prove that the advantage in differentiating between games ${\rm G}_{{\rm real}^{\prime}}$ and ${\rm G}_{\rm realRO}$ is upper bounded by the probability of any of ${\bf bad}_{1},{\bf bad}_{2},{\bf bad}_{3}$ occurring in game ${\rm G}_{\rm realRO}$. TeX Source\eqalignno{ \vert {\rm Pr}[{\rm G}_{{\rm rea}1^{\prime}}\ :\ b]- & {\rm Pr}[{\rm G}_{{\rm realRO}}:b]\vert \leq \cr & \quad{\rm Pr}[{\rm G}_{{\rm realRO}} : {\bf bad}_{1} \vee {\bf bad}_{2}\vee {\bf bad}_{3}]}

Figure 7. The game GrealRO
Figure 8. Two examples illustrating the necessity of event ${\bf bad}_{l}$

To finish the proof, we have to relate ${\rm Pr}[{\rm G}_{\rm realRO}:b]$ with ${\rm Pr}[{\rm G}_{\rm ideal}:b]$ and bound the probability of the failure events in game GrealRO. We first focus on the probability of bad1 and bad2. Event bad1 (resp. bad2) is set when a freshly sampled value $z$ is in the list $\bf Z$ (resp.${\mbi Y}$); since the size of both lists is bounded by $q$, this occurs with probability at most $q \ {2^{n}}$, for each of the possible $q$ queries.

Note that oracles $F_{q}, \ RO$, and $f_{q}$ in game ${\rm G}_{\rm realRO}$ use the same code to detect the failure events ${\bf bad}_{l}$ and ${\bf bad}_{2}$ when sampling a fresh value $z$. We can wrap this code in a new oracle that meets the conditions of Lemma 2: we take $u=q\ {2^{-n}}$ and $i=\vert {\bf Z} \vert$ (resp.$\vert {\mbi Y} \vert$). We get TeX Source$${\rm Pr}[{\rm G}_{{\rm realRO}}: {\bf bad}_{1}] \leq{q^{2}\over 2^{n}} \quad {\rm Pr}[{\rm G}_{{\rm realRO}}: {\bf bad}_{2}] \leq{q^{2}\over 2^{n}}$$

We are left to bound the probability of bad3 and relate the game ${\rm Pr}[{\rm G}_{\rm realRO}:b]$ with ${\rm Pr}[{\rm G}_{\rm ideal}:b]$. Note that in game ${\rm G}_{\rm realRO}$ chaining values are sampled eagerly, i.e. for a query $m$, oracle $F_{q}$ samples chaining values $z$ that are independent of the distinguisher's view (their associated flag is set to true). These values might later on become known to the distinguisher if it recomputes the Merkle-Damgård construction for $m$ using oracle $f_{q}$ (we identify this case setting found = true). We want to transform the game so that chaining values are sampled lazily (as in game ${\rm G}_{\rm ideal}$).

The same kind of argument can be used for ${\bf bad}_{3}$. This event is set whenever the distinguisher makes a query $(x, y)$ to $f_{q}$ with $y$ coinciding with a value uniformly and independently distributed w.r.t. its view.

Figure 9. The games GidealEager and GidealLazy

We modify game ${\rm G}_{\rm realRO}$ in order to prepare for the transition from eager to lazily sampled chaining values: the body of game ${\rm G}_{\rm idealEager}$ (see Figure 9) contains a loop which re-samples all chaining values that are unknown to the adversary, i.e., the values for which the second component in map ${\mbi I}$ is set to true. Furthermore, game ${\rm G}_{\rm idealEager}$ drops the failure events ${\bf bad}_{\{1,2,3\}}$, but introduces a new failure event bad4. We show that if ${\bf bad}_{3}$ is triggered in game ${\rm G}_{\rm realRO}$, then in ${\rm G}_{\rm idealEager} {\bf bad}_{4}$ is set to true or there exists an $i$ such that ${\mbi I}[i]=$ ($v$, true) and $v \ \in \ {\mbi Y}$. We get TeX Source\eqalignno{ & {\rm Pr}[{\rm G}_{{\rm realRO}}:b] \qquad={\rm Pr}[{\rm G}_{{\rm idealEager}}:b] \cr & {\rm Pr}[{\rm G}_{{\rm realRO}} :{\bf bad}_{3}] \leq {\rm Pr}[{\rm G}_{{\rm idealEager}} : {\bf bad}_{4}\vee {\rm I}_{\exists}]} where ${\rm I}_{\exists}=\exists_{i} 0\leq i\leq {\bf q}^{\prime}_{f}\wedge {\rm snd} ({\mbi I}[i])\ \wedge \ {\rm fst} ({\mbi I}[i]) \ \in \ {\mbi Y}$.

In game ${\rm G}_{\rm idealLazy}$ (see Figure 9), the loop we introduced in the last game is swapped with the call to the distinguisher and oracle $f_{q}$ samples the chaining values lazily (the branch found re-samples the value of $z$). In order to prove the equivalence with the previous game, we need to show that the loop that resamples the values unknown to the adversary swaps with calls to oracles $F_{q}$ and $f_{q}$ in games ${\rm G}_{\rm idealEager}$ and ${\rm G}_{\rm idealLazy}$. We obtain TeX Source\eqalignno{ & {\rm Pr}[{\rm G}_{\rm idealEager}:b] \quad \qquad\qquad \ ={\rm Pr}[{\rm G}_{{\rm idealLazy}}:b] \cr & {\rm Pr}[{\rm G}_{{\rm idealEager}}\ :\ {\bf bad}_{4}\vee {\rm I}_{\exists}] \quad={\rm Pr}[{\rm G}_{{\rm idealLa}z{\rm y}}\ :\ {\bf bad}_{4}\vee {\rm I}_{\exists}]} It is easy to see that games ${\rm G}_{\rm idealLazy}$ and ${\rm G}_{\rm ideal}$. are equivalent w.r.t. $b$; the global variable ${\bf q}_{f}$ and the maps ${\mbi R}$ and ${\mbi T}^{\prime}$ are equivalent in both games. The other variables in game ${\rm G}_{\rm idealLazy}$ and its loops do not influence the behavior of its oracles. We show that TeX Source$${\rm Pr}[{\rm G}_{\rm idealLazy}:b]={\rm Pr}[{\rm G}_{\rm ideal}:b].$$

We still have to bound the probability of ${\bf bad}_{4} \ \vee \ {\bf I}_{\exists}$ in game ${\rm G}_{\rm idealLazy}$. To do this, we simply modify the while loop in the code of the game by replacing the instruction $z\displaystyle\mathop{\leftarrow}^{\$}\{0,1\}^{n}$with TeX Source$$z\displaystyle\mathop{\leftarrow}^{\}\{0,1\}^{n};{\bf bad}_{4} \leftarrow {\bf bad}_{4}\vee z \in {\mbi Y}$$ This leads to a game${\rm G}_{{\rm idealLazy}^{\prime}}$, for which we show TeX Source$${\rm Pr}[{\rm G}_{\rm idealLazy} : {\bf bad}_{4}\vee {\rm I}_{\exists}]\leq {\rm Pr}[{\rm G}_{{\rm idealLazy}^{\prime}} : {\bf bad}_{4}]$$ We finally use the same technique as for bad1 to bound the probability of bad., in game${\rm G}_{{\rm idealLazy}^{\prime}}$, and obtain TeX Source$${\rm Pr}[{\rm G}_{{\rm idealLazy}^{\prime}}:{\bf bad}_{4}]\leq{q^{2}\over 2^{n}}$$ Putting the (in-)equalities proved above together we prove (4), which completes the proof of Theorem 7. SECTION VI ## SECURITY PROOFS OF GENERALIZED MERKLE-DAMGåRD To avoid inheriting structural weaknesses in the original Merkle-Damgård construction, existing hash functions employ instead slight variants of it. One well-known variant is the wide-pipe design, which uses an internal state larger than the final output [22], [28]. Many variants are subsumed by the following Generalized Merkle-Damgård construction. #### Definition 9 (Generalized Merkle-Damgård).Let IV$\in \{0,1\}^{n}$be a public initialization vector and$f, g$be two compression functions of type TeX Source$$f, g : \{0,1\}^{k}\times\{0,1\}^{n}\rightarrow\{0,1\}^{n}$$ Consider a function pad$: \ \{0,1\}^{\ast} \ \rightarrow \ (\{0,1\}^{k})^{\ast}\times\{0,1\}^{k}$that converts an arbitrary length message into a non-empty list of blocks of length$ksingling out the last block. The hash function G MD is defined as follows: TeX Source\eqalign{ & {\rm GMD} \qquad: \quad \{0,1\}^{\ast}\rightarrow\{0,1\}^{\ell}\cr & {\rm GMD}(m)\ \mathop{=}^{def} \quad {\bf let}\ (x, y)=\ {\rm pad} (m)\ {\bf in} [g(y, f^{\ast}(x, {\rm IV}))]^{\ell}} wheref^{\ast}$is defined as in Def. 3 and$[x]^{\ell}$chops off the$n-\ell$least significant bits from$x$, i. e. discards all but the leading$\ell\$ bits.

The NIST SHA-3 competition started in November 2007 with the objective of selecting new cryptographic hash functions to augment the set specified by the U.S. Federal Information Processing Standard (FIPS) 180–3, which includes the SHA-1 and SHA-2 algorithms. After receiving 64 entries, NIST selected 51 candidates for the first round, further narrowed down the list to just 14 candidates for the second round, and announced 5 finalists in December 2010: BLAKE [6], Grostl [26], JH [38], Keccak [14], and Skein [25]. A public comment period has started after this announcement and the winner is expected to be selected before the end of 2012.

The security of all SHA-3 finalists, and of many second round candidates, has been thoroughly scrutinized. Two survey articles summarize known results [3], [4]. While the algorithmic descriptions of the finalists and their exact security bounds fit in one page (see [4]), the corresponding security proofs are technically involved and need to be cautiously adapted to account for the specificities of each function. As a consequence, it is difficult to assess the validity of security claims for individual candidates and machine checking their proofs is an appealing perspective. In the remainder of this section we discuss the applicability of the proofs presented in Sections IV and V to SHA-3 finalists.

The five SHA-3 finalists are based on the iterated hash function design that underlies the Merkle-Damgård construction, but incorporate some variations such as round-dependent tweaks, counters, final transformations, and chopping. We observe that, in a more or less contrived way, all the finalists can be considered as variants of the Generalized Merkle-Damgård (Definition 9). The compression functions of the finalists are either block-cipher based (BLAKE, Skein) or permutation-based (Grostl JH, Keccak). Moreover, all finalists use suffix-free padding rules, while the padding rules of BLAKE and Skein are additionally prefix-free [4].

Our formalization models compression functions as functions of two arguments: a message block and a chaining value. This represents a deviation with respect to the compression functions of BLAKE and Skein. The compression function of BLAKE additionally takes a counter and a random salt value, whereas the compression function of Skein builds on a tweakable block cipher and takes as additional input a round-specific tweak. The additional arguments of the compression functions of BLAKE and Skein could be formalized as an integral part of the padding rule; the padding function can compute the appropriate round-specific values and append them to the message blocks. This alternative description would have the advantage of matching the model that we use in our results about the MD hash function. However, all finalists except BLAKE use chopping or a final transformation, which are formalized neither in our proof of collision resistance nor in our proof of indifferentiability. This rules out a direct application of our results, with the exception of BLAKE, for which Theorem 6 does apply. We leave it for future work to formalize this instantiation in EasyCrypt.

NIST requirements for the SHA-3 competition include collision resistance, preimage resistance and second preimage resistance. All the candidates selected as finalists satisfy these properties and (in most cases) even achieve optimal bounds for them when the underlying block-ciphers or permutations used to build their compression functions are assumed to be ideal [4]. Although the original NIST requirements did not include the property of indifferentiability from a random oracle, this notion has also been considered in the literature and is achieved by all five finalists [1], [2], [12], [15], [16], [20]. These indifferentiability proofs hold in an idealized model for some of the building blocks of the hash function: the ideal-cipher model for block-cipher based hash functions, or the ideal-permutation model for permutation based hash functions. Indifferentiability seems to be an excellent target for security proofs because it ensures that the high-level design of the hash function has no structural weaknesses, but also because it implies bounds for all of the classical properties enumerated above. Unfortunately, the assumption that some underlying primitive is ideal is at best unrealistic and at worst plainly wrong. Proofs of indifferentiability should be taken only as an indication for the security and as a palliative for the lack of security proofs in the standard model.

Compared to our result of Theorem 7, which assumes that the compression function is ideal, the indifferentiability of all the finalists has been proved in an ideal model for lower building blocks. We point out that assuming ideality of a lower building block is weaker than assuming ideality of the entire compression function and thus these results are stronger. Indeed, assuming ideality of the compression function seems to be inappropriate for all the finalists:

• The compression functions of JH and Keccak are trivially non-random, as collisions and preimages can be found in only one query to the underlying permutati on [4], [17];
• Finding fixed-points for the compression function of Grostl is trivial [26];
• The compression function of BLAKE has been recently shown to exhibit non-random behavior [1], [20];
• Non-randomness has been shown for reduced-round versions of Threefish, the underlying block-cipher of Skein [27].

The only two finalists that use a prefix-free padding rule, and for which our proof of indifferentiability can apply, are BLAKE and Skein. However, our proof of indifferentiability of prefix-free Merkle-Damgård relies on the assumption that the underlying compression function behaves like an ideal primitive. Thus, it cannot be applied to BLAKE, as this assumption has been invalidated. As for Skein, the assumption that its compression function is ideal is seriously weakened by the attacks on Threefish mentioned above.

Although Theorem 7 cannot be directly applied to any of the SHA-3 finalists, it constitutes a non-trivial result about the Merkle-Damgård construction and a good starting point for formalizing more complex proofs. Indeed, indifferentiability proofs based on weaker assumptions and general enough to apply to SHA-3 finalists are no significantly different from the proof we have formalized and use essentially the same techniques. We see no impediment to formalizing them in EasyCrypt.

SECTION VII

## CONCLUSION

Despite their widespread use, the formal verification of hash functions has received little attention. To our best knowledge, Toma and Borrione [35] were the first to use theorem provers to formally verify properties of SHA-1 but their focus is on functional properties, rather than security properties. The first machine-checked proof of security for a hash design appears in [7], where the authors use the CertiCrypt framework to verify that the construction from Brier et al. [18] yields a hash function indifferentiable from a random oracle into ordinary elliptic curves. More recently, Daubignard et al. [24] develop a method to permute dependencies between oracles in a game, and apply their method to prove indifferentiability of hash functions from random oracles. Their method is not implemented, although the underlying framework has been machine-checked [21].

The prevailing method for building hash functions is to iterate a compression function on a pre-processed input message. In this paper, we have considered the Merkle-Damgård construction, which pioneered this design, and proved that the resulting hash function preserves collision resistance and is indifferentiable from a random oracle. Our results demonstrate that state-of-the-art verification tools can be used for proving the security of hash designs, and not only for cryptanalysis [32]. We will further this line of research by exploring the formalization of more general security proofs that apply to a wider range of hash functions, including finalists of the SHA-3 competition.

### ACKNOWLEDGEMENTS

The authors want to thank Martín Abadi and the anonymous CSF reviewers for insightful feedback on the paper.

## Footnotes

No Data Available

## References

No Data Available

## Cited By

No Data Available

None

## Multimedia

No Data Available
This paper appears in:
No Data Available
Conference Date(s):
No Data Available
Conference Location:
No Data Available
On page(s):
No Data Available
E-ISBN:
No Data Available
Print ISBN:
No Data Available
INSPEC Accession Number:
None
Digital Object Identifier:
None
Date of Current Version:
No Data Available
Date of Original Publication:
No Data Available