By Topic

IEEE Quick Preview
  • Abstract



Cryptographic hash functions provide a basic data authentication mechanism and are routinely used as building blocks in other cryptographic constructions. For a given input Formula$m$, a cryptographic hash function Formula$H$ outputs a digest Formula$H(m)$ of some small fixed length. For most tasks, it is required that finding distinct inputs with the same digest—a collision—be difficult. However, recent research has demonstrated that widely used hash functions, including SHA-1 and MD5, are vulnerable to collision attacks [29], [36], [37]. In response to these concerns, the U.S. National Institute of Standards and Technology (NIST) started in November 2007 a public competition to develop new cryptographic hash functions to augment a set of standard functions that includes the SHA-1 and SHA-2 algorithms. This competition, commonly known as the SHA-3 competition, motivated a growing interest in developing cryptographic hash functions and in rigorously scrutinizing their security.

Verified security [8], [10] is an emerging approach to security proofs of cryptographic systems. It adheres to the same principles as provable security, but revisits its realization from a formal verification perspective. When taking a verified security approach, proofs are mechanically verified and built with the aid of state-of-the-art verification tools, such as SMT solvers, automated theorem provers and interactive proof assistants. EasyCrypt [8] is an automated framework that aims to make verified security accessible to cryptographers with a limited background in formal methods; it has been successfully applied to verify exact security bounds of several digital signature and encryption schemes.

In this paper, we report on an extension of EasyCrypt and its application to build and verify exact security proofs of the Merkle-Damgård construction [23], [31], which underlies the design of many cryptographic hash functions. In its simplest formulation, Merkle-Damgård iterates a compression function Formula$f : \{0,1\}^{k} \times \{0,1\}^{n} \rightarrow\{0,1 \}^{n}$ over the blocks of an input message padded to a block boundary. For a fixed public initialization vector IV, the digest of a padded message with blocks Formula$x_{1} \ \Vert \cdots \Vert \ x_{\ell}$ is computed as FormulaTeX Source$$ f(x_{\ell}, f(x_{\ell-1}, \ldots f(x_{1}, {\rm IV})\ldots))$$

One way of arguing that iterated constructions like Merkle-Damgård are secure is to show that they preserve security properties of the underlying compression function. The seminal works of Merkle [31] and Damgård [23] show that if messages are padded in some specific way, finding two colliding messages for the above iterated construction is at least as hard as finding two colliding inputs for the compression function Formula$f$; said otherwise, that the construction preserves the collision resistance of the compression function. We present a proof of a generalization of this result in EasyCrypt. Our proof applies when the padding function is suffix-free, i.e. the padding of a message Formula$m$ is not a suffix of the padding of any other message Formula$m^{\prime}$.

An alternative method for proving the security of a hash function is to show that it behaves as a random oracle when the compression function, or some other lower-level building block, is assumed to be ideal. The indifferentiability framework of Maurer et al. [30] provides a rigorous simulation-based definition that captures this intuition and implies a strong composability result. Glossing over technical subtleties [33], a hash function Formula$H$ indifferentiable from a random oracle can be plugged into a cryptosystem proven secure in the random oracle model for Formula$H$ without compromising the security of the cryptosystem. We present a proof in EasyCrypt of the indifferentiability of the Merkle-Damgård construction from a random oracle. Our proof, which follows the proof of Coron et al. [22], applies when the padding function is prefix-free, i.e. the padding of a message Formula$m$ is not a prefix of the padding of any other message Formula$m^{\prime}$.

Organization of the Paper

Section II overviews the foundations and verification mechanisms implemented in our extension to EasyCrypt; Section III describes the Merkle-Damgård construction and its security properties; Section IV describes a machine-checked proof that Merkle-Damgård preserves collision resistance when used with a suffix-free padding, while Section V describes a machine-checked proof of its indifferentiability from a random oracle when the padding is prefix-free; Section VI discusses the applicability of our results to generalizations of the Merkle-Damgård construction and the finalists of NIST SHA-3 competition. We conclude in Section VII.



Building a cryptographic proof in EasyCrypt is a process that can be decomposed in the following steps:

  • Defining a formal context, including types, constants and operators, and giving it meaning by declaring axioms and stating derived lemmas.
  • Defining a number of games, each of them composed of a collection of procedures (written in the probabilistic imperative language described below) and adversaries declared as abstract procedures with access to oracles.
  • Proving logical judgments that establish equivalences between games. This may be done fully automatically, with the help of hints from the user in the form of relational invariants, or interactively using basic tactics and automated strategies.
  • Deriving inequalities between probabilities of events in games, either by using previously proven logical judgments or by direct computation.

In the remainder of this section, we briefly overview some key aspects of the process of building an EasyCrypt proof. Note that the work reported in this article benefited from several extensions of the tool with respect to [8]; these extensions include:

  1. Support for reasoning about programs with loops. Loops were used to represent iteration in the Merkle-Damgård construction.
  2. Mechanization of the Failure Event Lemma of [11], implemented in EasyCrypt as an extension to the mechanism that directly computes probability bounds. This was used to bound the success probability of the distinguisher in the proof of indifferentiability presented in Sect. V.
  3. Proof engineering mechanisms to manage the size of proof obligations and the theories that external solvers use. These mechanisms were essential for the successful verification of the proofs presented in this paper.

A. Input Language

Probabilistic experiments are defined as programs in pWHILE, a strongly-typed imperative probabilistic programming language. The grammar of pWHILE commands is defined as follows: FormulaTeX Source$$\eqalign{ {\cal C} :: &= \ {\rm skip}\cr &\vert \quad {\cal V} \leftarrow {\cal E} \cr &\vert \quad {\cal V}\ \mathop{\leftarrow}^{\$} \ {\cal D}{\cal E} \cr &\vert \quad{\rm if} \ {\cal E}\ {\rm then}\ {\cal C} \ {\rm else}\ {\cal C} \cr &\vert \quad {\rm while}\ {\cal E} \ {\rm do} \ {\cal C} \cr &\vert \quad {\cal V} \ \leftarrow {\cal P}({\cal E}, \ldots, {\cal E}) \cr &\vert \quad {\cal C}; \ {\cal C} }\eqalign{&{\rm nop} \cr &{\rm deterministic \ assignment}\cr &{\rm probabilistic \ assignment} \cr &{\rm conditional} \cr & {\rm loop} \cr & {\rm procedure \ call} \cr &{\rm sequence} }$$ The only non-standard feature of the language are probabilistic assignments; an assignment Formula$x \ \displaystyle\mathop{\leftarrow}^{\$} \ d$ evaluates the expression Formula$d$ in the current state to a distribution Formula$\mu$ on values, samples a value according to Formula$\mu$ and assigns it to variable Formula$x$. The key to the flexibility of EasyCrypt is that the base language of expressions and distribution expressions can be extended by the user to suit the needs of the verification task. The rich base language includes expressions over Booleans, integers, fixed-length bitstrings, lists, finite maps, and option, product and sum types. User-defined operators can be axiomatized or defined in terms of other operators. In the following, we let Formula$\{0,1\}^{\ell}$ denote the uniform distribution on bitstrings of length Formula$\ell$.

A program (equivalently, a game) in EasyCrypt is represented as a set of global variables together with a collection of procedures. Some of these procedures are concrete and given a definition as a command Formula$c \ \in \ {\cal C}$, while some others may be abstract and left undefined. Quantification over adversaries in cryptographic proofs is achieved by representing them as abstract procedures parametrized by a set of oracles; these oracles must be instantiated as other procedures in the program.

Commands operate on program memories, which map local and global variables to values; we let Formula${\cal M}$ denote the set of memories. The semantics of a command Formula$c \ \in \ \cal C$ is a function Formula$([\!\vert c \vert\!]) \ :\ {\cal M} \rightarrow {\cal D}({\cal M})$ from program memories to sub-distributions on program memories. Note that programs that do not terminate with probability 1 generate sub-distributions with total probability less than 1. We refer the reader to [9] for a detailed description of the semantics of pWHILE as it has been formalized in the Coq proof assistant. In what follows, we denote by Formula${\rm Pr}[c, m:A]$ the probability of event Formula$A$ w.r.t. to the distribution Formula$\big[\!\vert c \vert\!\big]m$ and often omit the initial memory Formula$m$ when it is not relevant.

Although EasyCrypt is not tied to any particular cryptographic model, it provides good support to reason about proofs developed in the random oracle model. A random oracle Formula${\cal O} : X \rightarrow Y$ is modelled in EasyCrypt as a stateful procedure that maps values in Formula$X$ into uniformly and independently distributed values in Formula$Y$. The state of a random oracle can be represented as a global finite map Formula${\mbi L}$ that is initially empty. Queries are answered consistently so that identical queries are given the same answer:

Algorithm 1

B. Probabilistic Relational Hoare Logic

The foundation of EasyCrypt is a probabilistic Relational Hoare Logic (pRHL), whose judgments are quadruples of the form: FormulaTeX Source$$ \vdash c_{1}\sim c_{2}:\Psi \Rightarrow \Phi $$ where Formula$c_{1}, c_{2}$ are programs and Formula$\Psi, \Phi$ are first-order relational formulae. Relational formulae are defined by the grammar: FormulaTeX Source$$ \Psi, \Phi::=e \ \vert \ \neg\Phi\ \vert \Psi\wedge\Phi\vert \ \Psi\vee\Phi \ \vert \Psi\Rightarrow\Phi \ \vert \forall x.\ \Phi \ \vert \exists x.\ \Phi $$ where Formula$e$ stands for a Boolean expression over logical variables and program variables tagged with either Formula$\langle 1\rangle$ or Formula$\langle 2 \rangle$ to denote their interpretation in the left or right-hand side program; the only restriction is that logical variables must not occur free. The special keyword res denotes the return value of a procedure and can be used in the place of a program variable. We write Formula${\rm e}\langle i\rangle$ for the expression Formula$e$ in which all program variables are tagged with Formula$\langle i \rangle$. A relational formula is interpreted as a relation on program memories. For example, the formula Formula$x\langle 1\rangle+1\leq y\langle2\rangle$ is interpreted as the relation FormulaTeX Source$$ R=\{(m_{1}, m_{2})\vert m_{1}(x)+1\leq m_{2}(y)\} $$

The validity of a pRHL judgment is defined in terms of a lifting operator Formula${\cal L}:{\cal P}(A\times B)\rightarrow {\cal P}({\cal D}(A)\times {\cal D}(B))$. Concretely, FormulaTeX Source$$ \eqalign{ \models c_{1} & \sim c_{2}:\Psi \Rightarrow \Phi \ {\displaystyle{\mathop{=}^{\rm def}}} \cr & \forall m_{1}, m_{2}.\ m_{1}\Psi m_{2}\Rightarrow([\![c_{1}]\!]m_{1}){\cal L}(\Phi)([\![c_{2}]\!] m_{2})} $$ Formally, let Formula$\mu_{1}$ be a probability distribution on a set Formula$A$ and Formula$\mu_{2}$ a probability distribution on a set Formula$B$. We define the lifting Formula$\mu_{1}{\cal L}(R)\mu_{2}$ of a relation Formula$R \subseteq A \times E$ to Formula$\mu_{1}$ and Formula$\mu_{2}$ by the clause: FormulaTeX Source$$ \exists\mu:{\cal D}(A \times B).\ \pi_{1}(\mu)=\mu_{1}\wedge \pi_{2}(\mu)=\mu_{2}\wedge {\rm supp} (\mu) \subseteq R $$ where Formula$\pi_{1}(\mu)$ (resp. Formula$\pi_{2}(\mu)$) denotes the projection of Formula$\mu$ on its first (resp. second) component and Formula${\rm supp}(\mu)$ is the support of Formula$\mu$ as a sub-probability measure—if Formula$\mu$ is discrete, this is just the set of pairs with positive probability.

Figure 1 shows some selected rules that can be used to derive valid pRHL judgments. There are two kinds of rules: two-sided rules, which require that the related programs have the same syntactic form, and one-sided rules, which do not impose this requirement. One-sided rules are symmetric in nature and admit a left and a right variant. We briefly comment on some rules. The two-sided rule [Rnd] for random assignments requires the distributions from where values are sampled be uniform on some set Formula$X$; to apply the rule one must exhibit a function Formula$f:X\rightarrow X$ that may depend on the state and is 1–1 if the precondition holds. The one-sided rule Formula$[{\rm Rand}\langle 1\rangle]$ for random assignments simply requires that the post-condition is established for all possible outcomes; in effect, this rule treats random assignment as a non-deterministic assignment.

Similarly to Hoare logic, the rules for while loops require to exhibit an appropriate relational invariant Formula$\Phi$. The two-sided rule [While] applies when the loops execute in lockstep and thus requires proving that the guards are equivalent. The one-sided rule Formula$[{\rm While}\langle 1\rangle]$ further requires exhibiting a decreasing variant Formula$v$ and a lower bound Formula$m$. The premises ensure that the loop is absolutely terminating, which is crucial for the soundness of the rule.

The relational Hoare logic also allows capturing the well known cryptographic argument H Formula$x$ is uniformly distributed and independent of the adversary's view”, which is certainly one of the most difficult to formalize. We formalize this argument in EasyCrypt by proving that re-sampling Formula$x$ preserves the semantics of the program. Suppose we want to prove that in a program Formula$c$, a variable Formula$x$ used in an oracle Formula${\cal O}$ is uniformly distributed and independent of the view of an adversary Formula${\cal A}^{\cal O}$. Let Formula${\cal O}^{\prime}$ be the same as Formula${\cal O}$ except that it re-samples Formula$x$ when needed. We identify a condition used that holds whenever Formula$\cal A$ obtained some information about Formula$x$ (and thus, re-sampling would not preserve the semantics). We then prove that the conditional statement Formula$c^{\prime}\ \displaystyle\mathop{=}^{\rm def}$ if Formula$\neg {\rm used}$ then Formula$x \ \displaystyle\mathop{\leftarrow}^{\$} \ X$ can swap with calls to Formula${\cal O}$ and Formula${\cal O}^{\prime}$, i.e. FormulaTeX Source$$ \vdash c^{\prime};y \leftarrow {\cal O}(\vec{e})\sim y \leftarrow {\cal O}^{\prime}(\vec{e});c^{\prime}: \Phi \Rightarrow \Phi $$ where Formula$\Phi$ implies equality over all global variables. From this, we can conclude that Formula$c^{\prime}$ can also swap with calls to Formula${\cal A}^{\cal O}$ and Formula${\cal A}^{{\cal O}^{\prime}}$, and hence that the semantics of the program Formula$c$ is preserved when Formula${\cal O}$ is replaced by Formula${\cal O}^{\prime}$. The advantage of using such kind of reasoning is that it is generally much easier to reason about a game where Formula$x$ is sampled lazily, since its distribution is locally known.

We conclude with some observations on the mechanization of reasoning in pRHL. We implement in EasyCrypt several variants of two-sided and one-sided rules of pRHL in the form of tactics that can be applied in a goal-oriented fashion to prove the validity of judgments. For instance, instead of implementing rule Formula$[{\rm Rnd}\langle1\rangle]$, we combine it with the [Seq] rule to obtain the following more easily applicable rule: FormulaTeX Source$$ \vdash c_{1}\sim c_{2} : \Psi\Longrightarrow\forall v\in{\rm supp}(d).\ \Phi\{e\langle 1\rangle/x\langle1\rangle\} \over \vdash c_{1};\ x {\displaystyle\mathop{\leftarrow}^{\$} }d\sim c_{2}:\Psi\Longrightarrow\Phi $$

Figure 1
Figure 1. Selected pRHL rules

The application of a tactic may generate additional verification subgoals, and logical side conditions that are checked using SMT solvers, automated theorem provers and, as a last recourse, interactive proof assistants. Depending on their nature, application of the tactics can be fully automated or require user input. For instance, applying the tactics that mechanize the rules for while loops, requires the user to provide an adequate invariant. In the case of the two-sided rule, a new subgoal is generated to prove the correctness of the user-provided invariant, whereas the equivalence of the loop guards is checked automatically as a logical side-condition.

In addition to tactics that mechanize basic rules of pRHL, EasyCrypt implements automated strategies that combine the application of a weakest precondition transformer wp with heuristics to apply basic tactics. The wp transformer operates on deterministic loop-free programs. These strategies can often be used to deal automatically with large fragments of proofs, letting the user focus in the parts that require ingenuity.

C. Reasoning about Probabilities

Since cryptographic results are stated as inequalities on probabilities rather than pRHL judgments, it is important to derive probability claims from pRHL judgments. This can be done mechanically by applying rules in the style of FormulaTeX Source$${m_{1} \Psi m_{2} \ \vdash c_{1} \sim c_{2}:\ \Psi \Longrightarrow \Phi \ \Phi \Rightarrow A\langle 1 \rangle\Rightarrow B\langle 2\rangle \over {\rm Pr}[c_{1},m_{1}:A] \leq {\rm Pr}[c_{2},m_{2}:B]}$$

Game-based proofs often argue that two programs Formula$c_{1}$ and Formula$c_{2}$ behave identically unless a failure event Formula$F$ is triggered. This is used to conclude that the difference in probability of any event between the two programs is bounded by the probability of Formula$F$ in one of them. Although a syntactic characterization of this lemma is often used (in which failure is represented by a Boolean flag), it can be conveniently expressed and implemented in EasyCrypt in a more general form using pRHL.

Lemma 1

(Fundamental Lemma).Let Formula$c_{1}$ and Formula$c_{2}$ be two terminating commands and Formula$A, B, F$ events such that FormulaTeX Source$$\vdash c_{1}\sim c_{2} : \Psi\Longrightarrow F\langle1\rangle\Leftrightarrow F\langle2\rangle\wedge(\neg F\langle1\rangle\Rightarrow A\langle1\rangle\Leftrightarrow B\langle2\rangle)$$ Then, if the initial memories of both games satisfy Formula$\Psi$, FormulaTeX Source$$\vert {\rm Pr}[c_{1}:A]-{\rm Pr}[c_{2}:B]\vert \leq {\rm Pr}[{\rm G}_{1}:F]={\rm Pr}[{\rm G}_{2}:F] $$

In most applications of the above lemma, the failure event Formula$F$ can only be triggered in oracle queries made by an adversary. When the adversary can only make a known bounded number of queries, the following lemma, which we implemented in EasyCrypt, provides a means to bound the probability of failure. (We describe its hypotheses informally, but note that most of them can be captured by pRHL judgments.)

Lemma 2

(Failure event lemma). Consider a program Formula$c_{1};c_{2}$, an integer expression Formula$i$, an event Formula$F$, and Formula$u \ \in \ {\Bbb R}$. Assume the following:

  • Free variables in Formula$F$ ‘ and Formula$i$ are only modified by Formula$c_{1}$ or oracles in some set Formula$O$;
  • After executing Formula$c_{1}, F$ does not hold and Formula$0 \leq i$;
  • Oracles Formula${\cal O} \in O$ do not decrease Formula$i$ and strictly increase Formula$i$ when Formula$F$ is triggered;
  • For every oracle Formula${\cal O}$ in Formula$O, \neg F\Rightarrow {\rm Pr}\lceil {\cal O}:F\rceil\leq u$ Then, Formula${\rm Pr}[c_{1};c_{2}:F\wedge i\leq q]\leq q\cdot u$

Finally, EasyCrypt implements a simple mechanism to directly compute bounds for the probability of an event in a program. This mechanism can establish, for instance, that the probability that a value uniformly chosen from a set Formula$X$ equals an expression that does not depend on it is exactly Formula$1/\vert X \vert$, or that the probability that the same uniformly sampled value belongs to a list of Formula$n$ values that does not depend on it is at most Formula$n/\vert X\vert$.



Merkle-Damgård is a method for building a variable input-length (VIL) hash function from a fixed input-length (FIL) compression function. In its simplest form, the digest of a message is computed by first padding it to a block boundary and then iterating a compression function Formula$f$ over the resulting blocks starting from an initial chaining value IV. A compression function Formula$f$ maps a pair of bitstrings of length Formula$k$ and Formula$n$ (equivalently, a bitstring of length Formula$k+n$) to a bitstring of length Formula$n$: FormulaTeX Source$$f : \{0,1\}^{k}\times\{0,1\}^{n}\rightarrow\{0,1\}^{n}$$ A padding function pad converts an arbitrary length message into a list of bitstrings of block size (Formula$k$ is the block-size): FormulaTeX Source$${\rm pad} : \{0,1\}^{\ast}\rightarrow(\{0,1\}^{k})^{\ast}$$

Definition 3

(Merkle-Damgård). Let Formula$f$ be a compression function and pad a padding function as above, and let IV Formula$\in \{0,1\}^{n}$ be a public value, known as the initialization vector. The hash function MD is defined as follows: FormulaTeX Source$$ \eqalignno{ & {\rm MD}\quad \qquad : \{0,1\}^{\ast}\rightarrow\{0,1\}^{n} \cr & {\rm MD} (m)\quad \displaystyle\mathop{=}^{\rm def}f^{\ast} ({\rm pad} (m), {\rm IV})}$$ where Formula$f^{\ast} \ : \ (\{0,1\}^{k})^{\ast}\times\{0,1\}^{n} \ \rightarrow\{0,1\}^{n}$ is recursively defined by the equations FormulaTeX Source$$f^{\ast} (nil, y) \displaystyle\mathop{=}^{\rm def}y \qquad f^{\ast}(x::xs, y)\displaystyle\mathop{=}^{\rm def}f^{\ast}(xs, f(x, y))$$

The security properties of the compression function preserved by the Merkle-Damgård construction greatly depend on an adequate choice of padding to thwart certain types of attacks. In the remainder, we consider prefix- and suffix-free padding functions.

Definition 4

(Prefix- and suffix-free padding). A padding function pad is prefix-free (resp. suffix-free) iff for any distinct messages Formula$m, m^{\prime}$, there is no Formula$xs$ such that Formula${\rm pad} (m^{\prime})= {\rm pad}(m)\ \Vert \ xs (resp. \ {\rm pad}(m^{\prime})= xs \ \Vert \ {\rm pad}(m))$.

Security properties of hash functions are stated as claims about the difficulty of an attacker in achieving certain goals. Collision resistance states that it is hard to find distinct Formula$a, b$ such that Formula$H(a)=H(b)$. Pre-image resistance states that given a digest Formula$h$, it is hard to find Formula$a$ such that Formula$H(a)=h$. Second preimage resistance states that given Formula$a$, it is hard to find Formula$b\neq a$ such that Formula$H(a) \ = \ H(b)$. Finally, resistance to length-extension attacks states that it is hard to compute Formula$H(a \ \Vert \ b)$ from Formula$H(a)$. The precise formulation of these notions and their relationship is addressed in detail in [34].

An established method for proving the security of domain extenders, like MD above, is to show that they are property preserving: for instance, the seminal works of Merkle [31] and Damgård [23] show that if the compression function Formula$f$ is collision resistant, then the hash function MD with some specific padding function is also collision resistant. Property preservation also applies for other notions; a representative panorama of property preservation for collision resistance, preimage and second preimage resistance appears in [5]. In Section IV we use EasyCrypt to reduce the collision resistance of suffix-free MD to the collision resistance of the underlying compression function.

An alternative method for proving the security of domain extenders is to show that they preserve ideal functionalities, i.e. that when applied to ideal functionalities they yield an ideal functionality. The notion of indifferentiability of Maurer et al. [30] provides an appropriate framework.

Definition 5

(Indifferentiability). A procedure Formula${\cal C}$ with oracle access to an ideal primitive Formula${\cal G}$ is Formula$(t_{\cal S}, q, \epsilon)$-indifferentiable from Formula${\cal F}$ if there exists a simulator Formula$s$ with oracle access to Formula${\cal F}$ and executing within time Formula$t_{\cal S}$, such that for any distinguisher Formula${\cal D}$ that makes at most Formula$q$ oracle queries, the following inequality holds FormulaTeX Source$$\vert {\rm Pr}[b\leftarrow {\cal D}^{{\cal C},{\cal G}}():b]-{\rm Pr}[b\leftarrow D^{{\cal F},{\cal S}}():b]\vert \leq \epsilon $$

Intuitively, the distinguisher is either given access to Formula${\cal C}^{\cal G}$ and Formula${\cal G}$, or it is given access to Formula${\cal F}$ and Formula${\cal S}^{\cal F}$ (see Figure 2). The probability that it succeeds in distinguishing the two scenarios must be small.

Figure 2
Figure 2. Indifferentiability of Formula${\cal C}$ from an ideal functionality Formula${\cal F}$

In the application considered in this paper, Formula${\cal C}$ represents the Merkle-Damgård construction, Formula${\cal G}$ represents the compression function and Formula${\cal F}$ represents an idealized hash function. Thus, the role of Formula$\cal S$ is to simulate the behavior of the compression function, i.e. it should behave towards Formula${\cal F}$ like Formula${\cal G}$ behaves towards the Merkle-Damgård construction. In Section V, we use EasyCrypt to define a simulator Formula$\cal S$ that proves indifferentiability of MD from a VIL random oracle when the compression function Formula${\cal G}$ is modeled as a FIL random oracle—random oracles [13] are functions that map values in the input domain into uniformly and independently distributed values in the output domain; see Section II for a precise definition.

We conclude this section with two observations on the two proof methods. First, indifferentiability from random oracles provides weaker guarantees than initially anticipated—see [19] and [33] respectively for discussions on the random oracle model and on the notion of indifferentiability—but remains nevertheless a useful heuristics to gain confidence in the design of hash functions. Second, the two methods are complementary. On the one hand, indifferentiability from a VIL random oracle entails resistance against collision, preimage, second preimage, and length-extension attacks. On the other hand, property preservation is often established under weaker hypotheses and moreover, exact security bounds derived from indifferentiability proofs are sometimes looser than bounds delivered by direct proofs of property preservation.



We show that finding collisions for MD with a suffix-free padding is at least as hard as finding collisions for Formula$f$. A collision for the compression function Formula$f$ is a pair of inputs Formula$xy_{1}, xy_{2}$ satisfying the predicate FormulaTeX Source$${\rm coll} (xy_{1}, xy_{2})\mathop{=}^{def}xy_{1}\neq xy_{2}\wedge f(xy_{1})=f(xy_{2}) $$

Theorem 6.

Let MD be a Merkle-Damgård hash function with compression function Formula$f$ and a suffix-free padding pad. For any algorithm Formula${\cal A}$ finding collisions for MD of at most length Formula$p$, there exists an algorithm Formula$\cal B$ that finds collisions for Formula$f$ with the same probability and with an overhead of Formula$O(p\cdot t_{f})$, where Formula$t_{f}$ is a bound on the time needed for one evaluation of Formula$f$.

Consider the experiment CRMD below, in which an adversary Formula$\cal A$ performs a collision attack against MD:

Algorithm 2

We prove in EasyCrypt that the algorithm Formula$\cal B$ shown in Fig. 3 finds collisions for Formula$f$ in the experiment Formula${\rm CR}^{f}$ with at least the same probability as Formula${\cal A}$ finds collisions for MD in CRMD, i.e. FormulaTeX Source$${\rm Pr}\left[{\rm CR}^{{\rm MD}}:{\rm res}\right]\leq {\rm Pr}\left[{\rm CR}^{f}:{\rm res}\right] \eqno{\hbox{(1)}}$$ (Recall that res is a keyword that stands for the value returned by the main procedure of the games.) Algorithm Formula$\cal B$ obtains from Formula$\cal A$ a pair of messages Formula$m_{1}, m_{2}$, pads them, and iterates the compression function over the first blocks of the longer padded message until the remaining suffix is the same length as the other padded message. It then computes the remaining iterations to compute Formula${\rm MD}(m_{1})$ and Formula${\rm MD}(m_{2})$ in parallel. If both messages collide, a collision for Formula$f$ must occur in one of these parallel iterations.

Figure 3
Figure 3. A collision-finder Formula$\cal B$ for the compression function Formula$f$

In order to show (1) it suffices to prove the relational judgment: FormulaTeX Source$$ \vdash {\rm CR}^{{\rm MD}}\sim {\rm CR}^{f} : {\rm true} \Longrightarrow {\rm res}\langle1\rangle\Rightarrow {\rm res}\langle2\rangle \eqno{\hbox{(2)}} $$ Proving this judgment involves non-trivial relational reasoning because equivalent computations in the related games are not performed in lockstep. We begin by inlining the call to Formula${\cal B}$ in Formula${\rm CR}^{f}$ and showing that the relational post-condition FormulaTeX Source$$ \eqalignno{ & (m_{1}, m_{2})\langle1\rangle=(m_{1}, m_{2})\langle 2\rangle\wedge \cr & (h_{1}=\ {\rm MD}(m_{1})\wedge h_{2}=\ {\rm MD}(m_{2}))\langle 1\rangle} $$ holds after the call to Formula${\cal A}$ in both programs and the two calls to Formula$\sf F$ in CRMD. To show this, we prove that oracle Formula$\sf F$ correctly implements function MD using the one-sided rule for loops—the needed invariant is simply Formula$f^{\ast}(xs, y)={\rm MD}(m )$. At this point, note that if Formula$m_{1}=m_{2}$, judgment (2) holds trivially (we only have to check that Formula$\cal B$ terminates). We are left with the case Formula$m_{1}\neq m_{2}$. Assume w.l.o.g. that Formula$\vert {\rm pad} (m_{2})\vert \leq\vert {\rm pad} (m_{1}) \vert$, in which case Formula$\cal B$ never enters its second loop and the following invariant holds for the first: FormulaTeX Source$$ \eqalignno{ & f^{\ast}(xs_{1}, y_{1})={\rm MD}(m_{1})\wedge f^{\ast}(xs_{2}, y_{2})= MD (m_{2})\wedge \cr & m_{1}\neq m_{2}\wedge\vert xs_{2}\vert \leq\vert xs_{1}\vert \wedge xs_{2}= {\rm pad }(m_{2})\wedge &\hbox{(3)} \cr & \exists xs^{\prime}.\ xs^{\prime}\Vert xs_{1}= {\rm pad} (m_{1})} $$ We prove that if the messages Formula$m_{1}, m_{2}$ output by Formula$\cal A$ collide, the last loop necessarily exits because a collision is found. This can be shown by means of the following loop invariant: FormulaTeX Source$$ \eqalignno{ & f^{\ast}(xs_{1}, y_{1})={\rm MD} (m_{1})\wedge f^{\ast}(xs_{2}, y_{2})= {\rm MD} (m_{2})\wedge \cr & \vert xs_{2}\vert =\vert xs_{1}\vert \wedge \cr & (xs_{1}=xs_{2}\Rightarrow y_{1}\neq y_{2})}$$ Note that (3) and the negation of the guard of the first loop imply that the above invariant holds initially. In particular, the last implication holds because if Formula$x{\cal S}_{1}$ and Formula$x{\cal S}_{2}$ were equal, there would exist a prefix Formula$x{\cal S}^{\prime}$ such that Formula$x{\cal S}^{\prime} \ \Vert \ {\rm pad} (m_{2})= {\rm pad} (m_{1})$, contradicting the fact that pad is suffix-free. Finally, observe that the last loop can exit either because a collision for Formula$f$ is found or because Formula$x{\cal S}_{1} \ = \ {\rm nil}$. In this latter case, it must be the case that Formula$x{\cal S}_{2} \ = \ {\rm nil}$ and therefore Formula$y_{1}={\rm MD}(m_{1})={\rm MD}(m_{2})=y_{2}$. However, from the last implication in the invariant we also have Formula$y_{1}\neq y_{2}$, which leads to a contradiction that renders this case trivial.



We prove the indifferentiability of the MD construction from a random oracle in Formula$\{0,1\}^{\ast}\rightarrow\{0,1\}^{n}$ when its compression function Formula$f$ is modeled as a random oracle in Formula$\{0,1\}^{k}\times\{0,1\}^{n}\rightarrow\{0,1\}^{n}$ and its padding function is prefix-free. Our proof is based on [22].

Theorem 7

(Indifferentiability of MD).The Merkle- Damgård construction MD with an ideal compression function Formula$f$, prefix-free padding pad, and initialization vector IV is Formula$(t_{\cal S}, q_{\cal D}, \epsilon)$-indifferentiable from a variable input-length random oracle Formula$F:\{0,1\}^{\ast}\rightarrow\{0,1\}^{n}$ where FormulaTeX Source$$\epsilon={3 \ell^{2} \ q_{D}^{2} \over 2^{n}} \qquad t_{S}=O(\ell \ q_{D}^{2})$$ and Formula$\ell$ is an upper bound on the block-length of pad Formula$(m)$ for any message Formula$m$ appearing in a query of the distinguisher:

In what we call the real scenario, a distinguisher Formula$\cal D$ has access to an oracle Formula$F_{q}$ implementing the function MD and to a random oracle Formula$f_{q} : \{0,1\}^{k} \times \{0,1\}^{n} \rightarrow \{0,1\}^{n}$ that models the compression function. In contrast, in the ideal scenario, Formula$\cal D$ has access to a random oracle Formula$F_{q}:\{0,1\}^{\ast}\rightarrow\{0,1\}^{n}$ and Formula$f_{q}$ is simulated. See Fig. 4 for a formulation of these two scenarios as games. To prevent Formula$\cal D$ from making more than Formula$q$ oracle queries, we enforce a bound Formula$q=\ell \ q_{\cal D}$ on the counter Formula${\bf q}_{f}$, that counts the number of evaluations of the compression function in game Formula${\rm G}_{\rm real}$. Note that this is more permissive than the proof of Coron et al. [22], since it allows the distinguisher to trade queries to Formula$F_{q}$ for queries to Formula$f_{q}$. Indeed, if Formula$\cal D$ makes Formula$n_{f}$ queries to Formula$f_{q}$ and Formula$n_{F}$ queries to Formula$F_{q}$, we require FormulaTeX Source$${\rm q}_{f}\leq n_{f}+\ell \ n_{F} \leq \ell(n_{f}+n_{F}) \leq \ell \ q_{\cal D}=q $$ We show that the simulator Formula$f_{q}$ in Formula${\rm G}_{\rm ideal}$ behaves consistently with a random oracle. Whenever the distinguisher makes a query Formula$(x, y)$ to oracle Formula$f_{q}$, the simulator looks among all previous queries for a sequence that could be the chain of inputs to the compression function used to compute the hash of some message Formula$m$, for which Formula$x$ is the last block of pad Formula$(m)$. We call such a sequence a complete chain, and we define it formally below. When such a sequence is found, the simulator queries Formula$F$ for the hash of Formula$m$ and forwards the answer to the distinguisher. Otherwise, the simulator answers with a uniformly distributed random value. Figure 5 shows how this simulator would react to a sequence of queries FormulaTeX Source$$y_{2}\leftarrow f_{q}(x_{1}, {\rm IV});y_{3}\leftarrow f_{q}(x_{2}, y_{2});y_{4}\leftarrow f_{q}(x_{3}, y_{3}) $$ where Formula$x_{1} \ \Vert \ x_{2} \ \Vert \ x_{3}= {\rm pad} (m)$. The first two queries will be answered with random values, while the third completes a chain and is answered by forwarding Formula${\rm pad}^{-1}(x_{1} \ \Vert \ x_{2} \ \Vert \ x_{3})$ to Formula$F$; this maintains the consistency with the real scenario.

Figure 5
Figure 5. An example illustrating how the simulator works

Definition 8

(Complete chain).A complete chain in a map Formula$T : \{0,1\}^{k}\times\{0,1\}^{n}\rightarrow\{0,1\}^{n}$ is a sequence Formula$(x_{1}, y_{1})\ldots(x_{i}, y_{i})$ such that Formula$y_{1}={\rm IV}$ and

  1. Formula$\forall j=1\ldots i-1 (x_{j}, y_{j})\in {\rm dom} (T)\wedge T[x_{j}, y_{j}]=y_{j+1}$
  2. Formula$x_{1}\Vert\ldots\Vert x_{i}$ is in the domain of pad−1

The function findseq Formula$((x,y),T^{\prime}$ used by the simulator searches in Formula$T^{\prime}$ for a complete chain of the form Formula$(x_{1}, y_{1})\ldots(x_{i}, y_{i})(x, y)$ and returns Formula$x_{1} \Vert \ldots \Vert x_{i}$, or Formula$\perp$ if no such chain is found.

To help SMT solvers and automated provers check logical side-conditions arising in our proofs, we needed to derive several auxiliary lemmas: e.g., if a finite map Formula$T$ is injective and does not map any entry to the value IV, every complete chain is determined by its last element—that is, for any given Formula$(x, y)$, the value of findseq Formula$((x,y),T^{\prime}$ is uniquely determined. All of these lemmas have been mechanically verified based solely on the axiomatization and definitions of elementary operations. In many cases, EasyCrypt is able to verify the validity of these lemmas automatically. The more involved lemmas have been manually verified in the Coq proof assistant.

Figure 4
Figure 4. The game Formula${\rm G}_{{\rm real}^{\prime}}$

The proof proceeds by stepwise transforming the game Formula${\rm G}_{\rm real}$ into the game Formula${\rm G}_{\rm ideal}$, upper-bounding the probability that the outcome of consecutive games differ. By summing up over these probabilities, we obtain a concrete bound for the advantage of the distinguisher in telling apart the initial and final games. Specifically, we prove: FormulaTeX Source$$ \vert \displaystyle {\rm Pr}[{\rm G}_{\rm real}: b]-{\rm Pr}[{\rm x G}_{\rm ideal}:b]\vert \leq{3q^{2}\over 2^{n}}x \eqno{\hbox{(4)}}$$

Figure 6
Figure 6. The game Formula${\rm G}_{{\rm real}^{\prime}}$

We begin by considering the game Formula${\rm G}_{{\rm real}^{\prime}}$ defined in Fig. 6. We introduce events bad1, bad2 and bad3 that will be needed later. First, we introduce a copy of oracle Formula$f$, which we call Formula$f_{\bf bad}$. Both use the same map Formula$T$ to store previously answered queries, the difference is that Formula$f_{\bf bad}$ may trigger events bad1 and bad2. We also introduce the lists Y and Formula$Z$ that allow us to appropriately detect when these events occur. In addition, we modify the simulator Formula$f_{q}$ to maintain a map Formula$T^{\prime}$ of queries known to the distinguisher. Observe that Formula$T^{\prime}\subseteq T$, because queries to Formula$F_{q}$ result in entries being added only to Formula$T$, whereas queries to Formula$f_{q}$ result in the same entries being added to both Formula$T$ and Formula$T^{\prime}$. Additionally, the simulator Formula$f_{q}$ behaves in two different ways depending on whether findseq Formula$((x, y), T^{\prime})\neq\perp$. If this condition holds, there is a complete chain in map Formula$T^{\prime}$ ending in Formula$(x, y)$. In this case, in game Formula${\rm G}_{\rm ideal}$ the simulator should call oracle Formula$F$ to maintain consistency with the random oracle; otherwise the simulator could just sample a fresh random value. In this game, oracle Formula$f_{q}$ returns the same answer in both cases, but sets bad Formula$\{1,2,3\}$ accordingly. Lastly, we also unroll the last iteration of the loop in Formula$F_{q}$.

Note that instrumenting the game with the additional map Formula$T^{\prime}$ and the failure events bad Formula$\{1,2,3\}$ does not change the observable behavior. Therefore, FormulaTeX Source$${\rm Pr}[{\rm G}_{\rm real}:b]={\rm Pr}[{\rm G}_{{\rm real}^{\prime}}\ :\ b] $$

In game GrealRO, defined in Fig. 7, we introduce a random oracle Formula$RO : \{0,1\}^{\ast}\rightarrow\{0,1\}^{n}$ and replace every call Formula$f_{\bf bad}(x, y)$ in game Formula${\rm G}_{{\rm real}^{\prime}}$ where Formula$(x, y)$ ends a complete chain in Formula$T$ with a call to RO Formula$(m, y)$ where Formula$m$ is the unpadded message of the chain. I.e., in oracle Formula$f_{q}$ we call RO if findseq is successful and in oracle Formula$F_{q}$ we call RO instead of the last call to Formula$f_{\bf bad}$. We also introduce the map Formula$I:\Bbb{N}\rightarrow\{0,1\}^{n}\times {\Bbb B}$ which enumerates all sampled chaining values and includes a tainted flag to keep track of values known to the distinguisher. We introduce an indirection in map Formula$T$ and Formula$T^{\prime}$ through the use of map Formula${\mbi I}$. This allows us to keep track of the order in which queries were made and to know which answers we could re-sample without introducing inconsistencies in the view of the distinguisher.

The failure events that were introduced in the last step capture certain dependencies on previous queries that the distinguisher may exploit to tell apart games Formula${\rm G}_{{\rm real}^{\prime}}$ and Formula${\rm G}_{\rm realRO}$. We prove that games Formula${\rm G}_{{\rm real}^{\prime}}$ and GrealRO behave the same provided these failure events do not occur.

  1. bad1 is triggered whenever oracle Formula$f_{\bf bad}$ samples a random value that is either IV or has already been sampled for a distinct query before. The role of this event is twofold: on the one hand, if IV is sampled as a random value, then there could exist a complete chain in Formula$T$ that is a suffix of another complete chain in Formula$T$ as illustrated in the first example of Figure 8 (here Formula$T[x_{2}, y_{2}]= {\rm IV}$). The problem is that oracle Formula$F_{q}$ in the game Formula${\rm G}_{\rm real}$ will generate the same values for the two messages corresponding to those two chains, while Formula$F_{q}$ in the game Formula${\rm G}_{\rm ideal}$ most likely will not. On the other hand, if a sampled value has been sampled for another query before, then there could exist two complete chains in Formula$T$ that collide at some point and are identical from that point on as illustrated in the second example of Figure 8. Again the two corresponding messages would yield the same answer in Formula${\rm G}_{\rm real}$ but most likely not in Formula${\rm G}_{\rm ideal}$ on queries to Formula$F_{q}$. By requiring that event Formula${\bf bad}_{1}$ does not occur, we guarantee that in game Formula${\rm G}_{{\rm real}^{\prime}}$ the map Formula$T$ is injective and does not map any value to IV.
  2. bad2is triggered whenever oracle Formula$f_{\bf bad}$ samples a random value that has already been used as a chaining value in a previous query. This means that this query may be part of a chain of which the distinguisher has already queried later points in the chain, which should not be possible. The event also captures that no fixed-points (i.e. entries of the form Formula${\mbi T}[x, y]=y$) should be sampled.
  3. bad3is triggered whenever a chaining value Formula$y$ in a query has already been sampled as a random value and is in the range of Formula$T$ for some previous query Formula$(x^{\prime}, y^{\prime})$, but Formula$(x^{\prime}, y^{\prime})$ does not appear in the domain of Formula$T^{\prime}$ and Formula$(x^{\prime}, y^{\prime})$ is not the last element of a complete chain in Formula$T$. Intuitively, this means that Formula$y$ was never returned by Formula$f_{q}$ or Formula$F_{q}$ and hence the distinguisher managed to guess a random value.

In order to relate games Formula${\rm G}_{{\rm real}^{\prime}}$ and Formula${\rm G}_{\rm realRO}$ in case that findseq Formula$((x,y),T^{\prime})$ in Formula$f_{q}$ succeeds in both games, we need to show that the call Formula$f_{\bf bad}(x, y)$ in Formula${\rm G}_{{\rm real}^{\prime}}$ and the call Formula${\rm RO} (m, y)$ in Formula${\rm G}_{\rm realRO}$ behave similarly. For this we show that the following invariant is preserved in both games: for all complete chains Formula$\cal C$ in the map Formula$T$ of game Formula${\rm G}_{{\rm real}^{\prime}}$ with last Formula$({\cal C}) \in {\rm dom} (T)$, it holds that Formula$\cal C$ 'S associated message is in dom Formula$({\mbi R})$ of game Formula${\rm G}_{\rm realRO}$ and, vice versa, every message in dom Formula$({\mbi R})$ of game Formula${\rm G}_{\rm realRO}$ has a corresponding complete chain Formula$\cal C$ in the map Formula$T$ of game Formula${\rm G}_{{\rm real}^{\prime}}$ with last Formula$({\cal C}) \in {\rm dom} (T)$. This invariant allows EasyCrypt to prove this case by inferring that Formula$(x, y) \in {\rm dom} ({\mbi T})$ in game Formula${\rm G}_{{\rm real}^{\prime}}$ if and only if Formula$m \in {\rm dom} ({\mbi R})$ in game Formula${\rm G}_{{\rm realRO}}$.

Proving that the aforementioned invariant is preserved in the games requires several other invariants. Most of them merely relate the representation of maps in both games; we omit these technical details. The essential invariant is that the distinguisher queries Formula$f_{q}$ for points in a chain only if it has already queried the preceding part of the chain. This is important as it implies that each chain will be completed by a query for its last element, in which case findseq will detect this query and the corresponding message will be added to Formula${\mbi R}$. In game Formula${\rm G}_{{\rm real}^{\prime}}$, the predicate Formula${\rm set{}_{-}bad3}$ enforces this ordering by triggering event Formula${\bf bad}_{3}$. The probability of this event is negligible, because it means that Formula$y$ was never output by Formula$f_{q}$ or Formula$F_{q}$ and hence is not known to the distinguisher. In game GrealRO, we use the map Formula${\mbi I}$ to iterate over all chaining values in order to check for the ordering mentioned above.

In oracle Formula$F_{q}$ of game Formula${\rm G}_{\rm realRO}$, the computation of the Merkle-Damgård construction is split into three stages due to the different usage of the maps Formula$T^{\prime},T^{\prime}_{i}$, and Formula$T$ The first loop computes the construction for values that were already queried by the distinguisher and are therefore in dom Formula$(T^{\prime})$. The restriction that the distinguisher may only query chains in order implies that such values occur only in the prefix of a chain. The second loop handles values that were already used before by oracle Formula$F_{q}$, and the third loop samples fresh chaining values. Relating the final call to Formula$f_{{\rm bad}}$ in game Formula${\rm G}_{{\rm real}^{\prime}}$ and the final call to RO in game Formula${\rm G}_{\rm realRO}$ is similar to this case in oracle Formula$f_{q}$. We prove that the advantage in differentiating between games Formula${\rm G}_{{\rm real}^{\prime}}$ and Formula${\rm G}_{\rm realRO}$ is upper bounded by the probability of any of Formula${\bf bad}_{1},{\bf bad}_{2},{\bf bad}_{3}$ occurring in game Formula${\rm G}_{\rm realRO}$. FormulaTeX Source$$ \eqalignno{ \vert {\rm Pr}[{\rm G}_{{\rm rea}1^{\prime}}\ :\ b]- & {\rm Pr}[{\rm G}_{{\rm realRO}}:b]\vert \leq \cr & \quad{\rm Pr}[{\rm G}_{{\rm realRO}} : {\bf bad}_{1} \vee {\bf bad}_{2}\vee {\bf bad}_{3}]} $$

Figure 7
Figure 7. The game GrealRO
Figure 8
Figure 8. Two examples illustrating the necessity of event Formula${\bf bad}_{l}$

To finish the proof, we have to relate Formula${\rm Pr}[{\rm G}_{\rm realRO}:b]$ with Formula${\rm Pr}[{\rm G}_{\rm ideal}:b]$ and bound the probability of the failure events in game GrealRO. We first focus on the probability of bad1 and bad2. Event bad1 (resp. bad2) is set when a freshly sampled value Formula$z$ is in the list Formula$\bf Z$ (resp.Formula${\mbi Y}$); since the size of both lists is bounded by Formula$q$, this occurs with probability at most Formula$q \ {2^{n}}$, for each of the possible Formula$q$ queries.

Note that oracles Formula$F_{q}, \ RO$, and Formula$f_{q}$ in game Formula${\rm G}_{\rm realRO}$ use the same code to detect the failure events Formula${\bf bad}_{l}$ and Formula${\bf bad}_{2}$ when sampling a fresh value Formula$z$. We can wrap this code in a new oracle that meets the conditions of Lemma 2: we take Formula$u=q\ {2^{-n}}$ and Formula$i=\vert {\bf Z} \vert$ (resp.Formula$\vert {\mbi Y} \vert$). We get FormulaTeX Source$${\rm Pr}[{\rm G}_{{\rm realRO}}: {\bf bad}_{1}] \leq{q^{2}\over 2^{n}} \quad {\rm Pr}[{\rm G}_{{\rm realRO}}: {\bf bad}_{2}] \leq{q^{2}\over 2^{n}} $$

We are left to bound the probability of bad3 and relate the game Formula${\rm Pr}[{\rm G}_{\rm realRO}:b]$ with Formula${\rm Pr}[{\rm G}_{\rm ideal}:b]$. Note that in game Formula${\rm G}_{\rm realRO}$ chaining values are sampled eagerly, i.e. for a query Formula$m$, oracle Formula$F_{q}$ samples chaining values Formula$z$ that are independent of the distinguisher's view (their associated flag is set to true). These values might later on become known to the distinguisher if it recomputes the Merkle-Damgård construction for Formula$m$ using oracle Formula$f_{q}$ (we identify this case setting found = true). We want to transform the game so that chaining values are sampled lazily (as in game Formula${\rm G}_{\rm ideal}$).

The same kind of argument can be used for Formula${\bf bad}_{3}$. This event is set whenever the distinguisher makes a query Formula$(x, y)$ to Formula$f_{q}$ with Formula$y$ coinciding with a value uniformly and independently distributed w.r.t. its view.

Figure 9
Figure 9. The games GidealEager and GidealLazy

We modify game Formula${\rm G}_{\rm realRO}$ in order to prepare for the transition from eager to lazily sampled chaining values: the body of game Formula${\rm G}_{\rm idealEager}$ (see Figure 9) contains a loop which re-samples all chaining values that are unknown to the adversary, i.e., the values for which the second component in map Formula${\mbi I}$ is set to true. Furthermore, game Formula${\rm G}_{\rm idealEager}$ drops the failure events Formula${\bf bad}_{\{1,2,3\}}$, but introduces a new failure event bad4. We show that if Formula${\bf bad}_{3}$ is triggered in game Formula${\rm G}_{\rm realRO}$, then in Formula${\rm G}_{\rm idealEager} {\bf bad}_{4}$ is set to true or there exists an Formula$i$ such that Formula${\mbi I}[i]=$ (Formula$v$, true) and Formula$v \ \in \ {\mbi Y}$. We get FormulaTeX Source$$ \eqalignno{ & {\rm Pr}[{\rm G}_{{\rm realRO}}:b] \qquad={\rm Pr}[{\rm G}_{{\rm idealEager}}:b] \cr & {\rm Pr}[{\rm G}_{{\rm realRO}} :{\bf bad}_{3}] \leq {\rm Pr}[{\rm G}_{{\rm idealEager}} : {\bf bad}_{4}\vee {\rm I}_{\exists}]} $$ where Formula${\rm I}_{\exists}=\exists_{i} 0\leq i\leq {\bf q}^{\prime}_{f}\wedge {\rm snd} ({\mbi I}[i])\ \wedge \ {\rm fst} ({\mbi I}[i]) \ \in \ {\mbi Y}$.

In game Formula${\rm G}_{\rm idealLazy}$ (see Figure 9), the loop we introduced in the last game is swapped with the call to the distinguisher and oracle Formula$f_{q}$ samples the chaining values lazily (the branch found re-samples the value of Formula$z$). In order to prove the equivalence with the previous game, we need to show that the loop that resamples the values unknown to the adversary swaps with calls to oracles Formula$F_{q}$ and Formula$f_{q}$ in games Formula${\rm G}_{\rm idealEager}$ and Formula${\rm G}_{\rm idealLazy}$. We obtain FormulaTeX Source$$ \eqalignno{ & {\rm Pr}[{\rm G}_{\rm idealEager}:b] \quad \qquad\qquad \ ={\rm Pr}[{\rm G}_{{\rm idealLazy}}:b] \cr & {\rm Pr}[{\rm G}_{{\rm idealEager}}\ :\ {\bf bad}_{4}\vee {\rm I}_{\exists}] \quad={\rm Pr}[{\rm G}_{{\rm idealLa}z{\rm y}}\ :\ {\bf bad}_{4}\vee {\rm I}_{\exists}]}$$ It is easy to see that games Formula${\rm G}_{\rm idealLazy}$ and Formula${\rm G}_{\rm ideal}$. are equivalent w.r.t. Formula$b$; the global variable Formula${\bf q}_{f}$ and the maps Formula${\mbi R}$ and Formula${\mbi T}^{\prime}$ are equivalent in both games. The other variables in game Formula${\rm G}_{\rm idealLazy}$ and its loops do not influence the behavior of its oracles. We show that FormulaTeX Source$${\rm Pr}[{\rm G}_{\rm idealLazy}:b]={\rm Pr}[{\rm G}_{\rm ideal}:b]. $$

We still have to bound the probability of Formula${\bf bad}_{4} \ \vee \ {\bf I}_{\exists}$ in game Formula${\rm G}_{\rm idealLazy}$. To do this, we simply modify the while loop in the code of the game by replacing the instruction Formula$z\displaystyle\mathop{\leftarrow}^{\$}\{0,1\}^{n}$ with FormulaTeX Source$$z\displaystyle\mathop{\leftarrow}^{\$}\{0,1\}^{n};{\bf bad}_{4} \leftarrow {\bf bad}_{4}\vee z \in {\mbi Y}$$ This leads to a game Formula${\rm G}_{{\rm idealLazy}^{\prime}}$, for which we show FormulaTeX Source$${\rm Pr}[{\rm G}_{\rm idealLazy} : {\bf bad}_{4}\vee {\rm I}_{\exists}]\leq {\rm Pr}[{\rm G}_{{\rm idealLazy}^{\prime}} : {\bf bad}_{4}]$$ We finally use the same technique as for bad1 to bound the probability of bad., in game Formula${\rm G}_{{\rm idealLazy}^{\prime}}$, and obtain FormulaTeX Source$${\rm Pr}[{\rm G}_{{\rm idealLazy}^{\prime}}:{\bf bad}_{4}]\leq{q^{2}\over 2^{n}}$$ Putting the (in-)equalities proved above together we prove (4), which completes the proof of Theorem 7.



To avoid inheriting structural weaknesses in the original Merkle-Damgård construction, existing hash functions employ instead slight variants of it. One well-known variant is the wide-pipe design, which uses an internal state larger than the final output [22], [28]. Many variants are subsumed by the following Generalized Merkle-Damgård construction.

Definition 9

(Generalized Merkle-Damgård).Let IV Formula$\in \{0,1\}^{n}$ be a public initialization vector and Formula$f, g$ be two compression functions of type FormulaTeX Source$$f, g : \{0,1\}^{k}\times\{0,1\}^{n}\rightarrow\{0,1\}^{n}$$ Consider a function pad Formula$: \ \{0,1\}^{\ast} \ \rightarrow \ (\{0,1\}^{k})^{\ast}\times\{0,1\}^{k}$ that converts an arbitrary length message into a non-empty list of blocks of length Formula$k$ singling out the last block. The hash function G MD is defined as follows: FormulaTeX Source$$ \eqalign{ & {\rm GMD} \qquad: \quad \{0,1\}^{\ast}\rightarrow\{0,1\}^{\ell}\cr & {\rm GMD}(m)\ \mathop{=}^{def} \quad {\bf let}\ (x, y)=\ {\rm pad} (m)\ {\bf in} [g(y, f^{\ast}(x, {\rm IV}))]^{\ell}}$$ where Formula$f^{\ast}$ is defined as in Def. 3 and Formula$[x]^{\ell}$ chops off the Formula$n-\ell$ least significant bits from Formula$x$, i. e. discards all but the leading Formula$\ell$ bits.

The NIST SHA-3 competition started in November 2007 with the objective of selecting new cryptographic hash functions to augment the set specified by the U.S. Federal Information Processing Standard (FIPS) 180–3, which includes the SHA-1 and SHA-2 algorithms. After receiving 64 entries, NIST selected 51 candidates for the first round, further narrowed down the list to just 14 candidates for the second round, and announced 5 finalists in December 2010: BLAKE [6], Grostl [26], JH [38], Keccak [14], and Skein [25]. A public comment period has started after this announcement and the winner is expected to be selected before the end of 2012.

The security of all SHA-3 finalists, and of many second round candidates, has been thoroughly scrutinized. Two survey articles summarize known results [3], [4]. While the algorithmic descriptions of the finalists and their exact security bounds fit in one page (see [4]), the corresponding security proofs are technically involved and need to be cautiously adapted to account for the specificities of each function. As a consequence, it is difficult to assess the validity of security claims for individual candidates and machine checking their proofs is an appealing perspective. In the remainder of this section we discuss the applicability of the proofs presented in Sections IV and V to SHA-3 finalists.

The five SHA-3 finalists are based on the iterated hash function design that underlies the Merkle-Damgård construction, but incorporate some variations such as round-dependent tweaks, counters, final transformations, and chopping. We observe that, in a more or less contrived way, all the finalists can be considered as variants of the Generalized Merkle-Damgård (Definition 9). The compression functions of the finalists are either block-cipher based (BLAKE, Skein) or permutation-based (Grostl JH, Keccak). Moreover, all finalists use suffix-free padding rules, while the padding rules of BLAKE and Skein are additionally prefix-free [4].

Our formalization models compression functions as functions of two arguments: a message block and a chaining value. This represents a deviation with respect to the compression functions of BLAKE and Skein. The compression function of BLAKE additionally takes a counter and a random salt value, whereas the compression function of Skein builds on a tweakable block cipher and takes as additional input a round-specific tweak. The additional arguments of the compression functions of BLAKE and Skein could be formalized as an integral part of the padding rule; the padding function can compute the appropriate round-specific values and append them to the message blocks. This alternative description would have the advantage of matching the model that we use in our results about the MD hash function. However, all finalists except BLAKE use chopping or a final transformation, which are formalized neither in our proof of collision resistance nor in our proof of indifferentiability. This rules out a direct application of our results, with the exception of BLAKE, for which Theorem 6 does apply. We leave it for future work to formalize this instantiation in EasyCrypt.

NIST requirements for the SHA-3 competition include collision resistance, preimage resistance and second preimage resistance. All the candidates selected as finalists satisfy these properties and (in most cases) even achieve optimal bounds for them when the underlying block-ciphers or permutations used to build their compression functions are assumed to be ideal [4]. Although the original NIST requirements did not include the property of indifferentiability from a random oracle, this notion has also been considered in the literature and is achieved by all five finalists [1], [2], [12], [15], [16], [20]. These indifferentiability proofs hold in an idealized model for some of the building blocks of the hash function: the ideal-cipher model for block-cipher based hash functions, or the ideal-permutation model for permutation based hash functions. Indifferentiability seems to be an excellent target for security proofs because it ensures that the high-level design of the hash function has no structural weaknesses, but also because it implies bounds for all of the classical properties enumerated above. Unfortunately, the assumption that some underlying primitive is ideal is at best unrealistic and at worst plainly wrong. Proofs of indifferentiability should be taken only as an indication for the security and as a palliative for the lack of security proofs in the standard model.

Compared to our result of Theorem 7, which assumes that the compression function is ideal, the indifferentiability of all the finalists has been proved in an ideal model for lower building blocks. We point out that assuming ideality of a lower building block is weaker than assuming ideality of the entire compression function and thus these results are stronger. Indeed, assuming ideality of the compression function seems to be inappropriate for all the finalists:

  • The compression functions of JH and Keccak are trivially non-random, as collisions and preimages can be found in only one query to the underlying permutati on [4], [17];
  • Finding fixed-points for the compression function of Grostl is trivial [26];
  • The compression function of BLAKE has been recently shown to exhibit non-random behavior [1], [20];
  • Non-randomness has been shown for reduced-round versions of Threefish, the underlying block-cipher of Skein [27].

The only two finalists that use a prefix-free padding rule, and for which our proof of indifferentiability can apply, are BLAKE and Skein. However, our proof of indifferentiability of prefix-free Merkle-Damgård relies on the assumption that the underlying compression function behaves like an ideal primitive. Thus, it cannot be applied to BLAKE, as this assumption has been invalidated. As for Skein, the assumption that its compression function is ideal is seriously weakened by the attacks on Threefish mentioned above.

Although Theorem 7 cannot be directly applied to any of the SHA-3 finalists, it constitutes a non-trivial result about the Merkle-Damgård construction and a good starting point for formalizing more complex proofs. Indeed, indifferentiability proofs based on weaker assumptions and general enough to apply to SHA-3 finalists are no significantly different from the proof we have formalized and use essentially the same techniques. We see no impediment to formalizing them in EasyCrypt.



Despite their widespread use, the formal verification of hash functions has received little attention. To our best knowledge, Toma and Borrione [35] were the first to use theorem provers to formally verify properties of SHA-1 but their focus is on functional properties, rather than security properties. The first machine-checked proof of security for a hash design appears in [7], where the authors use the CertiCrypt framework to verify that the construction from Brier et al. [18] yields a hash function indifferentiable from a random oracle into ordinary elliptic curves. More recently, Daubignard et al. [24] develop a method to permute dependencies between oracles in a game, and apply their method to prove indifferentiability of hash functions from random oracles. Their method is not implemented, although the underlying framework has been machine-checked [21].

The prevailing method for building hash functions is to iterate a compression function on a pre-processed input message. In this paper, we have considered the Merkle-Damgård construction, which pioneered this design, and proved that the resulting hash function preserves collision resistance and is indifferentiable from a random oracle. Our results demonstrate that state-of-the-art verification tools can be used for proving the security of hash designs, and not only for cryptanalysis [32]. We will further this line of research by exploring the formalization of more general security proofs that apply to a wider range of hash functions, including finalists of the SHA-3 competition.


The authors want to thank Martín Abadi and the anonymous CSF reviewers for insightful feedback on the paper.


No Data Available


No Data Available


No Photo Available

Michael Backes

No Bio Available
No Photo Available

Gilles Barthe

No Bio Available
No Photo Available

Matthias Berg

No Bio Available
No Photo Available

Benjamin Grègoire

No Bio Available
No Photo Available

Cèsar Kunz

No Bio Available
No Photo Available

Malte Skoruppa

No Bio Available
No Photo Available

Santiago Bèguelin Zanella

No Bio Available

Cited By

No Data Available





No Data Available
This paper appears in:
No Data Available
Conference Date(s):
No Data Available
Conference Location:
No Data Available
On page(s):
No Data Available
No Data Available
Print ISBN:
No Data Available
INSPEC Accession Number:
Digital Object Identifier:
Date of Current Version:
No Data Available
Date of Original Publication:
No Data Available

Text Size