A FORMAL MODEL FOR EVENTUAL CONSISTENCY SEMANTICS Anne-Marie Bosneag
[email protected] Department of Computer Science Wayne State University Detroit, MI 48202, USA
Monica Brockmeyer
[email protected] Department of Computer Science Wayne State University Detroit, MI 48202, USA
ABSTRACT Wide-area replicated systems are characterized by a conflict between performance, availability and consistency. As a consequence, “one-size-fits-all” approaches will inadequately address every situation. Therefore, there is a need for a formalism that can be used to express consistency requirements in a uniform way, as well as for new algorithms and techniques appropriate for the wide-area setting. This paper takes advantage of dependency relations between operations for improving the classic algorithm for eventual consistency and proposes a formal model used for reasoning about the correctness of the algorithm. The proposed algorithm reduces the number of undo operations when independence between operations can be exploited. The formal model allows us to describe and reason about replication and consistency semantics in a formal manner. The new algorithm provides good results when dependence relations between operations can be identified.
KEY WORDS Optimistic consistency
replication,
formal
model,
eventual
1. INTRODUCTION It has become increasingly obvious that a “one-size-fitsall” approach will not be adequate to resolve the inherent tension between performance, availability and consistency in wide-area replication systems. One approach is to provide a range of consistency semantics within a single system [1], but most previous approaches to replication consistency have focused on individual consistency semantics and mechanisms [2, 3, 4, 5]. Further, many approaches are described algorithmically or informally, making it difficult to integrate them into a single system. This work is part of an effort to provide a coherent formal framework to describe and reason about consistency semantics for replicated data.
Eventual consistency semantics mitigate the conflict between consistency, availability, and performance for replicated state in distributed systems. In this model, all replicas converge to the same state if no updates take place for a sufficiently long time. Eventual consistency is characterized by optimistic replication – applying updates tentatively – and conflict resolution – solving the situations where tentative updates lead to conflict – such that given a sufficiently quiescent system all replicas will in time converge to the same state. However, existing approaches have not precisely stated their notion of eventuality. This paper describes a formal definition of eventual consistency and uses that definition to prove the correctness of the Bayou Algorithm for eventual consistency. In Bayou [6], operations are applied optimistically and tentatively, and later redone in a canonical order determined by a master replica. We describe an enhanced algorithm that exploits dependency information between operations to reduce the number of undo and redo operations necessary to achieve convergence to a final state. We base our model on the general approach of a two-tier architecture [7], in which optimistic and pessimistic replication co-exist as a mechanism to alleviate the conflict between performance, availability, and consistency. Existing approaches do not capture the semantics of undo together with ordering of operations in a coherent manner. Most current papers describe models algorithmically, and not formally, while existing formal approaches for replication consider only histories with single operations, as the x-ability theory [8]. Those existing approaches that consider ordering constraints, such as serializability theory, do not consider replication, and, moreover, only capture a single consistency policy [2, 9]. Our system organizes replicas in layers: a layer of replicas that observe strict consistency semantics, and a second layer of replicas that apply updates tentatively, such that they can later be undone and redone in order to force convergence to the canonical order of the strict layer [10]. At any moment in time, a set of updates arrives at each replica, either from clients or from other replicas, in their
effort to epidemically [11] spread updates everywhere. Any changes made to the state of a replica at the optimistic layer are not permanent until reconciliation with the canonical order takes place. This paper will show how to model the consistency semantics of each replica in a uniform way, and how to force replicas at the optimistic layer to converge to the canonical order of the strict layer, should no updates take place for a long time. We are presenting a new algorithm, which exploits dependencies between operations, in order to reduce the effort to bring the state of the replicas at the optimistic layer consistent with the state of the replicas at the strict layer. The idea of operational dependencies was inspired by the IceCube approach [12]. The paper is organized as follows: Section 2 first gives formal definitions for the terms needed in our model, and later uses those terms for proving history equivalence. Section 3 describes the Bayou Algorithm and proves its correctness, proposes an enhanced algorithm that takes into account static constraints between operations and discusses theoretically and by a real life example how the proposed algorithm reduces the number of undo and redo operations necessary for convergence to the canonical order. Section 4 presents conclusions and future work.
2.0. FORMAL MODEL
possible states for a copy of the object O: SO = { | vali ∈ V and tagsi ∈ T, ∀ i = 1 .. n} Definition 2: An operation for the object O is a function f : SO -> SO X (V ∪ {ε}). The return value is either a valid value in V, or an exception value ε that is returned if the operation fails. The relationship between the new state of the object and the return value can be expressed as: f() = (, val’), where ∈ SO is the new state of the object and val’∈ V is the return value. If the operation is a read, it has no effect on the state of the replica, and no side effects on the object universe (other replicas in the same class or in other classes). If the operation is an update, then the state of the replica will change, and changes may be propagated to other replicas as well. Ops is the set of all functions defined for the object O: Ops = {f | f : SO -> SO X (V ∪ {ε})} The operations are defined as functions, since they are assumed to be deterministic. Definition 3: A replica is a copy of an object. Any object in the system can have one or several replicas. Each replica is described at any moment in time by one state of type .
We propose a formal model that can be applied for expressing any consistency policy used by applications in replicated systems. This formal model will provide a way to express consistency semantics, and to make further reasoning about the behavior of systems.
Each replica has a policy for applying the operations. Thus, for applying an operation at a replica, it must first be specified when the operation can be applied. The replica operation, which also encapsulates the consistency policy is defined as the tuple q = < p, f >, where p represents the preconditions, and f is the operation.
2.1. Definitions:
The precondition p is a constraint defined by a boolean function p : SO -> B, where B = { true, false}. If S ∈ SO and p(S) = true, then f(S) can be applied.
Definition 1: An object (or item) is the smallest entity that is considered for consistency reasons and is defined as a tuple O = < SO, Ops>, where SO is the set of states of the object O and Ops is the set of operations that are defined for the object O. Thus, the object encapsulates information and operations. The global state SOb of the object is defined as a set {SR1, SR2, …, SRn}, where SRi is a tuple , with i = 1..n. The components of the tuple SRi contain information about the copy i of the object O: vali represents application-specific information - the value of the object O at copy i, while tagsi contains replication-specific information needed to maintain consistency. We define V and T as the sets of possible values for vali and tagsi, respectively. SO is defined as the set of all
The preconditions capture both application specific constraints, and consistency specific constraints, since they are applied on the state of the replica, which contains both the value (application specific information) and the tags (consistency specific information). For replicas where the policy accepts tentative application of operations, an undo function must be specified. This function is defined to reverse the effect of f on the value of the replica at which the operation f was applied: f-1 : SO -> SO, such that f-1 (f (S)) = S’, ∀ S, S’ ∈ SO such that S = , S’ = and val = val’. Now we can formally define a replica as the tuple R = < S, Q >, where S is the state of the object at replica R, and Q is the set of operations defined for the object
(Q = { < p, f > | p : SO -> B, f ∈ Ops }). The state S is, as shown above, a tuple of the form .
Then we define the reorder method as: reord (fi, fi+1) =
Definition 4: A history H is defined as a sequence of operations applied to the object: H = ∅ | f0 f1 f2 … fn-1 | h1 ° h2 ° h3 ° … ° hn, where: ∅ = empty set, fi (Si ) = Si+1, ∀ i = 0 .. n-1, ° = concatenation operator hi = sub-histories We sometimes show the interleaving states (or values) of the object for clarity – e.g. f0 f1 f2 … fn-1 will be represented as S0 f0 S1 f1 S2 f2 … Sn-1 fn-1 Sn. (or V0 f0 V1 f1 V2 f2 … Vn-1 fn-1 Vn). Definition 5: For each operation f in a history, we add two new functions: commit (fc) and abort (fa). Any operation f is considered tentative until it has either committed or aborted. Any tentative operation can be undone. After a commit operation, the effect of operation f becomes permanent. After an abort operation, no effect of f exists. Rules for legal histories: 1. for the canonical order, all operations are committed (fc) 2. for eventual consistency, allowed operations are f, f-1, fc, fa No histories with f and f-1 after fc and fa • are allowed The last reference to f before fc is f • • The last reference to f before fa is f-1. Definition 6: A consistent history is a history where the sequence of operations satisfies the condition: pi (Si) = true, ∀ i
2.2. History reduction: The operations of an application come as a stream into the system. They can be reordered, though, for performance optimization and for reaching a global history for the system. Thus, we define a method, called reorder as a function reord: Q x Q -> B. Definition 7: Formally, the reorder method is defined as follows: Let fi and fi+1 be two consecutive operations in a history, and pi, pi+1 the corresponding preconditions.
safe, if p (S ) = true, f (S ) = S ’ i+1 i i+1 i i+1 pi(Si+1) = true, fi(Si+1’) = Si+2 unsafe, otherwise
This method shows whether it is safe to apply the second operation before applying the first one. For example, if reord (fi, fi+1) = safe ⇒ it is safe to apply fi+1 before fi and the final result will be the same as applying fi first and then fi+1. For any two non-consecutive operations fi, fj, the reorder method is defined as: reord (fi, fj) = reord (fi, fi+1) ° reord (fi+1, fi+2) ° reord (fi+2, fi+3) ° …° reord (fj-1, fj) and reord (fi, fj) =
safe,
if all reorder’s in definition
are safe unsafe, if at least one reorder from definition is unsafe.
The definition of the reorder method is based on the set of preconditions, which are application specific. The preconditions can express both static and dynamic constraints. The static constraints can be verified by taking into account only the operation, while dynamic constraints depend heavily on the particular situation in which they are applied. As an example, static constraints are of the sort ‘any two reads can be interchanged’, while dynamic constraints are conditions that can only be verified on a specific situation, like ‘the value of the object must be lower than 100’. Now we base the definition of dependencies between operations on the reorder method. Definition 8: The dependency relation between any two operations is defined as: dependent (fi, fj) = false, if reord (fi, fj) = safe
true, if reord (fi, fj) = unsafe
In order to show how eventual consistency is enforced, we need to define equivalence relationships between different histories.
Definition 9: The reduction operator → is defined as follows: 1.
f f-1 → ∅
2. 3.
f g f → g, if dependent (f, g) = false if h1 → h2 and h2 → h3 ⇒ h1 → h3 (transitivity rule) if h1 → h2, then h ° h1 → h ° h2 (prefix rule)
4.
-1
Theorem 1: The reduction rule preserves the final value of the object, if the starting value is the same. Proof: Case 1: The reduction rule follows from the definition of the undo operation: V1 f V2 f-1 V1 → V1, meaning that by applying the operation f and then rolling it back, the value of the object at that replica stays the same. Case 2: The reduction rule is: V1 f V2 g V3 f-1 V4 → V1 g V4, where dependent (f, g) = false, which means that applying f, and then rolling back f after applying g, which is independent of f, is equivalent to applying g only. From the definition of dependencies, dependent (f, g) = false ⇒ reord (f, g) = safe. Therefore, the two operations can be commuted and the final value will still be V3: (1) V1 f V2 g V3 → V1 g V’ f V3 If we undo f: V1 g V’ f V3 f-1 V’’ (2) and we look at the second part of relation (2) (V’ f V3 f-1 V’’), taking into account the definition of undo’s ⇒ V’ = V’’ (3) From the initial order (V1 f V2 g V3 f-1 V4), we know that f-1 (V3) = V4 ⇒ V3 f-1 V4. Since we consider operations to be deterministic ⇒ V’ = V4 (4) From (3), (4) and the transitivity of =, we can conclude that relation (2) became: V1 g V4 f V3 f-1 V4 By the definition of undo’s ⇒ V1 g V4 f V3 f-1 V4 → V1 g (V4 f V3 f-1 V4) → V1 g V4 (5) From relations (1) and (5) we conclude that: V1 f V2 g V3 f-1 V4 → V1 g V4, which means that the final value V4 was preserved by the reduction operator. Case 3: Transitivity rule h1 → h2 ⇒ applying h1 results in the final value V, and applying h2 results in the final value V (6) h2 → h3 ⇒ applying h2 results in the final value V, and
applying h3 results in the final value V
(7)
From (6) and (7) ⇒ h1 and h3 both result in the final value V for the object ⇒ h1 → h3 preserved the final value of the replica. Case 4: Prefix rule Suppose the history h takes the object from initial value V0 to final value V1. h1 → h2 ⇒ starting with the same initial value, the final value will be the same, which is to say, if we start in V1, h1 and h2 both result in a final value V2. So, h takes the object from V0 to V1, then h1 takes the object from V1 to V2 ⇒ we have constructed the history V0 h V1 h1 V2 And h takes the object from V0 to V1, then h2 takes the object from V1 to V2 ⇒ we have constructed the history V0 h V1 h2 V2. Therefore, the final state is preserved by applying h ° h2 instead of h ° h1, which is equivalent to h ° h1 → h ° h2 preserved the final value.
3.0. EVENTUAL CONSISTENCY For eventual consistency, we need to construct an algorithm that will reduce the history of the eventually consistent replicas to the official history of the master copy. This algorithm needs to be periodically applied, in order to keep the replicas consistent. The policy of applying the algorithm will be dictated by performance reasons. In Bayou, the user chooses the rate at which to apply the algorithm. We envision a situation where the system will adjust the rate automatically, by introspection and self-tuning. We will use the term eventual or tentative history (schedule) to denote the history at the replicas at the optimistic layer and committed or strict history (schedule) to denote the history at the strict layer. In this section, the existing Bayou algorithm for constructing eventual histories is described in terms of histories and analyzed in terms of eventual convergence. Then, a new algorithm is proposed and analyzed. The new algorithm yields better performance in the case where static constraints can be identified.
3.1. Algorithms for constructing eventual histories The classic approach to constructing histories that converge to the canonical order is to submit the set of tentative operations to the master copy, wait for the
official ordering, and then roll back all tentative operations and reapply them in the order in which they appear in the official history. This algorithm is described below.
3.1.1. Algorithm 1 (Bayou): Given a tentative history T (from a replica that observes eventual consistency), and a committed history C (from the master copy), we must construct a committed history F, such that F = T °T’ and F → C. The algorithm will start with the tentative schedule, to which it will append a suffix, such that the final schedule becomes reducible to F. The idea is to apply undo and redo functions for operations already applied on the replica, such that the final state of the replica becomes the same as the final state dictated by the committed schedule. We first iterate through the tentative schedule T, for undoing all tentative operations in reverse order, then we reapply all operations in the committed schedule C. An operation Ci is equivalent to Ti ° Ti c (operation applied and committed).
F=T for i = n .. 1 F = F ° {Ti-1} for i = 1 .. m F = F ° {Ti ° Ti c}
Proof of correctness: We use rule #1 from the definition of the reduction operator to show that we have constructed a final history F that reduces to C (F → C).
when an operation must be rolled back, all independent operations that follow can be kept in place (according to rule 2 for reduction). If the independent operations have already been applied in the tentative schedule T in the correct order (the order of C), then they will be directly committed. Thus, the enhanced algorithm saves some operations, by skipping undoing and redoing of certain operations.
3.1.2. Algorithm 2 (improved for static constraints): Given a tentative history T and a committed history C, we must construct a committed history F, such that F = T °T’ and F → C. Intuitively, we start with the tentative history, and append undo’s and redo’s such that in the end the final history will be reducible to C. We compare each element in T with the elements in C. If they are the same, the operation can be committed and we iterate further. If not, there are 2 cases – either the operation was undone in T and now appears in C, in which case it must be redone and committed now, or the operation in T was not previously undone and has not yet appeared in C. In the latter case, the current operation in T, and all of its dependents must be rolled back. We will use 2 intermediary sets, D and U, where D is the set of dependent operations for an operation Ti, and U is the set of undone operations. F=T U=∅
// F is the final history //set of undone operations
k=1 i=1
// index in C (1 .. m) // index in T (1 .. n)
while (i ≤ n) if Ti ∈ U i++
//skip undone operations
if Ti = Ck F
= T1 T2 … Tn-1 Tn Tn-1 Tn-1-1 … T2-1 T1-1 C1 C2 … Cm → T1 T2 … Tn-1 (Tn Tn-1) Tn-1-1 … T2-1 T1-1 C1 C2 …Cm → T1 T2 … Tn-1 (∅) Tn-1-1 … T2-1 T1-1 C1 C2 … Cm → T1 T2 … (Tn-1 Tn-1-1) … T2-1 T1-1 C1 C2 … Cm → T1 T2 … (∅) … T2-1 T1-1 C1 C2 … Cm … → (T1 T1-1) C1 C2 … Cm → (∅) C1 C2 … Cm → C1 C2 … Cm = C
The drawback of the above algorithm is that all tentative work is lost, even if some of the operations were in the same order as in the official history. We observe that
F = F ° {Ti c} k++ i++ else if Ck ∈ U
// Ti can be committed
//the operation was undone // must be reapplied here
F = F ° {Ti Ti c} U = U \ {Ti} k++ else // form the set D of dependents D=∅ for j = i .. n if dependent (Ti, Tj) = true and Tj ∉ U
D = D ∪ {Tj}
3.2. Examples
// undo all dependent operations in // reverse order for j = n .. i if Tj ∈ D F = F ° {Tj-1} U = U ∪ {Tj} // roll back Ti F = F ° {Ti-1} i++
//end while
// after processing the entire T, reapply all operations // undone that appear in C, in the order shown by C for l = k .. m F = F ° {Cl} U = U \ { Cl} // abort all operations that have been undone and never // redone (operations that appear in T, but do not appear // in C) l=1 while (U ≠ ∅) T = U{l} F = F ° {Ta} U = U \ {T} l++ The proof of correctness, showing that the above algorithm constructs a history that can be reduced to C (F → C) is based on the fact that at each reduction step, the resulting history (F at step k) can be divided into two parts: a sub-history that can be reduced to a prefix of C and some additional junk. The second part shrinks at each step, while the first part becomes equivalent to a larger prefix of C at each step. In the end, F will become a history reducible to C. For space reasons, we will not reproduce the entire proof in this paper - the full proof can be found in [13].
Let us consider the example of a bank account. The object that we consider for consistency is the balance of the bank account. The state of the object will contain real values. The following operations are defined for this object: • deposit x, denoted D(x): val = val + x • withdraw x , denoted W(x): val = val - x • add 3% interest, denoted I: val = val * 1.03. The following static constraints can be defined for these operations: 1. dependent (D, D) = false 2. dependent (W, W) = false 3. dependent (D, W) = false 4. dependent (D, I) = true 5. dependent (W, I) = true. The user creates the following scenario: T = D(300) W(100) I D(200). The bank will apply the operations in the following order: C = D(300) W(100) D(200) I At the time of reconciliation, algorithm 1 will construct the following final history: F = D(300) W(100) I D(200) D(200)-1 I-1 W(100)-1 D(300)-1 D(300) W(100) D(200) I Algorithm 2 will take into account the static dependencies between the operations I (add 3% interest) and D(200) (the last deposit), and the independence between a deposit and a withdrawal (D(200) and W(100)), and between any two deposits (D(200) and D(300)). Therefore it will construct the following final history: F = D(300) W(100) I D(200) D(200)-1 I-1 D(200) I In this example, the increase in performance comes from the fact that the first two operations, D(300) and W(100), have not been undone and later redone. The difference in the number of operations applied on the object was thus decreased by 4 in this situation.
3.1.3. Performance of algorithms In terms of performance analysis, algorithm 2 yields better performance than algorithm 1 in the case of static constraints. If an application can be described in terms of static constraints, then algorithm 2 will take these into account, when undoing operations and thus will save unnecessary applications of undo functions. In the worst and best case, both algorithms are in the range O(1) and O(n) respectively, where n is the number of operations in T. But on the average, the second algorithm performs better, as it saves undo and redo operations that are independent of undone operations. A simulation will be used, in order to better understand the performance of the new algorithm.
4.0 CONCLUSIONS AND FUTURE WORK This paper has presented two algorithms for eventual consistency and demonstrated the correctness of these algorithms, using a general model for replication consistency. The improved algorithm exploits operation independence to reduce the number of operations that must be undone to bring a replica consistent with the master copy. The formal model makes precise the notion of undoing operations that heretofore have been described algorithmically.
We are exploring the usefulness of the model for proving and explaining a wide range of consistency semantics for replication. Future work will analyze how to further take advantage of operation independence to develop more efficient algorithms for replication. Moreover, we will take into consideration dynamic constraints, not only static constraints and will explore a way to prove static constraints. A prototype replication middleware is under development; this middleware will support a wide range of consistency semantics that will be proven correct by the formal model.
5.0. REFERENCES [1] H. Yu and A. Vahdat, Design and evaluation of a continuous consistency model for replicated services. Proceedings of Operating Systems Design and Implementation, San Diego, California, October 2000.
Principles of Distributed Computing, Portland, Oregon, July 2000, pg 229-237. [9] Philip A. Bernstein, Vassos Hadzilacos and Nathan Goodman, Concurrency Control and Recovery in Database Systems. Reading, Massachusetts: AddisonWesley 1987. [10] A. Demers, K.Petersen, M. Spreitzer, D. Terry, M. Theimer & B. Welch, The Bayou architecture: Support for data sharing among mobile users. Proceedings of the IEEE Workshop on Mobile Computing Systems and Applications, Santa Cruz, California, December 1994. [11] A. Demers, D. Greene, A. Hauser, W. Irish, J. Larson, S. Shenker, H. Sturgis, D. Swinehart, and D. Terry, Epidemic algorithms for replicated database maintenance. Proceedings of the ACM Symposium of Principles of Distributed Computing, Vancouver, Canada, August 1987, pg 1-12.
[2] M. Herlihy and J. Wing, Linearizability: A correctness condition for concurrent objects. ACM Transactions on Programming Languages and Systems, Volume 12, No. 3, July 1991, pg 463-492.
[12] A. Kermarrec, A. Rowstron, M. Shapiro and P. Druschel, The IceCube approach to the reconciliation of divergent replicas. Proceedings of the Twentieth ACM Symposium on Principles of Distributed Computing PODC, Newport, Rhode Island, August 2001.
[3] T. E. Anderson, M. D. Dahlin, J. M. Neefe, D. A. Patterson and D. S. Roselli, R. Y. Wang, Serverless network file systems. ACM Transactions on Computer Systems, Volume 14, No. 1, February 1996, pg 41-79.
[13] Anne-Marie Bosneag, Global coordination of replicated data in wide-area systems. Technical Report, Computer Science Department, Wayne State University, May 2002.
[4] W. J. Bolosky, J. R. Douceur, D. Ely, M. Theimer, Feasibility of a Serverless Distributed file system deployed on an existing set of desktop PCs. Proceedings of the International Conference on Measurement and Modeling of Computer Systems, Santa Clara, California, June 2000, pg 34-43. [5] A. Vahdat, P. Eastham, T. Anderson, WebFS: A global cache coherent file system. UC Berkeley Technical Report, December 1996. [6] D. B. Terry, M. M. Theimer, K. Petersen, A. J. Demers, M. J. Spreitzer, and C. Hauser, Managing Update Conflicts in Bayou, a Weakly Connected Replicated Storage System. Proceedings of the 15th Symposium on Operating Systems Principles (SOSP-15), Cooper Mountain, Colorado, December 1995, pg 172-183. [7] J. Gray, P. Helland, P. O’Neil & D. Shasha, The dangers of replication and a solution. Proceedings of the 1996 ACM SIGMOD International Conference on Management of Data, Montreal, Canada, June 1996, pg 173-182. [8] S. Frolund & R. Guerraoui, X-ability: a theory of replication. Proceedings of the 19th ACM Symposium on