Optimistic Total Order in Wide Area Networks
Optimistic Total Order in Wide Area Networks∗
António SOUSA , José PEREIRA, Francisco MOURA, Rui OLIVEIRA
Universidade do Minho, Portugal {als,jop,fsm,rco}@di.uminho.pt
Abstract
Total order multicast greatly simplifies the implementa- tion of fault-tolerant services using the replicated state ma- chine approach. The additional latency of total ordering can be masked by taking advantage of spontaneous order- ing observed in LANs: A tentative delivery allows the ap- plication to proceed in parallel with the ordering protocol. The effectiveness of the technique rests on the optimistic as- sumption that a large share of correctly ordered tentative deliveries offsets the cost of undoing the effect of mistakes.
This paper proposes a simple technique which enables the usage of optimistic delivery also in WANs with much larger transmission delays where the optimistic assumption does not normally hold. Our proposal exploits local clocks and the stability of network delays to reduce the mistakes in the ordering of tentative deliveries. An experimental evalu- ation of a modified sequencer-based protocol is presented, illustrating the usefulness of the approach in fault-tolerant database management.
1. Introduction
Total order multicast greatly simplifies the implementa- tion of fault-tolerant services using the replicated state ma- chine approach [25]. By ensuring that deterministic replicas handle the very same sequence of requests from clients, it is ensured that the state is kept consistent and the interaction with clients is serializable [12]. A particularly interesting application is the database state machine [21] which allows high performance replication of transactional databases.
Implementation of total order multicast is however more costly than other forms of multicast due to the unavoidable additional latency. For instance, in a sequencer based pro- tocol [5, 16] all processes (except the sequencer itself) have to wait for the message to reach the sequencer and for the sequence number to travel back before the message can be delivered.
On the other hand, protocols based on causal history [18, 23, 10] can provide latency proportional to the interarrival delay of each sender and thus lower latency than sequencer based protocols. However, when each sender has a large in-
∗Research supported by FCT, ESCADA proj (POSI/33792/CHS/2000).
terarrival time and low latency is desired, this requires the introduction of additional control messages. This is espe- cially unfortunate in large groups and in wide area networks with limited bandwidth links.
In some protocols, such as those based on consensus [7, 4] or on a sequencer [5, 16], the total order decided is the spontaneous ordering of messages as observed by some pro- cess. In addition, in local area networks (LANs) it can be observed that the spontaneous order of messages is often the same in all processes. The latency of total order pro- tocols can therefore be masked (not reduced) by tentatively delivering messages based on spontaneous ordering, thus allowing the application to proceed the computation in par- allel with the ordering protocol [17]. Later, when the total order is established and if it confirms the optimistic order- ing, the application can immediately use the results of the optimistic computation. If not, it must undo the effects of the computation and restart it using the correct ordering.
The effectiveness of the technique rests on the assumption that a large share of correctly ordered tentative deliveries offsets the cost of undoing the effects of mistakes. This is unfortunate as this makes optimistic delivery useful only in LANs where the latency is much less of a problem than in wide area networks (WANs).
This paper proposes a simple protocol which enables op- timistic total order to be used in WANs with much larger transmission delays where the optimistic assumption does not normally hold. Our proposal exploits local clocks and the stability of network delays to reduce the mistakes in the ordering of tentative deliveries by compensating the vari- ability of transmission delays. This allows protocols which are based on spontaneous ordering to fulfill the optimistic assumption and thus mask the latency.
An experimental evaluation of the technique is presented using a sequencer-based protocol, illustrating the useful- ness of the approach in fault-tolerant database management. When applied to the sequencer based protocol, our tech- nique does not introduce additional messages. The only overhead is that of an additional integer piggybacked on data messages. This compares favorably with both plain sequencer based and causal history based protocols.
The paper is structured as follows. The next section re- calls the problems of total order and optimistic total order multicasts, as well as the reasons preventing spontaneous
1
total order in wide are networks. Section 3 introduces the intuition underlying our proposal and presents a protocol providing optimistic delivery of messages based on a fixed- sequencer total order multicast protocol. In Section 4 we evaluate the performance gains of our approach. In Sec- tion 5 we discuss the paper contribution in general settings as well as applied to a specific application. Section 6 con- cludes the paper.
2. Background
2.1. Totally ordered multicast
Informally, totally ordered multicast (or atomic multicast) ensures that no pair of messages is delivered to distinct des- tination processes in different order. Totally ordered multi- cast greatly simplifies the implementation of fault-tolerant services using the replicated state machine (or active repli- cation) approach [25, 12]: By delivering exactly the same messages in the same order to a set of deterministic repli- cas, their internal state is kept consistent.
More formally, we consider an asynchronous message passing system composed of a finite set of sequential pro- cesses communicating over a fully connected reliable point- to-point network [7]. Processes do not have access to shared memory or to a global clock. A process may only fail by crashing and once a process crashes it does not recover. A process that never crashes is said correct. Totally ordered multicast is defined by primitives to-multicast(m) and to- deliver(m), and satisfies the following properties [14]: Validity. If a correct process to-multicasts a message m,
then it eventually to-delivers m. Agreement. If a correct process to-delivers a message m,
then every correct process eventually to-delivers m. Integrity. For every message m, every process to-delivers
m at most once, and only if m was previously to- multicast.
Total Order. If two correct processes to-deliver two mes- sages m and m′, then they do so in the same order.
Total order multicast has been shown to be equivalent to the generic agreement problem of consensus [7]. Therefore we must assume that in our system the consensus problem is solvable [11, 7], requiring that a majority of processes is correct and that failure detection is of class �S [6]. In some protocols, consensus is explicitly invoked to decide the message sequence [7, 4]. In others, consensus is implicit in a group membership service which supports the actual ordering protocol [5, 15, 10].
There is a plethora of total order protocols for asyn- chronous message passing systems which can be classi- fied according to several criteria [9]. Namely, some or- der the message while disseminating it [3, 2, 15]. Others take advantage of an existing unordered multicast proto- col [5, 16, 7] and work in two stages: First, messages are
p1 • seq(m1) %%
seq(m1) &&
//
p3 • DELIV(m1) //
p2 MCAST(m1)
66
<<
77oooo
• DELIV(m1) //
Figure 1. Sequencer based total order protocol.
disseminated using a reliable multicast protocol Then, an ordering protocol is run to decide which is the correct de- livery sequence of buffered messages. This results in addi- tional latency, when compared to reliable multicast.
An example of a protocol often used in group commu- nication toolkits is the sequencer [5, 16], which uses con- sensus implicitly in the view-synchronous reliable multi- cast protocol used to disseminate messages previously to ordering them. As depicted in Figure 1, a data message is disseminated using unordered reliable multicast. Upon re- ception (depicted as a solid dot), the message is buffered until a sequence number for it is obtained. A single pro- cess (p1 in the example) is designated as the sequencer: it increments a counter and multicasts its value along with the original message’s identification to all receivers as a control message. Data messages are then delivered according to the sequence numbers. A group membership protocol is used to ensure that for any given data message there is exactly one active sequencer.
Besides being a very simple protocol, it offers several ad- vantages, especially in networks with limited bandwidth or in large groups with large and variable message interarrival times: it requires at most a single additional control mes- sage for each data message and any message can always be delivered after two successive message transmission de- lays. The basic protocol is also easily modified to cope with higher message throughput by batching sequence numbers for several messages in a single one [5], reducing the num- ber of control messages at the expense of higher latency.
2.2. Optimistic total order
A reliable multicast protocol can deliver a message af- ter a single transmission delay from the originator to the receiver. This contrasts with the latency of totally ordered multicast1 which is either twice as large, when using a se- quencer based protocol, or proportional to message interar- rival delay in protocols using causal history. However:
• Some protocols, such as the sequencer, produce an or- dering which is the spontaneous ordering observed by some process.
• In local area networks, it can be observed that the spon- taneous ordering of message reception of all processes
1Except in the degenerate situation where a single process is multicas- ting and it can assume the sequencer role.
2
is often very similar, therefore, similar to the final or- dering decided by the sequencer.
Nevertheless, delivery incurs always in the additional la- tency. The optimistic atomic broadcast protocol [22] takes this in consideration to improve average delivery latency of a consensus based total order protocol.
Further latency improvements can be obtained if the ap- plication itself can take advantage of a tentatively ordered delivery. This is called optimistic delivery [17, 26] as it rests on the optimistic assumption that reliable multicast sponta- neously orders messages. It also implies that eventually an authoritative total order is determined, leading to a confir- mation or correction of previously used delivery order. To the interval between the optimistic delivery and the author- itative delivery we call optimistic window. It is during this interval that the application can optimistically do some pro- cessing in advance.
To define optimistic total order multicast we use two different delivery primitives an optimistic opt-deliver(m) that delivers messages in a tentative order and a final fnl- deliver(m) that delivers the messages in their final, or au- thoritative, order. Optimistic total order multicast satisfies the following properties [26]:
Validity. If a correct process to-multicasts a message m, then it eventually fnl-delivers m.
Agreement. If a correct process fnl-delivers a message m, then every correct process eventually fnl-delivers m.
Integrity. For every message m, every process opt- delivers m only if m was previously multicast; and every process fnl-delivers m only once, and only if m was previously multicast.
Local Order. No process opt-delivers a message m after having fnl-delivered m.
Total Order. If two processes fnl-deliver two messages m and m′, then they do so in the same order.
An example of such an application is the database state machine [21] which allows high performance replication of transactional databases and works as follows: transactions are executed optimistically by any of the replicas without locking. The resulting read and write sets are then multicast to all replicas which perform a deterministic certification to ensure that the transaction does not conflict with concur- rent transactions already committed. Total order multicast is used to ensure that the result of the certification process is identical in all replicas, thus ensuring consistency. If the or- der of messages is known in advance by optimistic delivery, this can be used to speed up the certification [17].
Notice that if the optimistic ordering turns out to be wrong, the application has to undo the effect of any pro- cessing it might have done. Therefore, the net advantage of optimistic delivery depends on the balance between the cost of a mistake and the ratio of correctly ordered optimistic de- liveries. In the database state machine, being able to undo
the effects of optimistic delivery just means than the trans- action cannot be effectively committed until authoritative delivery. When the optimistic delivery is wrong, there is a performance penalty: The processing resources used have been wasted.
The tradeoff is thus similar to the one involved in the de- sign of cache memories. However, the protocol designer has no possibility to reduce the cost of a mistake, as this depends solely on the application. The only option is thus to try to maximize the amount of messages which are deliv- ered early but correctly ordered.
2.3. Obstacles to spontaneous total order
A high ratio of spontaneously totally ordered messages which results in good performance of optimistic applica- tions is not trivially achieved, especially in wide area net- works. One reason for this is loopback optimization in the operating system’s network stack. Noticing that the out- going packet is also to be delivered locally, the operating system may use loopback at higher layers of the protocol stack and immediately queue the message for delivery. This allows it to be delivered in advance of packets from other senders which have reached the network first.
Another reason for out of order delivery lies in the net- work itself. Although not frequent, there is a possibility that packets are lost by some but not all destinations. A reliable multicast protocol detects the occurrence and issues a re- transmission. However, the delay introduced opens up the possibility of other packets being successfully transmitted while retransmission is being performed.
An additional issue is the complexity of the network topology. Different packets can be routed by different paths, being therefore subject to different queuing delays or even to being dropped by congested routers. This is especially noteworthy when there are multiple senders. Receivers which are nearer, in terms of hops, to one of them will re- ceive its messages first. Receivers which are nearer of an- other will possibly receive messages in the opposite order.
Notice however that bad spontaneous order in wide area networks is not attributable to large delays themselves, but to the fact that the delays to different destinations are likely to be different, often by two orders of magnitude. Consider Figure 2(a). Messages m1 and m2 are multicast to three different processes, including the senders themselves. The time taken to transmit each message varies with the recipi- ent, for instance, transmission to the sender itself (typically hundreds of microseconds by loopback) takes less time than transmission to other processes (typically up to tens of mil- liseconds over a long distance link). The result is that pro- cess p1 spontaneously orders message m1 first while p2 and p3 deliver m2 first.
Figure 2(b) shows a similar example where message transmission delays are longer but where is it more likely