B649 (Spring 1999)
Fault-Tolerance and Security of Distributed Systems
Table of Contents
Lectures: Tue and Thu, 5:30pm - 6:45pm, Ballantine Hall 215.
Office Hours: Tue and Thu, 11am - noon. Lindley Hall 201C.
- Overview of communication and failure models.
- Logical clocks, causality, consistent states, snapshot algorithms.
- Leader election.
- Group communication, including reliable multicast.
- Distributed databases. Replication.
- Security. Cryptography, authentication.
- Electronic commerce. Digital cash.
These topics are representative: we won't have time to discuss all of
them, and we might cover some topics not listed here.
The overlap with Spring 1997's B649 is relatively small (less than 15%).
- Publish-subscribe paradigm.
- Real-time systems. Analyzing worst-case response time, etc.
- Primary-backup approach, checkpointing.
- State-machine approach.
- Derivation and optimization of protocols and distributed algorithms
- Knowledge in distributed systems
- Distributed shared memory; distributed objects
- On-line voting
- Automated analysis and verification.
Several of the readings will be conference or journal articles. Most of
the course will not follow a textbook. The following "recommended book"
contains some relevant and useful background material, so you might want
to buy it (and read it).
Another book that covers similar material is
In addition to homeworks and exams, students will do a course project.
One or more suggestions will be made; students are encouraged to suggest
their own projects as well. Projects may be theoretical or
implementation-oriented (ideally both!).
Grades will be based on homework, two exams, and the project.
The weights are:
Be sure to read and follow the Computer Science
Department Statement on Academic Integrity.
You are free to discuss homework with everyone, but you must write and
submit your own solutions. Homework will be graded on the basis of
correctness and clarity. Homework submitted late is subject to a penalty
of up to 20% per day. Furthermore, no credit will be given for
homework submitted after solutions have been posted. Solutions to
homework will usually be posted sometime during the day after the
assignment is due.
By default, all assignments in this course are due by the end (midnight)
of the specified day. To submit a problem set, either email it to firstname.lastname@example.org or put
hardcopy under the door of LH201C or in the homework drop box near Lindley
215. If you use the homework drop box, be sure the instructor's name
appears on your problem set, so the staff knows to whom to give it. If
you submit electronically, please use ASCII, PostScript, or PDF; Microsoft
Word (.doc) files are less convenient.
Exams are potentially cumulative, e.g., exam2 might contain questions
about material covered on exam1. On the other hand, it is reasonable to
expect that exam2 will focus on material not covered on exam1.
Exams are "open book"; specifically, you may use course handouts, your
notes, your homeworks, and the recommended textbook (CDK). Of course, the
exams will be on material covered in lecture or in course handouts, so it
probably does not matter much whether you have the textbook. You may
not use someone else's notes or homeworks or mechanical
reproductions of all or part thereof. You may not use other
textbooks or mechanical reproductions of all or part thereof. You may use
a dictionary, if you like.
The course newsgroup is ac.csci.b649. If
you have questions, this is a good place to ask them, so that other
students can also benefit. If you know the answer to someone's question,
post it! This will help your classmates and show us that you know what's
|1-2||Intro. Failure models. Communication paradigms.
|3-5||Logical Time. Snapshots.
|8-10||Group Communication in Amoeba.
|22-26||Electronic Payment. Digital Cash.
Virtual Handout: We will discuss the following paper:
A related paper that you might want to look at is:
- S. H. Low, N. F. Maxemchuk and S. Paul. Anonymous Credit Cards. In
Proceedings of the 2nd ACM Conference on Computer and Communictions
Security, November 1994.
- S. H. Low, N. F. Maxemchuk and S. Paul. Anonymous Credit Cards and
Their Collusion Analysis. IEEE/ACM Transactions on Networking
4(6), pp. 809-816, December 1996. Winner of IEEE William R. Bennett Prize
for Best Original Paper in IEEE/ACM Transactions on Networking in 1996.
Course Eval: Please fill out the CS Dept's official On-line
Course Evaluation. Your comments and suggestions would be
Project: For people doing the voting project... It would be
helpful to include (preferably in your final report) brief descriptions of
all the non-trivial testcases you tried and their outcomes; as
substantiation, you might consider including pointers to TESTCASE files
(as in P536) containing transcripts of those testcases, or to scripts that
run those testcases. Also... The grading guidelines said: "Additional
extra credit: demo items 3-6 for n=7, b=2, f=2, V=8." It makes more sense
to take n=9 here.
Exam: Grading statistics for exam2 have been added above.
Exam: Here is a Exam2 Solution.
Project: Final reports are due on Sat night, 24 Apr. Your final
report should explain and justify your design decisions, and describe
significant aspects of your data structures and code (if your project
involves coding). Your report should clearly distinguish between what you
actually achieved and what else (if anything) you would have liked to
achieve if more time had been available. If your code does not work in
certain situations, your report should say so. For people doing the
voting project, you can assume the reader of your report is familiar with
the Phalanx paper. Finally, if you have a teammate, recall (from the
project page): The status report and final report must describe the
contributions of each team member. This information may be integrated
into the body of the report or placed in an appendix.
Homework: Grading statistics for hw8 and hw9 have been added above.
Crypto in Java: The crypto package I mentioned in class yesterday
in case anyone is interested. If you get it to work, let me know.
Demo Schedule: Here is the Demo Schedule.
If you haven't already signed up, please choose a free timeslot and
Practice Problems: If you have questions about (the answer to) a
practice problem, please post your question to the newsgroup. One of your
classmates (perhaps the author of the problem) might be able to answer
your question more promptly than I can.
Practice Problems: Here are the Practice
Problems from Homework 9 with Solutions.
Practice Problem: Here is the Second
"Missing" Practice Problem from Homework 9. I also added it to Practice Problems from Homework 9.
Exam: Exam2 covers material up to and including Topic VIII
(Digital Cash), Section 3 (An e-cash protocol) in the lecture notes.
Practice Problem: Here is a "Missing"
Practice Problem from Homework 9. I also added it to Practice Problems from Homework 9.
Homework: Here is a Homework 8
Practice Problems: Here are the Practice
Problems from Homework 9.
Handout: The following paper was distributed during lecture
Homework: Grading statistics for hw7 have been added above.
Homework: Here is Homework 9.
Project: For people working on the voting project... Here is a
draft of Grading Guidelines for Voting
Reading: Read (at least) sections 3.1-3.3, 3.5-3.7, and 3.11-3.12
of the crypto handout (chapter 3 of O'Mahony et al.).
Project: When you submit your final report (on Apr 24), make all
files related to your report readable by Scott and Ying, and include a
pointer to those files in your report or in an email. People doing the
voting project should use ACLs, instead of making their files
world-readable. People doing other projects are free to use either
Project: By default, demos will be held in Ying's office (LH 310).
Homework: Here is a Homework 7
Handout: Today's handout is chapter 3 of Donal O'Mahony, Michael
Peirce, and Hitesh Tewari, Electronic Payment Systems, Artech
House, Inc., Boston, 1997. If you didn't get a copy, the leftovers are in
the plastic stand outside my office door.
Project: For people working on the voting project... Some teams
seem to have some confusion about the following point. If a client
performing a read or write operation does not receive responses from an
entire quorum, then the operation did not succeed. The client must
retry those servers or try a different quorum.
Homework: Homework 8 is distributed today
Reading: Read Bernstein et al., chapters 7 and 8, except sections
7.5, 8.5-8.7, and 8.10. If you didn't receive the handout containing
chapters 7 and 8 of Bernstein et al., please print one off the web, or
photocopy one from the copy of the book on reserve in Swain Hall Library.
Homework: Homework 7 is distributed today
Schedule Change: We will meet 2pm-3:15pm on Sat, 10 Apr, in our
usual location (BH 215).
Project: Reminder: Your status report is due on Mon, 5 Apr.
Project: Everyone should be working on their projects this week.
Reading: Photocopies of Chapters 7 and 8 of Bernstein et al. are
available from a container outside my office door. Please take a copy and
start reading chapter 7.
ENJOY THE SPRING BREAK!
Exam: I added the following sentence to the solution to problem 2
of exam1: "If this correspondence is not ensured, a message might be
delivered more than once (in scenarios in which a server crashes and
recovers multiple times)." Constructing such a scenario is a good
exercise (if you didn't do it already during the exam).
Exam: Here is an Exam 1 Solution.
Grading statistics for exam1 have been added above. If you want to pick
up your exam booklet tomorrow (Friday), please come between 3pm and 4pm.
If you cannot come then, and you would like me to give your exam booklet
to a friend, email me their name.
Homework: I just added the following note to the solution to hw6,
problem 2: "This solution assumes that locks are granted one at a time."
I believe this assumption holds for typical implementations of Strict 2PL.
If anyone has evidence to the contrary, I'd be interested to see it.
Schedule: I will be at a conference during the week after spring
break, so b649 is cancelled on 23 Mar and 25 Mar. We will try to
re-schedule those classes sometime later in the semester.
Homework: I just updated the solution to hw6, problem 2, as
follows. If FCFS order (for granting of locks) is not acceptable for the
applications at hand (e.g., a priority-based order is required), then
adding Tk-->Ti is better than adding
Tk-->Tj. Thanks to Yu Ma for pointing this out.
Grading: I updated the Grading table to show
the breakdown between homework and project.
Homework: If you lost points on hw5, problem 1, because your
histories were not complete, please put your homework in Ying's mailbox,
and it will be re-graded.
Exam: As mentioned in lecture on 2 Mar, exam1 covers material up to
and including the topics discussed on 2 Mar, i.e., up to and including
Topic VI (Transactions), Section 3.4 (Strict 2PL) in the lecture notes.
Exam: It is a good idea to bring all of the course handouts to the
Homework: Here is a Homework 5
Handout: Photocopies of chapter 6 of Bernstein et al. are available
in a container outside my office door.
Homework: Homework 6 was distributed today
in class. It is due Sunday night, 7 Mar. It must be submitted by
email. Homework 6 Solution will become
readable at 8am on 8 Mar.
Handout: Photocopies of chapter 3 of Bernstein et al. are available
in a container outside my office door.
Voting Project: For teams working on this project, your status
report and final report must justify the design of your system. Even if
you end up implementing Dalia and Mike's design with no modifications,
your report should analyze that design and discuss its strengths and
Turing Award: Jim Gray is the winner of the 1998 A. M. Turing
Award, for his contributions to database and transaction processing. His
contributions include seminal work on 2 phase locking and 2 phase commit,
which we will be studying in the next few weeks.
Homework: The due date for hw5 has been postponed to 4 March.
If you have already submitted hw5, you are welcome to re-submit it later,
if your answers change.
On-line Book: The entire book by Bernstein et al. is available here.
Handout: Chapter 2 of Bernstein et al. should be available from a
box outside my door by about 4pm tomorrow (Friday).
Homework: Homework 5 was distributed
today in class.
Handout: Chapter 1 of the following book was distributed in lecture
on 16 Feb:
Philip A. Bernstein, Vassos Hadzilacos, and Nathan Goodman.
Concurrency Control and Recovery in Database Systems.
Group Communication: Frans Kaashoek is mailing me a copy of his
Ph.D. thesis. I'll let you know if it provides more insight.
Correction: I said in lecture yesterday that the read and write
operations in a transaction do not include accesses to auxiliary data
structures, such as indices. On second thought, it seems much more
sensible to include those accesses as part of the transaction, so that the
transaction properties (especially serializability) apply to them. Sorry
for the confusion.
Extra Credit: Correct pseudo-code for the Amoeba membership-change
algorithm and an informal argument that the code satisfies the
requirements outlined informally in lecture (and in the paper on Amoeba)
would be extra credit worth approximately as much as 2 typical homework
assignments (this is equivalent to getting 200% extra credit on a typical
homework assignment). You can submit this any time before April 1. You
are free work on this in a team of 2 or 3, in which case you'll each get
2/3 or 1/2 as much extra credit, respectively.
Group Communication: If I had realized sooner how vague the
description of Amoeba's membership change algorithm is, I would have
discussed the Transis group communication system instead: it is somewhat
more complicated (partly because it has more functionality), but at least
the presentation is clear and detailed. If you are interested, here is a
paper about Transis:
Warning: I believe there are some technical problems with the
specification of the membership service (Requisites M.1-M.5) in this
paper. Don't let that discourage you; this paper also contains lots of
good stuff. :-)
Danny Dolev, Dalia Malki and Ray Strong. An Asynchronous
Membership Protocol that Tolerates Partitions. Research Report RJ
9727 (84359), IBM Research Division, IBM Almaden Research Center, March
1994. Also available as Technical Report TR94-6, Institute of Computer
Science, The Hebrew University of Jerusalem, March 1994.
Clarification: My solution to hw4 uses a clique-based spec, instead
of an independent-set-based spec, because it is easy to show that a
clique-based spec satisfies the desiderata mentioned in lecture, in
particular, that the spec is satisfied by algs that (unlike the Invit Alg)
always put disconnected nodes in different groups. An
independent-set-based spec might also have this desirable property, but
it's not obvious (to me). Does anyone have a proof of this?
Homework: Here is a Solution
to Homework 4, Problem 1. For problem 2, here are some typos/errors
in Chow and Johnson's presentation of the Invitation Alg:
- In Alg Listing 10.23, the "wait up to T seconds" statement should
not be indented, i.e., should be outside the scope of the "for" loop.
Similarly for the "wait up to T seconds for Ready_answer messages"
statement in Algorithm Listing 10.26.
- In Algorithm Listing 10.24, both occurrences of "AYC_Answer" should be
changed to "AYT_Answer".
Project: Please notice the due dates at the top of the project
page. Your assignment for the coming week is to choose a project and form
Project: Information about the course project is available via the
Project Page link in the Table of
Exams: Exam1 will be on Thu, 11 Mar. Exam2 will be on Thu, 22 Apr.
Both exams will be "in class". If you have a schedule conflict with these
dates, let me know immediately. Here are Exam
Schedule Change (Reminder): We will meet at 7:45pm-9pm on Wed, 10
Feb, in addition to the usual timeslots on Tue and Thu.
Homework: Homework 4 was distributed
yesterday in class.
Homework: Here is a Homework 3
Solution. If you have a mailbox in LH 215, I probably put a copy
there; please check before printing it.
Clarification: In the clarification of 29 Jan, I referred to the
three wait stmts in Algorithm Listing 10.20. I miscounted; there are
Contest Results: Here are some typos/errors:
The first error is the one I expected people to find. Congratulations to
the two people who found it!
- In Algorithm Listing 10.18, in Coordinator_Timeout, the "if"
statement should have an else branch: "else Election()".
- In Algorithm Listing 10.21, insert "send(sender, NS_answer)" after the
last statement of "case New_State(sender, newdef)".
- In Algorithm Listing 10.18, in Coordinator_Timeout, "timeout=T" should be
deleted from the line starting with "send(Coordinator,AreYouUp)". Also,
the lines from "wait ..." until the end of the procedure should be
left-aligned with the send stmt, not indented WRT it.
- In Algorithm Listing 10.19, in Check, the "wait" statement should
have a timeout clause: "On timeout, skip". (Arguably, this just describes
the default behavior and therefore is not essential. But it is helpful
for the reader.)
Clarification: In the clarification of 29 Jan, I meant that the
three wait stmts and the if stmts following them should not be
Clarification: In Algorithm listing 10.20, I think the three "wait
up to T seconds ..." statements should not be indented (i.e., not be in
the scope of the for loop), because for each of those statements, the
process waits for at most time T (not N*T, where N is the number of
processes). At least, that's my interpretation. (This is not an answer
to the contest.)
Contest: There is (at least) one typo/error in the pseudo-code for
the Bully Algorithm in the handout from Chow and Johnson. Anyone who
reports it (by email to me) by noon on Sunday gets 20% extra credit on
Schedule Change: My office hour on 2 Feb will be moved to 3 Feb,
Homework: Here is a Homework 2
Handout: The following handout was distributed during lecture
yesterday: Homework 3.
Schedule Change: We will not meet on 2 Feb. Instead, we will meet
at 7:45pm-9pm on Wed, 10 Feb (the only proposed alternative timeslot that
no one objected to). There was a typo in my email about rescheduling the
class; I typed "Wed, 9 feb" instead of "Web, 10 feb". I hope that doesn't
affect anyone's response.
Handout: The following handout was distributed during lecture:
Snapshots: Passive Monitor using
Approximate Logical Clocks.
Homework: Yu pointed out some inconsistencies in the use of d in
the proposed solution to hw1, prob1. I just revised Homework 1 Solution (the revised version contains
"(v3)" in the first line).
Homework: Yu pointed out that my solution to 2(b) didn't deal with
the possibility that a RECOVERING msg gets lost, so I just revised Homework 1 Solution (the revised version contains
"(v2)" in the first line).
Homework: Here is a Homework 1 Solution.
Readings: The next two handouts, which are readings for the next
two topics, are available in a tray near the door of LH201C. They are:
(1) R. Chow and T. Johnson, Leader Election. (2) M. Frans Kaashoek and
Andrew S. Tanenbaum, Efficient Reliable Group Communication for
Homework: Homework 2 was distributed during
Clarification: Regarding hw1, problem 2. In part (c), I
accidentally omitted the words "and recovery" from assumption 1. Since
the assignment is due soon, it's OK to answer the question either way
(i.e., assuming recovery is possible, or assuming it isn't). Sorry for
Clarification: Regarding hw1, problem 2. When a host crashes and
recovers, assume that its clock is reset to zero. Do not assume the
existence of an external time service that can tell the recovering host
the "current time". In parts (a) and (b), it is sufficient to consider
only server crashes; for extra credit, you can consider client crashes,
too. In part (c), consider both client and server crashes.
Current Events: Here's an article related to our discussion of
at-most-once delivery: The risks of a first failure.
Clarification: Let r be the rate a clock is advancing relative to
the rate a perfect clock advances. The clock drift is r-1. Thus, a
perfect clock has r=1 and drift rate equal to 0. In problem 1 of hw1, I
meant to say that d is an upper bound on the absolute value of the
clock drift. Sorry for the confusion.
Handout: The following handout was distributed during lecture:
Reading: If you decided to get the recommended textbook (hereafter
called CDK), you might want to read chapters 1-3 for general background on
distributed systems and networks. Also, Section 4.3 of CDK discusses
Reading: Read the System Models handout.
Handout: The following handout was distributed during lecture: