Legal Disclaimer: The views and opinions expressed in this article are mine only and do not necessarily reflect the views of Google.
I spent my first two years at Google working on Google Checkout, and I walked away with two great lessons. One was the lesson I learned from Pat Helland on building scalable systems. The other was admitting a fundamental trade off in system design. Like many things, it seems obvious in retrospect, but you would be surprised at the number of people who keep trying for everything [cf, OceanStore]. When building a computing system, you only get to pick two: (1) Consistency (2) Survivability (3) Speed.
What do I mean by these? Consistency: Someone in Tokyo and someone in Chicago make a query on a piece of data and both observe the same state. Survivability: The system continues to function even in the face of a catastrophic outage such as the loss of power in New York state in 2003. Speed: How long does it take to mutate a single piece of data.
Scenario 1: Consistency and Survivability
If you want consistency and survivability, then you are forced to make each write in your system to backends on different failure domains. Failure domains may be quite large, as shown by the New York State 2003 power outage of 2003. This implies geographically distributed backends, and you can't beat the speed of light. A ping from Mountain View to Atlanta takes 60 ms on Google's network; and the speed of light puts the lower bound at 22 ms.
At Google, systems which choose consistency and survivability include Google Docs and our global Chubby cell. This trade off has the advantage of requiring no operator interaction when one of our clusters goes offline. However, it comes at a cost: writes which could have taken 2 ms take 200 ms. An application which makes 100 writes for each user action might work well enough in the former case but won't in the latter. Therefore, Google Docs and users of Global Chubby must be much more frugal with their writes. Brewer's Law: In a network partition, there is always a losing side.
If two systems are independently modifying a piece of data, it may be difficult (or impossible) to resolve the combined effects of those modifications. What does it mean for an order to be canceled and shipped (so-called "forked state")1 ? To maintain consistency during a network partition, at most one side will win ownership of each piece of data. The other side will at best have a read-only version that may be stale.
Scenario 2: Consistency and Speed
If you're willing to forgo survivability, you can get consistency and speed in a variety of ways. The simplest way is to have one computer and a single copy of the data (a global "master"). To reduce your mean time between failure, you can put a quorum system like RAID or GFS in place so that when individual components fail, your system keeps humming along. However, if your entire site goes offline due to a fiber cut or other catastrophic failure, you have an outage.
The common approximation to survivability chosen by people who choose consistency and speed is to have a replica which lags a certain distance behind the master and is located elsewhere. The replica has a consistent view of the world: it's just a bit out of date. Such replicas are often used to serve read-only versions of the data, and there are approaches to force-promote a replica to be master if the master is lost, but care must be taken regarding the possibly-stale state of existing objects.
Scenario 3: Speed and Survivability
How does Google provision web search for the world? We don't synchronize the index.
Google has S search clusters each of which can serve Q search queries per second. Each search cluster receives a continuous stream of background updates that bring it to a close approximation of the "current" state of the Internet. However, there's no need for each cluster to be strictly in sync with the others. Maybe you have the blog post that was made ten minutes ago in your search results and I don't. In return for such small inconsistencies, Google provides web search which is always on and always fast. Postscript
Even though I've framed the survivability versus speed trade off here as one involving geographically distributed backends, the trade off arises at lower levels as well. GFS trades some survivability for speed: write completion does not wait for disk commit, only memory commit to three chunk servers. If all three chunk servers lose power before their disk caches are flushed, the writes are lost. We've found this trade off acceptable given the speed up: disk seeks take about 10 ms, whereas memory references take only 10 ns.
1 Maybe Amazon knows what it means to have an order be independently shipped and canceled. Amazon's Dynamo Paper indicates that many of their objects are written to have mergable histories. -David Eger, November 2008