There’s 2 common misconceptions in engineering West Coast – East Coast data centers:
- that packets travel at the speed of light
- that database transactions must be 2-phase commit, and two masters cannot be located very far apart because of #1.
Most engineers halt with those rules-of-thumb in mind, but what happens when we look at the actual numbers with ecommerce/advertising applications in mind?
Cross-USA Internet Packet Latency (One-Way)
|Transmission Method||End-End Speed||Time SF-NY||Note|
|Light in vacuum||299,792 kps||16 ms||similar speed in air|
|Microwave in air||235,000 kps||20 ms||Repeaters every 48 km (actually built in 1950s in both USA and Canada! Currently HFT applications use 15+ microwave routes from Chicago to New York.)|
|“Electricity” in wire||235,000 kps||20 – 30 ms||Depends on how wire is constructed and what electromagnetic property is used to transmit information. As a rule of thumb, electrons flow in a typical electronics wire at c/2 while an electromagnetic wave propagated in an air gap is up to .99c.|
|Light in silica fiber (theoretical)||204,081 kps||22 ms||Index of refraction is 1.45|
|Oceanic cable for comparison||156,666 kps||30 ms||Including amplification and switching|
|Google Routing in silica fiber||150,000 kps||31 ms||Extrapolated from The Dalles to Ashburn (4,350 km) at 29 ms|
|AT&T Routing in silica fiber||150,000 kps||31 ms|
|Public Internet Packets in silica fiber||137,000 kps||35 – 36 ms||Public Internet already using MPLS|
|Transmission Method||End-End Speed||Time SF-NY||Note|
From Figure 1 above, we see that light can travel from SF to NY in 16 ms, yet the public Internet averages 35 ms. That’s 2.2x longer than expected if packets are supposed to travel at the speed of light in a vacuum.
Ever watched the expression on somebody’s face when you tell them, “Packets don’t travel at the speed of light.”? It’s priceless.
Now that we’ve described numerically the latency limits, there’s some very interesting things to investigate:
- A serious enterprise could construct microwave towers across the USA again for latency-sensitive traffic with 20 ms latency in good weather, with fiber backup. (“It’s better to be fast 99% of the time than slow 99.999% of the time” – mckay-brothers.com 🙂 )
- If SF – NY is too ambitious (after all, SF is earthquake-prone) “pinch” the west and east-most locations by using a central location. (See below.)
Figure 2: Instead of SF-NY (~31 ms today) Data Center locations, “pinch” the network topology of the synchronous master database pair to LAS or SLC and ATL or ORD (~20 ms today). (Map of USA Population Centers According to Major Airport Traffic)
Figure 3: Another interesting topology, using near speed-of-light microwave links from the East-most synchronous master database Chicago to NY (8.5 ms). Instead of spending a few billion dollars on a nation-wide microwave chain, one of the 15+ existing microwave providers in Chicago can be leveraged for 1,300 km for low-bandwidth transaction traffic.
Wide-Area Multi-Master Database Transactions
So how does that help us with multi-master database latency?
- for 2-phase/sync commit, 31-35 ms for a medium to high volume of OLTP transactions isn’t workable, especially over the Public Internet. But 17-20 ms of reliable latency is fundamentally different. (10 ms is the same as public Internet latency from San Jose to Las Vegas!) An optimized ecommerce store application would work with a reliable latency near 20 ms. (Confirmed with Percona Consulting.)
- if that’s not workable, think beyond 2-phase commit. Lamport/vector clock algorithms have been available since 1988, and have been implemented in Voldemort and soon in Redis (soon you can delegate database session handling, etc. to Redis if you need cross-DC availability.) Cassandra uses last-write wins and is DC-aware. Use NTP/GPS/optimization like Google Spanner does.
- #1 can be modified by “pinching” the location of the database masters. Instead of thinking SF and NY, locate the masters in Las Vegas or SLC and Atlanta or ORD with read-slaves in SJC and Ashburn as required.
- Google and AT&T have virtually unlimited CONUS fiber, meaning unlimited bandwidth and known reliability around 31 ms. A new algorithm can be built according to those constraints. Think git, but for database transactions.
What Does a Reliable Network Mean?
Reliable for wide-area multi-master database transactions means:
- almost always partition-free – 5x9s or more during most-active shopping times (Google is emphasizing partition-free in their networks, as it’s far easier than reducing latency and more predictable overall)
- zero packet loss
- maintenance windows known in advance
- good enough for your DBA Team to say “Yes, we can support this.”
At this time, that requires a private network, either yours or a cloud provider.
What is the Low-Hanging Fruit?
From lowest-cost to highest-cost for making database transactions WAN-safe:
- wiki exercise – document how your business applications:
- Internal and external SLAs are defined
- connects to the database (what options are used, are they persistent and how many round-trips result)
- how many database round-trips are needed per page
- how sessions and session failover works
- what percentage of writes vs. reads are made
- are the transactions as thin as possible using row self-updates and removing read-before-write cases aka race conditions
- how it all should really work.
- data reduction/archiving (just active OLTP rows, please)
- use transaction group commit
- pinching west-most and east-most locations closer together. ie. put one master in a central location. See Figures 2 and 3 above.
- algorithms like vector clocks, or newer/better
- reducing latency on existing routes (MPLS, direct optic routes)
- building new private CONUS/Gulf of Mexico fiber route.
In my experience, most organizations never even get to step #1 above: 🙂
Fortunately, there is a half-measure: multi-AZ with AWS uses different data centers in the same region with only 1-2 ms inter-DC latencies. James Hamilton from AWS calls using small data centers in the same region “limiting the blast radius.”
The Speed of Light – Depends on the Medium
The speed of light in a vacuum is 299,792,458 meters per second, or 186,282 miles per second. In any other medium, though, it’s generally a lot slower. In normal optical fibers (silica glass), light travels a full 31% slower.
Exercises for the Reader
- Fill in the wiki outline above.
- What regions does my cloud provider support?
- What is the lowest inter-master latency that can be provisioned?
- How many TPS does my database do that is directly ecommerce-related (not DW or logging)?
Please leave a comment!
Please leave a comment (no registration required) if you have any experience implementing similar topologies, or have suggestions or corrections.
How Google Does It
cloudplatform.googleblog.com: With Multi-Region support in Cloud Spanner, have your cake and eat it too
Google Public NTP
Microwave WAN Transmission
The secret world of microwave networks
The Abandoned Microwave Towers That Once Linked the US
Trans Canada Microwave
mckay-brothers.com: Microwave Bandwidth at Extreme Low Latency
109 Microwave Towers Bring the Internet to Remote Alaska Villages
Calculating Optical Fiber Latency
$1.5 billion: The cost of cutting London-Tokyo latency by 60ms
Researchers create fiber network that operates at 99.7% speed of light, smashes speed and latency records (fiber optic waveguide)
Public Internet Latency Measurements
Research and Other News
netflix: Active-Active for Multi-Regional Resiliency
Network latency – how low can you go?
W: Multiprotocol Label Switching (MPLS)
Latency: The New Web Performance Bottleneck
developer.apple: Networking Concepts
hpbn.co: Primer on Latency and Bandwidth
Network performance: Links between latency, throughput and packet loss
Turning the Optical Fiber Network into a Giant Earthquake Sensor
fgiesen.wordpress.com: Network latencies and speed of light
Einstein, Poincaré & Modernity: a Conversation