Latency is very hard to wring out of a system once it is creeps in, which is why Internet engineers are obsessed with measuring and minimizing it.
Unfortunately, the Cloud is like a flashback to the 1970s in terms of the amount and variety of latencies it introduces in network and storage systems.
Some of the methods to decrease latency in AWS are:
- “comparison shop” for the best-performing instances and EBSs using benchmarking tools, then drop the slow ones
- write data to a queue in memory before writing to disk
- use local (“ephemeral”) storage instead of EBS, or try provisioned IOPs if you can afford it
- EBSs are a barrel of laughs in terms of performance and reliability. The likelihood of a “stuck” volume increases with striping, so use 4 or fewer striped EBS volumes
- use private IP addresses, not public IP addresses, from EC2 instances to other EC2 nodes to avoid extra routing hops
- note that EIPs can take a max of 9-13 seconds to update, which also affects RDS failover
- allocate the entire server by using 32 GB RAM instances to avoid “noisy neighbours”
- use KVM (Northwood-based Intel CPUs) instead of Xen hosts
- use placement groups if possible/affordable
- SSDs have much lower latency than rotating disk (spinning rust) drives
- MySQL InnoDB with SSDs can do atomic (single) writes, instead of using double-write buffer on 4k block devices
- reduce data writes with compression and archiving of data, and use a large enough buffer pool to hold data in memory
- Amazon Route 53 Adds Latency Based Routing
- Cross-region latency between US and non-US regions can be surprisingly high
- VPC might help (all new accounts are on VPC)
- nodes in different AZs in the same region are typically about 1 – 2 ms apart – that’s huge compared to 0.0 ms on a local switch
- configure geo-replication with MySQL replication or log-shipping with mysqlbinlog –read-from-remote-server (possible from EC2 or ClearDB, but not RDS) can decrease failover time in case of another entire region failure
- and obviously, don’t “double-virtualize” your instances like Russian Dolls.
To detect latency problems, look at the outliers on 95% and 99% graphs, and also linux top steal time (time given to other DomU instances). CloudPing is informative.
Please leave a comment if you can think of any more methods to monitor or reduce latency!
Database Latency is the Achilles Heel of Cloud Computing
AWS: the good, the bad and the ugly
Stuart Cheshire: It’s the Latency, Stupid
Dynamo Sure Works Hard
Amazon AWS team: Choosing the Right EC2 Instance Type for Your Application, Multi-Region Latency Based Routing now Available for AWS – “a regional outage wouldn’t have a direct effect on the routing decisions; the absence of measurements doesn’t contribute towards the averages observed … hours and days is where to set expectations” (ie. very high-latency)
EC2Instances.info: Easy Amazon EC2 Instance Comparison
AWS peers into soul of Load Balancers for DNS failover
dtrace.org: Virtualization Performance: Zones, KVM, Xen