Minimizing System Latency in Amazon AWS

Amazon AWS LogoLatency is very hard to wring out of a system once it is creeps in, which is why Internet engineers are obsessed with measuring and minimizing it.

Unfortunately, the Cloud is like a flashback to the 1970s in terms of the amount and variety of latencies it introduces in network and storage systems.

Some of the methods to decrease latency in AWS are:

  1. “comparison shop” for the best-performing instances and EBSs using benchmarking tools, then drop the slow ones
  2. write data to a queue in memory before writing to disk
  3. use local (“ephemeral”) storage instead of EBS, or try provisioned IOPs if you can afford it
  4. EBSs are a barrel of laughs in terms of performance and reliability. The likelihood of a “stuck” volume increases with striping, so use 4 or fewer striped EBS volumes
  5. use private IP addresses, not public IP addresses, from EC2 instances to other EC2 nodes to avoid extra routing hops
  6. note that EIPs can take a max of 9-13 seconds to update, which also affects RDS failover
  7. allocate the entire server by using 32 GB RAM instances to avoid “noisy neighbours”
  8. use KVM (Northwood-based Intel CPUs) instead of Xen hosts
  9. use placement groups if possible/affordable
  10. SSDs have much lower latency than rotating disk (spinning rust) drives
  11. MySQL InnoDB with SSDs can do atomic (single) writes, instead of using double-write buffer on 4k block devices
  12. reduce data writes with compression and archiving of data, and use a large enough buffer pool to hold data in memory
  13. Amazon Route 53 Adds Latency Based Routing
  14. Cross-region latency between US and non-US regions can be surprisingly high
  15. VPC might help (all new accounts are on VPC)
  16. nodes in different AZs in the same region are typically about 1 – 2 ms apart – that’s huge compared to 0.0 ms on a local switch
  17. configure geo-replication with MySQL replication or log-shipping with mysqlbinlog –read-from-remote-server (possible from EC2 or ClearDB, but not RDS) can decrease failover time in case of another entire region failure
  18. and obviously, don’t “double-virtualize” your instances like Russian Dolls.

To detect latency problems, look at the outliers on 95% and 99% graphs, and also linux top steal time (time given to other DomU instances). CloudPing is informative.

Please leave a comment if you can think of any more methods to monitor or reduce latency!

Resources

Database Latency is the Achilles Heel of Cloud Computing
AWS: the good, the bad and the ugly
Stuart Cheshire: It’s the Latency, Stupid
Dynamo Sure Works Hard
Amazon AWS team: Choosing the Right EC2 Instance Type for Your Application, Multi-Region Latency Based Routing now Available for AWS – “a regional outage wouldn’t have a direct effect on the routing decisions; the absence of measurements doesn’t contribute towards the averages observed … hours and days is where to set expectations” (ie. very high-latency)
EC2Instances.info: Easy Amazon EC2 Instance Comparison
AWS peers into soul of Load Balancers for DNS failover
dtrace.org: Virtualization Performance: Zones, KVM, Xen

This entry was posted in Business, Cassandra, Cloud, Linux, MySQL, Open Source, Oracle, Storage, Tech, Toys. Bookmark the permalink.

Leave a Reply

Your email address will not be published.

This site uses Akismet to reduce spam. Learn how your comment data is processed.