Cassandra Operations Checklist

Most of the Cassandra rollouts I’ve heard about at conferences have been “Devopsed” – written by Dev and productionized by Dev, with hand-off to Operations long afterwards.

That’s the opposite to how RDBMS projects are usually deployed in large companies.

As Cassandra becomes more mature, this hand-off will occur earlier after development ends.

Here is a checklist for handing off a Cassandra database to Operations (I only consider non-trivial rings of 3 or more nodes in production with a full data set):

  Node Impact
  Item Comments Performance/ Space/ Time/IOPs/BW
Cassandra Server Version Should be exactly the same minor version across cluster except briefly during server updates
Token or vnodes? needs to be configured before first start of server
Cassandra Client/Connector Version Thrift or CQL?
Snitch name? Why? several choices
Replication Factor (RF)? Why? usually RF=3 for SoT* data, defined at keyspace level
Compaction method? Why? Size or Level, defined at CF level
Read Consistency Level? Why? Netflix recommends CL=ONE. ALL seldom makes sense.
Write Consistency Level? Why? ALL seldom makes sense.
TTL? Why? Defined at row level.
Expected Average Query Latency 10 ms is reasonable, 1 ms is tough.
nodetool repair/scrub needed weekly yes more space more
Bootstrapping a new node yes yes
Java gcpause stop the world yes yes
Are there any wide columns? do they get wider over time? pathological case for Cassandra yes more space more
Backup in case of application bug or a disaster. Opscenter, Priam, custom. yes slightly more for incremental backups, double for local cold copy more
Restore requires Cassandra node shutdown yes
If a storage volume fills, howto fix it? Especially a problem with multiple JBOD volumes, which fill unevenly. yes less space less
If a storage volume fails, howto fix it? yes less space less
What is the total data size now? Projected in 12 months? affects most operations yes yes yes
What is the acceptable query latency? affects network and hardware choices
What is the best maintenance window time each week?
What are the business and practical SLAs?
What training is needed for your Operations team? Datastax Admin and Data Modelling Classes (recommend most recent Cassandra version)
What partitioner is used? Opscenter only supports random partitioner or murmur 3 partitioner for rebalancing
What procedures need to be written for your Operations team?
What monitoring tools?
  1. DSE or DCE/OpsCenter
  2. nodetool
  3. Jconsole/jmxterm
  4. Boundary
  5. nagios/zabbix
What bugs have been encountered? Which ones still apply?
What lessons can Devops share with the Operations team?

SoT = Source of Truth

About Data Consistency in Cassandra
ConstantContact techblog: Cassandra and Backups
stackoverflow.com: Do I absolutely need a minimum of 3 nodes/servers for a Cassandra cluster or will 2 suffice?
Cassandra Parameters Calculator
Adding vnodes to an existing cluster

This entry was posted in Business, Cassandra, Cloud, Tech. Bookmark the permalink.

Leave a Reply

Your email address will not be published.

This site uses Akismet to reduce spam. Learn how your comment data is processed.