Most of the Cassandra rollouts I’ve heard about at conferences have been “Devopsed” – written by Dev and productionized by Dev, with hand-off to Operations long afterwards.
That’s the opposite to how RDBMS projects are usually deployed in large companies.
As Cassandra becomes more mature, this hand-off will occur earlier after development ends.
Here is a checklist for handing off a Cassandra database to Operations (I only consider non-trivial rings of 3 or more nodes in production with a full data set):
Node Impact | |||||
---|---|---|---|---|---|
Item | Comments | Performance/ | Space/ | Time/IOPs/BW | |
Cassandra Server Version | Should be exactly the same minor version across cluster except briefly during server updates | ||||
Token or vnodes? | needs to be configured before first start of server | ||||
Cassandra Client/Connector Version | Thrift or CQL? | ||||
Snitch name? Why? | several choices | ||||
Replication Factor (RF)? Why? | usually RF=3 for SoT* data, defined at keyspace level | ||||
Compaction method? Why? | Size or Level, defined at CF level | ||||
Read Consistency Level? Why? | Netflix recommends CL=ONE. ALL seldom makes sense. | ||||
Write Consistency Level? Why? | ALL seldom makes sense. | ||||
TTL? Why? | Defined at row level. | ||||
Expected Average Query Latency | 10 ms is reasonable, 1 ms is tough. | ||||
nodetool repair/scrub | needed weekly | yes | more space | more | |
Bootstrapping a new node | yes | yes | |||
Java gcpause | stop the world | yes | yes | ||
Are there any wide columns? do they get wider over time? | pathological case for Cassandra | yes | more space | more | |
Backup | in case of application bug or a disaster. Opscenter, Priam, custom. | yes | slightly more for incremental backups, double for local cold copy | more | |
Restore | requires Cassandra node shutdown | yes | |||
If a storage volume fills, howto fix it? | Especially a problem with multiple JBOD volumes, which fill unevenly. | yes | less space | less | |
If a storage volume fails, howto fix it? | yes | less space | less | ||
What is the total data size now? Projected in 12 months? | affects most operations | yes | yes | yes | |
What is the acceptable query latency? | affects network and hardware choices | ||||
What is the best maintenance window time each week? | |||||
What are the business and practical SLAs? | |||||
What training is needed for your Operations team? | Datastax Admin and Data Modelling Classes (recommend most recent Cassandra version) | ||||
What partitioner is used? | Opscenter only supports random partitioner or murmur 3 partitioner for rebalancing | ||||
What procedures need to be written for your Operations team? | |||||
What monitoring tools? |
|
||||
What bugs have been encountered? Which ones still apply? | |||||
What lessons can Devops share with the Operations team? |
SoT = Source of Truth
About Data Consistency in Cassandra
ConstantContact techblog: Cassandra and Backups
stackoverflow.com: Do I absolutely need a minimum of 3 nodes/servers for a Cassandra cluster or will 2 suffice?
Cassandra Parameters Calculator
Adding vnodes to an existing cluster