Notes on “Getting Started with Time Series Data Modeling” in Cassandra

Patrick McFadin’s tutorial on Getting Started with Time Series Data Modeling aka “The Weather Model” is a very good introduction to a few important concepts in Cassandra:

  1. it’s the canonical time series use case in Cassandra. (Try to DESCRIBE the OpsCenter keyspace for how DataStax does time series rollups.)
  2. basic intro to CQL syntax.

Here’s some notes on getting more out of Patrick’s tutorial:

  • INSERT and UPDATE are the same in CQL and are known as UPSERT. That implies that if event_time was not part of the key, then the time and temp would just keep updating the same row each hour. It also implies that if we do a double-insert with the same timestamp, we will only see the last one to make it even if event_time is part of the key.
  • “PRIMARY KEY ((weatherstation_id,date),event_time)” and “PRIMARY KEY (weatherstation_id,event_time),) WITH CLUSTERING ORDER BY (event_time DESC)” both use event_time as a clustering (sorted) key, but only the latter syntax allows the descending ordering
  • try to do some SELECT statements without specifying the weather station id. What happens? Why?
  • the reason that Patrick is so excited with TTL is that with RDBMS, either you have to do DELETE FROM table where event_time < now() or setup partitions and DROP old ones and CREATE new ones. The first one causes massive IO and the second one increases administration overhead. Also, Cassandra's row TTL is much more granular than table-level partitions
  • there is no GROUP BY syntax, but denormalizing into PRIMARY KEY ((station_id), temperature) does something similar
  • although Patrick only inserted 4 test records, that’s still more than most database developers bother to test new features with, so do your own acceptance tests 🙂

Some more general CSQL tips are:

  • Datastax is promoting CSQL to users more so than Thrift
  • queries on the same node in the same partition are relatively fast, everything else being slower.
  • column names are stored as strings in each column, so keep them short if you want to save space. Remember that wasting space inflates memtables, SStables, snapshots, backups, AWS data transfer, etc.
  • SECONDARY INDEX is generally not recommended as it will usually do a scatter-gather across multiple nodes, thus either taking a long time or even timing out
  • prior to CQL3, you could use the cassandra-cli program to list the column families in more low-level detail to see how they were stored
  • to change the row key or clustering you need to drop the table (column family) and create it again

For those new to csqlsh, here’s the full list of commands to get the sample code working:

$ cqlsh

CREATE KEYSPACE "weather_model" WITH REPLICATION = {
   'class' : 'SimpleStrategy', 'replication_factor' : 3 };

SELECT * FROM system.schema_keyspaces;

USE weather_model;

CREATE TABLE temperature (
   weatherstation_id text,
   event_time timestamp,
   temperature text,
   PRIMARY KEY (weatherstation_id,event_time)
);

INSERT INTO temperature (weatherstation_id,event_time,temperature)
VALUES ('1234ABCD','2013-04-03 07:01:00','72F');

INSERT INTO temperature (weatherstation_id,event_time,temperature)
VALUES ('1234ABCD','2013-04-03 07:02:00','73F');

INSERT INTO temperature (weatherstation_id,event_time,temperature)
VALUES ('1234ABCD','2013-04-03 07:03:00','73F');

INSERT INTO temperature (weatherstation_id,event_time,temperature)
VALUES ('1234ABCD','2013-04-03 07:04:00','74F');
 
SELECT event_time,temperature FROM temperature
WHERE weatherstation_id='1234ABCD';

Display the read and write stats. (Use -h host if you get zero reads or writes.)

$ nodetool cfstats weather_model

Keyspace: weather_model
	Read Count: 2
	Read Latency: 0.1925 ms.
	Write Count: 4
	Write Latency: 0.141 ms.
[...]

Then clean up:

DROP TABLE temperature;

DROP KEYSPACE weather_model;

Links

Cassandra Query Language (CQL) v3
ebaytechblog.com: Cassandra Data Modeling Best Practices

This entry was posted in API Programming, Cassandra, Open Source, Tech. Bookmark the permalink.

Leave a Reply

Your email address will not be published.

This site uses Akismet to reduce spam. Learn how your comment data is processed.