bootstrapping and consistent range movement in Cassandra

If you’re adding multiple nodes to a cluster, DataStax docs tell you to “Make sure you start each node with consistent.rangemovement property turned off”.   What is “consistent range movement”, why does Cassandra have it, and why should it be turned off?

What consistent range movement means is that, when you’re bootstrapping, get your data from the replica you’re taking it over from, and not from any secondary replicas.  Why would we normally want this?  Without it, consistency guarantees can be broken.

Take a worst-case scenario: nodes (a,b,c) have data for a token, and you add 3 new nodes that take that token range over. If you wrote at quorum, and nodes (b,c) have the data, but not (a), the 3 new nodes could theoretically stream from (a), and you’ve now lost the data. This is, of course, a worst-case scenario and fairly unlikely, but if you write at CL ONE or ANY, for example, similar things could happen more easily.  Or, another less extreme version would be that 2 out of 3 nodes originally had the data – say (a,b), so reads at QUORUM would see the update, but during range movement, 2 of the new nodes stream from (c), so now only one node has the update.  QUORUM reads could then fail to read the update.  (In this case, a repair will fix the issue.)

So, consistent range movement says to get the token from the node you’re taking it over from.  In that case, if we added 3 new nodes and they replaced (a,b,c), they’d have to stream from (a,b,c) respectively, and preserve the data replication.

Why disable it if you add multiple nodes at once, then?  This is to avoid failed bootstraps due to timeouts.  If you add three nodes, and one of them takes a token from (a), but then the next one gets assigned to take it over from that new node, it will try to stream from the (still bootstrapping) new node, find it unavailable, and may error out and fail to bootstrap.

In summary, disabling consistent range movement allows bootstrapping nodes to get their data from any available node, so they won’t fail if they are trying to take over data from an unavailable (possibly bootstrapping) node.

The safest way to add new nodes is to bootstrap one (with consistent range movement enabled – the default), allow the bootstrap to fully complete, including all streaming, then bootstrap the next, etc. This takes longer and requires a bit more attention, but is safer. In practice, you might want to save the time, bootstrap with the 2-minute pause between nodes, then repair, and possibly do a full (cross-DC) repair to resolve any data issues, if they’ve arisen.  (And by the way, why the two-minute pause between adding nodes rule?  To allow each node to start up, feed gossip, negotiate and announce which tokens it’s now responsible for, to avoid race conditions.)

Blocking and Non-Blocking Read Repair in Apache Cassandra

When are read repairs blocking or non-blocking in Cassandra?  There’s a lot of confusion and misinformation about this, even in the best sources.  It’s actually pretty simple.

Read repairs due to read_repair_chance are non-blocking.  They are done in the background after results are returned to the client.

On any consistency level that involves more than one node (i.e., all except ANY and ONE), if the read digests don’t match up, read repair is done in a blocking fashion before returning results.  (This means that if the repair doesn’t complete in time, the read request can fail.)

Read repairs caused by digest mismatches are blocking in order to ensure monotonic quorum reads.  (See https://issues.apache.org/jira/browse/CASSANDRA-10726 .)  However, that they are blocking does not only apply when CL=QUORUM or ALL, as some believe.  They are blocking for all consistency levels (except ANY and ONE, which don’t compare any digests).

One other detail is that background read repair operates on all replicas of the data.  Blocking read repairs only act on the nodes touched to satisfy the consistency level.  (Ie, where the digest mismatches were found, and not on nodes that weren’t queried.)

The relevant code is in /src/java/org/apache/cassandra/service/StorageProxy.java .

Multiple versions of Cassandra on one host

If you want to test out DSE or C* on your laptop, or a spare machine, it can be done.  You can simulate multiple nodes on the one host.  A very nice tool for doing testing in this way is “ccm”, Cassandra Cluster Manager ( https://github.com/pcmanus/ccm ).  ccm is great for testing, very convenient, but not for production.  (Eg, it only uses localhost, no networking.)

This is not the “normal” use-case, but if you somehow only have high-end machines, and want to run DataStax Enterprise (“DSE”) with multiple Apache Cassandra (“C*”) instances on each machine, in production, that can be done, too.  If you use DSE packages, there is something called “DSE multi-instance” that can get multiple nodes up on one server.  There are certain options for the bin/dse command that support multi-instance environments, such as ‘dse list-nodes’, ‘dse add-node’, and ‘dse remove-node’.  ( https://docs.datastax.com/en/latest-dse/datastax_enterprise/multiInstance/multiInstanceArchitecture.html .)

What if you want to use multiple *versions* of DSE on the same server (without using VMs)?  Using the packages would be problematic, since they install some resources into certain default directories, and would stomp on each other, dependencies would get confusing, etc.

In this case, you can use binary tarballs extracted to separate locations, and configure them to use non-default directories for configuration files, logs, etc.  (The DSE multi-instance command options are not included in the binary tarball version.)

When installing, you will need to make sure to set up non-default locations for data, logs, etc. (See https://docs.datastax.com/en/latest-dse/datastax_enterprise/install/installTARdse.html .) In particular, for a basic Cassandra cluster, you need to do the following:

  1. extract the tarball to locations you want to use for each node. Each node will get a copy.
  2. create a directory for each node’s logs, and add the following to each node’s cassandra-env.sh: export CASSANDRA_LOG_DIR=/path/to/your/log/dir
  3. modify JMX_PORT in cassandra-evn.sh to give each node a unique, unused port. (Eg, 7201, 7202, 7203,… if those are unused.)
  4. create directories for each node’s data_file, commitlog, saved_caches and hints settings, and edit each node’s cassandra.yaml to point to these directories. (Note that you don’t want nodes to share any physical disks, for performance reasons.)
  5. modify each node’s listen_address rpc_address in the cassandra.yaml appropriately. (For testing, eg, you could use 127.0.0.1, 127.0.0.2, 127.0.0.3, etc.  For any production use, you would need to bind separate IP addresses for each node, ideally each with its own NIC.)
  6. Update the seeds setting, if you don’t want to use 127.0.0.1 (for testing!), and any other settings you want to modify.

Once you’ve done this, you can cd to each node’s installation directory and use ./bin/dse cassandra to start up each node. (You could even run different installations as different users, if you wanted to.) 

To install other components, you would have to configure them separately as well, via their config files, and start ‘dse’ with the appropriate options.

Clearly, this is a bit of work, and doesn’t allow you to use services to start and stop, etc. However, it does allow you to run different versions of DSE/Cassandra on one host.

I’ve a feeling we’re not in Kansas anymore – backups in Cassandra

(The sarcastic image above is from my older blog, Oracle2MySQL.)

In the spirit of my other blog, Oracle2MySQL, I was trying to think of things that caught my by surprise when I was exploring Cassandra.  One thing that really got me was the mechanisms for backups.

I read about taking backups using snapshots.  Sounded good so far, I’ve used LVM snapshots to backup databases, although they can cause a drag on performance while you copy them off.

But, wait, I go to the next page, and it says the snapshot is taken by creating hard links to the data files.  This is where I experienced an impedance mismatch.  I figured something was wrong with the documentation, it made no sense.  Hard links don’t do anything but link to the existing file.  How is that a point-in-time, consistent snapshot of the database?

It took me a bit to shake myself out of it, and remember the important fact:  sstables are immutable.  All that is needed is a way to record which files exist at that point in time, and prevent deletion.  That was when I said to myself, “Now I… I know we’re not in Kansas.”