bootstrapping and consistent range movement in Cassandra

If you’re adding multiple nodes to a cluster, DataStax docs tell you to “Make sure you start each node with consistent.rangemovement property turned off”.   What is “consistent range movement”, why does Cassandra have it, and why should it be turned off?

What consistent range movement means is that, when you’re bootstrapping, get your data from the replica you’re taking it over from, and not from any secondary replicas.  Why would we normally want this?  Without it, consistency guarantees can be broken.

Take a worst-case scenario: nodes (a,b,c) have data for a token, and you add 3 new nodes that take that token range over. If you wrote at quorum, and nodes (b,c) have the data, but not (a), the 3 new nodes could theoretically stream from (a), and you’ve now lost the data. This is, of course, a worst-case scenario and fairly unlikely, but if you write at CL ONE or ANY, for example, similar things could happen more easily.  Or, another less extreme version would be that 2 out of 3 nodes originally had the data – say (a,b), so reads at QUORUM would see the update, but during range movement, 2 of the new nodes stream from (c), so now only one node has the update.  QUORUM reads could then fail to read the update.  (In this case, a repair will fix the issue.)

So, consistent range movement says to get the token from the node you’re taking it over from.  In that case, if we added 3 new nodes and they replaced (a,b,c), they’d have to stream from (a,b,c) respectively, and preserve the data replication.

Why disable it if you add multiple nodes at once, then?  This is to avoid failed bootstraps due to timeouts.  If you add three nodes, and one of them takes a token from (a), but then the next one gets assigned to take it over from that new node, it will try to stream from the (still bootstrapping) new node, find it unavailable, and may error out and fail to bootstrap.

In summary, disabling consistent range movement allows bootstrapping nodes to get their data from any available node, so they won’t fail if they are trying to take over data from an unavailable (possibly bootstrapping) node.

The safest way to add new nodes is to bootstrap one (with consistent range movement enabled – the default), allow the bootstrap to fully complete, including all streaming, then bootstrap the next, etc. This takes longer and requires a bit more attention, but is safer. In practice, you might want to save the time, bootstrap with the 2-minute pause between nodes, then repair, and possibly do a full (cross-DC) repair to resolve any data issues, if they’ve arisen.  (And by the way, why the two-minute pause between adding nodes rule?  To allow each node to start up, feed gossip, negotiate and announce which tokens it’s now responsible for, to avoid race conditions.)

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s