What does ‘nodetool garbagecollect’ do in Cassandra?

‘nodetool garbagecollect’ clears deleted data from sstables.  (“Deleted” data is also called “shadowed” data, or data that’s shadowed by a tombstone.) It does not currently drop the tombstones (even after gc_grace_seconds).  garbagecollect can be used to reclaim space, or to make it more likely that tombstones will be dropped in subsequent compactions.  (Compactions can’t drop tombstones until after gc_grace_seconds, and after the shadowed data is cleared from any older sstables – garbagecollect does the clearing.)

garbagecollect iterates through sstables, removing any data that’s no longer needed because it’s shadowed by a tombstone.  By default, it will clean deleted partitions and rows.  To also clean deleted cell values, use the option ‘-g CELL’.

If entire partitions have been deleted, then a single-table compaction after the fact, or a second run of garbagecollect, can clean out the tombstone. 

I say “entire partitions”, because a manual single-table compaction, or garbagecollect, will check the bloom filters of older sstables for the partition, and if it doesn’t find it, drop the tombstone.  This works when a partition has been deleted.  If rows within a partition, or cells, have been deleted, the partition will likely still exist in older sstables, so the bloom-filter check would fail.


cleaning up tombstones in DataStax DSE and Apache Cassandra

In Cassandra, data isn’t deleted in the same way it is in RDBMSs. Cassandra is designed for high write throughput, and avoids reads-before-writes. It uses sstables, which are immutable once written. So, a delete is actually an update, and updates are actually inserts (into new sstables). A “tombstone” marker is written to indicate that the data is now (logically) deleted. (See here for a good writeup about tombstones from The Last Pickle.)

Heavy deletes can lead to not only extra disk space usage, but also decreasing performance on reads. (Since we might have to read through all those tombstones to find any “live” data.)  Over time, tombstones get cleaned up during anti-entropy compactions (the process that cleans up out-of-date records). However, how quickly this happens depends on many things – compaction strategies, sstable sizes, gc_grace_seconds, whether relevant rows are in one sstable or spread out over multiple sstables, etc.

People often ask how they can clean up unwanted tombstones, especially if they’re getting tombstone warnings or errors in their logs. We can tune the compaction to be more aggressive (using compaction subproperties), then sit back and wait again.  Or, in some cases customers use forced major compactions, but this can lead to issues down the road (having one big sstable that won’t get compacted for some time), so might also require using sstablesplit to split up the large resulting sstable. Or, a hack for smaller tables is to alter the table to use a different compaction strategy, then alter it back. (Forcing all sstables to be rewritten.)

Recently, though, it was pointed out to me that DSE 5.1 (and Apache Cassandra 3.10) offer a new nodetool command, ‘nodetool garbagecollect’, introduced in jira CASSANDRA-7019 .  This was added for just this situation, and is designed to clean up a table’s droppable tombstones.  (Edit: this is incorrect – see my next post.)

This looks to be a very useful update. I haven’t played with it much yet – if you have any good or bad experiences with it, I’d love to hear them!

Benchmarketing – Neo4j and MySQL

CTEs are out in MySQL 8, which makes it easier to write graph queries.  I wanted to revisit some of the performance comparisons I’ve seen online touting Neo4j’s superiority.

For my tests, I’ve only done the MySQL side.  I’m comparing to numbers given for Neo4j in a couple of online posts.  They may be outdated.  TLDR; MySQL did great, generally beating the (perhaps outdated) Neo4j numbers in the posts.

I spun up MySQL 8 on my laptop, with a 3G innodb_buffer_pool_size and all else default.  No fine-tuning.  I loaded data from https://snap.stanford.edu/data/soc-pokec.html, with ~1.6M nodes and ~30M relationships.

To cut to the quick, as data was cached, queries performed on a par with the Neo4j numbers in the posts I had read.  If the dataset can fit into memory, as I expect it probably did for Neo4j, MySQL does very well.  If it isn’t cached, it does fairly well, too.  As an overview, finding friends to a depth of 4 recursions took about 1/2 a second.  To a depth of 5, it took 10-15 seconds.

Following are more details of my quick tests.

I installed MySQL 8 on my macbook, and gave it the following ~/.my.cnf file:






loose-local-infile = 1

I created two tables, using a subset of the columns in the sample data, and loaded the datasets:

create table soc_pokec_profiles (

    user_id int primary key,

    public boolean,

    completion_percentage int,

    gender boolean,

    region varchar(256),

    last_login datetime,

    registration datetime,

    AGE int,

    body varchar(256),

    working_in_field varchar(256),

    spoken_languages varchar(256),

    hobbies varchar(256),

    enjoy_good_food varchar(256),

    pets varchar(256),

    body_type varchar(256),

    my_eyesight varchar(256),

    eye_color varchar(64),

    hair_color varchar(64),

    hair_type varchar(64),

    completed_level_of_education varchar(64),

    favourite_color varchar(256),

    relation_to_smoking varchar(64),

    relation_to_alcohol varchar(128),

    sign_in_zodiac varchar(64)


load data local infile ‘/Users/benkrug/Downloads/soc-pokec/soc-pokec-profiles.txt’

into table soc_pokec_profiles


create table soc_pokec_relationships (

user_id int, friend_id int, primary key (user_id,friend_id));

load data infile ‘/Users/benkrug/Downloads/soc-pokec/soc-pokec-relationships.txt’


into table soc_pokec_relationships


I warmed up the buffer pool using

SELECT COUNT(*) FROM soc_pokec_profiles WHERE sign_in_zodiac = ‘load_cache’;

SELECT COUNT(*) FROM soc_pokec_relationships WHERE friend_id = 33;

THen I ran queries like the following which searches to 5 levels of recursions (“depth” 1 is level 0):



  SELECT 1 AS depth, 990000 AS user_id

  FROM soc_pokec_profiles 


  SELECT depth+1, friend_id FROM soc_pokec_relationships JOIN links 

    WHERE links.user_id=soc_pokec_relationships.user_id

    AND depth < 6


SELECT * FROM links;

Varying the initial user_id, 4 levels of recursion generally took .6 seconds.  5 levels took about 10-15 seconds.

For 4 levels, that’s faster than the Neo4j times given at https://neo4j.com/news/how-much-faster-is-a-graph-database-really/ .  For 5 levels, it’s slower, but nothing like the “not finished in an hour” listed.  In fact, nothing in the quote “For the simple friends of friends query, Neo4j is 60% faster than MySQL. For friends of friends of friends, Neo is 180 times faster. And for the depth four query, Neo4j is 1,135 times faster. And MySQL just chokes on the depth 5 query” is remotely correct in my results.  My times were all better than the Neo4j times given, until 5 levels, then still were very good.

Of course, I don’t have the details of the tests they used.  I will note that they mention 1M users, and don’t say how many relationships.  I used 1.6M users and 30M relationships.

A somewhat snarky post at https://maxdemarzi.com/2017/02/06/neo4j-is-faster-than-mysql-in-performing-recursive-query/ is very proud of 2.7 seconds for 4 levels, whereas I got .6 seconds practically out of the box.  (Also, the post mentions 100k nodes, then 1M nodes, it’s a bit confusing, and 10M relationships, whereas I had 30M.)

In all, don’t believe the hype.  Use what’s best based on all the relevant considerations (manageability, ease-of-use, maintainability, performance, etc).  And do your own testing.

P.S. For a good introduction to CTEs, I thought this book was good (see my review).  I also consulted the MySQL reference here .

DataStax DSE, spark, and HA

DataStax DSE spark is integrated in such a way to be highly available, meaning that (for example) if the master goes down, jobs can continue. However, you need to use the right options in order to make use of this.

DSE in an analytics DC keeps information in memory about the current spark master. If you’re running spark-submit on a node in the analytics DC, it will connect and determine the current master. If you want to submit from a remote host, you need to export the cluster configuration and import it to the remote host. Details are at https://docs.datastax.com/en/dse/5.1/dse-admin/datastax_enterprise/spark/sparkRemoteCommands.html .

For HA, don’t specify –master. DSE will determine that for you when you submit. (That way, if that master is down, it will find the current one. If you specify it, that won’t happen.)

And, in case the master goes down while the job is running, specify –supervise. This tells the spark job to supervise, and handle any failures in-flight.

There is no cross-DC HA. Normally, there is a dedicated Analytics (spark) DC on which jobs are run. You could have two of these, but jobs would have to pointed at the second DC if the first failed. We don’t have a way to do this automatically. From the spark perspective, an analytics DC is a separate cluster, and the master is not DC-aware.

Also, as an aside while we’re at it, you probably want to specify ‘–deploy-mode cluster’. This tells the job to deploy on the cluster, and not on the local host (the default).

Docs are a bit limited on this, from what I can see, so hopefully this is helpful!

bootstrapping and consistent range movement in Cassandra

If you’re adding multiple nodes to a cluster, DataStax docs tell you to “Make sure you start each node with consistent.rangemovement property turned off”.   What is “consistent range movement”, why does Cassandra have it, and why should it be turned off?

What consistent range movement means is that, when you’re bootstrapping, get your data from the replica you’re taking it over from, and not from any secondary replicas.  Why would we normally want this?  Without it, consistency guarantees can be broken.

Take a worst-case scenario: nodes (a,b,c) have data for a token, and you add 3 new nodes that take that token range over. If you wrote at quorum, and nodes (b,c) have the data, but not (a), the 3 new nodes could theoretically stream from (a), and you’ve now lost the data. This is, of course, a worst-case scenario and fairly unlikely, but if you write at CL ONE or ANY, for example, similar things could happen more easily.  Or, another less extreme version would be that 2 out of 3 nodes originally had the data – say (a,b), so reads at QUORUM would see the update, but during range movement, 2 of the new nodes stream from (c), so now only one node has the update.  QUORUM reads could then fail to read the update.  (In this case, a repair will fix the issue.)

So, consistent range movement says to get the token from the node you’re taking it over from.  In that case, if we added 3 new nodes and they replaced (a,b,c), they’d have to stream from (a,b,c) respectively, and preserve the data replication.

Why disable it if you add multiple nodes at once, then?  Basically, this is a way of saying “I know this might cause inconsistencies or problems, but I don’t care – do it anyway”.  (See CASSANDRA-7069.)

In summary, disabling consistent range movement allows bootstrapping nodes to get their data from any available node, so they won’t fail if they are trying to take over data from an unavailable (possibly bootstrapping) node.

The safest way to add new nodes is to bootstrap one (with consistent range movement enabled – the default), allow the bootstrap to fully complete, including all streaming, then bootstrap the next, etc. This takes longer and requires a bit more attention, but is safer. In practice, you might want to save the time, bootstrap with the 2-minute pause between nodes, then repair, and possibly do a full (cross-DC) repair to resolve any data issues, if they’ve arisen.  (And by the way, why the two-minute pause between adding nodes rule?  To allow each node to start up, feed gossip, negotiate and announce which tokens it’s now responsible for, to avoid race conditions.)

Blocking and Non-Blocking Read Repair in Apache Cassandra

When are read repairs blocking or non-blocking in Cassandra?  There’s a lot of confusion and misinformation about this, even in the best sources.  It’s actually pretty simple.

Read repairs due to read_repair_chance are non-blocking.  They are done in the background after results are returned to the client.

On any consistency level that involves more than one node (i.e., all except ANY and ONE), if the read digests don’t match up, read repair is done in a blocking fashion before returning results.  (This means that if the repair doesn’t complete in time, the read request can fail.)

Read repairs caused by digest mismatches are blocking in order to ensure monotonic quorum reads.  (See https://issues.apache.org/jira/browse/CASSANDRA-10726 .)  However, that they are blocking does not only apply when CL=QUORUM or ALL, as some believe.  They are blocking for all consistency levels (except ANY and ONE, which don’t compare any digests).

One other detail is that background read repair operates on all replicas of the data.  Blocking read repairs only act on the nodes touched to satisfy the consistency level.  (Ie, where the digest mismatches were found, and not on nodes that weren’t queried.)

The relevant code is in /src/java/org/apache/cassandra/service/StorageProxy.java .

Multiple versions of Cassandra on one host

If you want to test out DSE or C* on your laptop, or a spare machine, it can be done.  You can simulate multiple nodes on the one host.  A very nice tool for doing testing in this way is “ccm”, Cassandra Cluster Manager ( https://github.com/pcmanus/ccm ).  ccm is great for testing, very convenient, but not for production.  (Eg, it only uses localhost, no networking.)

This is not the “normal” use-case, but if you somehow only have high-end machines, and want to run DataStax Enterprise (“DSE”) with multiple Apache Cassandra (“C*”) instances on each machine, in production, that can be done, too.  If you use DSE packages, there is something called “DSE multi-instance” that can get multiple nodes up on one server.  There are certain options for the bin/dse command that support multi-instance environments, such as ‘dse list-nodes’, ‘dse add-node’, and ‘dse remove-node’.  ( https://docs.datastax.com/en/latest-dse/datastax_enterprise/multiInstance/multiInstanceArchitecture.html .)

What if you want to use multiple *versions* of DSE on the same server (without using VMs)?  Using the packages would be problematic, since they install some resources into certain default directories, and would stomp on each other, dependencies would get confusing, etc.

In this case, you can use binary tarballs extracted to separate locations, and configure them to use non-default directories for configuration files, logs, etc.  (The DSE multi-instance command options are not included in the binary tarball version.)

When installing, you will need to make sure to set up non-default locations for data, logs, etc. (See https://docs.datastax.com/en/latest-dse/datastax_enterprise/install/installTARdse.html .) In particular, for a basic Cassandra cluster, you need to do the following:

  1. extract the tarball to locations you want to use for each node. Each node will get a copy.
  2. create a directory for each node’s logs, and add the following to each node’s cassandra-env.sh: export CASSANDRA_LOG_DIR=/path/to/your/log/dir
  3. modify JMX_PORT in cassandra-env.sh to give each node a unique, unused port. (Eg, 7201, 7202, 7203,… if those are unused.)
  4. create directories for each node’s data_file, commitlog, saved_caches and hints settings, and edit each node’s cassandra.yaml to point to these directories. (Note that you don’t want nodes to share any physical disks, for performance reasons.)
  5. modify each node’s listen_address rpc_address in the cassandra.yaml appropriately. (For testing, eg, you could use,,, etc.  For any production use, you would need to bind separate IP addresses for each node, ideally each with its own NIC.)
  6. Update the seeds setting, if you don’t want to use (for testing!), and any other settings you want to modify.

Once you’ve done this, you can cd to each node’s installation directory and use ./bin/dse cassandra to start up each node. (You could even run different installations as different users, if you wanted to.) 

To install other components, you would have to configure them separately as well, via their config files, and start ‘dse’ with the appropriate options.

Clearly, this is a bit of work, and doesn’t allow you to use services to start and stop, etc. However, it does allow you to run different versions of DSE/Cassandra on one host.