DataStax DSE, spark, and HA

DataStax DSE spark is integrated in such a way to be highly available, meaning that (for example) if the master goes down, jobs can continue. However, you need to use the right options in order to make use of this.

DSE in an analytics DC keeps information in memory about the current spark master. If you’re running spark-submit on a node in the analytics DC, it will connect and determine the current master. If you want to submit from a remote host, you need to export the cluster configuration and import it to the remote host. Details are at https://docs.datastax.com/en/dse/5.1/dse-admin/datastax_enterprise/spark/sparkRemoteCommands.html .

For HA, don’t specify –master. DSE will determine that for you when you submit. (That way, if that master is down, it will find the current one. If you specify it, that won’t happen.)

And, in case the master goes down while the job is running, specify –supervise. This tells the spark job to supervise, and handle any failures in-flight.

There is no cross-DC HA. Normally, there is a dedicated Analytics (spark) DC on which jobs are run. You could have two of these, but jobs would have to pointed at the second DC if the first failed. We don’t have a way to do this automatically. From the spark perspective, an analytics DC is a separate cluster, and the master is not DC-aware.

Also, as an aside while we’re at it,¬†you probably want to specify ‘–deploy-mode cluster’. This tells the job to deploy on the cluster, and not on the local host (the default).

Docs are a bit limited on this, from what I can see, so hopefully this is helpful!

Advertisements