Replication with Gossip protocol

Certify and Increase Opportunity.
Be
Govt. Certified Apache Cassandra Professional

Cassandra performs replication to store multiple copies of data on multiple nodes for reliability and fault tolerance. To configure replication, you need to choose a data partitioner and replica placement strategy. Data partitioning determines how data is placed across the nodes in the cluster. For information about how this works. Nodes communicate with each other about replication and other things using the gossip protocol. Be sure to configure gossip.

Virtual nodes

DataStax Enterprise 3.1 and above can use virtual nodes. Virtual nodes simplify many tasks in Cassandra, such as eliminating the need to determine the partition range (calculate and assign tokens), rebalancing the cluster when adding or removing nodes, and replacing dead nodes. For a complete description of virtual nodes and how they work.

Using virtual nodes

In the cassandra.yaml, file uncomment num_tokens and leave the initial_token parameter unset. Guidelines for using virtual nodes include:

Determining the num_tokens value: The initial recommended value for num_tokens is 256.
Mixed architecture: Cassandra supports using virtual node-enabled and non-virtual node data centers. For example, you can use virtual nodes on your Cassandra and Solr nodes and not on your Hadoop nodes as long as they are in separate data centers.
Migrating: To upgrade existing clusters to virtual nodes: Use the cassandra-shuffle utility.

Note

Currently, DataStax recommends using virtual nodes only on data centers running purely Cassandra workloads. You should disable virtual nodes on data centers running either Hadoop or Solr workloads by setting num_tokens to 1.

Using the single-token-per-node architecture in DSE 3.1 and above

If you don’t use virtual nodes, you must make sure that each node is responsible for roughly an equal amount of data. To do this, assign each node an initial-token value and calculate the tokens for each data center as described in Generating tokens located in the Datastax Enterprise 3.0 documentation. You can also use the Murmur3Partitioner and calculate the tokens as described in Cassandra 1.2 Generating tokens.

Partitioner settings

You can use either the Murmur3Partitioner or RandomPartitioner with virtual nodes.

The Murmur3Partitioner (org.apache.cassandra.dht.Murmur3Partitioner) is the default partitioning strategy for new Cassandra clusters (1.2 and above) and the right choice for new clusters in almost all cases. You can only use Murmur3Partitioner for new clusters; you cannot change the partitioner in existing clusters. If you are switching to the 1.2 cassandra.yaml, be sure to change the partitioner setting to match the previous partitioner.

The RandomPartitioner (org.apache.cassandra.dht.RandomPartitioner) was the default partitioner prior to Cassandra 1.2. You can continue to use this partitioner when migrating to virtual nodes.

Snitch settings

A snitch determines which data centers and racks are written to and read from. It informs Cassandra about the network topology so that requests are routed efficiently and allows Cassandra to distribute replicas by grouping machines into data centers and racks. All nodes must have exactly the same snitch configuration.

The following sections describe three commonly-used snitches. All available snitches are described in the Apache Cassandra documentation. The default endpoint_snitch is the DseDelegateSnitch. The default snitch delegated by this snitch is the DseSimpleSnitch (org.apache.cassandra.locator.DseSimpleSnitch). You set the snitch used by the DseDelegateSnitch in the dse.yaml file:

Packaged installations: /etc/dse/dse.yaml

Tarball installations: <install_location>/resources/dse/conf/dse.yaml

DseSimpleSnitch

DseSimpleSnitch is used only for DataStax Enterprise (DSE) deployments. This snitch logically configures each type of node in separate data centers to segregate the analytics, real-time, and search workloads. You can use the DseSimpleSnitch for mixed-workload DSE clusters located in one physical data center or for multiple physical data centers. When using multiple data centers, place each type of node (Cassandra, Hadoop, and Solr) in a separate physical data center.

When defining your keyspace strategy_options, use Analytics, Cassandra, or Search for your data center names.

SimpleSnitch

For a single data center (or single node) cluster, the SimpleSnitch is usually sufficient. However, if you plan to expand your cluster at a later time to multiple racks and data centers, it is easier if you use a rack and data center aware snitch from the start, such as the RackInferringSnitch. All snitches are compatible with all replication strategies.

PropertyFileSnitch

The PropertyFileSnitch allows you to define your data center and rack names to be whatever you want. Using this snitch requires that you define network details for each node in the cluster in the cassandra-topology.properties configuration file.

Packaged installations: /etc/dse/cassandra/cassandra-topology.properties
Tarball installations: <install_location>/resources/cassandra/conf/cassandra-topology.properties

Every node in the cluster should be described in this file, and specified exactly the same on every node in the cluster.

For example, suppose you had non-uniform IPs and two physical data centers with two racks in each, and a third logical data center for replicating analytics data, you would specify them as follows:

# Data Center One

175.56.12.105=DC1:RAC1
175.50.13.200=DC1:RAC1
175.54.35.197=DC1:RAC1

120.53.24.101=DC1:RAC2
120.55.16.200=DC1:RAC2
120.57.102.103=DC1:RAC2

# Data Center Two

110.56.12.120=DC2:RAC1
110.50.13.201=DC2:RAC1
110.54.35.184=DC2:RAC1

50.33.23.120=DC2:RAC2
50.45.14.220=DC2:RAC2
50.17.10.203=DC2:RAC2

# Analytics Replication Group

172.106.12.120=DC3:RAC1
172.106.12.121=DC3:RAC1
172.106.12.122=DC3:RAC1

# default for unknown nodes
default=DC3:RAC1

Make sure the data center names defined in the /etc/dse/cassandra/cassandra-topology.properties file correlates to what you name your data centers in your keyspace strategy-options.

Choosing keyspace replication options

When you create a keyspace, you must define the replica placement strategy and the number of replicas you want. DataStax recommends choosing NetworkTopologyStrategy for single and multiple data center clusters. This strategy is as easy to use as the SimpleStrategy and allows for expansion to multiple data centers in the future. It is much easier to configure the most flexible replication strategy up front, than to reconfigure replication after you have already loaded data into your cluster.

NetworkTopologyStrategy takes as options the number of replicas you want per data center. Even for single data center (or single node) clusters, you can use this replica placement strategy and just define the number of replicas for one data center. For example:

CREATE KEYSPACE test
  WITH REPLICATION = {'class' : 'NetworkTopologyStrategy', 'us-east' : 6};

Or for a multiple data center cluster:

CREATE KEYSPACE test2
  WITH REPLICATION = {'class' : 'NetworkTopologyStrategy', 'dc1' : 3, 'dc2' : 3};

When declaring the keyspace strategy-options, what you name your data centers depends on the snitch you have chosen for your cluster. The data center names must correlate to the snitch you are using in order for replicas to be placed in the correct location.

As a general rule, the number of replicas should not exceed the number of nodes in a replication group. However, it is possible to increase the number of replicas, and then add the desired number of nodes afterwards. When the replication factor exceeds the number of nodes, writes will be rejected, but reads will still be served as long as the desired consistency level can be met.

In DataStax Enterprise 3.0.1 and later, the default consistency level has changed from ONE to QUORUM for reads and writes to resolve a problem finding a CassandraFS block when using consistency level ONE on a Hadoop node.

Changing replication settings

The default replication of 1 for keyspaces is suitable only for development and testing of a single node. For production environments, it is important to change the replication of keyspaces from 1 to a higher number. To avoid operations problems, changing the replication of these system keyspaces is especially important:

HiveMetaStore, cfs, and cfs_archive keyspacesIf the node is an Analytics node that uses Hive, increase the HiveMetaStore and cfs keyspace replication factors to 2 or higher to be resilient to single-node failures. If you use cfs_archive, increase it accordingly.
dse_system keyspaceOn an Analytics/Hadoop node, this keyspace contains information about the location of the job tracker. If only a single node contains the job tracker replica, other nodes cannot find the job tracker when that node is unavailable for some reason.

To change the replication these keyspaces

Check the name of the data center of the node.
- Packaged installs: nodetool status
- Binary installs: <install_location>/bin/nodetool status
The output tells you the name of the data center for the node, for example, datacenter1.

Change the replication of the cfs and cfs_archive keyspaces from 1 to 3, for example:

ALTER KEYSPACE cfs
  WITH REPLICATION = {'class' : 'NetworkTopologyStrategy', 'dc1' : 3};

ALTER KEYSPACE cfs_archive
  WITH REPLICATION = {'class' : 'NetworkTopologyStrategy', 'dc1' : 3};

How high you increase the replication depends on the number of nodes in the cluster.

If you use Hive, update the HiveMetaStore keyspace to increase the replication from 1 to 3, for example.
Alter the dse_system keyspace to increase the replication from 1 to 3, for example.
If the keyspaces you changed contain any data, run nodetool repair to avoid having missing data problems or data unavailable exceptions.

Replication is the process of storing copies of data on multiple nodes to ensure reliability and fault tolerance.

Cassandra stores copies, called replicas, of each row based on the row key. You set the number of replicas when you create a keyspace using the replica placement strategy. In addition to setting the number of replicas, this strategy sets the distribution of the replicas across the nodes in the cluster depending on the cluster’s topology.

The total number of replicas across the cluster is referred to as the replication factor. A replication factor of 1 means that there is only one copy of each row on one node. A replication factor of 2 means two copies of each row, where each copy is on a different node. All replicas are equally important; there is no primary or master replica. As a general rule, the replication factor should not exceed the number of nodes in the cluster. However, you can increase the replication factor and then add the desired number of nodes afterwards. When replication factor exceeds the number of nodes, writes are rejected, but reads are served as long as the desired consistency level can be met.

To determine the physical location of nodes and their proximity to each other, the replication strategy also relies on the cluster-configured snitch, which is described below.
Replication Strategy

The available strategies are:

SimpleStrategy
NetworkTopologyStrategy

SimpleStrategy

Use SimpleStrategy for simple single data center clusters. This strategy is the default replica placement strategy when creating a keyspace using the Cassandra CLI. See Creating a Keyspace. When using the Cassandra Query Language interface, you must explicitly specify a strategy. See CREATE KEYSPACE.

SimpleStrategy places the first replica on a node determined by the partitioner. Additional replicas are placed on the next nodes clockwise in the ring without considering rack or data center location.

NetworkTopologyStrategy

Use NetworkTopologyStrategy when you have (or plan to have) your cluster deployed across multiple data centers. This strategy specify how many replicas you want in each data center.

When deciding how many replicas to configure in each data center, the two primary considerations are (1) being able to satisfy reads locally, without incurring cross-datacenter latency, and (2) failure scenarios. The two most common ways to configure multiple data center clusters are:

Two replicas in each data center. This configuration tolerates the failure of a single node per replication group and still allows local reads at a consistency level of ONE.
Three replicas in each data center. This configuration tolerates the failure of a one node per replication group at a strong consistency level of LOCAL_QUORUM or tolerates multiple node failures per data center using consistency level ONE.

Asymmetrical replication groupings are also possible. For example, you can have three replicas per data center to serve real-time application requests and use a single replica for running analytics.

The NetworkTopologyStrategy determines replica placement independently within each data center as follows:

The first replica is placed according to the partitioner (same as with SimpleStrategy).
Additional replicas are placed by walking the ring clockwise until a node in a different rack is found. If no such node exists, additional replicas are placed in different nodes in the same rack.

NetworkTopologyStrategy attempts to place replicas on distinct racks because nodes in the same rack (or similar physical grouping) can fail at the same time due to power, cooling, or network issues.

NetworkTopologyStrategy relies on a properly configured snitch to place replicas correctly across data centers and racks. It is important to configure your cluster to use the type of snitch that correctly determines the locations of nodes in your network.

Note

Be sure to use NetworkTopologyStrategy instead of the OldNetworkTopologyStrategy, which supported only a limited configuration of 3 replicas across 2 data centers, without control over which data center got the two replicas for any given row key. This strategy meant that some rows had two replicas in the first and one replica in the second, while others had two in the second and one in the first.
About Snitches

A snitch maps IPs to racks and data centers. It defines how the nodes are grouped together within the overall network topology. Cassandra uses this information to route inter-node requests as efficiently as possible. The snitch does not affect requests between the client application and Cassandra and it does not control which node a client connects to.

You configure snitches in the cassandra.yaml configuration file. All nodes in a cluster must use the same snitch configuration.

The following snitches are available:
SimpleSnitch

The SimpleSnitch (the default) does not recognize data center or rack information. Use it for single-data center deployments (or single-zone in public clouds).

When defining your keyspace strategy_options, use replication_factor=<#>.
DseSimpleSnitch

For information about this snitch, see DataStax Enterprise 2.1 documentation.
RackInferringSnitch

The RackInferringSnitch infers (assumes) the topology of the network by the octet of the node’s IP address.

When defining your keyspace strategy_options, use the second octet number of your node IPs for your data center name. In the above graphic, you would use 100 for the data center name.
PropertyFileSnitch

The PropertyFileSnitch determines the location of nodes by rack and data center. This snitch uses a user-defined description of the network details located in the property file cassandra-topology.properties. Use this snitch when your node IPs are not uniform or if you have complex replication grouping requirements. For more information, see Configuring the PropertyFileSnitch.

When using this snitch, you can define your data center names to be whatever you want. Make sure that the data center names defined in the cassandra-topology.properties file correlates to the name of your data centers in your keyspace strategy_options.
EC2Snitch

Use the EC2Snitch for simple cluster deployments on Amazon EC2 where all nodes in the cluster are within a single region. The region is treated as the data center and the availability zones are treated as racks within the data center. For example, if a node is in us-east-1a, us-east is the data center name and 1a is the rack location. Because private IPs are used, this snitch does not work across multiple regions.

When defining your keyspace strategy_options, use the EC2 region name (for example,“us-east“) as your data center name.
EC2MultiRegionSnitch

Use the EC2MultiRegionSnitch for deployments on Amazon EC2 where the cluster spans multiple regions. As with the EC2Snitch, regions are treated as data centers and availability zones are treated as racks within a data center. For example, if a node is in us-east-1a, us-east is the data center name and 1a is the rack location.

This snitch uses public IPs as broadcast_address to allow cross-region connectivity. This means that you must configure each Cassandra node so that the listen_address is set to the private IP address of the node, and the broadcast_address is set to the public IP address of the node. This allows Cassandra nodes in one EC2 region to bind to nodes in another region, thus enabling multiple data center support. (For intra-region traffic, Cassandra switches to the private IP after establishing a connection.)

Additionally, you must set the addresses of the seed nodes in the cassandra.yaml file to that of the public IPs because private IPs are not routable between networks. For example:

seeds: 50.34.16.33, 60.247.70.52

To find the public IP address, run this command from each of the seed nodes in EC2:

curl http://instance-data/latest/meta-data/public-ipv4

Finally, open storage_port or ssl_storage_port on the public IP firewall.

When defining your keyspace strategy_options, use the EC2 region name, such as“us-east“, as your data center names.