Apache Cassandra

Cassandra is NoSQL database management system designed for handling a high volume of structured data. If you are preparing for a role in Apache Cassandra, then you will find these interview questions helpful.



Q.1 Explain the types of Data models.

There are three types of Data Model:

  • Conceptual Data Model
  • Logical Data Model
  • Physical Data Model
Q.2 What is the role of durable writes?
Durable Writes provides a means for instructing Cassandra whether to use commitlog for updates on the current KeySpace or not. However, this option is not mandatory and the default value for durable writes is TRUE.
Q.3 Define replication factor.
Cassandra stores copies that are known as replicas of each row based on the row key. The replication factor refers to the number of nodes that will act as copies (replicas) of each row of data.
Q.4 Define replication Strategy.

The replica placement strategy can be defined as how the replicas will be placed in the ring. However, there are different strategies that ship with Cassandra for determining which nodes will get copies of which keys. This include:

  • Simple Strategy
  • Network Topology Strategy
Q.5 What is a Simple Strategy?
This uses Simple Single Datacenter Clusters and places the first Replica on a node determined by the Partitioner. Additional Replicas are placed on the next nodes in a clockwise (in a Ring) manner without considering Rack or Datacenter location.
Q.6 Define Network Topology Strategy.
This is used when we want to deploy a cluster over Multiple Datacenters. It is the primary consideration for inserting replicas. This can satisfy reads locally without incurring cross Data-Center Latency and also control failure scenarios.
Q.7 What do you understand about a Row in Cassandra? Name its elements.
A row is a collection of sorted columns. This is the smallest unit that stores related data in Cassandra. Any component of a Row can store data or metadata. However, the elements of a row are: Row Key Column Keys Column Values
Q.8 What is data replication?
Data replication refers to an operation in which data from one node is copied to different nodes in the cluster. This operation makes sure to have the redundancy and fault tolerance in the database. Further, in this, the replication factor decides the number of copies, and the replication strategy decides the nodes in which the data is copied.
Q.9 What is a commit log?
This can be considered as a mechanism that is used for recovering data in case the database crashes. Every operation that is carried out is saved in the commit log.
Q.10 Define tunable consistency in Cassandra.
Tunable consistency refers to a remarkable character that makes Cassandra a mostly used database choice of Developers, Analysts, and Big data Architects. Consistency here refers to the up-to-date and synchronized data rows on all their replicas. Cassandra’s tunable consistency enables users to choose the consistency level best suited for their use cases. It supports two consistencies eventual consistency and strong consistency. Further, for strong consistency, Cassandra supports the following condition: R + W > N where, N – Number of replicas W – Number of nodes that need to agree for a successful write R – Number of nodes that need to agree for a successful read
Q.11 What is the process of Cassandra’s write function?
Cassandra performs the write function by applying two commits: first, it writes to a commit log on the disk, and then it commits to an in-memory structure known as memtable. And, once the two commits are successful, the write is achieved. Writes are written in the table structure as SSTables (sorted string tables). Cassandra offers faster write performance.
Q.12 What is memtable?
Memtable is the in-memory/write-back cache space consisting of the content in a key and column format. The data in a memtable is sorted by key, and each column family consists of a distinct memtable that retrieves column data via the key. It stores the writes until it is full, and then flushes them out.
Q.13 Define Bloom Filter.
Bloom filter is linked with SStable. This is an off-heap (off the Java heap to native memory) data structure for checking whether there is any data available in the SSTable before performing any I/O disk operation.
Q.14 What is CAP Theorem?
With a strong requirement for scaling systems when additional resources are required, CAP Theorem plays a major role in maintaining the scaling strategy. This is an efficient way of handling scaling in distributed systems. Further, the Consistency, availability, and partition tolerance (CAP) theorem states that in distributed systems like Cassandra, users can enjoy only two out of these three characteristics. The two options available are AP and CP.
Q.15 Differentiate between a node, a cluster, and a data center in Cassandra.
A node is a single machine running Cassandra and a cluster is a collection of nodes that have similar types of data grouped together. Lastly, Data centers are useful components when serving customers in different geographical areas. However, you can group different nodes of a cluster into different data centers.
Q.16 What is compaction in Cassandra?
Compaction can be defined as a maintenance process in Cassandra, in which the SSTables are reorganized for data optimization of data structures on the disk. The compaction process is useful during interacting with memtables. There are two types of compaction in Cassandra. 1. Minor compaction This begins automatically when a new SSTable is created. Here, Cassandra condenses all the equally sized SSTables into one. 2. Major compaction This is triggered manually using the node tool. It compacts all SSTables of a column family into one.
Q.17 Define Super Column in Cassandra.
Cassandra Super Column refers to a unique element consisting of similar collections of data. They are actually key-value pairs with values as columns. It is a sorted array of columns, and they follow a hierarchy when in action: Keystore > column family > super column > column data structure in JSON. Further, super column data entries contain no independent values but are used to collect other columns.
Q.18 Explain what is Cassandra?
Cassandra is an open source data storage system for inbox search, developed at Facebook and it's designed for storing and managing large amounts of data across commodity servers. It can serve as both. Real time data store system for online applications, and Also for business intelligence system as a read intensive database.
Q.19 State the use of Cassandra and why to use Cassandra?

Cassandra was designed to handle big data workloads over the multiple nodes without any single point of failure. The various factors responsible for using Cassandra are:

  • It is fault tolerant and consistent
  • Gigabytes to petabytes scalabilities
  • It is a column-oriented database
  • No single point of failure
  • No need for separate caching layer
  • Flexible schema design
  • It has easy data distribution, flexible data storage, and fast writes
  • It supports ACID (Atomicity, Consistency, Isolation, and Durability)properties
  • Multi-data center and cloud capable
  • Data compression.
Q.20 Explain what is composite type in Cassandra?
Cassandra built-in composite types come in two forms:

  • Static composite type: Data types for each part of a composite column are predefined per column family.  All the column names/keys within a column family must be of that composite type.
  • Dynamic composite type: This type allows mixing column names with different composite types in a column family or even in one row.
Q.21 How Cassandra stores data?
  • All data stored as bytes
  • Cassandra ensures those bytes are encoded as per requirement, when you specify Validators
  • Then a collation orders the column based on the ordering specific to the encoding
  • While with a particular encoding composite are just byte arrays, for each component it stores a two byte length followed by the byte encoded component followed by a termination bit.
Q.22 Please mention the main components of Cassandra Data Model?
The main components of Cassandra Data Model are:
Cluster
Key space
Column
Column & Family
Q.23 Explain what is a column family in Cassandra?
A collection of Rows in Cassandra are referred as column family.
Q.24 Explain what is a cluster in Cassandra?
A cluster is a container for key spaces. Cassandra database is distributed over several machines that function together. The cluster is the outermost container which manages the nodes in a ring format and assigns data to them. These nodes have a replica which takes charge in case of failure of data handling.
Q.25 List out the other components of Cassandra?

The other components of Cassandra are

  • Node
  • Data Center
  • Cluster
  • Commit log
  • Mem-table
  • SSTable
  • Bloom Filter
Q.26 Explain what is a keyspace in Cassandra?
In Cassandra, a keyspace is a namespace determining the data replication on nodes. A cluster consist of one keyspace per node.
Q.27 Give the syntax to create keyspace in Cassandra?
Syntax for creating keyspace in Cassandra is
CREATE KEYSPACE WITH
Q.28 Mention the values that are stored in the Cassandra Column?

In Cassandra Column, basically there are three values:

  • Column Name
  • Value
  • Time Stamp
Q.29 Mention when you can use Alter keyspace?
To change properties such as the number of replicas and the durable_write of a keyspace ALTER KEYSPACE can be used.
Q.30 Explain what is Cassandra-Cqlsh?

Cassandra-Cqlsh is a query language that enabling the users to communicate with its database. By using Cassandra cqlsh, one can do:

  • Define a schema
  • Insert a data 
  • Execute a query.
Q.31 Explain how Cassandra writes changed data into commitlog?
  • Cassandra concatenate changed data to commitlog
  • Commitlog acts as a crash recovery log for data
  • Until the changed data is concatenated to commitlog write operation will be never considered successful
  • Data will not be lost once commitlog is flushed out to file.
Q.32 Explain how Cassandra delete Data?
SSTables are permanent and cannot remove a row from SSTables. Cassandra assigns the column value with a special value called Tombstone when a row needs to be deleted.
Therefore, when the data is read, the Tombstone value is considered as deleted.
Q.33 State the usage of "void close()" method?
In Cassandra, to close the current session instance the void close() method is used.
Q.34 To start the cqlsh prompt state the command used?
The cqlsh command is used to initiate the cqlsh prompt.
Q.35 Give the usage of "cqlsh-version" command?
The "cqlsh-version" command is used to provide the version of the cqlsh one is using.
Q.36 Does Cassandra work on Windows?
Yes. it's is compatible with the Windows and works pretty well. Now its Linux and Window compatible version are available too.
Q.37 What is Kundera in Cassandra?
Kundera is an object-relational mapping (ORM) implementation, in the Cassandra which is written using Java annotations.
Q.38 What do you understand by Thrift in Cassandra?
Thrift is the name of RPC client which is utilized to communicate with the Cassandra Server.
Q.39 What is Hector in Cassandra?
Hector was one of the early Cassandra clients. It is an open source project using the MIT license written in Java.
Q.40 State some of the features of Apache Cassandra.
Some of the features of Apache Cassandra -
1. High Scalability
2. High fault tolerant
3. Flexible Data storage
4. Easy data distribution
5. Tunable Consistency
6. Efficient Wires
7. Cassandra Query Language
Q.41 How would you define NoSQL Database?
NoSQL Database is a database that deals with the non-relational database. It is also known as a Not only SQL database. NoSQL Database provides a mechanism to store and retrieve different type of data that includes images, sounds and more.
Q.42 What are the primary features of any NoSQL database?
Some of the primary features of any NoSQL database are -
1. Schema Agnostic
2. AutoSharding and Elasticity
3. Highly Distributable
4. Easily Scalable
5. Integrated Caching
Q.43 Which query language is used in Cassandra Database?
Cassandra query language' is used for Cassandra Database. Cassandra query language is an interface that a user uses to access the database and is basically a communication medium. Such that all the operations are carried out from this panel.
Q.44 What is the primary objective of creating Cassandra?
The primary objective of crating Cassandra is to handle a large amount of data. Also the objective ensures fault tolerance with the swift transfer of data.
Q.45 What do you understand by Document Store DB?
Data record is the JSON/XML representation of key-value pairs such that every record can have a different set of fields. Document DBs are similar to Key-value pairs, but the only difference is that the key is associated with a document
Q.46 What is the purpose of CQLSH?
Cassandra-CQLSH is a defined query language which enables users to communicate with its database. The purpose of using Cassandra CQLSH is to -
1. Define a schema
2. Insert a data
3. Execute a query
Q.47 How do you define is a YML file in Cassandra?
Cassandra YAML file is the main configuration file for Cassandra. Therefore after changing properties in the cassandra.yaml file, we must restart the node for the changes to take effect.
Q.48 Define Key-Value Store DB.
In this, all of the data inside the database consists of an indexed key and a value. A key may correspond to one or multiple values (hash table). Moreover, it provides great performance and can be very easily scaled as per business needs.
Q.49 Define Column Store DB.
In Column Store DB, the data is stored in cells are grouped in columns of data rather than as rows of data. Columns are logically grouped into column families. However, one row may have one or multiple data records, which are indexed by a partition key.
Q.50 What do you understand about Graph DB?
Graph DB can be referred to as the type of NoSQL database in which a flexible graphical representation is used. The key motive is to store relationships between nodes.
Get Govt. Certified Take Test
 For Support