Tuning commit logs and memtables

Certify and Increase Opportunity.
Be
Govt. Certified Apache Cassandra Professional

Cassandra is optimized for write throughput. Cassandra writes are first written to a commit log (for durability), and then to an in-memory table structure called a memtable. A write is successful once it is written to the commit log and memory, so there is very minimal disk I/O at the time of write. Writes are batched in memory and periodically written to disk to a persistent table structure called an SSTable (sorted string table). Memtables and SSTables are maintained per column family. Memtables are organized in sorted order by row key and flushed to SSTables sequentially (no random seeking as in relational databases).

Commit Logs

The biggest performance gain for write is to put commit log in a separate disk drive

  • commit log uses sequential write and most hard drive will meet the throughput requirement
  • However, if SSTables share the same drive with commit log
  • I/O contention between commit log & SSTables may deteriorate commit log writes and SSTable reads
  • Unfortunately, current cloud computing service does not provide a real standalone drive

Memtables

When performing write operations, Cassandra stores values to column-family specific, in-memory data structures called Memtables. These Memtables are flushed to disk whenever one of the configurable thresholds is exceeded. The initial settings (64mb/0.3) are purposefully conservative, and proper tuning of these thresholds is important in making the most of available system memory, without bringing the node down for lack of memory.

Configuring Thresholds

Larger Memtables take memory away from caches: Since Memtables are storing actual column values, they consume at least as much memory as the size of data inserted. However, there is also overhead associated with the structures used to index this data. When the number of columns and rows is high compared to the size of values, this overhead can become quite significant, (possibly greater than the data itself). In other words, which threshold(s) to use, and what to set them to is not just a function of how much memory you have, but of how many column families, how many columns per column-family, and the size of values being stored.

Larger Memtables don’t improve write performance: Increasing the memtable capacity will cause less-frequent flushes but doesn’t improve write performance directly: writes go directly to memory regardless. (Actually, if your commitlog and sstables share a volume they might contend, so if at all possible, put them on separate volumes)

Larger memtables do absorb more overwrites: If your write load sees some rows written more often than others (eg upvotes of a front-page story) a larger memtable will absorb those overwrites, creating more efficient sstables and thus better read performance. If your write load is batch oriented or if you have a massive row set, rows are not likely to be rewritten for a long time, and so this benefit will pay a smaller dividend.

Larger memtables do lead to more effective compaction: Since compaction is tiered, large sstables are preferable: turning over tons of tiny memtables is bad. Again, this impacts read performance (by improving the overall io-contention weather), but not writes.

Listed below are the thresholds found in storage-conf.xml (or cassandra.yaml in 0.7+), along with a description.

MemtableThroughputInMB
As the name indicates, this sets the max size in megabytes that the Memtable will store before triggering a threshold violation and causing it to be flushed to disk. It corresponds to the size of the values inserted, (plus the size of the containing column).

If left unconfigured (missing from the config), this defaults to 128MB.

Note: This was referred to as MemtableSizeInMB in versons of Casandra before 0.6.0. In version 0.7b2+, the value will be applied on a per column-family basis.

MemtableOperationsInMillions
This directive sets a threshold on the number of columns stored.
Left unconfigured (missing from the config), this defaults to 0.1 (or 100,000 objects). The config file’s inital setting of 0.3 (or 300,000 objects) is a conservative starting point.

Share this post
[social_warfare]
Performance Tuning
Caching and buffer sizing

Get industry recognized certification – Contact us

keyboard_arrow_up