intra-mart Accel Platform / Cassandra Administrator Guide

Version 8 2014-04-01

«  8. Cassandra Operations   ::   Contents   ::   10. Cassandra Version Upgrade  »

9. Cassandra Operations

In this section Cassandra operations are described.

9.1. Timing and Risk of Status Change

Timing of Cassandra operations status change and the associated risk are described.

9.1.1. Planned Stoppage

This represents the case in which Cassandra is stopped at the intended timing either manually or by the service.
It should be assumed that there is no outstanding update request from the ApplicationServer when Cassandra is to be stopped.

Note

Support Example

  • Please stop ApplicationServer before Cassandra is stopped.
  • nodetool drain command should be executed, and write request would not be accepted.
Flush will be executed at the same time drain is executed.
  • In case flush was successful :
    This is the status normally expected. There is no data loss, because flush has been executed. Therefore, there is no need to take further action.
  • In case flush failed :
    When the planned stoppage is made, flush would be normally executed and data integrity at the time of stoppage would be secured. However, flush may sometimes fail at the time of stoppage.
    Data that was on the memory would be lost, because flush was not successfully executed.

9.1.2. Abnormal Stoppage

This represents the case in which Cassandra is stopped by some unexpected cause.

Note

Following status are assumed.

  • In case the exception that makes the continuation of service in Cassandra impossible (OutOfMemoryError) occurs
  • In case the process of Cassandra is forced to stop
  • In case OS has stopped during Cassandra startup
  • In case flush has been executed :
    Flush may have been executed before the abnormal stoppage.
    Since flush has been executed, there is no data loss. Therefore, there is no need to take further action.
  • In case flush has not been executed or has failed :
    Flush is normally not executed before the abnormal stoppage. Therefore, it will fail even if it is executed.
    Data that was on the memory would be lost, because flush was not successfully executed.

Note

Regardless of whether it is a planned stoppage or abnormal stoppage, there is a risk of data loss because of the flush failure.

9.1.3. Disconnection from Network

If the node of Cassandra is disconnected from the network, data loss will not occur.
However, the service may become unavailable depending on the configuration status.
  • In case it is built with a single node :
    IMBox will not be available until Cassandra is restored.

    Note

    In the case of single node operation, data loss will not occur, because read/write operation cannot be executed when the network is disconnected.
  • In case it is built as a cluster :
    Status would change depending on the number of nodes disconnected from the network.
    • In case the number of available units allow service to continue :
      For example, if 1 unit is disconnected while [3 nodes (replication factor is 3)] are in operation, the service can continue normally with the other 2 units.
      When the disconnected node is reconnected to the network, missing data is replicated (copied) and the normal operation status is resumed.
    • In case the number of available units do not allow service to continue :
      IMBox becomes unavailable until Cassandra is restored.
      For example, if 2 units are disconnected while [there are 3 nodes and replication factor is 3], the service cannot be continued.
      Since the service becomes unavailable and read/write operations cannot be performed, there would be no data loss.

9.2. Risk Mitigation

From the descriptions of the previous section, it is basically necessary to take some countermeasure in case the flush fails or is not executed.
There are a few alternatives shown below as countermeasures.

9.2.1. Flush Execution before Planned Stoppage

Flush process is executed once before the planned stoppage.
By doing this, you can avoid the risk of flush failure at the time of planned stoppage.

9.2.2. Periodic Flush Execution

By executing the flush periodically by cron and so on at Cassandra startup time, data contents at the point of the last flush execution will be secured even if the data loss occurs because of abnormal stoppage.

Note

This countermeasure only reduces the possibility or amount of data loss.
Please see this for Flush.

Note

Frequency of flush execution should be coordinated with the conditions required by the operations design.
Following items would need to be assessed.
  • Maximum amount of time in which data loss situation might be allowed
  • Impact on service response time and server load by the increased I/O load by the periodic flush execution

9.2.3. Cluster Configuration Build

Possibility of data loss or service stoppage can be significantly reduced by building and operating Cassandra cluster.
It is strongly recommended that the number of nodes that make up a cluster should be 3 or more.
  • In case number of nodes are 2 :
    In case there are 2 nodes in the cluster, service continuation (data read and write) would not be possible if one of the nodes is stopped.
    If the service cannot be continued, data loss would not occur, because data write cannot be made.
  • In case number of nodes are 3 (or more) :
    In case there are 3 nodes in the cluster (and the replication factor is 3), the service can continue even if 1 node is stopped.
    When the stopped node is recovered, missing data for the stopped node is replicated (copied), and the normal operation status would be resumed when the replication is completed.
    If 2 nodes out of 3 nodes are stopped at the same time, the service cannot be continued. Therefore, recovery process is necessary when 1 node is stopped.
    However, since data write cannot be made when the service cannot be continued, data loss would not occur.

9.2.4. Summary

Here are the summaries of effective measures for each configuration.

9.2.4.1. In case Cassandra is operated with a Single Node

In case Cassandra is operated with a single node, it is not possible to completely prevent the data loss when the abnormal stoppage occurs.
Following countermeasures are considered effective when the abnormal stoppage occurs.
  • Flush execution before stoppage
  • Periodic flush execution

Note

Reference Information

Cassandra system for our company’s internal use is operated with a single node, and flush is executed with 15-minute interval.
This system has been in operation for about 2 years, and in the past posted data for the preceding 15 minutes (maximum) was lost 4 times.
[*] Since our company’s system is also used as operations test environment, pre-release version (RC version) is being used, and hence the error rate would be higher than the normal systems.

9.2.4.2. In case Cassandra is operated in Cluster Configuration

Operations in the cluster configuration would, by themselves, significantly reduce the possibility of data loss.
In case there is a need to stop the entire cluster for the maintenance or other purpose, the risk of data loss can be reduced by executing the following measures.
  • Flush Execution before Stoppage

9.3. Data Recovery

If the data loss has occurred, data that can be restored are as follows :

9.3.1. Recoverable Data

[User] and [Department] information that are written synchronously by IM-Common Master

Note

It can be restored by executing the dedicated job (job net).
For the details, please refer to [IMBox Specifications - Job Scheduler].

9.3.2. Unrecoverable Data

[Posting Contents] or [Newly Created Group] which has been updated on IMBox will be saved only on Cassandra.
Therefore, it cannot be restored once data loss has occurred.

Note

Since the attachment files are saved on the file system (PublicStorage), they can be restored by the System Administrator.
For the saving location of attachment files, please refer to [IMBox Specifications - Action -File].

9.4. Time Adjustment

All the data stored on Cassandra have the time. This time is utilized to resolve data inconsistency mainly in the clustering environment.
Therefore, if you do such operations as changing the clock of the server Cassandra is operated on, data mismatch would occur and data may be damaged.
Please note that the time stored on Cassandra comes from the time set by the application connecting to Cassandra and not from the time of the server Cassandra operates on.
Therefore, all the clocks on intra-mart Accel Platform operations environment and on the servers Cassandra executes must have the same time.
Please assess the use of ntp server to make all the times common.

«  8. Cassandra Operations   ::   Contents   ::   10. Cassandra Version Upgrade  »