Configuring HA - an example
This supplements the High Availability documentation and provides an example of how to configure a pair of Cloudsoft AMP servers to run in master-standby mode with a shared NFS datastore
Prerequisites
- Two VMs (or physical machines) have been provisioned
- NFS or another suitable file system has been configured and is available to both VMs*
- An NFS folder has been mounted on both VMs at
/mnt/brooklyn-persistence
and both machines can write to the folder
* AMP can be configured to use either an object store such as S3, or a shared NFS mount. The recommended option is to use an object store as described in the Object Store Persistence documentation. For simplicity, a shared NFS folder is assumed in this example
Launching
To start, download and install the latest Cloudsoft AMP release on both VMs following the instructions in Running Cloudsoft AMP.
On the first VM, which will be the master node, set the following configuration options:
- highAvailabilityMode: MASTER
- persistMode: AUTO
- persistenceDir: /mnt/brooklyn-persistence
Then launch AMP with:
If you are using RPMs/deb to install, please see the Running Cloudsoft AMP documentation for the appropriate launch commands.
Once AMP has launched, on the second VM, set the following configuration options (org.apache.brooklyn.osgilauncher.cfg
):
- highAvailabilityMode: AUTO
- persistMode: AUTO
- persistenceDir: /mnt/brooklyn-persistence
Then launch the standby AMP with:
Failover
When running as a HA standby node, each standby AMP server (in this case there is only one standby) will check the shared persisted state
every one second to determine the state of the HA master. If no heartbeat has been recorded for 30 seconds, then an election will be performed
and one of the standby nodes will be promoted to master. At this point all requests should be directed to the new master node.
If the master is terminated gracefully, the secondary will be immediately promoted to mater. Otherwise, the secondary will be promoted after
heartbeats are missed for a given length of time. This defaults to 30 seconds, and is configured in brooklyn.properties using
brooklyn.ha.heartbeatTimeout
In the event that tasks - such as the provisioning of a new entity - are running when a failover occurs, the new master will display the current state of the entity, but will not resume its provisioning or re-run any partially completed tasks. In this case it may be necessary to remove the entity and reprovision it. In the case of a failover whilst executing a task called by an effector, it may be possible to simple call the effector again
High Availability Management
On top of the API
, High Availability can be explicitly controlled from the AMP UI,
which allows for the server to change its priority and request to promote itself to master.
This can be achieved via the HA Status
table in the About
page, which displays information about
nodes in the current management plane. The control menu is opened by selecting the Manage
option on the current server entry in the table.
The following menu allows to change the priority value, as well as the status of the node.
- If a node is
MASTER
, it can demote itself by changing to another state. In such case, a new master is selected from available standby servers, basing on their priority. - If a node is
STANDBY
, orHOT_STANDBY
, it can promote itself by changing toMASTER
state. It is recommended for this server to have the highest priority amongst all available servers.
Additionally, terminated servers can be removed from the persistence with Remove
option (visible upon hover over the terminated node in the HA Status
Table).
All terminated servers can be removed at once with Remove terminated nodes
option. These operations are only available to the master node.
Client Configuration
It is the responsibility of the client to connect to the master AMP server. This can be accomplished in a variety of ways.
Reverse Proxy
To allow the client application to automatically fail over in the event of a master server becoming unavailable, or the promotion of a new master,
a reverse proxy can be configured to route traffic depending on the response returned by https://<ip-address>:8443/v1/server/ha/state
(see above).
If a server returns "MASTER"
, then traffic should be routed to that server, otherwise it should not be. The client software should be configured
to connect to the reverse proxy server and no action is required by the client in the event of a failover. It can take up to 30 seconds for the
standby to be promoted, so the reverse proxy should retry for at least this period, or the failover time should be reconfigured to be shorter
Re-allocating an Elastic IP on Failover
If the cloud provider you are using supports Elastic or Floating IPs, then the IP address should be allocated to the HA master, and the client
application configured to connect to the floating IP address. In the event of a failure of the master node, the standby node will automatically
be promoted to master, and the floating IP will need to be manually re-allocated to the new master node. No action is required by the client
in the event of a failover. It is possible to automate the re-allocation of the floating IP if the AMP servers are deployed and managed
by AMP using the entity org.apache.brooklyn.entity.brooklynnode.AMPCluster
Client-based failover
In this scenario, the responsibility for determining the AMP master server falls on the client application. When configuring the client application, a list of all servers in the cluster is passed in at application startup. On first connection, the client application connects to any of the members of the cluster to retrieve the HA states (see above). The JSON object returned is used to determine the addresses of all members of the cluster, and also to determine which node is the HA master
In the event of a failure of the master node, the client application should then retrieve the HA states of the cluster from any of the other cluster members. This is the same process as when the application first connects to the cluster. The client should refresh its list of cluster memebers and determine which node is the HA master
It is also recommended that the client application periodically checks the status of the cluster and updates its list of addresses. This will ensure that failover is still possible if the standby server(s) has been replaced. It also allows additional standby servers to be added at any time
Sharing SSH keys
When AMP is installed using the RPM it will create a new user “amp” assuming one doesn’t exist. No SSH keys will be generated in ~/.ssh/ as they would then be different across HA nodes meaning that after a fail-over, the new master would not have access to any existing entities. To work around this, AMP will create and store SSH keys in the persistence store whenever a virtual machine is provisioned. An alternative would be to install shared SSH keys on each of the AMPs in the HA cluster. We recommend doing this as follows.
Put the id_rsa and id_rsa.pub keys onto each of the servers in the AMP cluster e.g. by using a tool such as scp
. The following steps will correctly
place the keys in the “amp” user account with the correct ownership and permissions.
If AMP is already running then you will need to restart AMP.
Testing
You can confirm that AMP is running in high availability mode on the master by logging into the web console at https://<ip-address>:8443
.
Similarly you can log into the web console on the standby VM where you will see a warning that the server is not the high availability master.
To test a failover, you can simply terminate the process on the first VM and log into the web console on the second VM. Upon launch, AMP will
output its PID to the file pid.txt
; you can force an immediate (non-graceful) termination of the process by running the following command
from the same directory from which you launched AMP:
It is also possible to check the high availability state of a running AMP server using the following curl command:
This will return one of the following states:
Note: The quotation characters will be included in the reply
To obtain information about all of the nodes in the cluster, run the following command against any of the nodes in the cluster:
This will return a JSON document describing the AMP nodes in the cluster. An example of two HA AMP nodes is as follows (whitespace formatting has been added for clarity):
The examples above show how to use curl
to manually check the status of AMP via its REST API. The same REST API calls can also be used by
automated third party monitoring tools such as Nagios