Creating a Service and Testing Failover
To create a service and test failover:
Services are created and usually configured to run a resource agent that is responsible for
starting and stopping processes. Most resource agents are created according to the OCF (Open
Cluster Framework) specification, which is defined as an extension for the Linux Standard Base
(LSB). Many handy resource agents for commonly used processes are included in the
resource-agents
packages, including various heartbeat agents that track
whether commonly used daemons or services are still running.
In the following example, a service is set up that uses a Dummy resource agent created precisely to test Pacemaker. This agent is used because it requires a basic configuration and doesn't make any assumptions about the environment or the types of services that you intend to run with Pacemaker.
-
Add the service as a resource by using the pcs resource create command:
sudo pcs resource create dummy_service ocf:pacemaker:Dummy op monitor interval=120s
In the previous example, dummy_service is the name that is provided for the service for this resource:
To invoke the Dummy resource agent, a notation (
ocf:pacemaker:Dummy
) is used to specify that it conforms to the OCF standard, that it runs in the pacemaker namespace, and that the Dummy script is used. If you were configuring a heartbeat monitor service for a clustered file system, you might use theocf:heartbeat:Filesystem
resource agent.The resource is configured to use the monitor operation in the agent and an interval is set to check the health of the service. In this example, the interval is set to 120s to give the service sufficient time to fail while you're demonstrating failover. By default, this interval is typically set to 20 seconds, but it can be modified depending on the type of service and the particular environment.
When you create a service, the cluster starts the resource on a node by using the resource agent's start command.
-
View the resource start and run status, for example:
sudo pcs status
The following output is displayed:
Cluster name: pacemaker1 Stack: corosync Current DC: node2 (version 2.1.2-4.0.2.el9-f765c3be2f4) - partition with quorum Last updated:Mon Jul 18 14:54:28 2022 Last change: Mon Jul 18 14:52:28 2022 by root via cibadmin on node1 2 nodes configured 1 resource configured Online: [ node1 node2 ] Full list of resources: dummy_service (ocf::pacemaker:Dummy): Started node1 Daemon Status: corosync: active/disabled pacemaker: active/disabled pcsd: active/enabled
-
Run the crm_resource command to simulate service failure by force stopping the service directly:
sudo crm_resource --resource dummy_service --force-stop
Running the crm_resource command ensures that the cluster is unaware that the service has been manually stopped.
-
Run the crm_mon command in interactive mode so that you can wait until a node fails, to view the
Failed Actions
message, for example:sudo crm_mon
The following output is displayed:
Stack: corosync Current DC: node1 (version 2.1.2-4.0.2.el9-f765c3be2f4) - partition with quorum Last updated: Mon Jul 18 15:00:28 2022 Last change: Mon Jul 18 14:58:14 2022 by root via cibadmin on node1 3 nodes configured 1 resource configured Online: [ node1 node2 ] Active resources: dummy_service (ocf::pacemaker:Dummy): Started node2 Failed Resource Actions: * dummy_service_monitor_120000 on node1 'not running' (7): call=7, status=complete, exitreason='', last-rc-change='Mon Jul 18 15:00:17 2022', queued=0ms, exec=0ms
You can see the service restart on the alternate node. Note that the default monitor period is set to 120 seconds, so you might need to wait up to the full period before you see notification that a node has gone offline.
Tip:
You can use the
Ctrl-C
key combination to exit out of crm_mon at any point. -
Reboot the node where the service is running to determine whether failover also occurs in the case of node failure.
Note that if you didn't enable the
corosync
andpacemaker
services to start on boot, you might need to manually start the services on the node that you rebooted by running the following command:sudo pcs cluster start node1