Creating a Service and Testing Failover

To create a service and test failover:

Services are created and usually configured to run a resource agent that is responsible for starting and stopping processes. Most resource agents are created according to the OCF (Open Cluster Framework) specification, which is defined as an extension for the Linux Standard Base (LSB). Many handy resource agents for commonly used processes are included in the resource-agents packages, including various heartbeat agents that track whether commonly used daemons or services are still running.

In the following example, a service is set up that uses a Dummy resource agent created precisely to test Pacemaker. This agent is used because it requires a basic configuration and doesn't make any assumptions about the environment or the types of services that you intend to run with Pacemaker.

  1. Add the service as a resource by using the pcs resource create command:

    sudo pcs resource create dummy_service ocf:pacemaker:Dummy op monitor interval=120s

    In the previous example, dummy_service is the name that is provided for the service for this resource:

    To invoke the Dummy resource agent, a notation (ocf:pacemaker:Dummy) is used to specify that it conforms to the OCF standard, that it runs in the pacemaker namespace, and that the Dummy script is used. If you were configuring a heartbeat monitor service for a clustered file system, you might use the ocf:heartbeat:Filesystem resource agent.

    The resource is configured to use the monitor operation in the agent and an interval is set to check the health of the service. In this example, the interval is set to 120s to give the service sufficient time to fail while you're demonstrating failover. By default, this interval is typically set to 20 seconds, but it can be modified depending on the type of service and the particular environment.

    When you create a service, the cluster starts the resource on a node by using the resource agent's start command.

  2. View the resource start and run status, for example:

    sudo pcs status

    The following output is displayed:

    Cluster name: pacemaker1
    Stack: corosync
    Current DC: node2 (version 2.1.2-4.0.2.el9-f765c3be2f4) - partition with quorum
    Last updated:Mon Jul 18 14:54:28 2022
    Last change: Mon Jul 18 14:52:28 2022 by root via cibadmin on node1
    
    2 nodes configured
    1 resource configured
    
    Online: [ node1 node2 ]
    
    Full list of resources:
    
     dummy_service  (ocf::pacemaker:Dummy): Started node1
    
    Daemon Status:
      corosync: active/disabled
      pacemaker: active/disabled
      pcsd: active/enabled
  3. Run the crm_resource command to simulate service failure by force stopping the service directly:

    sudo crm_resource --resource dummy_service --force-stop

    Running the crm_resource command ensures that the cluster is unaware that the service has been manually stopped.

  4. Run the crm_mon command in interactive mode so that you can wait until a node fails, to view the Failed Actions message, for example:

    sudo crm_mon

    The following output is displayed:

    Stack: corosync
    Current DC: node1 (version 2.1.2-4.0.2.el9-f765c3be2f4) - partition with quorum
    Last updated: Mon Jul 18 15:00:28 2022
    Last change: Mon Jul 18 14:58:14 2022 by root via cibadmin on node1
    
    3 nodes configured
    1 resource configured
    
    Online: [ node1 node2 ]
    
    Active resources:
    
    dummy_service   (ocf::pacemaker:Dummy): Started node2
    
    Failed Resource Actions:
    * dummy_service_monitor_120000 on node1 'not running' (7): call=7, status=complete, exitreason='',
        last-rc-change='Mon Jul 18 15:00:17 2022', queued=0ms, exec=0ms

    You can see the service restart on the alternate node. Note that the default monitor period is set to 120 seconds, so you might need to wait up to the full period before you see notification that a node has gone offline.

    Tip:

    You can use the Ctrl-C key combination to exit out of crm_mon at any point.

  5. Reboot the node where the service is running to determine whether failover also occurs in the case of node failure.

    Note that if you didn't enable the corosync and pacemaker services to start on boot, you might need to manually start the services on the node that you rebooted by running the following command:

    sudo pcs cluster start node1