10 Using the Service Guardian
This chapter includes the following sections:
- Overview of the Service Guardian
The service guardian is a mechanism that detects and attempts to resolve deadlocks in Coherence threads. - Configuring the Service Guardian
The service guardian is enabled out-of-the box and has two configuration items: the timeout value and the failure policy. - Issuing Manual Guardian Heartbeats
TheGuardSupport
class provides heartbeat methods that applications can use to manually issue heartbeats to the service guardian. - Setting the Guardian Log Thread Dump Frequency
If Coherence cache server logs are overwhelmed with too many service guardian thread dumps in a short duration of time, you can increase the interval between service guardian thread dumps.
Parent topic: Using Coherence Clusters
Overview of the Service Guardian
The service guardian receives periodic heartbeats that are issued by Coherence-owned and created threads. Should a thread fail to issue a heartbeat before the configured timeout, the service guardian takes corrective action. Both the timeout and corrective action (recovery) can be configured as required.
Note:
The term deadlock does not necessarily indicate a true deadlock; a thread that does not issue a timely heartbeat may be executing a long running process or waiting on a slow resource. The service guardian does not have the ability to distinguish a deadlocked thread from a slow one.
Interfaces That Are Executed By Coherence
Implementations of the following interfaces are executed by Coherence-owned threads. Any processing in an implementation that exceeds the configured guardian timeout results in the service guardian attempting to recover the thread. The list is not exhaustive and only provides the most common interfaces that are implemented by end users.
com.tangosol.net.Invocable
com.tangosol.net.cache.CacheStore
com.tangosol.util.Filter
com.tangosol.util.InvocableMap.EntryAggregator
com.tangosol.util.InvocableMap.EntryProcessor
com.tangosol.util.MapListener
com.tangosol.util.MapTrigger
Understanding Recovery
The service guardian's recovery mechanism uses a series of steps to determine if a thread is deadlocked. Corrective action is taken if the service guardian concludes that the thread is deadlocked. The action to take can be configured and custom actions can be created if required. The recovery mechanism is outlined below:
-
Soft Timeout – The recovery mechanism first attempts to interrupt the thread just before the configured timeout is reached. The following example log message demonstrates a soft timeout message:
<Error> (thread=DistributedCache, member=1): Attempting recovery (due to soft timeout) of Daemon{Thread="Thread[WriteBehindThread:CacheStoreWrapper(com. tangosol.examples.rwbm.TimeoutTest),5,WriteBehindThread:CacheStoreWrapper(com. tangosol.examples.rwbm.TimeoutTest)]", State=Running}
If the thread can be interrupted and it results in a heartbeat, normal processing resumes.
-
Hard Timeout – The recovery mechanism attempts to stop a thread after the configured timeout is reached. The following example log message demonstrates a hard timeout message:
<Error> (thread=DistributedCache, member=1): Terminating guarded execution (due to hard timeout) of Daemon{Thread="Thread[WriteBehindThread:CacheStoreWrapper (com.tangosol.examples.rwbm.TimeoutTest),5,WriteBehindThread:CacheStoreWrapper (com.tangosol.examples.rwbm.TimeoutTest)]", State=Running}
-
Lastly, if the thread cannot be stopped, the recovery mechanism performs an action based on the configured failure policy. Actions that can be performed include: shutting down the cluster service, shutting down the JVM, and performing a custom action. The following example log message demonstrates an action taken by the recovery mechanism:
<Error> (thread=Termination Thread, member=1): Write-behind thread timed out; stopping the cache service
Parent topic: Using the Service Guardian
Configuring the Service Guardian
This section includes the following topics:
- Setting the Guardian Timeout
- Using the Timeout Value From the PriorityTask API
- Setting the Guardian Service Failure Policy
Parent topic: Using the Service Guardian
Setting the Guardian Timeout
This section includes the following topics:
- Overview of Setting the Guardian Timeout
- Setting the Guardian Timeout for All Threads
- Setting the Guardian Timeout Per Service Type
- Setting the Guardian Timeout Per Service Instance
Parent topic: Configuring the Service Guardian
Overview of Setting the Guardian Timeout
The service guardian timeout can be set in three different ways based on the level of granularity that is required:
-
All threads – This option allows a single timeout value to be applied to all Coherence-owned threads on a cluster node. This is the out-of-box configuration and is set at
305000
milliseconds by default. -
Threads per service type – This option allows different timeout values to be set for specific service types. The timeout value is applied to the threads of all service instances. If a timeout is not specified for a particular service type, then the timeout defaults to the timeout that is set for all threads.
-
Threads per service instance – This option allows different timeout values to be set for specific service instances. If a timeout is not set for a specific service instance, then the service's timeout value, if specified, is used; otherwise, the timeout that is set for all threads is used.
Setting the timeout value to 0
stops threads from being guarded. In general, the service guardian timeout value should be set equal to or greater than the timeout value for packet delivery.
Note:
The guardian timeout can also be used for cache store implementations that are configured with a read-write-backing-map scheme. In this case, the <cachestore-timeout>
element is set to 0
, which defaults the timeout to the guardian timeout. See read-write-backing-map-scheme.
Parent topic: Setting the Guardian Timeout
Setting the Guardian Timeout for All Threads
To set the guardian timeout for all threads in a cluster node, add a <timeout-milliseconds>
element to an operational override file within the <service-guardian>
element. The following example sets the timeout value to 120000
milliseconds:
<?xml version='1.0'?> <coherence xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns="http://xmlns.oracle.com/coherence/coherence-operational-config" xsi:schemaLocation="http://xmlns.oracle.com/coherence/coherence-operational-config coherence-operational-config.xsd"> <cluster-config> <service-guardian> <timeout-milliseconds>120000</timeout-milliseconds> </service-guardian> </cluster-config> </coherence>
The <timeout-milliseconds>
value can also be set using the coherence.guard.timeout
system property.
Parent topic: Setting the Guardian Timeout
Setting the Guardian Timeout Per Service Type
To set the guardian timeout per service type, override the service's guardian-timeout
initialization parameter in an operational override file. The following example sets the guardian timeout for the DistributedCache
service to 120000
milliseconds:
<?xml version='1.0'?> <coherence xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns="http://xmlns.oracle.com/coherence/coherence-operational-config" xsi:schemaLocation="http://xmlns.oracle.com/coherence/coherence-operational-config coherence-operational-config.xsd"> <cluster-config> <services> <service id="3"> <init-params> <init-param id="17"> <param-name>guardian-timeout</param-name> <param-value>120000</param-value> </init-param> </init-params> </service> </services> </cluster-config> </coherence>
The guardian-timeout
initialization parameter can be set for the DistributedCache
, ReplicatedCache
, OptimisticCache
, Invocation
, and Proxy
services. Refer to the tangosol-coherence.xml
file that is located in the coherence.jar
file for the correct service ID and initialization parameter ID to use when overriding the guardian-timeout
parameter for a service.
Each service also has a system property that sets the guardian timeout, respectively:
coherence.distributed.guard.timeout
coherence.replicated.guard.timeout
coherence.optimistic.guard.timeout
coherence.invocation.guard.timeout
coherence.proxy.guard.timeout
Parent topic: Setting the Guardian Timeout
Setting the Guardian Timeout Per Service Instance
To set the guardian timeout per service instance, add a <guardian-timeout>
element to a cache scheme definition in the cache configuration file. The following example sets the guardian timeout for a distributed cache scheme to 120000
milliseconds.
<distributed-scheme> <scheme-name>example-distributed</scheme-name> <service-name>DistributedCache</service-name> <guardian-timeout>120000</guardian-timeout> <backing-map-scheme> <local-scheme> <scheme-ref>example-binary-backing-map</scheme-ref> </local-scheme> </backing-map-scheme> <autostart>true</autostart> </distributed-scheme>
The <guardian-timeout>
element can be used in the following schemes: <distributed-scheme>
, <replicated-scheme>
, <optimistic-scheme>
, <transaction-scheme>
, <invocation-scheme>
, and <proxy-scheme>
.
Parent topic: Setting the Guardian Timeout
Using the Timeout Value From the PriorityTask API
Custom implementations of the Invocable
, EntryProcessor
, and EntryAggregator
interface can implement the PriorityTask
interface. In this case, the service guardian attempts recovery after the task has been executing for longer than the value returned by getExecutionTimeoutMillis()
. See Managing Thread Execution.
The execution timeout can be set using the <task-timeout>
element within an <invocation-scheme>
element defined in the cache configuration file. For the Invocation
service, the <task-timeout>
element specifies the timeout value for Invocable
tasks that implement the PriorityTask
interface, but do not explicitly specify the execution timeout value; that is, the getExecutionTimeoutMillis()
method returns 0
. If the <task-timeout>
element is set to 0
, the default guardian timeout is used.
Parent topic: Configuring the Service Guardian
Setting the Guardian Service Failure Policy
This section includes the following topics:
- Overview of Setting the Guardian Service Failure Policy
- Setting the Guardian Failure Policy for All Threads
- Setting the Guardian Failure Policy Per Service Type
- Setting the Guardian Failure Policy Per Service Instance
- Enabling a Custom Guardian Failure Policy
Parent topic: Configuring the Service Guardian
Overview of Setting the Guardian Service Failure Policy
The service failure policy determines the corrective action that the service guardian takes after it concludes that a thread is deadlocked. The following policies are available:
-
exit-cluster
– This policy attempts to recover threads that appear to be unresponsive. If the attempt fails, an attempt is made to stop the associated service. If the associated service cannot be stopped, this policy causes the local node to stop the cluster services. This is the default policy if no policy is specified. -
exit-process
– This policy attempts to recover threads that appear to be unresponsive. If the attempt fails, an attempt is made to stop the associated service. If the associated service cannot be stopped, this policy cause the local node to exit the JVM and terminate abruptly. -
logging
– This policy logs any detected problems but takes no corrective action. -
custom – the name of a Java class that provides an implementation for the
com.tangosol.net.ServiceFailurePolicy
interface. See Enabling a Custom Guardian Failure Policy.
The service guardian failure policy can be set three different ways based on the level of granularity that is required:
-
All threads – This option allows a single failure policy to be applied to all Coherence-owned threads on a cluster node. This is the out-of-box configuration.
-
Threads per service type – This option allows different failure policies to be set for specific service types. The policy is applied to the threads of all service instances. If a policy is not specified for a particular service type, then the timeout defaults to the timeout that is set for all threads.
-
Threads per service instance – This option allows different failure policies to be set for specific service instances. If a policy is not set for a specific service instance, then the service's policy, if specified, is used; otherwise, the policy that is set for all threads is used.
Parent topic: Setting the Guardian Service Failure Policy
Setting the Guardian Failure Policy for All Threads
To set a guardian failure policy, add a <service-failure-policy>
element to an operational override file within the <service-guardian>
element. The following example sets the failure policy to exit-process
:
<?xml version='1.0'?> <coherence xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns="http://xmlns.oracle.com/coherence/coherence-operational-config" xsi:schemaLocation="http://xmlns.oracle.com/coherence/coherence-operational-config coherence-operational-config.xsd"> <cluster-config> <service-guardian> <service-failure-policy>exit-process</service-failure-policy> </service-guardian> </cluster-config> </coherence>
Parent topic: Setting the Guardian Service Failure Policy
Setting the Guardian Failure Policy Per Service Type
To set the failure policy per service type, override the service's service-failure-policy
initialization parameter in an operational override file. The following example sets the failure policy for the DistributedCache
service to the logging
policy:
<?xml version='1.0'?> <coherence xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns="http://xmlns.oracle.com/coherence/coherence-operational-config" xsi:schemaLocation="http://xmlns.oracle.com/coherence/coherence-operational-config coherence-operational-config.xsd"> <cluster-config> <services> <service id="3"> <init-params> <init-param id="18"> <param-name>service-failure-policy</param-name> <param-value>logging</param-value> </init-param> </init-params> </service> </services> </cluster-config> </coherence>
The service-failure-policy
initialization parameter can be set for the DistributedCache
, ReplicatedCache
, OptimisticCache
, Invocation
, and Proxy
services. Refer to the tangosol-coherence.xml
file that is located in the coherence.jar
file for the correct service ID and initialization parameter ID to use when overriding the service-failure-policy
parameter for a service.
Parent topic: Setting the Guardian Service Failure Policy
Setting the Guardian Failure Policy Per Service Instance
To set the failure policy per service instance, add a <service-failure-policy>
element to a cache scheme definition in the cache configuration file. The following example sets the failure policy to logging
for a distributed cache scheme:
<distributed-scheme> <scheme-name>example-distributed</scheme-name> <service-name>DistributedCache</service-name> <guardian-timeout>120000</guardian-timeout> <service-failure-policy>logging</service-failure-policy> <backing-map-scheme> <local-scheme> <scheme-ref>example-binary-backing-map</scheme-ref> </local-scheme> </backing-map-scheme> <autostart>true</autostart> </distributed-scheme>
The <service-failure-policy>
element can be used in the following schemes: <distributed-scheme>
, <replicated-scheme>
, <optimistic-scheme>
, <transaction-scheme>
, <invocation-scheme>
, and <proxy-scheme>
.
Parent topic: Setting the Guardian Service Failure Policy
Enabling a Custom Guardian Failure Policy
To use a custom failure policy, include an <instance>
subelement and provide a fully qualified class name that implements the ServiceFailurePolicy
interface. See instance. The following example enables a custom failure policy that is implemented in the MyFailurePolicy
class. Custom failure policies can be enabled for all threads (as shown below) or can be enabled per service instance within a cache scheme definition.
<?xml version='1.0'?> <coherence xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns="http://xmlns.oracle.com/coherence/coherence-operational-config" xsi:schemaLocation="http://xmlns.oracle.com/coherence/coherence-operational-config coherence-operational-config.xsd"> <cluster-config> <service-guardian> <service-failure-policy> <instance> <class-name>package.MyFailurePolicy</class-name> </instance> </service-failure-policy> </service-guardian> </cluster-config> </coherence>
As an alternative, the <instance>
element supports the use of a <class-factory-name>
element to use a factory class that is responsible for creating ServiceFailurePolicy
instances, and a <method-name>
element to specify the static factory method on the factory class that performs object instantiation. The following example gets a custom failure policy instance using the getPolicy
method on the MyPolicyFactory
class.
<?xml version='1.0'?> <coherence xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns="http://xmlns.oracle.com/coherence/coherence-operational-config" xsi:schemaLocation="http://xmlns.oracle.com/coherence/coherence-operational-config coherence-operational-config.xsd"> <cluster-config> <service-guardian> <service-failure-policy> <instance> <class-factory-name>package.MyPolicyFactory</class-factory-name> <method-name>getPolicy</method-name> </instance> </service-failure-policy> </service-guardian> </cluster-config> </coherence>
Any initialization parameters that are required for an implementation can be specified using the <init-params>
element. The following example sets the iMaxTime
parameter to 2000
.
<?xml version='1.0'?> <coherence xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns="http://xmlns.oracle.com/coherence/coherence-operational-config" xsi:schemaLocation="http://xmlns.oracle.com/coherence/coherence-operational-config coherence-operational-config.xsd"> <cluster-config> <service-guardian> <service-failure-policy> <instance> <class-name>package.MyFailurePolicy</class-name> <init-params> <init-param> <param-name>iMaxTime</param-name> <param-value>2000</param-value> </init-param> </init-params> </instance> </service-failure-policy> </service-guardian> </cluster-config> </coherence>
Parent topic: Setting the Guardian Service Failure Policy
Issuing Manual Guardian Heartbeats
GuardSupport
class provides heartbeat methods that applications can use to manually issue heartbeats to the service guardian.GuardSupport.heartbeat();
For known long running operations, the heartbeat can be issued with the number of milliseconds that should pass before the operation is considered "stuck:"
GuardSupport.heartbeat(long cMillis);
Parent topic: Using the Service Guardian
Setting the Guardian Log Thread Dump Frequency
If Coherence cache server logs are overwhelmed with too many service guardian thread dumps in a short duration of time, you can increase the interval between service guardian thread dumps.
Set the property coherence.guardian.log.threaddump.interval
to a
time duration. The time duration format is a number followed by a letter,
h
for hour, m
for minute, s
for
seconds. The default time duration is 3 seconds.
You can set the interval up to a maximum of 3 hours. Specifying a value larger than the maximum results in the interval being set to the maximum duration.
Thread dumps are important in order to identify thread deadlock. However, when thread dumps are too frequent in a system or the systems thread dumps are very large, use this setting to tune an acceptable balance.
Parent topic: Using the Service Guardian