B Log Message Glossary
This appendix includes the following sections:
- TCMP Log Messages
Log messages that pertain to TCMP - Configuration Log Messages
Log messages that pertain to configuration - Partitioned Cache Service Log Messages
Log messages that pertain to the partitioned cache service - Service Thread Pool Log Messages
Log messages that pertain to the service thread pool - TMB Log Messages
Log messages that pertain to TMB. - Cluster Service Exceptions
Exceptions thrown by the cluster service. - Guardian Service Log Messages
Log messages that pertain to the Guardian Service. - Persistence Log Messages
Log messages that pertain to persistence. - Transaction Exception Messages
Exceptions thrown by the Transaction Framework API.
TCMP Log Messages
- Experienced a %n1 ms communication delay (probable remote GC) with Member %s
-
%n1 - the latency in milliseconds of the communication delay; %s - the full Member information. Severity: 2-Warning or 5-Debug Level 5 or 6-Debug Level 6 depending on the length of the delay.
- Failed to satisfy the variance: allowed=%n1 actual=%n2
-
%n1 - the maximum allowed latency in milliseconds; %n2 - the actual latency in milliseconds. Severity: 3-Informational or 5-Debug Level 5 depending on the message frequency.
- Created a new cluster "%s1" with Member(%s2)
-
%s1 - the cluster name; %s2 - the full Member information. Severity: 3-Informational.
- This Member(%s1) joined cluster "%s2" with senior Member(%s3)
-
%s1 - the full Member information for this node; %s2 - the cluster name; %s3 - the full Member information for the cluster senior node. Severity: 3-Informational.
- Member(%s) joined Cluster with senior member %n
-
%s - the full Member information for a new node that joined the cluster this node belongs to; %n - the node id of the cluster senior node. Severity: 5-Debug Level 5.
- there appears to be other members of the cluster "%s" already running with an incompatible network configuration, aborting join with "%n"
-
%s - the cluster name to which a join attempt was made; %n - information for the cluster. Severity: 5-Debug Level 5.
- Member(%s) left Cluster with senior member %n
-
%s - the full Member information for a node that left the cluster; %n - the node id of the cluster senior node. Severity: 5-Debug Level 5.
- MemberLeft notification for Member %n received from Member(%s)
-
%n - the node id of the departed node; %s - the full Member information for a node that left the cluster. Severity: 5-Debug Level 5.
- Received cluster heartbeat from the senior %n that does not contain this %s ; stopping cluster service.
-
%n - the senior service member id; %s - a cluster service member's id. Severity: 1-Error.
- Service %s joined the cluster with senior service member %n
-
%s - the service name; %n - the senior service member id. Severity: 5-Debug Level 5.
- This node appears to have partially lost the connectivity: it receives responses from MemberSet(%s1) which communicate with Member(%s2), but is not responding directly to this member; that could mean that either requests are not coming out or responses are not coming in; stopping cluster service.
-
%s1 - set of members that can communicate with the member indicated in %s2; %s2 - member that can communicate with set of members indicated in %s1. Severity: 1-Error.
- validatePolls: This senior encountered an overdue poll, indicating a dead member, a significant network issue or an Operating System threading library bug (e.g. Linux NPTL): Poll
-
Severity: 2-Warning
- Received panic from junior member %s1 caused by %s2
-
%s1 - the cluster member that sent the panic; %s2 - a member claiming to be the senior member. Severity 2-Warning
- Received panic from senior Member(%s1) caused by Member(%s2)
-
%s1 - the cluster senior member as known by this node; %s2 - a member claiming to be the senior member. Severity: 1-Error.
- Member %n1 joined Service %s with senior member %n2
-
%n1 - an id of the Coherence node that joins the service; %s - the service name; %n2 - the senior node for the service. Severity: 5-Debug Level 5.
- Member %n1 left Service %s with senior member %n2
-
%n1 - an id of the Coherence node that joins the service; %s - the service name; %n2 - the senior node for the service. Severity: 5-Debug Level 5.
- Service %s: received ServiceConfigSync containing %n entries
-
%s - the service name; %n - the number of entries in the service configuration map. Severity: 5-Debug Level 5.
- TcpRing: connecting to member %n using TcpSocket{%s}
-
%s - the full information for the TcpSocket that serves as a TcpRing connector to another node; %n - the node id to which this node has connected. Severity: 5-Debug Level 5.
- Rejecting connection to member %n using TcpSocket{%s}
-
%n - the node id that tries to connect to this node; %s - the full information for the TcpSocket that serves as a TcpRing connector to another node. Severity: 4-Debug Level 4.
- Timeout while delivering a packet; requesting the departure confirmation for Member(%s1) by MemberSet(%s2)
-
%s1 - the full Member information for a node that this node failed to communicate with; %s2 - the full information about the "witness" nodes that are asked to confirm the suspected member departure. Severity: 2-Warning.
- This node appears to have become disconnected from the rest of the cluster containing %n nodes. All departure confirmation requests went unanswered. Stopping cluster service.
-
%n - the number of other nodes in the cluster this node was a member of. Severity: 1-Error.
- A potential communication problem has been detected. A packet has failed to be delivered (or acknowledged) after %n1 seconds, although other packets were acknowledged by the same cluster member (Member(%s1)) to this member (Member(%s2)) as recently as %n2 seconds ago. Possible causes include network failure, poor thread scheduling (see FAQ if running on Windows), an extremely overloaded server, a server that is attempting to run its processes using swap space, and unreasonably lengthy GC times.
-
%n1 - The number of seconds a packet has failed to be delivered or acknowledged; %s1 - the recipient of the packets indicated in the message; %s2 - the sender of the packets indicated in the message; %n2 - the number of seconds since a packet was delivered successfully between the two members indicated above. Severity: 2-Warning.
- Node %s1 is not allowed to create a new cluster; WKA list: [%s2]
-
%s1 - Address of node attempting to join cluster; %s2 - List of WKA addresses. Severity: 1-Error.
- This member is configured with a compatible but different WKA list then the senior Member(%s). It is strongly recommended to use the same WKA list for all cluster members.
-
%s - the senior node of the cluster. Severity: 2-Warning.
- <socket implementation> failed to set receive buffer size to %n1 packets (%n2 bytes); actual size is %n3 packets (%n4 bytes). Consult your OS documentation regarding increasing the maximum socket buffer size. Proceeding with the actual value may cause sub-optimal performance.
-
%n1 - the number of packets that fits in the buffer that Coherence attempted to allocate; %n2 - the size of the buffer Coherence attempted to allocate; %n3 - the number of packets that fits in the actual allocated buffer size; %n4 - the actual size of the allocated buffer. Severity: 2-Warning.
- The timeout value configured for IpMonitor pings is shorter than the value of 5 seconds. Short ping timeouts may cause an IP address to be wrongly reported as unreachable on some platforms.
-
Severity: 2-Warning
- Network failure encountered during InetAddress.isReachable(): %s
-
%n - a stack trace. Severity: 5-Debug Level 5.
- TcpRing has been explicitly disabled, this is not a recommended practice and will result in a minimum death detection time of %n seconds for failed processes.
-
%n - the number of seconds that is specified by the packet publisher's resend timeout which is 5 minutes by default. Severity: 2-Warning.
- IpMonitor has been explicitly disabled, this is not a recommended practice and will result in a minimum death detection time of %n seconds for failed machines or networks.
-
%n - the number of seconds that is specified by the packet publisher's resend timeout which is 5 minutes by default. Severity: 2-Warning.
- TcpRing connecting to %s
-
%s - the cluster member to which this member has joined to form a TCP-Ring. Severity: 6-Debug Level 6.
- TcpRing disconnected from %s to maintain ring
-
%s - the cluster member from which this member has disconnected. Severity: 6-Debug Level 6.
Parent topic: Log Message Glossary
Configuration Log Messages
- Loaded operational configuration from resource "%s"
-
%s - the full resource path (URI) of the operational configuration descriptor. Severity: 3-Informational.
- Loaded operational overrides from "%s"
-
%s - the URI (file or resource) of the operational configuration descriptor override. Severity: 3-Informational.
- Optional configuration override "%s" is not specified
-
%s - the URI of the operational configuration descriptor override. Severity: 3-Informational.
- java.io.IOException: Document "%s1" is cyclically referenced by the 'xml-override' attribute of element %s2
-
%s1 - the URI of the operational configuration descriptor or override; %s2 - the name of the XML element that contains an incorrect reference URI. Severity: 1-Error.
Parent topic: Log Message Glossary
Partitioned Cache Service Log Messages
- Application code running on "%s1" service thread(s) should not call ensureCache as this may result in a deadlock
-
The most common case is a
CacheFactory
call from a customCacheStore
implementation.%s1 - the service which is executing the application code
Cause: This message indicates that an
ensureCache
operation has been called for a cache on the same service that is executing the application code. This code can be from any user application code such as an interceptor, entry processor, or cache store. This is to protect against potential deadlock through a callback into the same service.
- Asking member %n1 for %n2 primary partitions
-
%n1 - the node id this node asks to transfer partitions from; %n2 - the number of partitions this node is willing to take. Severity: 4-Debug Level 4.
- Transferring %n1 out of %n2 primary partitions to member %n3 requesting %n4
-
%n1 - the number of primary partitions this node transferring to a requesting node; %n2 - the total number of primary partitions this node currently owns; %n3 - the node id that this transfer is for; %n4 - the number of partitions that the requesting node asked for. Severity: 4-Debug Level 4.
- Transferring %n1 out of %n2 partitions to a machine-safe backup 1 at member %n3 (under %n4)
-
%n1 - the number of backup partitions this node transferring to a different node; %n2 - the total number of partitions this node currently owns that are "endangered" (do not have a backup); %n3 - the node id that this transfer is for; %n4 - the number of partitions that the transferee can take before reaching the "fair share" amount. Severity: 4-Debug Level 4.
- Transferring backup%n1 for partition %n2 from member %n3 to member %n4
-
%n1 - the index of the backup partition that this node transferring to a different node; %n2 - the partition number that is being transferred; %n3 the node id of the previous owner of this backup partition; %n4 the node id that the backup partition is being transferred to. Severity: 5-Debug Level 5.
- Failed backup transfer for partition %n1 to member %n2; restoring owner from: %n2 to: %n3
-
%n1 the partition number for which a backup transfer was in-progress; %n2 the node id that the backup partition was being transferred to; %n3 the node id of the previous backup owner of the partition. Severity: 4-Debug Level 4.
- Deferring the distribution due to %n1 pending configuration updates
-
%n1- the number of configuration updates. Severity: 5-Debug Level 5.
- Limiting primary transfer to %n1 KB (%n2 partitions)
-
%n1 - the size in KB of the transfer that was limited; %n2 the number of partitions that were transferred. Severity: 4-Debug Level 4.
- DistributionRequest was rejected because the receiver was busy. Next retry in %n1 ms
-
%n1 - the time in milliseconds before the next distribution check is scheduled. Severity: 6-Debug Level 6.
- Restored from backup %n1 partitions
-
%n1 - the number of partitions being restored. Severity: 3-Informational.
- Re-publishing the ownership for partition %n1 (%n2)
-
%n1 the partition number whose ownership is being re-published; %n2 the node id of the primary partition owner, or 0 if the partition is orphaned. Severity: 4-Debug Level 4.
- %n1> Ownership conflict for partition %n2 with member %n3 (%n4!=%n5)
-
%n1 - the number of attempts made to resolve the ownership conflict; %n2 - the partition whose ownership is in dispute; %n3 - the node id of the service member in disagreement about the partition ownership; %n4 - the node id of the partition's primary owner in this node's ownership map; %n5 - the node id of the partition's primary owner in the other node's ownership map. Severity: 4-Debug Level 4.
- Assigned %n1 orphaned primary partitions
-
%n1 - the number of orphaned primary partitions that were re-assigned. Severity: 2-Warning.
- validatePolls: This service timed-out due to unanswered handshake request. Manual intervention is required to stop the members that have not responded to this Poll
-
Severity: 1-Error.
- com.tangosol.net.RequestPolicyException: No storage-enabled nodes exist for service service_name
-
Severity: 1-Error.
- An entry was inserted into the backing map for the partitioned cache "%s" that is not owned by this member; the entry will be removed."
-
%s - the name of the cache into which insert was attempted. Severity: 1-Error.
- Exception occurred during filter evaluation: %s; removing the filter...
-
%s - the description of the filter that failed during evaluation. Severity: 1-Error.
- Exception occurred during event transformation: %s; removing the filter...
-
%s - the description of the filter that failed during event transformation. Severity: 1-Error.
- Exception occurred during index rebuild: %s
-
%s - the stack trace for the exception that occurred during index rebuild. Severity: 1-Error.
- Exception occurred during index update: %s
-
%s - the stack trace for the exception that occurred during index update. Severity: 1-Error.
- Exception occurred during query processing: %s
-
%s - the stack trace for the exception that occurred while processing a query. Severity: 1-Error.
- BackingMapManager %s1: returned "null" for a cache: %s2
-
%s1 - the classname of the BackingMapManager implementation that returned a null backing-map; %s2 - the name of the cache for which the BackingMapManager returned null. Severity: 1-Error.
- BackingMapManager %s1: failed to instantiate a cache: %s2
-
%s1 - the classname of the BackingMapManager implementation that failed to create a backing-map; %s2 - the name of the cache for which the BackingMapManager failed. Severity: 1-Error.
- BackingMapManager %s1: failed to release a cache: %s2
-
%s1 - the classname of the BackingMapManager implementation that failed to release a backing-map; %s2 - the name of the cache for which the BackingMapManager failed. Severity: 1-Error.
- Unexpected event during backing map operation: key=%s1; expected=%s2; actual=%s3
-
%s1 - the key being modified by the cache; %s2 - the expected backing-map event from the cache operation in progress; %s3 - the actual MapEvent received. Severity: 6-Debug Level 6.
- Application code running on "%s1" service thread(s) should not call %s2 as this may result in deadlock. The most common case is a CacheFactory call from a custom CacheStore implementation.
-
%s1 - the name of the service which has made a re-entrant call; %s2 - the name of the method on which a re-entrant call was made. Severity: 2-Warning.
- Repeating %s1 for %n1 out of %n2 items due to re-distribution of %s2
-
%s1 - the description of the request that must be repeated; %n1 - the number of items that are outstanding due to re-distribution; %n2 - the total number of items requested; %s2 - the list of partitions that are in the process of re-distribution and for which the request must be repeated. Severity: 5-Debug Level 5.
- Error while starting cluster: com.tangosol.net.RequestTimeoutException: Timeout during service start: ServiceInfo(%s)
-
%s - information on the service that could not be started. Severity: 1-Error.
- Failed to restart services: com.tangosol.net.RequestTimeoutException: Timeout during service start: ServiceInfo(%s)
-
%s - information on the service that could not be started. Severity: 1-Error.
- Failed to recover partition 0 from SafeBerkeleyDBStore(...); partition-countmismatch 501(persisted) != 277(service); reinstate persistent store fromtrash once validation errors have been resolved
-
Cause: The partition-count is changed while active persistence is enabled. The current active data is copied to the trash directory.
Parent topic: Log Message Glossary
Service Thread Pool Log Messages
Log messages that pertain to the service thread pool
- DaemonPool "%s" increasing the pool size from %n1 to %n2 thread(s) due to the pool being shaken
-
%s - the service name; %n1 - the current thread pool count; %n2 - the new thread pool count. Severity: 3 - Informational.
Cause: The thread pool count will be increased intermittently and the thread pool throughput will be measured to determine if the increase is effective. The thread count will be increased only if dynamic thread pooling is enabled and the new thread count does not exceed the maximum configured value.
Action: None. This is part of the process for determining the most effective thread pool count.
- DaemonPool "%s" increasing the pool size from %n1 to %n2 thread(s) due to a decrease in throughput of %n3op/sec
-
%s - the service name; %n1 - the current thread pool count; %n2 - the new thread pool count; %n3 - the change in operations per second. Severity: 3 - Informational.
Cause: The thread pool task throughput was reduced with a lower thread count. The thread count is being increased to improve throughput. The thread count will be increased only if dynamic thread pooling is enabled and the new thread count does not exceed the maximum configured value.
Action: None. This is part of the process for determining the most effective thread pool count.
- DaemonPool "%s" decreasing the pool size from %n1 to %n2 thread(s) due to a decrease in throughput of %n3op/sec
-
%s - the service name; %n1 - the current thread pool count; %n2 - the new thread pool count; %n3 - the change in operations per second. Severity: 3 - Informational.
Cause: The thread pool task throughput was reduced with a higher thread count. The thread count is being decreased to improve throughput. The thread count will be decreased only if dynamic thread pooling is enabled and the new thread count does not drop below the configured minimum value.
Action: None. This is part of the process for determining the most effective thread pool count.
Parent topic: Log Message Glossary
TMB Log Messages
Log messages that pertain to TMB.
- %s1 rejecting connection from %s2 using incompatible protocol id %s3, required %s4
-
%s1 - the local endpoint; %s2 - the socket address; %s3 - the connection protocol id; %s4 - the required protocol Id. Severity: 2 -Warning.
Cause: A Coherence node with incompatible protocol identifier has attempted to establish a connection with this node. This should not happen unless the request is from a malicious connect attempt or the message header is corrupted.
Action: Restart remote node. If problem persists, send all related information to Oracle Support for investigation.
- %s1 rejecting connection from %s2, bus is closing
-
%s1 – the local endpoint; %s2 – the socket address. Severity: 5-Debug.
Cause: The local message bus connection received a connection request while it is being closed. This is likely to occur during the local node shutdown.
Action: None.
- %s1 deferring reconnect attempt from %s2 on %s3, pending release
-
%s1 – the local endpoint; %s2 – the peer’s endpoint; %s3 – the associated channel socket. Severity: 5-Debug.
Cause: The local message bus connection received a reconnect request from the same remote endpoint while the current connection is waiting for an application to fully release.
Action: None.
- %s1 replacing deferred reconnect attempt from %s2 on %s3, pending release
-
%s1 – the local endpoint; %s2 – the peer’s endpoint; %s3 – the associated channel socket. Severity: 5-Debug.
Cause: The local TCP socket received a subsequent reconnect request, thus replacing a previous reconnect attempt, while waiting for the application to fully release. This is expected due to the possibility of concurrent connect initiation from both the remote and local endpoint, which is part of the connection handshake protocol.
Action: None.
- %s1 initiating connection migration with %s2 after %n ack timeout %s3
-
%s1 – the local endpoint; %s2 - the peer’s endpoint; %n – the ack timeout value; %s3 – debug info. Severity: 2-Warning.
Cause: A message was sent, but a logical ack for that message was not received for more than the configured ack timeout. The default value of this timeout is 15s which, from a Coherence perspective, is an eternity to not deliver a message. This is a means to detect a stalled connection and initiate the remedial actions to resolve the situation.
Action: If a stalled connection was correctly inferred, then the remedial action of migrating the connection to a new TCP connection should resolve the issue. To resolve the stalled connections, ensure that you have the latest version of the OS as stalls have been observed in certain Linux Kernel versions. The connection may not have been installed and may be due to process unresponsiveness. Therefore, also ensure that network connectivity to the machine mentioned in %s2 looks reasonable and the process seems responsive (not in a GC loop). Frequent migration will severely impact performance and availability. If the message perpetually repeats, collect a heap dump from both the local and peer, any network reports available, and all coherence logs. Send the information to Oracle Support for investigation.
- %s1 accepting connection migration with %s2, replacing %s3 with %s4:%s5
-
%s1 – the local endpoint; %s2 – the peer endpoint; %s3 – the old SocketChannel; %s4 – the new SocketChannel; %s5 – the old message bus connection. Severity: 2-Warning.
Cause: The peer initiated a connection migration while local was not aware of the connection issue. The local message bus accepted the request and replaced the old socket channel with the new one. The migration can be caused by TCP connection stalls, GC, or a network issue.
Action: If the problem persists, collect heap dumps from local and remote server, any available network reports, and all coherence logs. Send the information to Oracle Support for investigation. Also, enabling TCP captures provides significant insight into whether the messages are being received by the peer and the sender.
- %s1 migrating connection with %s2 off %s3 on %s4
-
%s1 – the local endpoint; %s2 – the peer endpoint; %s3 – the socket channel; %s4 – the string representation of the message bus connection. Severity: 6-Debug.
Cause: The local message bus initiated a connection migration due to ack timeout or another error. If the message is seen frequently while the application is still functioning, it indicates process unresponsiveness (often due to GC) or a network issue, which is likely to impact cluster performance.
Action: Investigate the remote GC or network issue. If the problem persists, send heap dumps of both the local and peer as well as all Coherence logs to Oracle Support for investigation. Also, enabling TCP captures provides significant insight into whether the messages are being received by the peer and by the sender.
- %s1 synchronizing migrated connection with %s2 will result in %n1 skips and %n2 re-deliveries: %s3
-
%s1 – the local endpoint; %s2 – the peer’s endpoint %n1 – number of messages to skip; %n2 – number of messages to re-deliver; %s3 – the string representation of the local bus connection. Severity: 5-Debug.
Cause: This is informational only. The migrated connection needs to skip or redeliver the queued messages, depending on whether acks for the messages are received.
Action: None.
- %s1 rejecting connection migration from %s2 on %s3, no existing connection %s4/%s5
-
%s1 – the local endpoint; %s2 – the peer endpoint; %s3 – the local socket address; %s4 – the current connection identifier; %s5 – the old connection identifier or 0 if old connection does not exist. Severity: 5-Debug.
Cause: The local message bus received a migration request on a connection that does not exist. Hence, reject the request. Most probably, the connection has been released.
Action: None.
- %s1, %s2 accepted connection migration with %s3:%s4
-
%s1 – the local endpoint; %s2 – the peer endpoint; %s3 – the socket channel; %s4 – the string representation of message bus connection. Severity: 2-Warning.
Cause: This is informational message. The connection migration has successfully finished the handshake protocol.
Action: None.
- %s1 resuming migrated connection with %s2
-
%s1 – the local endpoint; %s2 – the string representation of bus connection. Severity: 5-Debug.
Cause: This is informational message. The connection was successfully migrated; now resuming normal processing with the new migrated socket channel.
Action: None.
- %s1 ServerSocket failure; no new connection will be accepted.
-
%s2 – the local endpoint. Severity: 1-Error.
Cause: This message indicates that the server socket channel, on which the message bus accepts new connections, has failed to register with a selection service.
Action: This is an unexpected state and may require a node restart if the process continues to appear unhealthy.
- %s1 disconnected connection with %s2
-
%s1 – local endpoint; %s2 – remote endpoint. Severity: 3-Info.
Cause: The connection with the mentioned remote endpoint was disconnected.
Action: None.
- %s1 close due to exception during handshake phase %s2 on %s3
-
%s1 – the local endpoint; %s2 – the phase of handshake; %s3 – socket associated with the connection channel. Severity: 2-Warning.
Cause: The connection request was rejected during the mentioned handshake phase due to SSLException; the associated socket channel was closed.
Action: The error message should indicate why the handshake failed and should provide sufficient information to resolve (for example, an expired cert). If the issue persists, contact Oracle Support.
- %s1 dropping connection with %s2 after %s3 fatal ack timeout %s4
-
%s1 – the local endpoint; %s2 – the remote endpoint; %s3 – the fatal ack timeout value in ms; %s4 – info for debugging purpose. Severity: 2-Warning.
Cause: The local bus connection has failed to hear from the peer for the configure fatal ack timeout; the connection will be dropped as it is unrecoverable. It is likely caused by extended process unresponsiveness (potential GC issues) or a network issue.
Action: Investigate remote GC logs or network logs (TCP captures / network monitoring). If the problem persists, send heap dumps from both the local and peer, as well as all Coherence logs to Oracle Support for investigation.
- %s unexpected exception during Bus accept, ignoring
-
%s – the local endpoint. Severity: 3-Info.
Cause: An exception has occurred during server socket accepting a connection request. It is safe to ignore the exception and continue to accept the request as the server socket channel is still open.
Action: None.
- %s ServerSocket failure; no new connection will be accepted
-
%s – the local endpoint. Severity: 1-Error.
Cause: An exception has occurred during server socket accepting a connection request and the server socket was closed unexpectedly.
Action: Restart the node. If the problem persists, contact Oracle Support.
- Unhandled exception in %s, attempting to continue
-
%s – the selection service. Severity: 1-Error.
Cause: Unexpected error while running the selector thread. The thread will continue to select and process messages.
Action: None.
- %s1 disconnected connection with %s2
-
%s1 – the local endpoint; %s2 – the string representation of bus connection. Severity: 2-Warning if the reason for disconnect is SSLException, otherwise 6-Debug.
Cause: The mentioned connection was closed. This could be due to various reasons, including an encountered exception/error or an expected release.
Action: None.
Parent topic: Log Message Glossary
Cluster Service Exceptions
Exceptions thrown by the cluster service.
- IllegalStateException: The cluster has been halted and is not restartable. This cluster member's JVM process must be restarted.
-
Cause: When a service guardian fails to terminate its non-responsive service thread and the guardian service failure policy is
exit-cluster
, the cluster is halted. Any attempt to use a Coherence cluster operation after the cluster has been halted results in this exception.Action: Restart the server process when the cluster is halted. To avoid the need to restart the server process in the future, review if another guardian service failure policy option is more appropriate for your environment. See Setting the Guardian Service Failure Policy in Developing Applications with Oracle Coherence.
The default guardian service failure policy is
exit-cluster
. To identify what time this failure occurred, search in the server logs for the guardian service log message "Oracle Coherence <Error>: Halted the cluster: Cluster is not running: State=5" .
- Stopping cluster due to unhandled exception: <exception and its descriptive message>
-
Cause: After the cluster is halted, the above message can occur multiple times while attempting to stop the cluster completely.
Action: This message and all other error or warning messages that occur after the cluster is halted can be ignored.
Parent topic: Log Message Glossary
Guardian Service Log Messages
Log messages that pertain to the Guardian Service.
- Detected hard timeout after %s1 of {WrapperGuardable Guard{Daemon=%s2 } Service=%s3{Name=%s4, State=(SERVICE_STARTED), …}}
-
%s1 –time duration; %s2 - cache scheme:service name; %s3 – serviceKind: %s4 – service name
Cause: The guardian service did not receive a heartbeat from service for time duration of %s1. The service will attempt to terminate the non-responding thread.
Action: Informational. Typically, the guardian is able to recover the service thread and proceed.
- onServiceFailed: Failed to stop service %s with state=%n, isAlive=%b1, stop service thread isAlive=%b2
-
%s – service name; %n – service state; %b1 – true if service is still alive; %b2 – true if terminating service thread is still alive
Cause: The guardian service failed to interrupt hung service thread %s.
Action: Analyze the hung service’s stack trace (see the next message), and find the next Full Thread Dump and Outstanding Polls in the server log to analyze what the service thread was waiting for to get stuck.
- onServiceFailed: Service thread: Stack trace for thread Thread[%s1:%s2,%n,Cluster]: <%stackTrace>
-
%s1 – cache kind; %s2 – service name; %n – thread identifier; <%stackTrace> - multi line stack trace of uninterruptible thread
Cause: The guardian service failed to interrupt service %s2.
Action: Analyze the stack trace of the hung service and search for Full Thread Dump to see if there was a deadlock between the stuck thread and another running thread.
- Oracle Coherence <Error>: Halted the cluster: Cluster is not running: State=5
-
Cause: When the guardian service fails to recover a stuck thread and the guardian service failure policy is
exit-cluster
, the cluster is halted.Action: Restart the server process when the cluster is halted. To avoid the need to restart the server process in the future, review if another guardian service failure policy option is more appropriate for your environment. See Setting the Guardian Service Failure Policy in Developing Applications with Oracle Coherence. The default guardian service failure policy is
exit-cluster
. To understand the failure that resulted in the cluster being halted, look for the logs messages listed in this section in the server log. The timestamp on this message provides the exact time the cluster is halted.
- (thread=Recovery Thread, member=%n): Full Thread Dump: (excluding deadlock analysis)
-
%n
= short Coherence member idCause: The guardian service recovery thread prints this diagnostic when recovery of the stuck service thread fails.
Action: Analyze if the stuck service thread that could not be stopped was blocked waiting for a lock held by another active thread at the time of the failure.
Parent topic: Log Message Glossary
Persistence Log Messages
Log messages that pertain to persistence.
- "/xx/xx/CoherenceSnapshot/xxCoherenceCluster/DistributedCache/CoherenceSnapshotnnnn" appears to reference a remote file-system and as such Coherence persistence is enabling "coherence.distributed.persistence.bdb.je.log.useODSYNC" in order ensure the integrity of remote commits. As this may impact write performance, you may explicitly set the system property to "false" to override this decision; though this is only recommended if the location is actually a local file-system.
-
Cause: Coherence detected that the persistence environment points to a remote file system, for example NFS, as opposed to a local file system. As such, the property
coherence.distributed.persistence.bdb.je.log.useODSYNC
is automatically enabled to ensure the integrity of remote commits for the cache data that is persisted.Action: Oracle strongly recommends enabling
ODSYNC
for remote file systems. Thecoherence.distributed.persistence.bdb.je.log.useODSYNC
property can also be explicitly disabled by setting it tofalse
for faster write performance (for example, when the file system has been detected as remote but is effectively local).
Parent topic: Log Message Glossary
Transaction Exception Messages
Exceptions thrown by the Transaction Framework API.
com.tangosol.coherence.transaction.exception.UnableToAcquireLockException: Unable to acquire write lock for key:%s1 in table:%s2 for service: %s3 in thread: %s4 Unable to acquire write lock for TxnId: %d5; lock is already owned by TxnId: %d6
%s1 – toString of key; %s2 – table name; %s3 – transaction service name; %s4 – thread name unable to acquire lock; %d5 – short JTA transaction id (unique within service) of current thread; %d6 – short JTA transaction id (unique within service) of JTA transaction that is holding write lock
Cause: Thread described by %s4 and %3 and having active short JTA transaction id of %d5 is unable to write lock on key described by %s1 in table %s2 since another concurrent JTA transaction %d6 has the write lock.
Action: The only action needed is the Transaction Framework API must roll back the JTA transaction that could not be completed due to this exception. All locks held by a JTA transaction are released when the JTA transaction is committed or rolled back. See Part V Performing Data Grid Operations -> 35 Performing Transactions -> Performing cache operations within a transaction on how to implicitly (Example 35-4 Performing an Auto-Commit Transaction) or explicitly perform cache operations within a transaction (Example 35-5 Performing a Non Auto-Commit Transaction).
Parent topic: Log Message Glossary