SNMP Poller

The SNMP Poller microservice discovers SNMP devices for the Discovery Service microservice and periodically polls the discovered devices for availability.

This microservice is part of the Discovery microservice pipeline. It uses a worker-coordinator design to balance workloads and allow for scaling. You deploy instances of this microservice and others in the same pipeline to separate clusters for each device zone. See Understanding the Discovery Pipeline in Unified Assurance Concepts for conceptual information.

You can enable redundancy for this microservice when you deploy it. See Configuring Microservice Redundancy for general information.

Autoscaling is supported but disabled by default for this microservice. You can optionally enable autoscaling when you deploy the microservice. See Configuring Autoscaling and SNMP Poller Autoscaling Configuration.

This microservice provides additional Prometheus monitoring metrics. See SNMP Poller Self-Monitoring Metrics.

SNMP Poller Prerequisites

Before deploying the microservice, confirm that the following prerequisites are met:

A microservice cluster is set up. See Microservice Cluster Setup.
The following microservices are deployed:
- Pulsar
- Discovery Service

Deploying SNMP Poller

To deploy the microservice, run the following commands:

su - assure1
export NAMESPACE=<namespace>
export WEBFQDN=<WebFQDN> 
a1helm install <microservice-release-name> assure1/snmp-poller -n $NAMESPACE --set global.imageRegistry=$WEBFQDN

In the commands:

<namespace> is the namespace where you are deploying the microservice. The default namespace is a1-zone1-pri, but you can change the zone number and, when deploying to a redundant cluster, change pri to sec.
<WebFQDN> is the fully-qualified domain name of the primary presentation server for the cluster.
<microservice-release-name> is the name to use for the microservice instance. Oracle recommends using the microservice name (snmp-poller) unless you are deploying multiple instances of the microservice to the same cluster.

You can also use the Unified Assurance UI to deploy microservices. See Deploying a Microservice by Using the UI for more information.

Changing SNMP Poller Configuration Parameters

When running the install command, you can optionally change default configuration parameter values by including them in the command with additional --set arguments. You can add as many additional --set arguments as you need.

For example:

Set a global parameter described in Default Global SNMP Poller Configuration by adding --set configData.<parameter_name>=<parameter_value>. For example, --set configData.LOG_LEVEL=DEBUG.
Set a coordinator-specific or worker-specific parameter by prefixing configData in the argument with coordinator or worker. For example, --set coordinator.configData.LOG_LEVEL=DEBUG. This overrides the global parameter.
Enable redundancy for the microservice by adding --set redundancy=enabled.
Enable client-side (worker) or server-side (coordinator) keep-alive checks by adding --set configData.GRPC_CLIENT_KEEPALIVE=true or --set configData.GRPC_SERVER_KEEPALIVE=true. See About Keep-Alive Configurations.
Enable autoscaling for the microservice and set the maximum replica count to an appropriate value for your environment by adding --set autoscaling.enabled=true --set autoscaling.maxReplicaCount=<N>. See SNMP Autoscaling Configuration for information about choosing an appropriate value for <N>.

Default SNMP Poller Configuration

Some SNMP Poller configurations apply to workers and coordinators, some apply only to coordinators, and some apply only to workers. The parameters set for the workers or coordinators specifically override the global parameters. For example, if you set global log levels to DEBUG, but the log level for coordinators to INFO, then the coordinator logs will use INFO and worker logs will use DEBUG.

Default Global SNMP Poller Configuration

The following table describes the default global configuration parameters found in the Helm chart under configData for the microservice. These apply to both workers and coordinators.

Name	Default Value	Possible Values	Notes
LOG_LEVEL	INFO	FATAL, ERROR, WARN, INFO, DEBUG	Global logging level between coordinator and workers. Any setting at the worker or coordinator level overrides this.
GRPC_CONN_DOWN_DEADLINE	5s	Integer + Text (ns, us, µs, ms, s, m, h)	The time period to wait for a GPRC connection before it is considered failed.
GRPC_CLIENT_KEEPALIVE	false	Text (true or false)	Whether to use client-side keepalive checks, sent from the workers, to validate communication with the coordinator. See About Keep-Alive Configurations for information about the keepalive parameters.
GRPC_CLIENT_KEEPALIVE_TIME	30s	Integer + Text (ns, us, µs, ms, s, m, h)	The time period, after no communication, to ping the server (coordinator).
GRPC_CLIENT_KEEPALIVE_TIMEOUT	5s	Integer + Text (ns, us, µs, ms, s, m, h)	The time period to wait for a response to the ping before the server connection is considered down.
GRPC_SERVER_KEEPALIVE	false	Text (true or false)	Whether to use server-side keepalive checks, sent from the coordinator, to validate communication with the workers.
GRPC_SERVER_KEEPALIVE_TIME	30s	Integer + Text (ns, us, µs, ms, s, m, h)	The time period, after no communication, to ping the clients (workers).
GRPC_SERVER_KEEPALIVE_TIMEOUT	5s	Integer + Text (ns, us, µs, ms, s, m, h)	The time period to wait for a response to the ping before the client connection is considered down.

About Keep-Alive Configurations

By default, the coordinator and individual workers periodically send heartbeat messages between each other, with no validation, to check that the connection is not idle. To validate the connection, you can optionally enable ping-based gPRC keepalive checks, which expect a response within a configurable timeframe. If no response is received, the connection is considered down and the workers attempt to reestablish communication.

In the SNMP Poller microservice, the coordinator acts as the gPRC server and the workers act as clients. You enable keepalive checks from the coordinator to workers in the GRPC_SERVER_KEEPALIVE parameter and from workers to the coordinator in the GRPC_CLIENT_KEEPALIVE parameter. You set the interval at which the checks are made in the GRPC_SERVER_KEEPALIVE_TIME and GRPC_CLIENT_KEEPALIVE parameters, and the time within which a response is expected in the GRPC_SERVER_KEEPALIVE_TIMEOUT and GRPC_CLIENT_KEEPALIVE_TIMEOUT parameters.

Client-side keepalive checks have mandatory enforcement policies. If the client checks too frequently, the connection will be dropped with an ENHANCE_YOUR_CALM(too_many_pings) error. When you enable client-side keepalive checks, the SNMP Poller automatically sets the enforcement policy to allow no more than the value of GRPC_CLIENT_KEEPALIVE_TIME minus the value of GRPC_CLIENT_KEEPALIVE_TIMEOUT.

Default SNMP Poller Coordinator Configuration

The following table describes the default configuration parameters for coordinators found in the Helm chart under configData for the microservice.

Name	Default Value	Possible Values	Notes
LOG_LEVEL	INFO	FATAL, ERROR, WARN, INFO, DEBUG	Coordinator logging level. This overrides the global configuration.
POLLER_RESYNC_PERIOD	15m	Integer + Text (ns, us, µs, ms, s, m, h)	The time to wait before the coordinator re-synchronizes with the Unified Assurance database.
DISCOVERY_WORKERS_PERCENTAGE	25	Integer, 0 up to 100	The percentage of workers allocated to perform discovery workloads exclusively.
WORKER_CONCURRENCY	2000	Integer, greater than 0	The number of concurrent SNMP workloads that a single worker instance can perform.
WORKER_STREAM_FAILURE_THRESHOLD	5	Integer, greater than 0	The number of concurrent reconnections in the timeframe specified in WORKER_STREAM_FAILURE_WINDOW before forcing the worker to restart.
WORKER_STREAM_FAILURE_WINDOW	30m	Integer + Text (ns, us, µs, ms, s, m, h)	The timeframe to count concurrent reconnections for before for forcing the worker to restart.
PULSAR_SNMP_DISCOVERY_TOPIC_OVERRIDE	""	Text	Override for the topic from which the coordinator listens for discovery workload requests.
REDUNDANCY_INIT_DELAY	20s	Integer + Text (ns, us, µs, ms, s, m, h)	At startup, the amount of time to wait for the primary microservice to come up before initiating redundancy.
REDUNDANCY_POLL_PERIOD	5s	Integer + Text (ns, us, µs, ms, s, m, h)	The amount of time between status checks from the secondary microservice to the primary microservice.
REDUNDANCY_FAILOVER_THRESHOLD	4	Integer, greater than 0	The number of times the primary microservice must fail checks before the secondary microservice becomes active.
REDUNDANCY_FALLBACK_THRESHOLD	1	Integer, greater than 0	The number of times the primary microservice must succeed checks before the secondary microservice becomes inactive.
PROBE_V2_SUPPORT_ENABLED	""	Bool	Whether to enable SNMP probe v2c (true) or v1 (false) for v2c enabled devices during device discovery. If no value is provided, the default is false.

Default SNMP Poller Worker Configuration

The following table describes the default configuration parameters for workers found in the Helm chart under configData for the microservice.

Name	Default Value	Possible Values	Notes
LOG_LEVEL	INFO	FATAL, ERROR, WARN, INFO, DEBUG	Worker logging level. This overrides the global configuration.
GRPC_GRACEFUL_CONN_TIME	60s	Integer + Text (ns, us, µs, ms, s, m, h)	The amount of time the workers should try to connect with the coordinator before failing.
STREAM_OUTPUT_METRIC	""	Text	Override for the topic where performance polling workload results are published.
STREAM_OUTPUT_AVAILABILITY	""	Text	Override for the topic where availability polling workload results are published.
PULSAR_DISCOVERY_CALLBACK_OVERRIDE	""	Text	Override for the topic where discovery workload results are published.

SNMP Poller Autoscaling Configuration

Autoscaling is supported for the SNMP Poller microservice. See Configuring Autoscaling for general information and details about the standard autoscaling configurations.

For SNMP Poller, KEDA also uses the snmp_coordinator_metric_workers_required Prometheus metric to make scaling decisions. This metric is set dynamically. The SNMP Poller Microservice coordinator assigns polling and discovery workers during resynchronization with the Unified Assurance database. Numbers are based on polling throughput and your configuration settings as follows:

Number of poller workers: The number of unique devices being polled divided by the WORKER_CONCURRENCY value. The result is rounded up to the nearest whole number.
Number of discovery workers: The number of poller workers, multiplied by the DISCOVERY_WORKERS_PERCENTAGE value, divided by 100. The result is rounded up to the nearest whole number.
Total workers: The number of polling workers plus the number of discovery workers. This exposed as the snmp_coordinator_metric_workers_required Prometheus metric.

When you deploy the SNMP Poller microservice with autoscaling enabled, you must also calculate total workers required, based on the expected number of devices that will be polled in the device zone, to determine the value to use for the maxReplicaCount autoscaling configuration setting.

For example:

For 100,000 polled devices, when WORKER_CONCURRENCY is set to 2000 and DISCOVERY_WORKERS_PERCENTAGE is set to 25:
- Total required workers and maxReplicaCount value: 63
- Polling workers: 50
- Discovery workers: 13
For 250,000 polled devices, when WORKER_CONCURRENCY is set to 3000 and DISCOVERY_WORKERS_PERCENTAGE is set to 33:
- Total required workers and maxReplicaCount value: 112
- Polling workers: 84
- Discovery workers: 28

Modifying Scaling Triggers

By default, only the snmp_coordinator_metric_workers_required metric is configured as an autoscaling trigger. You can define additional triggers in the Helm chart by adding them under the triggers section.

For example, the default trigger configuration is:

autoscaling:
  ...
  triggers:
    - type: prometheus
      metadata:
        metricName: required_total_workers
        serverAddress: http://prometheus-kube-prometheus-prometheus.a1-monitoring.svc.cluster.local:9090
        query: snmp_coordinator_metric_required_total_workers
        threshold: '1'
        metricType: Value

SNMP Poller Self-Monitoring Metrics

The SNMP Poller microservice exposes the self-monitoring metrics for coordinators described in the following table to Prometheus.

Each of the metrics in the table is prefixed with snmp_coordinator in the database. For example, the full metric name in the database for the first metric is snmp_coordinator_metric_worker_count.

Metric Name	Type	Labels	Description
metric_worker_count	Gauge	N/A	The number of workers currently enrolled with the coordinator.
metric_workforce_count	Gauge	N/A	The number of workers multiplied by worker concurrency.
metric_discovery_worker_count	Gauge	N/A	The number of discovery workers currently enrolled with the coordinator.
metric_polling_worker_count	Gauge	N/A	The number of polling workers currently enrolled with the coordinator.
metric_required_discovery_workers	Gauge	N/A	The number of workers required for discovery when using autoscaling. Only available when autoscaling is enabled.
metric_required_polling_workers	Gauge	N/A	The number of workers required for polling when using autoscaling. Only available when autoscaling is enabled.
metric_required_total_workers	Gauge	N/A	The number of workers required for polling and discovery when using autoscaling. Only available when autoscaling is enabled.
metric_discovery_requests_queued	Gauge	N/A	The number of discovery requests. (queued, realtime)
metric_discovery_requests_processing	Gauge	N/A	The number of discovery requests. (processing, realtime)
metric_polling_requests_queued	Gauge	N/A	The number of polling requests. (queued, realtime)
metric_polling_requests_processing	Gauge	N/A	The number of polling requests. (processing, realtime)
metric_polled_devices_count	GaugeVec	domain, cycle	The number of polled devices per domain and cycle.
metric_polled_objects_count	GaugeVec	domain, cycle	The number of polled objects per domain and per cycle.
metric_polling_duration	GaugeVec	domain, cycle	The total polling duration in seconds for last cycle per domain and per cycle.
metric_polling_average	GaugeVec	domain, cycle	The average polling duration in seconds for last cycle per domain and per cycle.
metric_polling_average95	GaugeVec	domain, cycle	The 95th percentile average polling duration in seconds for last cycle per domain and per cycle.
metric_polling_utilisation	GaugeVec	domain, cycle	The polling utilisation in percent for last cycle per domain and per cycle.
metric_polling_utilisation95	GaugeVec	domain, cycle	The 95th percentile polling utilisation in percent for last cycle per domain and per cycle.

Note:

Metric names in the database include a prefix that indicates the service that inserted them. The prefix is prom_ for metrics inserted by Prometheus. For example, metric_worker_count is stored as prom_metric_worker_count in the database.

Title and Copyright Information

Implementation Guide

G10620-01