SNMP Poller

The SNMP Poller microservice discovers SNMP devices for the Discovery Service microservice and periodically polls the discovered devices for availability.

This microservice is part of the Discovery microservice pipeline. It uses a worker-coordinator design to balance workloads and allow for scaling. You deploy instances of this microservice and others in the same pipeline to separate clusters for each device zone. See Understanding the Discovery Pipeline in Unified Assurance Concepts for conceptual information.

You can enable redundancy for this microservice when you deploy it. See Configuring Microservice Redundancy for general information.

Autoscaling is supported but disabled by default for this microservice. You can optionally enable autoscaling when you deploy the microservice. See Configuring Autoscaling and SNMP Poller Autoscaling Configuration.

This microservice provides additional Prometheus monitoring metrics. See SNMP Poller Self-Monitoring Metrics.

SNMP Poller Prerequisites

Before deploying the microservice, confirm that the following prerequisites are met:

  1. A microservice cluster is set up. See Microservice Cluster Setup.

  2. The following microservices are deployed:

Deploying SNMP Poller

To deploy the microservice, run the following commands:

su - assure1
export NAMESPACE=<namespace>
export WEBFQDN=<WebFQDN> 
a1helm install <microservice-release-name> assure1/snmp-poller -n $NAMESPACE --set global.imageRegistry=$WEBFQDN

In the commands:

You can also use the Unified Assurance UI to deploy microservices. See Deploying a Microservice by Using the UI for more information.

Changing SNMP Poller Configuration Parameters

When running the install command, you can optionally change default configuration parameter values by including them in the command with additional --set arguments. You can add as many additional --set arguments as you need.

For example:

Default SNMP Poller Configuration

Some SNMP Poller configurations apply to workers and coordinators, some apply only to coordinators, and some apply only to workers. The parameters set for the workers or coordinators specifically override the global parameters. For example, if you set global log levels to DEBUG, but the log level for coordinators to INFO, then the coordinator logs will use INFO and worker logs will use DEBUG.

Default Global SNMP Poller Configuration

The following table describes the default global configuration parameters found in the Helm chart under configData for the microservice. These apply to both workers and coordinators.

Name Default Value Possible Values Notes
LOG_LEVEL INFO FATAL, ERROR, WARN, INFO, DEBUG Global logging level between coordinator and workers. Any setting at the worker or coordinator level overrides this.
GRPC_CONN_DOWN_DEADLINE 5s Integer + Text (ns, us, µs, ms, s, m, h) The time period to wait for a GPRC connection before it is considered failed.
GRPC_CLIENT_KEEPALIVE false Text (true or false) Whether to use client-side keepalive checks, sent from the workers, to validate communication with the coordinator. See About Keep-Alive Configurations for information about the keepalive parameters.
GRPC_CLIENT_KEEPALIVE_TIME 30s Integer + Text (ns, us, µs, ms, s, m, h) The time period, after no communication, to ping the server (coordinator).
GRPC_CLIENT_KEEPALIVE_TIMEOUT 5s Integer + Text (ns, us, µs, ms, s, m, h) The time period to wait for a response to the ping before the server connection is considered down.
GRPC_SERVER_KEEPALIVE false Text (true or false) Whether to use server-side keepalive checks, sent from the coordinator, to validate communication with the workers.
GRPC_SERVER_KEEPALIVE_TIME 30s Integer + Text (ns, us, µs, ms, s, m, h) The time period, after no communication, to ping the clients (workers).
GRPC_SERVER_KEEPALIVE_TIMEOUT 5s Integer + Text (ns, us, µs, ms, s, m, h) The time period to wait for a response to the ping before the client connection is considered down.

About Keep-Alive Configurations

By default, the coordinator and individual workers periodically send heartbeat messages between each other, with no validation, to check that the connection is not idle. To validate the connection, you can optionally enable ping-based gPRC keepalive checks, which expect a response within a configurable timeframe. If no response is received, the connection is considered down and the workers attempt to reestablish communication.

In the SNMP Poller microservice, the coordinator acts as the gPRC server and the workers act as clients. You enable keepalive checks from the coordinator to workers in the GRPC_SERVER_KEEPALIVE parameter and from workers to the coordinator in the GRPC_CLIENT_KEEPALIVE parameter. You set the interval at which the checks are made in the GRPC_SERVER_KEEPALIVE_TIME and GRPC_CLIENT_KEEPALIVE parameters, and the time within which a response is expected in the GRPC_SERVER_KEEPALIVE_TIMEOUT and GRPC_CLIENT_KEEPALIVE_TIMEOUT parameters.

Client-side keepalive checks have mandatory enforcement policies. If the client checks too frequently, the connection will be dropped with an ENHANCE_YOUR_CALM(too_many_pings) error. When you enable client-side keepalive checks, the SNMP Poller automatically sets the enforcement policy to allow no more than the value of GRPC_CLIENT_KEEPALIVE_TIME minus the value of GRPC_CLIENT_KEEPALIVE_TIMEOUT.

Default SNMP Poller Coordinator Configuration

The following table describes the default configuration parameters for coordinators found in the Helm chart under configData for the microservice.

Name Default Value Possible Values Notes
LOG_LEVEL INFO FATAL, ERROR, WARN, INFO, DEBUG Coordinator logging level. This overrides the global configuration.
POLLER_RESYNC_PERIOD 15m Integer + Text (ns, us, µs, ms, s, m, h) The time to wait before the coordinator re-synchronizes with the Unified Assurance database.
DISCOVERY_WORKERS_PERCENTAGE 25 Integer, 0 up to 100 The percentage of workers allocated to perform discovery workloads exclusively.
WORKER_CONCURRENCY 2000 Integer, greater than 0 The number of concurrent SNMP workloads that a single worker instance can perform.
WORKER_STREAM_FAILURE_THRESHOLD 5 Integer, greater than 0 The number of concurrent reconnections in the timeframe specified in WORKER_STREAM_FAILURE_WINDOW before forcing the worker to restart.
WORKER_STREAM_FAILURE_WINDOW 30m Integer + Text (ns, us, µs, ms, s, m, h) The timeframe to count concurrent reconnections for before for forcing the worker to restart.
PULSAR_SNMP_DISCOVERY_TOPIC_OVERRIDE "" Text Override for the topic from which the coordinator listens for discovery workload requests.
REDUNDANCY_INIT_DELAY 20s Integer + Text (ns, us, µs, ms, s, m, h) At startup, the amount of time to wait for the primary microservice to come up before initiating redundancy.
REDUNDANCY_POLL_PERIOD 5s Integer + Text (ns, us, µs, ms, s, m, h) The amount of time between status checks from the secondary microservice to the primary microservice.
REDUNDANCY_FAILOVER_THRESHOLD 4 Integer, greater than 0 The number of times the primary microservice must fail checks before the secondary microservice becomes active.
REDUNDANCY_FALLBACK_THRESHOLD 1 Integer, greater than 0 The number of times the primary microservice must succeed checks before the secondary microservice becomes inactive.
PROBE_V2_SUPPORT_ENABLED "" Bool Whether to enable SNMP probe v2c (true) or v1 (false) for v2c enabled devices during device discovery. If no value is provided, the default is false.

Default SNMP Poller Worker Configuration

The following table describes the default configuration parameters for workers found in the Helm chart under configData for the microservice.

Name Default Value Possible Values Notes
LOG_LEVEL INFO FATAL, ERROR, WARN, INFO, DEBUG Worker logging level. This overrides the global configuration.
GRPC_GRACEFUL_CONN_TIME 60s Integer + Text (ns, us, µs, ms, s, m, h) The amount of time the workers should try to connect with the coordinator before failing.
STREAM_OUTPUT_METRIC "" Text Override for the topic where performance polling workload results are published.
STREAM_OUTPUT_AVAILABILITY "" Text Override for the topic where availability polling workload results are published.
PULSAR_DISCOVERY_CALLBACK_OVERRIDE "" Text Override for the topic where discovery workload results are published.

SNMP Poller Autoscaling Configuration

Autoscaling is supported for the SNMP Poller microservice. See Configuring Autoscaling for general information and details about the standard autoscaling configurations.

For SNMP Poller, KEDA also uses the snmp_coordinator_metric_workers_required Prometheus metric to make scaling decisions. This metric is set dynamically. The SNMP Poller Microservice coordinator assigns polling and discovery workers during resynchronization with the Unified Assurance database. Numbers are based on polling throughput and your configuration settings as follows:

When you deploy the SNMP Poller microservice with autoscaling enabled, you must also calculate total workers required, based on the expected number of devices that will be polled in the device zone, to determine the value to use for the maxReplicaCount autoscaling configuration setting.

For example:

Modifying Scaling Triggers

By default, only the snmp_coordinator_metric_workers_required metric is configured as an autoscaling trigger. You can define additional triggers in the Helm chart by adding them under the triggers section.

For example, the default trigger configuration is:

autoscaling:
  ...
  triggers:
    - type: prometheus
      metadata:
        metricName: required_total_workers
        serverAddress: http://prometheus-kube-prometheus-prometheus.a1-monitoring.svc.cluster.local:9090
        query: snmp_coordinator_metric_required_total_workers
        threshold: '1'
        metricType: Value

SNMP Poller Self-Monitoring Metrics

The SNMP Poller microservice exposes the self-monitoring metrics for coordinators described in the following table to Prometheus.

Each of the metrics in the table is prefixed with snmp_coordinator in the database. For example, the full metric name in the database for the first metric is snmp_coordinator_metric_worker_count.

Metric Name Type Labels Description
metric_worker_count Gauge N/A The number of workers currently enrolled with the coordinator.
metric_workforce_count Gauge N/A The number of workers multiplied by worker concurrency.
metric_discovery_worker_count Gauge N/A The number of discovery workers currently enrolled with the coordinator.
metric_polling_worker_count Gauge N/A The number of polling workers currently enrolled with the coordinator.
metric_required_discovery_workers Gauge N/A The number of workers required for discovery when using autoscaling. Only available when autoscaling is enabled.
metric_required_polling_workers Gauge N/A The number of workers required for polling when using autoscaling. Only available when autoscaling is enabled.
metric_required_total_workers Gauge N/A The number of workers required for polling and discovery when using autoscaling. Only available when autoscaling is enabled.
metric_discovery_requests_queued Gauge N/A The number of discovery requests. (queued, realtime)
metric_discovery_requests_processing Gauge N/A The number of discovery requests. (processing, realtime)
metric_polling_requests_queued Gauge N/A The number of polling requests. (queued, realtime)
metric_polling_requests_processing Gauge N/A The number of polling requests. (processing, realtime)
metric_polled_devices_count GaugeVec domain, cycle The number of polled devices per domain and cycle.
metric_polled_objects_count GaugeVec domain, cycle The number of polled objects per domain and per cycle.
metric_polling_duration GaugeVec domain, cycle The total polling duration in seconds for last cycle per domain and per cycle.
metric_polling_average GaugeVec domain, cycle The average polling duration in seconds for last cycle per domain and per cycle.
metric_polling_average95 GaugeVec domain, cycle The 95th percentile average polling duration in seconds for last cycle per domain and per cycle.
metric_polling_utilisation GaugeVec domain, cycle The polling utilisation in percent for last cycle per domain and per cycle.
metric_polling_utilisation95 GaugeVec domain, cycle The 95th percentile polling utilisation in percent for last cycle per domain and per cycle.

Note:

Metric names in the database include a prefix that indicates the service that inserted them. The prefix is prom_ for metrics inserted by Prometheus. For example, metric_worker_count is stored as prom_metric_worker_count in the database.