SNMP Poller
The SNMP Poller microservice discovers SNMP devices for the Discovery Service microservice and periodically polls the discovered devices for availability.
This microservice is part of the Discovery microservice pipeline. It uses a worker-coordinator design to balance workloads and allow for scaling. You deploy instances of this microservice and others in the same pipeline to separate clusters for each device zone. See Understanding the Discovery Pipeline in Unified Assurance Concepts for conceptual information.
You can enable redundancy for this microservice when you deploy it. See Configuring Microservice Redundancy for general information.
Autoscaling is supported but disabled by default for this microservice. You can optionally enable autoscaling when you deploy the microservice. See Configuring Autoscaling and SNMP Poller Autoscaling Configuration.
This microservice provides additional Prometheus monitoring metrics. See SNMP Poller Self-Monitoring Metrics.
SNMP Poller Prerequisites
Before deploying the microservice, confirm that the following prerequisites are met:
-
A microservice cluster is set up. See Microservice Cluster Setup.
-
The following microservices are deployed:
Deploying SNMP Poller
To deploy the microservice, run the following commands:
su - assure1
export NAMESPACE=<namespace>
export WEBFQDN=<WebFQDN>
a1helm install <microservice-release-name> assure1/snmp-poller -n $NAMESPACE --set global.imageRegistry=$WEBFQDN
In the commands:
-
<namespace> is the namespace where you are deploying the microservice. The default namespace is a1-zone1-pri, but you can change the zone number and, when deploying to a redundant cluster, change pri to sec.
-
<WebFQDN> is the fully-qualified domain name of the primary presentation server for the cluster.
-
<microservice-release-name> is the name to use for the microservice instance. Oracle recommends using the microservice name (snmp-poller) unless you are deploying multiple instances of the microservice to the same cluster.
You can also use the Unified Assurance UI to deploy microservices. See Deploying a Microservice by Using the UI for more information.
Changing SNMP Poller Configuration Parameters
When running the install command, you can optionally change default configuration parameter values by including them in the command with additional --set arguments. You can add as many additional --set arguments as you need.
For example:
-
Set a global parameter described in Default Global SNMP Poller Configuration by adding --set configData.<parameter_name>=<parameter_value>. For example, --set configData.LOG_LEVEL=DEBUG.
-
Set a coordinator-specific or worker-specific parameter by prefixing configData in the argument with coordinator or worker. For example, --set coordinator.configData.LOG_LEVEL=DEBUG. This overrides the global parameter.
-
Enable redundancy for the microservice by adding --set redundancy=enabled.
-
Enable client-side (worker) or server-side (coordinator) keep-alive checks by adding --set configData.GRPC_CLIENT_KEEPALIVE=true or --set configData.GRPC_SERVER_KEEPALIVE=true. See About Keep-Alive Configurations.
-
Enable autoscaling for the microservice and set the maximum replica count to an appropriate value for your environment by adding --set autoscaling.enabled=true --set autoscaling.maxReplicaCount=<N>. See SNMP Autoscaling Configuration for information about choosing an appropriate value for <N>.
Default SNMP Poller Configuration
Some SNMP Poller configurations apply to workers and coordinators, some apply only to coordinators, and some apply only to workers. The parameters set for the workers or coordinators specifically override the global parameters. For example, if you set global log levels to DEBUG, but the log level for coordinators to INFO, then the coordinator logs will use INFO and worker logs will use DEBUG.
Default Global SNMP Poller Configuration
The following table describes the default global configuration parameters found in the Helm chart under configData for the microservice. These apply to both workers and coordinators.
Name | Default Value | Possible Values | Notes |
---|---|---|---|
LOG_LEVEL | INFO | FATAL, ERROR, WARN, INFO, DEBUG | Global logging level between coordinator and workers. Any setting at the worker or coordinator level overrides this. |
GRPC_CONN_DOWN_DEADLINE | 5s | Integer + Text (ns, us, µs, ms, s, m, h) | The time period to wait for a GPRC connection before it is considered failed. |
GRPC_CLIENT_KEEPALIVE | false | Text (true or false) | Whether to use client-side keepalive checks, sent from the workers, to validate communication with the coordinator. See About Keep-Alive Configurations for information about the keepalive parameters. |
GRPC_CLIENT_KEEPALIVE_TIME | 30s | Integer + Text (ns, us, µs, ms, s, m, h) | The time period, after no communication, to ping the server (coordinator). |
GRPC_CLIENT_KEEPALIVE_TIMEOUT | 5s | Integer + Text (ns, us, µs, ms, s, m, h) | The time period to wait for a response to the ping before the server connection is considered down. |
GRPC_SERVER_KEEPALIVE | false | Text (true or false) | Whether to use server-side keepalive checks, sent from the coordinator, to validate communication with the workers. |
GRPC_SERVER_KEEPALIVE_TIME | 30s | Integer + Text (ns, us, µs, ms, s, m, h) | The time period, after no communication, to ping the clients (workers). |
GRPC_SERVER_KEEPALIVE_TIMEOUT | 5s | Integer + Text (ns, us, µs, ms, s, m, h) | The time period to wait for a response to the ping before the client connection is considered down. |
About Keep-Alive Configurations
By default, the coordinator and individual workers periodically send heartbeat messages between each other, with no validation, to check that the connection is not idle. To validate the connection, you can optionally enable ping-based gPRC keepalive checks, which expect a response within a configurable timeframe. If no response is received, the connection is considered down and the workers attempt to reestablish communication.
In the SNMP Poller microservice, the coordinator acts as the gPRC server and the workers act as clients. You enable keepalive checks from the coordinator to workers in the GRPC_SERVER_KEEPALIVE parameter and from workers to the coordinator in the GRPC_CLIENT_KEEPALIVE parameter. You set the interval at which the checks are made in the GRPC_SERVER_KEEPALIVE_TIME and GRPC_CLIENT_KEEPALIVE parameters, and the time within which a response is expected in the GRPC_SERVER_KEEPALIVE_TIMEOUT and GRPC_CLIENT_KEEPALIVE_TIMEOUT parameters.
Client-side keepalive checks have mandatory enforcement policies. If the client checks too frequently, the connection will be dropped with an ENHANCE_YOUR_CALM(too_many_pings) error. When you enable client-side keepalive checks, the SNMP Poller automatically sets the enforcement policy to allow no more than the value of GRPC_CLIENT_KEEPALIVE_TIME minus the value of GRPC_CLIENT_KEEPALIVE_TIMEOUT.
Default SNMP Poller Coordinator Configuration
The following table describes the default configuration parameters for coordinators found in the Helm chart under configData for the microservice.
Name | Default Value | Possible Values | Notes |
---|---|---|---|
LOG_LEVEL | INFO | FATAL, ERROR, WARN, INFO, DEBUG | Coordinator logging level. This overrides the global configuration. |
POLLER_RESYNC_PERIOD | 15m | Integer + Text (ns, us, µs, ms, s, m, h) | The time to wait before the coordinator re-synchronizes with the Unified Assurance database. |
DISCOVERY_WORKERS_PERCENTAGE | 25 | Integer, 0 up to 100 | The percentage of workers allocated to perform discovery workloads exclusively. |
WORKER_CONCURRENCY | 2000 | Integer, greater than 0 | The number of concurrent SNMP workloads that a single worker instance can perform. |
WORKER_STREAM_FAILURE_THRESHOLD | 5 | Integer, greater than 0 | The number of concurrent reconnections in the timeframe specified in WORKER_STREAM_FAILURE_WINDOW before forcing the worker to restart. |
WORKER_STREAM_FAILURE_WINDOW | 30m | Integer + Text (ns, us, µs, ms, s, m, h) | The timeframe to count concurrent reconnections for before for forcing the worker to restart. |
PULSAR_SNMP_DISCOVERY_TOPIC_OVERRIDE | "" | Text | Override for the topic from which the coordinator listens for discovery workload requests. |
REDUNDANCY_INIT_DELAY | 20s | Integer + Text (ns, us, µs, ms, s, m, h) | At startup, the amount of time to wait for the primary microservice to come up before initiating redundancy. |
REDUNDANCY_POLL_PERIOD | 5s | Integer + Text (ns, us, µs, ms, s, m, h) | The amount of time between status checks from the secondary microservice to the primary microservice. |
REDUNDANCY_FAILOVER_THRESHOLD | 4 | Integer, greater than 0 | The number of times the primary microservice must fail checks before the secondary microservice becomes active. |
REDUNDANCY_FALLBACK_THRESHOLD | 1 | Integer, greater than 0 | The number of times the primary microservice must succeed checks before the secondary microservice becomes inactive. |
PROBE_V2_SUPPORT_ENABLED | "" | Bool | Whether to enable SNMP probe v2c (true) or v1 (false) for v2c enabled devices during device discovery. If no value is provided, the default is false. |
Default SNMP Poller Worker Configuration
The following table describes the default configuration parameters for workers found in the Helm chart under configData for the microservice.
Name | Default Value | Possible Values | Notes |
---|---|---|---|
LOG_LEVEL | INFO | FATAL, ERROR, WARN, INFO, DEBUG | Worker logging level. This overrides the global configuration. |
GRPC_GRACEFUL_CONN_TIME | 60s | Integer + Text (ns, us, µs, ms, s, m, h) | The amount of time the workers should try to connect with the coordinator before failing. |
STREAM_OUTPUT_METRIC | "" | Text | Override for the topic where performance polling workload results are published. |
STREAM_OUTPUT_AVAILABILITY | "" | Text | Override for the topic where availability polling workload results are published. |
PULSAR_DISCOVERY_CALLBACK_OVERRIDE | "" | Text | Override for the topic where discovery workload results are published. |
SNMP Poller Autoscaling Configuration
Autoscaling is supported for the SNMP Poller microservice. See Configuring Autoscaling for general information and details about the standard autoscaling configurations.
For SNMP Poller, KEDA also uses the snmp_coordinator_metric_workers_required Prometheus metric to make scaling decisions. This metric is set dynamically. The SNMP Poller Microservice coordinator assigns polling and discovery workers during resynchronization with the Unified Assurance database. Numbers are based on polling throughput and your configuration settings as follows:
-
Number of poller workers: The number of unique devices being polled divided by the WORKER_CONCURRENCY value. The result is rounded up to the nearest whole number.
-
Number of discovery workers: The number of poller workers, multiplied by the DISCOVERY_WORKERS_PERCENTAGE value, divided by 100. The result is rounded up to the nearest whole number.
-
Total workers: The number of polling workers plus the number of discovery workers. This exposed as the snmp_coordinator_metric_workers_required Prometheus metric.
When you deploy the SNMP Poller microservice with autoscaling enabled, you must also calculate total workers required, based on the expected number of devices that will be polled in the device zone, to determine the value to use for the maxReplicaCount autoscaling configuration setting.
For example:
-
For 100,000 polled devices, when WORKER_CONCURRENCY is set to 2000 and DISCOVERY_WORKERS_PERCENTAGE is set to 25:
-
Total required workers and maxReplicaCount value: 63
-
Polling workers: 50
-
Discovery workers: 13
-
-
For 250,000 polled devices, when WORKER_CONCURRENCY is set to 3000 and DISCOVERY_WORKERS_PERCENTAGE is set to 33:
-
Total required workers and maxReplicaCount value: 112
-
Polling workers: 84
-
Discovery workers: 28
-
Modifying Scaling Triggers
By default, only the snmp_coordinator_metric_workers_required metric is configured as an autoscaling trigger. You can define additional triggers in the Helm chart by adding them under the triggers section.
For example, the default trigger configuration is:
autoscaling:
...
triggers:
- type: prometheus
metadata:
metricName: required_total_workers
serverAddress: http://prometheus-kube-prometheus-prometheus.a1-monitoring.svc.cluster.local:9090
query: snmp_coordinator_metric_required_total_workers
threshold: '1'
metricType: Value
SNMP Poller Self-Monitoring Metrics
The SNMP Poller microservice exposes the self-monitoring metrics for coordinators described in the following table to Prometheus.
Each of the metrics in the table is prefixed with snmp_coordinator in the database. For example, the full metric name in the database for the first metric is snmp_coordinator_metric_worker_count.
Metric Name | Type | Labels | Description |
---|---|---|---|
metric_worker_count | Gauge | N/A | The number of workers currently enrolled with the coordinator. |
metric_workforce_count | Gauge | N/A | The number of workers multiplied by worker concurrency. |
metric_discovery_worker_count | Gauge | N/A | The number of discovery workers currently enrolled with the coordinator. |
metric_polling_worker_count | Gauge | N/A | The number of polling workers currently enrolled with the coordinator. |
metric_required_discovery_workers | Gauge | N/A | The number of workers required for discovery when using autoscaling. Only available when autoscaling is enabled. |
metric_required_polling_workers | Gauge | N/A | The number of workers required for polling when using autoscaling. Only available when autoscaling is enabled. |
metric_required_total_workers | Gauge | N/A | The number of workers required for polling and discovery when using autoscaling. Only available when autoscaling is enabled. |
metric_discovery_requests_queued | Gauge | N/A | The number of discovery requests. (queued, realtime) |
metric_discovery_requests_processing | Gauge | N/A | The number of discovery requests. (processing, realtime) |
metric_polling_requests_queued | Gauge | N/A | The number of polling requests. (queued, realtime) |
metric_polling_requests_processing | Gauge | N/A | The number of polling requests. (processing, realtime) |
metric_polled_devices_count | GaugeVec | domain, cycle | The number of polled devices per domain and cycle. |
metric_polled_objects_count | GaugeVec | domain, cycle | The number of polled objects per domain and per cycle. |
metric_polling_duration | GaugeVec | domain, cycle | The total polling duration in seconds for last cycle per domain and per cycle. |
metric_polling_average | GaugeVec | domain, cycle | The average polling duration in seconds for last cycle per domain and per cycle. |
metric_polling_average95 | GaugeVec | domain, cycle | The 95th percentile average polling duration in seconds for last cycle per domain and per cycle. |
metric_polling_utilisation | GaugeVec | domain, cycle | The polling utilisation in percent for last cycle per domain and per cycle. |
metric_polling_utilisation95 | GaugeVec | domain, cycle | The 95th percentile polling utilisation in percent for last cycle per domain and per cycle. |
Note:
Metric names in the database include a prefix that indicates the service that inserted them. The prefix is prom_ for metrics inserted by Prometheus. For example, metric_worker_count is stored as prom_metric_worker_count in the database.