Planning the Installation

Oracle Linux Automation Manager Node Architecture

Oracle Linux Automation Manager supports a Service Mesh that provides a multi-service network that links nodes within a secure mesh. The Service Mesh can include up to 20 nodes that have the following node types:

Control Nodes: These nodes provide management functions such as launching system jobs, inventory updates, and project synchronizations. Control nodes use ansible-runner which in turn uses Podman to run jobs within the Control Plane Execution Environment execution environments. The Control Plane Execution Environment execution environment references the olam-ee container image found on the Oracle Linux Container Registry. Control nodes do not run standard jobs.
Execution Nodes: These nodes run standard jobs using ansible-runner which in turn uses Podman to run playbooks within OLAM EE execution environments. The OLAM EE execution environment references the olam-ee container image found on the Oracle Linux Container Registry. Execution nodes do not run management jobs. Execution nodes can also run custom execution environments that you can create using the Builder utility. For more information about custom execution environments, see Oracle Linux Automation Manager 2.1: Private Automation Hub Installation Guide. For more information about using Podman and the Oracle Linux Container Registry, see Oracle Linux: Podman User's Guide.
Hybrid Nodes: Hybrid nodes combine the functions of both control nodes and execution nodes into one node. Hybrid nodes are supported in single host Oracle Linux Automation Manager deployments, but not in clustered multi-host deployments.
Hop Nodes: You can use hop nodes to connect control nodes to execution nodes within a cluster. Hop nodes cannot run playbooks and do not appear in instance groups. However, they do appear as part of the service mesh.

Oracle Linux Automation Manager can manage a variety of different instance types, such as devices, servers, databases, network equipment, and so on. In general, Oracle Linux Automation Manager manages the following instance types:

Directly Managed Instances: Directly managed instances (nodes) are any virtual, physical, or software instances that Oracle Linux Automation Manager manages over an ssh connection.
Indirectly Managed Instances: Indirectly managed instances (nodes) include any identifiable instance not directly connected to Oracle Linux Automation Manager, but managed by a device that is directly connected to Oracle Linux Automation Manager.

For example, the following image illustrates all node and instance types and some of the ways that they can be related to one another.

The diagram shows a control plane instance group with two control nodes connected to another instance group that has two execution nodes. Each execution node is directly connected a managed instance and indirectly connected to additional managed instances. A third instance group contains a hybrid control and execution plane node that is also directly and indirectly connected to managed instances.

Installation Options

Oracle Linux Automation Manager provides three installation options:

Standalone Installation: All components are on the same host, including the database.

Figure 2-1 Standalone Installation with Local Database
Standalone Installation with Remote Database: All components are on the same host, with the exception of the database which is on a remote host.

Figure 2-2 Standalone Installation with Remote Database
Clustered Installation with Remote Database: Clustered installation can contain up to 20 nodes with one or more control node, one or more execution nodes, and one or more hop nodes all connected to one database. For example, the following shows a cluster with two control plane nodes and two execution plane nodes, each on separate hosts, and all of them connected to a remote database.

Figure 2-3 Clustered Installation with Remote Database

Service Mesh Topology Examples

There are a variety of ways that you can configure the Oracle Linux Automation Manager Service Mesh topology.

Example 1: Design the Service Mesh such that you have at least one backup control plane node and one backup execution plane node. For example, two control nodes and two execution nodes. Each execution plane node would have communication with both control plane nodes in case one of the control plane node were to fail. If the first execution plane node were to fail, the control plane node would switch to the second execution plane node.

Figure 2-4 Clustered Installation with Remote Database

The diagram shows a clustered instance of Oracle Linux Automation Manager with two control plane nodes, two execution nodes, and a remote database on a different host, where all nodes are connected to one another.

The high level steps to configure the Service Mesh are as follows:

Configure the /etc/receptor/receptor.conf file with the Node ID, tcp-listener, and tcp-peer addresses as required for each node. For more information about this task, see Configuring and Starting the Control Plane Service Mesh and Configuring and Starting the Execution Plane Service Mesh.
From a control plane node, log in as the awx user, and run the awx-manage command to do the following:
1. Provision each host's IP address or host name, and designate it as a control plane or execution plane node type. For example, the following commands provision two control plane and two execution plane nodes as illustrated in the figure above:
```
awx-manage provision_instance --hostname=192.0.121.28 --node_type=control
awx-manage provision_instance --hostname=192.0.126.72 --node_type=control
awx-manage provision_instance --hostname=192.0.113.178 --node_type=execution
awx-manage provision_instance --hostname=192.0.127.70 --node_type=execution
```
2. Register each node to either the controlplane or the execution instance group, based on the type of node you designated for each node. The awx-manage command refers to instance groups as queuenames. For example, the following commands create the controlplane and execution instance groups and associates the two control plane and two exection plane nodes to each instance group as illustrated in the figure above:
```
awx-manage register_queue --queuename=controlplane --hostnames=192.0.121.28
awx-manage register_queue --queuename=controlplane --hostnames=192.0.126.72
awx-manage register_queue --queuename=execution --hostnames=192.0.113.178
awx-manage register_queue --queuename=execution --hostnames=192.0.127.70
```
3. Register the peer relationship between each node. Note that when you register a peer relationship between a source IP address to a target IP address, the peer relationship establishes bidirectional communication. For example, the following commands registers the host IP address of the execution nodes as the source and each tcp-peer connection are the targets, which are the control plane nodes:
```
awx-manage register_peers 192.0.113.178 --peers 192.0.121.28
awx-manage register_peers 192.0.113.178 --peers 192.0.126.172
awx-manage register_peers 192.0.127.70 --peers 192.0.121.28
awx-manage register_peers 192.0.127.70 --peers 192.0.126.172
```
  The command must be run for each link you want to establish between nodes.
4. Register each instance group as the default queuename for either the control plane or the execution plane. This ensures that only control type jobs go to the control plane instance group and only Oracle Linux Automation Engine jobs go to execution plane instance group. To do this, you must edit the /etc/tower/settings.py file with the DEFAULT_EXECUTION_QUEUE_NAME and the DEFAULT_CONTROL_PLANE_QUEUE_NAME parameters.
For more information about these steps, see Configuring the Control, Execution, and Hop Nodes.

Example 2: Deploy as many control and execution plane nodes as you require such that you build in fail over in case any control or execution plane node fails. Ensure you don't exceed the 20 node limit for your cluster. Additional options you can consider are:

In some cases you may have an execution node that cannot be directly connected to a control plane node. In such cases you can connect the execution node to another execution node that is connected to the control node. This does introduce a risk such that if the intermediate execution node were to fail, then the connected execution node would become inaccessible to the control node.
In some cases you may have an execution node that cannot be directly connected to a control plane node. In such cases you can connect the execution node to a hop node that is connected to the control node. This does introduce a risk such that if the intermediate hop node were to fail, then the connected execution node would become inaccessible to the control node.
Establishing a peer relationship between control plane nodes. This ensures that control plane nodes are always directly accessible to one another. If no such relationship is established, then control plane nodes are aware of each other through connected execution plane nodes. For example, control A connects to control B through execution A which is connected to both.

Tuning Instances for Playbook Duration

Oracle Linux Automation Manager monitors jobs for status changes. For example, some job statuses are Running, Successful, Failed, Waiting, and so on. Normally the playbook being run triggers status changes as it makes progress in various ways. However, in some cases, the playbook will get stuck in the Running or Waiting state. When this happens, a reaper process automatically changes the state of the task from Running or Waiting to Failed. The default timer for when the reaper changes the status of a stuck job to the Failed state is 60 seconds.

If you have jobs that are designed to run longer than 60 seconds, then modify the REAPER_TIMEOUT_SEC parameter to the /etc/tower/settings.py file. Specify a time in seconds that is longer than the duration that your playbooks with the longest duration is expected to run. This avoids scenarios where the reaper mistakenly sets a long running playbook to the Failed state because the REAPER_TIMEOUT_SEC value has expired.

A possible scenario could occur if you run many short and long duration playbooks together with a reaper that has a long timeout value. If one or more of the short duration playbooks run for longer than expected, (for example, because of a network outage making it impossible for these playbooks to complete) the reaper continues to track the status of the stuck short duration playbooks until they either get unstuck and transition to the Successful state or until the reaper timeout value is reached. This scenario should cause no performance difficulties if only a few such failures were to occur. However, if hundreds of such failures were to occur at the same time, Oracle Linux Automation Manager would waste resources on tracking these stuck jobs and could degrade the performance of the host processing the jobs.

For more information about setting the REAPER_TIMEOUT_SEC parameter, see Setting up Hosts.