The environment was working as expected and then we restarted the server and the agents did not automatically reconnect. Doing an un-manage/manage caused them to start working again. What causes this?
To answer this, it is important to understand what should happen. The same sequence happens with both the agent and connection manager, so for the purposes of simplifying, we will refer to either as the "agent".
The first thing the agent does is a registration request. The agent sends a notification (UDP) to the workgroup server (WGS) to say it is running. When received, the WGS sends an acknowledgement back to the agent.
The WGS then connects to the agent (TCP), starts a discovery and collects which objects to manage. For connection manager, the list of managers comes from the WGS configuration and sent to the agent. In agent mode, the list is sent from the agent to WGS. Then a full discovery is initiated for all objects on each of those managers. Un-manage and Manage causes the same process to occur.
When this process doesn't work, it indicates a communication issue. If un-manage and and manage makes it work again, this eliminates connectivity issues between the WGS and agent. This means that the most likely cause is an issue with the registration command.
There are a few possible reasons:
- The address of the WGS configured for the agent is wrong: The config/groups/mqgroup.ini file contains the names and locations of the WGS it is to connect to. When starting, it will register with any that it is associated with.
- Running at the console, you will see the following command for each registration request:
IMI0014 Registration sent to WORKGROUP("MQM") at localhost(4010)
- If these do not match the expected WGS, host and port, change as required
- Running at the console, you will see the following command for each registration request:
- The registration request was blocked: Registration is done using UDP. Many networks restrict this via firewall settings. As noted above, you must be able to reach via UDP port 4010.
- Testing with telnet or ping between the hosts is not sufficient since these do not use UDP
- As above, the successful response of the acknowledgement produces this message:
IMI0015 Confirmation received from WORKGROUP("MQM") 127.0.0.1(4010)
- if this is not present, then it was likely not successful
- Registration request was not authorized: A request to register must pass authorization checks within the WGS. The specific security right is "Node Registration".
- By default, this is assigned to the *agents group and the user registering needs to be part of this group
- The agent will connect using the login id or service account it is running under in order to register.
- In most cases, it will be the account you run the command using
- Windows services can run as SYSTEM
- WGS limits exceeded: For versions prior to 10.4, there are limits for total number of nodes (agents) that a WGS can support. When this limit is exceeded, the new nodes will not be allowed to connect. Increase the Max. number of nodes from the Workgroup properties Workgroup Limits tab from the Workspace dashboard.
- Licensing limits are being reached: This is not a likely cause since these would not typically be addressed by un-manage/manage but could appear to resolve it if other systems are now down.
If none of these address the issue, additional investigation may be required.