Agent Auto-Update
Definition"Agent Auto-Update" refers to the ability for a running agent to automatically update itself, including updates to jars, configuration files, scripts and preference values. An agent should be able to update and restart itself automatically without the need for any additional manual intervention, specifically, there should be no need for an administrator to log into the agent machine and perform tasks necessary to complete the update. RequirementsThe following are requirements that must be supported by our agent auto-update feature:
The Prime DirectiveThere is one important design decision we are making in our first version of agent auto-update - we call it the Prime Directive and it is this:
Keeping the above fact in mind, this means:
In the future, we may decide to allow violations of this Prime Directive - we might be able to support different "minor" versions of agents; however, in our first implementation, we will not support it. In fact, we can somehow explicitly disallow a mis-versioned agent from talking to a server. The easiest way we can do this without requiring additional out-of-band data to flow over the comm connection is to have the agent kill itself if it detects it is one version but the server is another version. Unfortunately, this will not prevent really old agents (those prior to this agent-update feature) from trying to talk to the server. To prevent that, we could add an out-of-band version string to be placed in the agent's outgoing commands' configuration and if the server gets a bad version, it will disallow the messages. NOTE: we need a way to not make this too difficult for agent developers. A rebuild of an agent in a development environment should somehow be able to stay at a particular version so it can be started and not be denied access because the new agent is not of the version the server thinks is the latest version. Agent Update BinariesToday we ship agent distributions as a .zip. You unzip them and you have an agent installation. We are now going to distribute "agent update binaries" that will be packaged in our server distribution. The agent update binary will be a self-executing .jar that contains not only the full agent distribution .zip but also the update code itself. The name of the jar will include version information so it is easily discernable, such as "rhq-enterprise-agentupdate-2.2.0.jar".
The agent distribution zip files will still be called "rhq-enterprise-agent-2.2.0.zip". Agent Update Binaries can be used either within the context of an existing agent performing an update, or completely standalone (in the case of doing a fresh provisioning of an agent where no agent existed before). Example: java -jar rhq-enterprise-agent-2.2.0.jar --install[=<new-agent-dir>]
This will tell the Main-Class (as defined in the jar's manifest) to extract the agent in the current directory without doing anything special - do not run any update-specific code. It is as if the user simply extracted the .zip distro from the jar, and unzip'ed the agent .zip distro. If you specify a <new-agent-dir>, the agent will be installed in that directory (because it simply unzips the enclosing agent.zip, the agent will really be installed in "<new-agent-dir>/rhq-agent" since the zip is rooted at "rhq-agent"). java -jar rhq-enterprise-agent-2.2.0.jar --update[=<old-agent-dir>] This will be the command that an existing agent will issue after it downloads the jar file. This tells the Main-Class that we want to update an existing agent, where that existing agent is installed in the given <old-agent-dir> directory. The default will be the current directory (to allow a user to copy this jar file in an agent home directory and just run "java -jar rhq-enterprise-agent-2.2.0.jar --update" and have it work).
Jar File LayoutHow the files are packaged in the agent update binary jar file will be important. We want this to be a self-executing jar file, so we must have a manifest with the appropriate Main-Class property as well as have all the update code packages in the jar such that the classloader can find it. In addition, we must package the agent distribution in here as well. Additional files are probably also going to be needed, such as perhaps a properties file with version information in it. Here's is a first stab at how the jar file will be laid out: /rhq-agent-update-version.properties (information about this agent) /org/rhq/enterprise/agent/update/... (all the RHQ update code here) /abc-corp/... (third party libraries needed by the update code - gnu.getopt, etc...) /rhq-enterprise-agent-#.#.#.zip (the actual agent distro .zip file) /README.txt (some helpful information; will be displayed with --help) /LICENSE (the RHQ license file) The "rhq-agent-update-version.properties" file will look like this: rhq-agent.latest.version=1.2.0.GA rhq-agent.latest.build-number=12345 to indicate the version/build number of the agent, after the update is applied. The RHQ Server's Agent Update ServletDeployed in the RHQ Server's portal-war will be a standalone Agent Update Servlet. It will be accessed via simple HTTP GET - which means it is accessible using simply a browser, wget or other web client. However, its main purpose is to be accessed by the RHQ Agent - the agent will be able to ask the servlet for information about the Agent Update Binary as well as to download it (see the agent's "update" prompt command). The servlet will be mapped to more than one URI to allow for it to support different types of requests. One request type would be to download an agent update binary. Another request will be to ask it what agent version the server supports (i.e. the version of the agent that can be downloaded from the servlet). Example: <servlet> <servlet-name>AgentUpdateServlet</servlet-name> <servlet-class>org.rhq.enterprise.gui.agentupdate.AgentUpdateServlet</servlet-class> </servlet> <servlet-mapping> <servlet-name>AgentUpdateServlet</servlet-name> <url-pattern>/agentupdate/version</url-pattern> </servlet-mapping> <servlet-mapping> <servlet-name>AgentUpdateServlet</servlet-name> <url-pattern>/agentupdate/download</url-pattern> </servlet-mapping> Note that because we will deploy the servlet inside of our portal-war, our servlet will be able to make MBean and SLSB calls into the server in case it needs to do things like get the version of the server and to check to see if our server has been configured to disable the serving up of agent updates. /agentupdate/downloadThe Agent Update Binary will be placed in the server, in rhq.ear/rhq-downloads/rhq-agent. The servlet will assume any file with a .jar extension is the Agent Update Binary (there can be only one). When the servlet is access via the URL "/agentupdate/download", this jar file will be streamed to the client issuing the HTTP GET request. /agentupdate/versionInside the jar file, the servlet will look for a file called "rhq-agent-update-version.properties". This .properties file should include information about the agent - such as its version and build number. This agent version information, along with the server information, will be returned in a HTTP GET response when the servlet is accessed via the URL "/agentupdate/version". Limiting and Disabling Agent DownloadsIn the RHQ Server, there are several settings within rhq-server.properties that allow you to limit the number of concurrent messages coming into the RHQ Server from RHQ Agents. These are specific settings to limit comm-layer messages. We introduced a similar kind of concurrency limit for agent downloads: rhq.server.agent-downloads-limit=45 This is set in rhq-server.properties. In the future, you may be able to set this in the RHQ Server resource's Config tab, under the Concurrency Limit config group (see RHQ-1111). This setting does not affect the comm-layer messaging, but serves a similar purpose. The agent update servlet will disallow any more than this number of concurrent downloads occurring at any one time within a single RHQ Server. This value is configurable on a per-RHQ Server basis to allow larger/higher-throughput machines to support downloading more concurrent agent update binaries than smaller/lower-throughput machines. Note that this value has an upper-limit equal to the "rhq.server.startup.web.max-connections" limit (since this max-connections setting sets the upper bounds on the total number of concurrent web requests allowed to come into the RHQ Server's Tomcat layer). If the maximum number of concurrent downloads is currently in process, and another download request comes in, the servlet should reply to that request with an HTTP error code of 503 "Service Unavailable" to indicate that the agent needs to wait for a bit and resubmit its request. The 503 response should include the header "Retry-After" with a value being the number of seconds the agent should wait before attempting to ask again. If the "rhq.server.agent-downloads-limit" setting is set to 0, then the agent update servlet will reject any download request for the agent update binary. This means if any web client or agent sends an HTTP GET request to "agentupdate/download", the servlet will immediately reply with an HTTP error code of 403 "Forbidden". We have a global server-cloud configuration setting in the database (in RHQ_SYSTEM_CONFIG) to disable agent downloads across the entire server cloud. If, for some reason, you never want agents to download updates, you could go to the Administration > Server Configuration UI page and say "No" to the option "Enable Agent Auto-Updates". This setting goes into RHQ_SYSTEM_CONFIG and the servlet would access that setting via the System Configuration SLSB - if it is unchecked, the servlet should immediately reply with an HTTP error code of 403 "Forbidden". Performing Version Check To Determine If Update Is NeededThe agent has an "update" prompt command that is able to ask the server what version of the agent it has by doing an HTTP GET on http://<rhq-server>:7080/agentupdate/version (or the custom version URL if configured). This will return a simple name=version set of properties. We will make this generic so we can extend it in the future. For the first implementation, I suspect the only thing this will return in the response is: rhq-server.version=1.2.0.GA rhq-server.build-number=1852 rhq-agent.latest.md5=<the MD5 hashcode of the binary .jar> rhq-agent.latest.version=1.2.0.GA rhq-agent.latest.build-number=12345
The version string that the servlet returns for "rhq-agent.latest.version" (and "rhq-agent.latest.build-number", if we think we need it) will be determined at servlet runtime by examining the agent update binary jar itself (in the "rhq-agent-update-version.properties" file). This version information will be cached in a file internal to the servlet, stored in the rhq-downloads/rhq-agent directory ("rhq-server-agent-versions.properties" file). The agent doesn't need to do any of the above when it needs to automatically check to see if it needs to update. The agent will have its version checked automatically by the server during the agent's startup registration and thereafter during its connectAgent calls (in the case the agent fails over to another server different from the one it registered on). If the server's version check fails, it means the agent is attempting to violate The Prime Directive - this causes an AgentNotSupportedException to be thrown from the server to the agent.
Performing the UpdateAfter an agent determines it is out of date and needs to update, the agent must stop everything it is doing and start the update processing. The agent will spawn a separate non-daemon thread called the "RHQ Agent Update Thread". This will need to first shutdown all internal agent components (e.g. the comm layer and the plugin container). After the agent is shutdown, with only the update thread running, the agent will do the following: 1) HTTP GET "server-transport://server-endpoint:server-port/agentupdate/download" (or whatever the configured download URL is) At this point, the new agent update binary jar has been executed. It should sleep for a bit to wait for the original agent to die (can/should we perform some checks to confirm it is dead?). It should then perform its update tasks: 1) unpackage the rhq-agent directory found in the jar file, place it in some tmp location (<rhq-home>/update?).
8) exit If any of this fails, the agent will probably be dead in the water because presumably it needed to update because the server has been updated and its comm layer is different. However, this isn't always the case (most times it probably won't be) so the Agent Update Thread tries to start the original agent (via AgentMain.start()) to bring it back to where it was. DistributionsWe may decide to stop shipping agent-only distributions (our maven builds can still produce them, we just will not publish them). Because all servers must be paired with a particular version of agent, we may ship one distribution that includes both the server and agent, thus ensuring we keep the proper versions of servers and agents paired together. If we need to update an agent (say, we need to patch a bug in the agent), we will ship that new agent in a new server such that you will update the server which will then auto-update all agents with the new code. This will be true even if the server code itself did not change. What this does is:
wget --content-disposition http://<rhq-server>:7080/agentupdate/download will pull down an agent that can be provisioned manually if an agent does not yet exist. Updating Agent Running As A Windows Service / UNIX Boot Time ProcessThe auto-update feature needs to determine how it will be affected if the agent is running as a Windows Service or UNIX boot time process. Agent Launch ScriptsFirst, we need to document all the different, but valid, ways the agent can be run on production machines (i.e. those cases where agent auto-update will be an important feature to help manage that running instance) using the various launch scripts. Note that there are other ways to start the agent; however, those other ways may not be able to allow the agent to be auto-updated but since these are not to be considered "normal" ways you would want to start the agent in production anyway we won't worry about supporting them for auto-update capability. These "other" ways are typically only used by developers during development/testing. Examples of these "non-production/developer" ways to start the agent are:
The different ways the agent can be started on production machines are:
The following describes the behavior of the launch scripts for each of the above ways the agent can be started:
Restarting the AgentWhile I'm sure not impossible, the variations are so great that it would be very difficult to restart the agent in the exact manner that the agent was originally started. The above scenarios show you the different ways people might want to start the agent. The agent would have to know exactly how it was started (perhaps by squirreling away its command line arguments and initial script that was used) and then re-execute that command. However, this may not always be desired - for example, if the agent was started with --cleanconfig, we may not want to restart the agent with that option again. Therefore, we will assume that the person initially provisioning the agent into a production environment will:
In order for the agent auto-update feature to work, it requires that the above is true. In other words, the agent's runtime environment must be started using the rhq-agent-wrapper launcher script, because that is the way the agent will be restarted after it has been auto-updated. If you start the agent using the rhq-agent.[sh,bat] scripts, the agent can still be auto-updated, but you are not guaranteed to be able to have the agent automatically restarted because its possible the rhq-agent-env script doesn't configure the agent fully and in the same manner as when the agent was first started. This can happen, for example, when the user who started the agent had set some RHQ_AGENT_xxx environment variables in his shell directly, not in the rhq-agent-env configuration script, or the user passed in some custom arguments to the agent (like --pref or --input) that would not otherwise be passed in. In addition, unless the user passed the -daemon command line argument to the agent, the agent probably is running in foreground with keyboard input. Restarting the agent after an agent auto-update will put it into background with no keyboard input. So, while it is possible for you to run the agent using other means, you are not guaranteed to have the agent run in the exact same way with the same configuration as when the agent auto-update restarts the new agent. Therefore, to be safe, you should always run the agent using rhq-agent-wrapper scripts when starting the agent for a production environment.
TestingTesting this stuff is a bit difficult because you have to get an agent to think its an older version than what the server is expecting. In order to artificially violate the Prime Directive, you can use a small ANT script to modify an existing agent distribution or agent update binary so that agent thinks it's a particular version. Read the comments at the top of that ANT script to learn how to use it.
Stamping An Agent DistributionIf you already have an agent installed, you can stamp that agent with another version. In short, you need to run this: ant -Dagent.home.dir=<your-agent-install-directory>
This will inject some bogus version strings into the appropriate places so the agent will think its that version, not the one it really is. You can then run the agent normally and it should immediately begin its agent-update process. Stamping A ServerIf you want several agents to auto-upgrade themselves, it may be easier to stamp the agent update binary itself (as found in the server) as opposed to stamping each of the individual agents. If you already have a server installed, you can stamp that server's agent update binary with another version. In short, you need to run this: ant -Dserver.home.dir=<your-server-install-directory> Things That Have Been TestedHere are some things that were explicitly tested.
Future EnhancementsAutomatic Pre-Configuration of Agent Update BinaryAllow the server to manipulate the out-of-box agent update binary to set settings in agent-configuration.xml. The main use-case - allow the server to set the rhq.agent.server.bind-address so all agents that download the agent update binary from this server will connect to this server when that new agent is started. Agent Platform-detection and Agent Update Binary GenerationRight now, all agent distributions are cross-platform. They include both .bat and .sh scripts, Java Server Wrapper binaries and configuration (which is only valid for Windows), and all Sigar libraries for all platforms. It would be nice for agents to tell the server which platform the agent is running on, and have the server send down the agent update binary appropriate for that platform. We could have several platform-specific agent update binaries - e.g. Windows binary would have the wrapper binaries but none of the UNIX Sigar binaries. Older Notes That May Or May Not Be Relevent Now
Here are some issues that we need to solve before we can implement this feature:
Here are some ideas to think about when deciding how to implement:
|
Comments (1)
Oct 27, 2008
John Mazzitelli says:
Chris Morgan wrote: > You have the following prime directive: > "All agent...Chris Morgan wrote:
> You have the following prime directive:
> "All agents talking to a server cloud will be of the same version
> and an individual server will support talking to only one specific
> version of an agent." Maybe I'm missing the meaning here, so I hope
> you can help. Are we saying that agent management will not be
> backward compatible? In other words, if the customer upgrades the
> management server to 2.2, then only the agents that successfully
> upgrade to 2.2 will be manageable? Any 2.1.2 agents that fail to
> upgrade will not work? it seems like there should at least be some
> previous version compatibility – maybe not for full management,
> but some sort of ability.
> Or as stated, I'm probably missing something...
Chris,
Charles also brought up the thought that we should at least have some
ability to support older agent versions. So the design does have built into it the future ability to allow heterogenous agents running.
However, you have to remember if we release a new agent, in all probability it will contain code that is required to talk to the new server. For example, 2.0 agents simply will not work in a 2.1 environment due to the fact that agents now have to explicitly issue a connectAgent call to the server. Without that, alerts will get lost. So old agents, even if the communications protocol didn't change, will break the system.
But, for "minor" revisions of agents (say, a small patch to an agent jar that does not affect server<->agent comm), it is quite possible that an old agent that doesn't have that patch is able to successfully interact with the server. It is on those cases where the agent should be allowed to run.