============================ LBNL Node Health Check (NHC) ============================ [![Join the chat at https://gitter.im/mej/nhc](https://badges.gitter.im/Join%20Chat.svg)](https://gitter.im/mej/nhc?utm_source=badge&utm_medium=badge&utm_campaign=pr-badge&utm_content=badge) TORQUE, SLURM, and other schedulers/resource managers provide for a periodic "node health check" to be performed on each compute node to verify that the node is working properly. Nodes which are determined to be "unhealthy" can be marked as down or offline so as to prevent jobs from being scheduled or run on them. This helps increase the reliability and throughput of a cluster by reducing preventable job failures due to misconfiguration, hardware failure, etc. Though many sites have created their own scripts to serve this function, the vast majority are one-off efforts with little attention paid to extensibility, flexibility, reliability, speed, or reuse. Developers at [Lawrence Berkeley National Laboratory](http://www.lbl.gov/) created this project in an effort to change that. LBNL Node Health Check (NHC) has several design features that set it apart from most home-grown solutions: * Reliable - To prevent single-threaded script execution from causing hangs, execution of subcommands is kept to an absolute minimum, and a watchdog timer is used to terminate the check if it runs for too long. * Fast - Implemented almost entirely in native `bash` (2.x or greater). Reducing pipes and subcommands also cuts down on execution delays and related overhead. * Flexible - Anything which can be described in a shell function can be a check. Modules can also populate cache data and reuse it for multiple checks. * Extensible - Its modular functional interface makes writing new checks easy. Just drop modules into the scripts directory, then add your checks to the config file! * Reusable - Written to be ultra-portable and can be used directly from a resource manager or scheduler, run via cron, or even spawned centrally (e.g., via `pdsh`). The configuration file syntax allows for all compute nodes to share a single configuration. In a typical scenario, the NHC driver script is run periodically on each compute node by the resource manager client daemon (e.g., `pbs_mom`). It loads its configuration file to determine which checks are to be run on the current node (based on its hostname). Each matching check is run, and if a failure is encountered, NHC will exit with an error message describing the problem. It can also be configured to mark nodes offline so that the scheduler will not assign jobs to bad nodes, reducing the risk of system-induced job failures. NHC can also log errors to the syslog (which is often forwarded to the master node). Some resource managers are even able to use NHC as a pre-job validation tool, keeping scheduled jobs from running on a newly-failed node, and/or a post-job cleanup/checkup utility to remove nodes from the scheduler which may have been adversely affected by the just-completed job. *************** Getting Started *************** The following instructions will walk you through downloading and installing LBNL NHC, configuring it for your system, testing the configuration, and implementing it for use with the TORQUE resource manager. Installation ============ Pre-built RPM packages for Red Hat Enterprise Linux versions 4, 5, 6, and 7 are made available with each release along with the source tarballs. The latest release, as well as prior releases, can be found [on GitHub](https://github.com/mej/nhc/releases/). Simply download the appropriate RPM for your compute nodes (e.g., [lbnl-nhc-1.4.2-1.el7.noarch.rpm](https://github.com/mej/nhc/releases/download/1.4.2/lbnl-nhc-1.4.2-1.el7.noarch.rpm)) and install it into your compute node VNFS. The NHC Yum repository is currently unavailable, but we hope to provide one in the very near future! The [source tarball for the latest release](https://github.com/mej/nhc/archive/1.4.2.tar.gz) is also available via the [NHC Project on GitHub](https://github.com/mej/nhc/). If you prefer to install from source, or aren't using one of the distributions shown above, use the commands shown here: .. code-block:: bash ./configure --prefix=/usr --sysconfdir=/etc --libexecdir=/usr/libexec make test make install .. note:: The `make test` step is optional but recommended. This will run NHC's built-in unit test suite to make sure everything is functioning properly! .. note:: You can also fork and/or clone the whole NHC project on GitHub; this is recommended if you plan to contribute to NHC development as this makes it very easy to submit your changes upstream using GitHub Pull Requests! Visit the [NHC Project Page](https://github.com/mej/nhc/) to Watch, Star, or Fork the project! Whether you use RPMs or install from source, the script will be installed as `/usr/sbin/nhc`, the configuration file and check scripts in `/etc/nhc`, and the helper scripts in `/usr/libexec/nhc`. Once you've completed one of the 3 installation methods above on your compute nodes' root filesystem image, you can proceed with the configuration. Sample Configuration ==================== The default configuration supplied with LBNL NHC is intended to be more of an overview of available checks than a working configuration. It's essentially impossible to create a default configuration that will work out-of-the-box for any host and still do something useful. But there are some basic checks which are likely to apply, with some modifications of boundary values, to most systems. Here's an example `nhc.conf` which shouldn't require too many tweaks to be a solid starting point: .. code-block:: bash # Check that / is mounted read-write. * || check_fs_mount_rw / # Check that sshd is running and is owned by root. * || check_ps_service -u root -S sshd # Check that there are 2 physical CPUs, 8 actual cores, and 8 virtual cores (i.e., threads) * || check_hw_cpuinfo 2 8 8 # Check that we have between 1kB and 1TB of physical RAM * || check_hw_physmem 1k 1TB # Check that we have between 1B and 1TB of swap * || check_hw_swap 1b 1TB # Check that we have at least some swap free * || check_hw_swap_free 1 # Check that eth0 is available * || check_hw_eth eth0 Obviously you'll need to adjust the CPU and memory numbers, but this should get you started. Config File Auto-Generation --------------------------- Instead of starting with a basic sample configuration and building on it, as of version 1.4.1, the `nhc-genconf` utility is supplied with NHC which uses the same shell code as NHC itself to query various attributes of your system (CPU socket/core/thread counts, RAM size, swap size, etc.) and automatically generate an initial configuration file based on its scan. Simply invoke `nhc-genconf` on each system where NHC will be running. By default, this will create the file `/etc/nhc/nhc.conf.auto` which can then be renamed (or used directly via NHC's `-c` option), tweaked, and deployed on your system! Normally the config file which `nhc-genconf` creates will use the hostname of the node on which it was run at the beginning of each line. This is to allow multiple files to be merged and sorted into a single config that will work across your system. However, you may wish to provide a custom match expression to prefix each line; this may be done via the `-H` option (e.g., `-H host1` or `-H '*'`). The scan also includes BIOS information obtained via the `dmidecode` command. The default behavior only includes lines from the output which match the regular expression `/([Ss]peed|[Vv]ersion)/`, but this behavior may be altered by supplying an alternative match string via the `-b` option (e.g., `-b '*release*'`). It can be incredibly tedious, especially for large, well-established heterogeneous or multi-generational clusters to gather up all the different types of hardware that exist in your system and write the appropriate NHC config file rules, match expressions, etc. The following commands might come in handy for aggregating the results of `nhc-genconf` across a large group of nodes: .. code-block:: bash wwsh ssh 'n*' "/usr/sbin/nhc-genconf -H '*' -c -" | dshbak -c # OR pdsh -a "/usr/sbin/nhc-genconf -H '*' -c -" | dshbak -c Testing ======= As of version 1.2 (and higher), NHC comes with a built-in set of fairly extensive unit tests. Each of the check functions is tested for proper functionality; even the driver script (`/usr/sbin/nhc` itself) is tested! To run the unit tests, use the `make test` command at the top of the source tree. You should see something like this: .. code-block:: bash # make test make -C test test make[1]: Entering directory `/home/mej/svn/lbnl/nhc/test' Running unit tests for NHC: nhcmain_init_env...ok 6/6 nhcmain_finalize_env...ok 14/14 nhcmain_check_conffile...ok 1/1 nhcmain_load_scripts...ok 6/6 nhcmain_set_watchdog...ok 1/1 nhcmain_run_checks...ok 2/2 common.nhc...ok 18/18 ww_fs.nhc...ok 61/61 ww_hw.nhc...ok 65/65 ww_job.nhc...ok 2/2 ww_nv.nhc...ok 4/4 ww_ps.nhc...ok 32/32 All 212 tests passed. make[1]: Leaving directory `/home/mej/svn/lbnl/nhc/test' If everything works properly, all the unit tests should pass. Any failures represent a problem that should be reported to the [NHC Users' Mailing List](mailto:nhc@lbl.gov)! Before adding the node health check to your resource manager (RM) configuration, it's usually prudent to do a test run to make sure it's installed/configured/running properly first. To do this, simply run `/usr/sbin/nhc` with no parameters. Successful execution will result in no output and an exit code of 0. If this is what you get, you're done testing! Skip to the next section. If you receive an error, it will look similar to the following: .. code-block:: bash ERROR Health check failed: Actual CPU core count (2) does not match expected (8). Depending on which check failed, the message will vary. Hopefully it will be clear what the discrepancy is based on the content of the message. Adjust your configuration file to match your system and try again. If you need help, feel free to post to the [NHC Users' Mailing List](mailto:nhc@lbl.gov). Additional information may be found in `/var/log/nhc.log`, the runtime logfile for NHC. A successful run based on the configuration above will look something like this: .. code-block:: bash Node Health Check starting. Running check: "check_fs_mount_rw /" Running check: "check_ps_daemon sshd root" Running check: "check_hw_cpuinfo 2 8 8" Running check: "check_hw_physmem 1024 1073741824" Running check: "check_hw_swap 1 1073741824" Running check: "check_hw_swap_free 1" Running check: "check_hw_eth eth0" Node Health Check completed successfully (1s). A failure will look like this: .. code-block:: bash Node Health Check starting. Running check: "check_fs_mount_rw /" Running check: "check_ps_daemon sshd root" Running check: "check_hw_cpuinfo 2 8 8" Health check failed: Actual CPU core count (2) does not match expected (8). We can see from the excerpt here that the `check_hw_cpuinfo` check failed and that the machine we ran on appears to be a dual-socket single-core system (2 cores total). Since our configuration expected a dual-socket quad-core system (8 cores total), this was flagged as a failure. Since we're testing our configuration, this is most likely a mismatch between what we told NHC to expect and what the system actually has, so we need to fix the configuration file. Once we have a working configuration and have gone into production, a failure like this would likely represent a hardware issue. Once the configuration has been modified, try running `/usr/sbin/nhc` again. Continue fixing the discrepancies and re-running the script until it succeeds; then, proceed with the next section. Implementation ============== Instructions for putting NHC into production depend entirely on your use case. We can't possibly hope to delineate them all, but we'll cover some of the most common. TORQUE Integration ------------------ NHC can be executed by the `pbs_mom` process at job start, job end, and/or regular intervals (irrespective of whether or not the node is running job(s)). More detailed information on how to configure the `pbs_mom` health check can be found in the [TORQUE Documentation](http://docs.adaptivecomputing.com/torque/help.htm#topics/11-troubleshooting/computeNodeHealthCheck.htm). The configuration used here at LBNL is as follows: .. code-block:: bash $node_check_script /usr/sbin/nhc $node_check_interval 5,jobstart,jobend $down_on_error 1 This causes `pbs_mom` to launch `/usr/sbin/nhc` every 5 "MOM intervals" (45 seconds by default), when starting a job, and when a job completes (or is terminated). Failures will cause the node to be marked as "down." > **NOTE:** Some concern has been expressed over the possibility for "OS jitter" caused by NHC. NHC was designed to keep jitter to an absolute minimum, and the implementation goes to extreme lengths to reduce and eliminate as many potential causes of jitter as possible. No significant jitter has been experienced so far (and similar checks at similar intervals are used on _extremely_ jitter-sensitive systems); however, increase the interval to `80` instead of `5` for once-hourly checks if you suspect NHC-generated jitter to be an issue for your system. Alternatively, some sites have configured NHC to detect running jobs and simply exit (or run fewer checks); that works too! In addition, NHC will by default mark the node "offline" (i.e., `pbsnodes -o`) and add a note (viewable with `pbsnodes -ln`) specifying the failure. Once the failure has been corrected and NHC completes successfully, it will remove the note it set and clear the "offline" status from the node. In order for this to work, however, each node must have "operator" access to the TORQUE daemon. Unfortunately, the support for wildcards in `pbs_server` attributes is limited to replacing the host, subdomain, and/or domain portions with asterisks, so for most setups this will likely require omitting the entire hostname section. The following has been tested and is known to work: .. code-block:: bash qmgr -c "set server operators += root@*" This functionality is not strictly required, but it makes determining the reason nodes are marked down significantly easier! Another possible caveat to this functionality is that it only works if the canonical hostname (as returned by the `hostname` command or the file `/proc/sys/kernel/hostname`) of each node matches its identity within TORQUE. If your site uses FQDNs on compute nodes but has them listed in TORQUE using the short versions, you will need to add something like this to the top of your NHC configuration file: .. code-block:: bash * || HOSTNAME="$HOSTNAME_S" This will cause the offline/online helpers to use the shorter hostname when invoking `pbsnodes`. This will NOT, however, change how the hostnames are matched in the NHC configuration, so you'll still need to use FQDN matching there. It's also important to note here that NHC will only set a note on nodes that don't already have one (and aren't yet offline) or have one set by NHC itself; also, it will only online nodes and clear notes if it sees a note that was set by NHC. It looks for the string "NHC:" in the note to distinguish between notes set by NHC and notes set by operators. If you use this feature, and you need to mark nodes offline manually (e.g., for testing), setting a note when doing so is strongly encouraged. (You can do this via the `-N` option, like this: `pbsnodes -o -N 'Testing stuff' n0000 n0001 n0002`) There was a bug in versions prior to 1.2.1 which would cause it to treat nodes with no notes the same way it treats nodes with NHC-assigned notes. This _should_ be fixed in 1.2.1 and higher, but you never know.... SLURM Integration ----------------- Add the following to `/etc/slurm.conf` (or `/etc/slurm/slurm.conf`, depending on version) on your master node **AND** your compute nodes (because, even though the `HealthCheckProgram` only runs on the nodes, your `slurm.conf` file must be the same across your entire system): .. code-block:: bash HealthCheckProgram=/usr/sbin/nhc HealthCheckInterval=300 This will execute NHC every 5 minutes. For optimal support of SLURM, NHC version 1.3 or higher is recommended. Prior versions will require manual intervention. Periodic Execution ------------------ The original method for doing this was to employ a simple `crontab` entry, like this one: .. code-block:: bash MAILTO=operators@your.com */5 * * * * /usr/sbin/nhc Annoyingly, this would result in an e-mail being sent every 5 minutes if one of the health checks fails. It was for this very reason that the contributed `nhc.cron` script was originally written. However, even though it avoids the former technique's flood of e-mail when a problem arose, it still had no clean way of dealing with multiple contexts and could not be set up to do periodic reminders of issues. Additionally, it would fail to notify if a new problem was detected before or at the same time the old problem was resolved. Version 1.4.1 introduces a vastly superior option: `nhc-wrapper`. This tool will execute `nhc`[1](#footnotes) and record the results. It then compares the results to the output of the previous run, if present, and will ignore results that are identical to those previously obtained. Old results can be set to expire after a given length of time (and thus re-reported). Results may be echoed to stdout or sent via e-mail. Once an unrecognized command line option or non-option argument is encountered, it and the rest of the command line arguments are passed to the wrapped program intact. This tool will typically be run via `cron(8)`. It can be used to wrap distinct contexts of NHC in a manner identical to NHC itself (i.e., specified via executable name or command line arg); also, unlike the old `nhc.cron` script, this one does a comparison of the results rather than only distinguishing between the presence/absence of output, and those results can have a finite lifespan. `nhc-wrapper` also offers another option for periodic execution: looping (`-L`). When launched from a terminal or `inittab`/`init.d` entry in looping mode, `nhc-wrapper` will execute a loop which runs the wrapped program (e.g., `nhc`) at a time interval you supply. It attempts to be smart about interpreting your intent as well, calculating sleep times after subprogram execution (i.e., the interval is from start time to start time, not end time to start time) and using nice, round execution times when applicable (i.e., based on 00:00 local time instead of whatever random time the wrapper loop happened to be entered). For example, if you ask it to run every 5 minutes, it'll run at :00, :05, :10, :15, etc. If you ask for every 4 hours, it'll run at 00:00, 04:00, 08:00, 12:00, 16:00, and 20:00 exactly--regardless of what time it was when you originally launched `nhc-wrapper`! This allows the user to run `nhc-wrapper` in a terminal to keep tabs on it while still running checks at predictable times (just like `crond` would). It also has some flags to provide timestamps (`-L t`) and/or ASCII horizontal rulers (`-L r`) between executions; clearing the screen (`-L c`) before each execution (`watch`-style) is also available. Examples: To run `nhc` and notify `root` when errors appear, are cleared, or every 12 hours while they persist: .. code-block:: bash /usr/sbin/nhc-wrapper -M root -X 12h Same as above, but run the "nhc-cron" context instead (`nhc -n nhc-cron`): .. code-block:: bash /usr/sbin/nhc-wrapper -M root -X 12h -n nhc-cron # OR /usr/sbin/nhc-wrapper -M root -X 12h -A '-n nhc-cron' Same as above, but run `nhc-cron` (symlink to `nhc`) instead: .. code-block:: bash /usr/sbin/nhc-wrapper -M root -X 12h -P nhc-cron # OR ln -s nhc-wrapper /usr/sbin/nhc-cron-wrapper /usr/sbin/nhc-cron-wrapper -M root -X 12h Expire results after 1 week, 1 day, 1 hour, 1 minute, and 1 second: .. code-block:: bash /usr/sbin/nhc-wrapper -M root -X 1w1d1h1m1s Run verbosely, looping every minute with ruler and timestamp: .. code-block:: bash /usr/sbin/nhc-wrapper -L tr1m -V Or for something quieter and more `cron`-like: .. code-block:: bash /usr/sbin/nhc-wrapper -L 1h -M root -X 12h ************* Configuration ************* Now that you have a basic working configuration, we'll go more in-depth into how NHC is configured, including command-line invocation, configuration file syntax, modes of operation, how individual checks are matched against a node's hostname, and what checks are already available in the NHC distribution for your immediate use. Configuration of NHC is generally done in one of 3 ways: passing option flags and/or configuration (i.e., environment) variables on the command line, setting variables and specifying checks in the configuration file (`/etc/nhc/nhc.conf` by default), and/or setting variables in the sysconfig initialization file (`/etc/sysconfig/nhc` by default). The latter works essentially the same as any other sysconfig file (it is directly sourced into NHC's `bash` session using the `.` operator), so this document does not go into great detail about using it. The following sections discuss the other two mechanisms. Command-Line Invocation ======================= From version 1.3 onward, NHC supports a subset of command-line options and arguments in addition to the configuration and sysconfig files. A few specific settings have CLI options associated with them as shown in the table below; additionally, any configuration variable which is valid in the configuration or sysconfig file may also be passed on the command line instead. Options ------- | **Command-Line Option** | **Equivalent Configuration Variable** | **Purpose** | | ----------------------- | ------------------------------------- | ----------- | | `-D` `confdir` | `CONFDIR=confdir` | Use config directory `confdir` (default: `/etc/name`) | | `-a` | `NHC_CHECK_ALL=1` | Run ALL checks; don't exit on first failure (useful for `cron`-based monitoring) | | `-c` `conffile` | `CONFFILE=conffile` | Load config from `conffile` (default: `confdir/name.conf`) | | `-d` | `DEBUG=1` | Activate debugging output | | `-f` | `NHC_CHECK_FORKED=1` | Run each check in a separate background process (*EXPERIMENTAL*) | | `-h` | N/A | Show command line help | | `-l` `logspec` | `LOGFILE=logspec` | File name/path or BASH-syntax directive for logging output (`-` for `STDOUT`) | | `-n` `name` | `NAME=name` | Set program name to `name` (default: `nhc`); see -D & -c | | `-q` | `SILENT=1` | Run quietly | | `-t` `timeout` | `TIMEOUT=``timeout` | Use timeout of `timeout` seconds (default: 30) | | `-v` | `VERBOSE=1` | Run verbosely (i.e., show check progress) | .. note:: Due to the use of the `getopts bash` built-in, and the limitations thereof, POSIX-style bundling of options (e.g., `-da`) is NOT supported, and all command-line options MUST PRECEDE any additional variable/value-type arguments! Variable/Value Arguments ------------------------ Instead of, or possibly in addition to, the use of command-line options, NHC accepts configuration via variables specified on the command line. Simply pass any number of _`VARIABLE=value`_ arguments on the command line, and each variable will be set to its respective value immediately upon NHC startup. This happens before the sysconfig file is loaded, so it can be used to alter such values as `$SYSCONFIGDIR` (`/etc/sysconfig` by default) which would normally be unmodifiable. It's important to note that while command-line configuration directives will override NHC's built-in defaults for various variables, variables set in the configuration file (see below) will NOT be overridden. The config file takes precedence over the command line, in contrast to most other CLI tools out there (and possibly contrary to user expectation) due to the way `bash` deals with variables and initialization. If you want the command line to take precedence, you'll need to test the value of the variable in the config file and only alter it if the current value matches NHC's built-in default. Example Invocations ------------------- Most sites just run `nhc` by itself with no options when launching from a resource manager daemon. However, when running from cron or manually at the command line, numerous other possible scenarios exist for invoking NHC in various ways. Here are some real-world examples. .. code-block:: bash nhc -d nhc DEBUG=1 To run for testing purposes in debug mode with no timeout and with node online/offline disabled: .. code-block:: bash nhc -d -t 0 MARK_OFFLINE=0 To force use of SLURM as the resource manager and use a sysconfig path in `/opt`: .. code-block:: bash nhc NHC_RM=slurm SYSCONFIGDIR=/opt/etc/sysconfig To run NHC out-of-band (e.g., from cron) with the name `nhc-oob` (which will load its config from `/etc/sysconfig/nhc-oob` and `/etc/nhc/nhc-oob.conf`): .. code-block:: bash nhc -n nhc-oob .. note:: As an alternative, you may symlink `/usr/sbin/nhc-oob` to `nhc` and run `nhc-oob` instead. This will accomplish the same thing. Configuration File Syntax ========================= The configuration file is fairly straight-forward. Stored by default in `/etc/nhc/nhc.conf`, the file is plain text and recognizes the traditional `#` introducer for comments. Any line that starts with a `#` (with or without leading whitespace) is ignored. Blank lines are also ignored. Examples: .. code-block:: bash # This is a comment. # This is also a comment. # This line and the next one will both be ignored. Configuration lines contain a **target** specifier, the separator string `||`, and the **check** command. The target specifies which hosts should execute the check; only nodes whose hostname matches the given target will execute the check on that line. All other nodes will ignore it and proceed to the next check. A check is simply a shell command. All NHC checks are bash functions defined in the various included files in `/etc/nhc/scripts/*.nhc`, but in actuality any valid shell command that properly returns success or failure will work. This documentation and all examples will only reference bash function checks. Each check can take zero or more arguments and is executed exactly as seen in the configuration. As of version 1.2, configuration variables may also be set in the config file with the same syntax. This makes it easy to alter specific settings, commands, etc. globally or for individual hosts/hostgroups! Example: .. code-block:: bash * || SOMEVAR="value" * || check_something *.foo || another_check 1 2 3 Match Strings ============= As noted in the last section, the first item on each line of the NHC configuration file specifies the **target** for the check which will follow. When NHC runs on a particular host, it reads and parses each line of the configuration file, comparing the hostname of the host (taken from the `$HOSTNAME` variable) with the specified target expression; if the target matches, the check will be saved for later execution. Lines whose targets don't match the current host are ignored completely. The target is expressed in the form of a **match string** -- an NHC expression that allows for exact string matches or a variety of dynamic comparison methods. Match strings are a very important concept and are used throughout NHC, not just for check targets, but as parameters to individual checks as well, so it's important that users fully understand how they work. There are multiple forms of **match string** supported by NHC. The default style is a **glob**, also known as a **wildcard**. bash will determine if the hostname of the node (specifically, the contents of `/proc/sys/kernel/hostname`) matches the supplied glob expression (e.g., `n*.viz`) and execute only those checks which have matching target expressions. If the hostname does not match the glob, the corresponding check is ignored. The second method for specifying host matches is via **regular expression**. Regex targets must be surrounded by slashes to identify them as regular expressions. The internal regex matching engine of bash is used to compare the hostname to the given regular expression. For example, given a target of `/^n00[0-5][0-9]\.cc2$/`, the corresponding check would execute on `n0017.cc2` but not on `n0017.cc1` or `n0083.cc2`. The third form of match string (supported in NHC versions 1.2.2 and later) is **node range expressions** similar to those used by `pdsh`, Warewulf, and other open source HPC tools. (_Please note that not all expressions supported by other tools will work in NHC due to limitations in `bash`._) The match expression is placed in curly braces and specifies one or more comma-separated node name ranges, and the corresponding check will only execute on nodes which fall into at least one of the specified ranges. Note that only one range expression is supported per range, and commas within ranges are not supported. So, for example, the target `{n00[00-99].phys,n000[0-4].bio}` would cause its check to execute on `n0030.phys`, `n0099.phys`, and `n0001.bio`, but not on `n0100.phys` nor `n0005.bio`. Expressions such as `{n[0-3]0[00-49].r[00-29]}` and `{n00[00-29,54,87].sci}` are not supported (though the latter may be written instead as `{n00[00-29].sci,n0054.sci,n0087.sci}`). Match strings of any form (glob/wildcard, regular expression, node range, or external) can be negated. This simply means that a match string which would otherwise have matched will instead fail to match, and vice versa (i.e., the boolean result of the match is inverted). To negate any match string, simply prefix it (before the initial type character, if any) with an exclamation mark (`!`). For example, to run a check on all but the I/O nodes, you could use the expression: `!io*` Examples: .. code-block:: bash * || valid_check1 !ln* || valid_check2 /n000[0-9]/ || valid_check3 !/\.(gpu|htc)/ || valid_check4 {n00[20-39]} || valid_check5 !{n03,n05,n0[7-9]} || valid_check6 {n00[10-21,23]} || this_target_is_invalid Throughout the rest of the documentation, we will refer to this concept as a **match string** (or abbreviated **mstr**). Anywhere a match string is expected, either a glob, a regular expression surrounded by slashes, or node range expression in braces, possibly with a leading `!` to negate it, may be specified. Supported Variables =================== As mentioned above, version 1.2 and higher support setting/changing shell variables within the configuration file. Many aspects of NHC's behavior can be modified through the use of shell variables, including a number of the commands in the various checks and helper scripts NHC employs. There are, however, some variables which can only be specified in `/etc/sysconfig/nhc`, the global initial settings file for NHC. This is typically for obvious reasons (e.g., you can't change the path to the config file from within the config file!). The table below provides a list of the configuration variables which may be used to modify NHC's behavior; those which won't work in a config file (only sysconfig or command line) are marked with an asterisk ("*"): | **Variable Name** | **Default Value** | **Purpose** | | ----------------- | ----------------- | ----------- | | * CONFDIR | `/etc/nhc` | Directory for NHC configuration data | | * CONFFILE | `$CONFDIR/$NAME.conf` | Path to NHC config file | | DEBUG | `0` | Set to `1` to activate debugging output | | * DETACHED_MODE | `0` | Set to `1` to activate [Detached Mode](#detached-mode) | | * DETACHED_MODE_FAIL_NODATA | `0` | Set to `1` to cause [Detached Mode](#detached-mode) to fail if no prior check result exists | | DF_CMD | `df` | Command used by `check_fs_free`, `check_fs_size`, and `check_fs_used` | | DF_FLAGS | `-Tka` | Flags to pass to `$DF_CMD` for space checks. **_NOTE:_ Adding the `-l` flag is _strongly_ recommended if only checking local filesystems.** | | DFI_CMD | `df` | Command used by `check_fs_inodes`, `check_fs_ifree`, and `check_fs_iused` | | DFI_FLAGS | `-Tia` | Flags to pass to `$DFI_CMD`. **_NOTE:_ Adding the `-l` flag is _strongly_ recommended if only checking local filesystems.** | | * FORCE_SETSID | `1` | Re-execute NHC as a session leader if it isn't already one at startup | | * HELPERDIR | `/usr/libexec/nhc` | Directory for NHC helper scripts | | * HOSTNAME | Set from `/proc/sys/kernel/hostname` | Canonical name of current node | | * HOSTNAME_S | `$HOSTNAME` truncated at first `.` | Short name (no domain or subdomain) of current node | | IGNORE_EMPTY_NOTE | `0` | Set to `1` to treat empty notes like NHC-assigned notes (<1.2.1 behavior) | | * INCDIR | `$CONFDIR/scripts` | Directory for NHC check scripts | | JOBFILE_PATH | TORQUE/PBS: `$PBS_SERVER_HOME/mom_priv/jobs`
SLURM: `$SLURM_SERVER_HOME` | Directory on compute nodes where job records are kept | | * LOGFILE | `>>/var/log/nhc.log` | File name/path or BASH-syntax directive for logging output (`-` for `STDOUT`) | | LSF_BADMIN | `badmin` | Command to use for LSF's `badmin` (may include path) | | LSF_BHOSTS | `bhosts` | Command to use for LSF's `bhosts` (may include path) | | LSF_OFFLINE_ARGS | `hclose -C` | Arguments to LSF's `badmin` to offline node | | LSF_ONLINE_ARGS | `hopen` | Arguments to LSF's `badmin` to online node | | MARK_OFFLINE | `1` | Set to `0` to disable marking nodes offline on check failure | | MAX_SYS_UID | `99` | UIDs <= this number are exempt from rogue process checks | | MCELOG | `mcelog` | Command to use to check for MCE log errors | | MCELOG_ARGS | `--client` | Parameters passed to `$MCELOG` command | | MCELOG_MAX_CORRECTED_RATE | `9` | Maximum number of **corrected** MCEs allowed before `check_hw_mcelog()` returns failure | | MCELOG_MAX_UNCORRECTED_RATE | `0` | Maximum number of **uncorrected** MCEs allowed before `check_hw_mcelog()` returns failure | | MDIAG_CMD | `mdiag` | Command to use to invoke Moab's `mdiag` command (may include path) | | * NAME | `nhc` | Used to populate default paths/filenames for configuration | | NHC_AUTH_USERS | `root nobody` | Users authorized to have arbitrary processes running on compute nodes | | NHC_CHECK_ALL | `0` | Forces all checks to be non-fatal. Displays each failure message, reports total number of failed checks, and returns that number. | | NHC_CHECK_FORKED | `0` | Forces each check to be executed in a separate forked subprocess. NHC attempts to detect directives which set environment variables to avoid forking those. Enhances resiliency if checks hang. | | NHC_RM | Auto-detected | Resource manager with which to interact (`pbs`, `slurm`, `sge`, or `lsf`) | | NVIDIA_HEALTHMON | `nvidia-healthmon` | Command used by `check_nv_healthmon` to check nVidia GPU status | | NVIDIA_HEALTHMON_ARGS | `-e -v` | Arguments to `$NVIDIA_HEALTHMON` command | | OFFLINE_NODE | `$HELPERDIR/node-mark-offline` | Helper script used to mark nodes offline | | ONLINE_NODE | `$HELPERDIR/node-mark-online` | Helper script used to mark nodes online | | PASSWD_DATA_SRC | `/etc/passwd` | Colon-delimited file in standard passwd format from which to load user account data | | PATH | `/sbin:/usr/sbin:/bin:/usr/bin` | If a path is not specified for a particular command, this variable defines the directory search order. | PBSNODES | `pbsnodes` | Command used by above helper scripts to mark nodes online/offline | | PBSNODES_LIST_ARGS | `-n -l all` | Arguments to `$PBSNODES` to list nodes and their status notes | | PBSNODES_OFFLINE_ARGS | `-o -N` | Arguments to `$PBSNODES` to mark node offline with note | | PBSNODES_ONLINE_ARGS | `-c -N` | Arguments to `$PBSNODES` to mark node online with note | | PBS_SERVER_HOME | `/var/spool/torque` | Directory for TORQUE files | | RESULTFILE | `/var/run/nhc/$NAME.status` | Used in [Detached Mode](#detached-mode) to store result of checks for subsequent handling | RM_DAEMON_MATCH | TORQUE/PBS: `/\bpbs_mom\b/`
SLURM: `/\bslurmd\b/`
SGE/UGE: `/\bsge_execd\b/` | [Match string](#match-strings) used by `check_ps_userproc_lineage` to make sure all user processes were spawned by the RM daemon | | SILENT | `0` | Set to `1` to disable logging via `$LOGFILE` | | SLURM_SCONTROL | `scontrol` | Command to use for SLURM's `scontrol` (may include path) | | SLURM_SC_OFFLINE_ARGS | `update State=DRAIN` | Arguments to pass to SLURM's `scontrol` to offline a node | | SLURM_SC_ONLINE_ARGS | `update State=IDLE` | Arguments to pass to SLURM's `scontrol` to online a node | | SLURM_SERVER_HOME | `/var/spool/slurmd` | Location of SLURM data files (see also: `$JOBFILE_PATH`) | | SLURM_SINFO | `sinfo` | Command to use for SLURM's `sinfo` (may include path) | | STAT_CMD | `/usr/bin/stat` | Command to use to `stat()` files | | STAT_FMT_ARGS | `-c` | Parameter to introduce format string to `stat` command | | * TIMEOUT | `30` | Watchdog timer (in seconds) | | VERBOSE | `0` | Set to `1` to display each check line before it's executed | Example usage: -------------- .. code-block:: bash * || export PATH="$PATH:/opt/torque/bin:/opt/torque/sbin" n*.rh6 || MAX_SYS_UID=499 n*.deb || MAX_SYS_UID=999 *.test || DEBUG=1 * || export MARK_OFFLINE=0 * || NVIDIA_HEALTHMON="/global/software/rhel-6.x86_64/modules/nvidia/tdk/3.304.3/nvidia-healthmon/nvidia-healthmon" Detached Mode ============= Version 1.2 and higher support a feature called "detached mode." When this feature is activated on the command line or in `/etc/sysconfig/nhc` (by setting `DETACHED_MODE=1`), the `nhc` process will immediately fork itself. The foreground (parent) process will immediately return success. The child process will run all the checks and record the results in `$RESULTFILE` (default: `/var/run/nhc.status`). The next time `nhc` is executed, just before forking off the child process (which will again run the checks in the background), it will load the results from `$RESULTFILE` from the last execution. Once the child process has been spawned, it will then return the previous results to its caller. The advantage of detached mode is that any hangs or long-running commands which occur in the checks will not cause the resource manager daemon (e.g., `pbs_mom`) to block. Sites that use home-grown health check scripts often use a similar technique for this very reason -- it's non-blocking. However, a word of caution: if a detached-mode `nhc` encounters a failure, it won't get acted upon until the **next execution**. So let's say you have NHC configured to only on job start and job end. Let's further suppose that the `/tmp` filesystem encounters an error and gets remounted read-only at some point after the completion of the last job and that you have `check_fs_mount_rw /tmp` in your `nhc.conf`. In normal mode, when a new job tries to start, `nhc` will detect the read-only mount on job start and will take the node out of service before the job is allowed to begin executing on the node. In detached mode, however, since `nhc` has not been run in the meantime, and the previous run was successful, `nhc` will return success and allow the job to start _before_ the error condition is noticed! For this reason, when using detached mode, periodic checks are HIGHLY recommended. This will not completely prevent the above scenario, but it will drastically reduce the odds of it occurring. Users of detached mode, as with any similar method of delayed reporting, must be aware of and accept this caveat in exchange for the benefits of the more-fully-non-blocking behavior. Built-in Checks =============== In the documentation below, parameters surrounded by square brackets ([like this]) are **optional**. All others are **required**. The LBNL Node Health Check distribution supplies the following checks: check_cmd_output ---------------- `check_cmd_output [-t timeout] [-r retval] [-m match [...]] { -e 'command [arg1 [...]]' | command [arg1 [...]] }` `check_cmd_output` executes a `command` and compares each line of its output against any `mstr`s ([match strings](#match-strings)) passed in. If any positive match **is not** found in the command output, or if any negative match **is** found, the check fails. The check also fails if the exit status of `command` does not match `retval` (if supplied) or if the `command` fails to complete within `timeout` seconds (default 5). Options to this check are as follows: | **Check Option** | **Purpose** | | ---------------- | ----------- | | `-e` `command` | Execute `command` and gather its output. The `command` is split on word boundaries, much like `/bin/sh -c '...'` does. | | `-m` `mstr` | If negated, no line of the output may match the specified `mstr` expression. Otherwise, at least one line must match. This option may be used multiple times as needed. | | `-r` `retval` | Exit status (a.k.a. return code or return value) of `command` must equal `retval` or the check will fail. | | `-t` `secs` | Command will timeout if not completed within `secs` seconds (default is 5). | .. note:: If the `command` is passed using `-e`, the `command` string is split on word boundaries to create the `argv[]` array for the command. If passed on the end of the check line, DO NOT quote the command. Each parameter must be distinct. Only use quotes to group multiple words into a single argument. For example, passing `command` as `"service bind restart"` will work if used with `-e` but will fail if passed at the end of the check line (use without quotes instead)! _**Example** (Verify that the `rpcbind` service is alive)_: `check_cmd_output -t 1 -r 0 -m '/is running/' /sbin/service rpcbind status` check_cmd_status ---------------- `check_cmd_status [-t timeout] -r retval command [arg1 [...]]` `check_cmd_status` executes a `command` and redirects its output to `/dev/null`. The check fails if the exit status of `command` exit status does not match `retval` or if the `command` fails to complete within `timeout` seconds (default 5). Options to this check are as follows: | **Check Option** | **Purpose** | | ---------------- | ----------- | | `-r` `retval` | Exit status (a.k.a. return code or return value) of `command` must equal `retval` or the check will fail. | | `-t` `secs` | Command will timeout if not completed within `secs` seconds (default is 5). | _**Example** (Make sure SELinux is disabled)_: `check_cmd_status -t 1 -r 1 selinuxenabled` check_dmi_data_match -------------------- `check_dmi_data_match [-h handle] [-t type] [-n | '!'] string` `check_dmi_data_match` uses parsed, structured data taken from the output of the `dmidecode` command to allow the administrator to make very specific assertions regarding the contents of the DMI (a.k.a. SMBIOS) data. Matches can be made against any output or against specific types (classifications of data) or even handles (identifiers of data blocks, typically sequential). Output is restructured such that sections which are indented underneath a section header have the text of the section header prepended to the output line along with a colon and intervening space. So, for example, the string "ISA is supported" which appears underneath the "Characteristics:" header, which in turn is underneath the "BIOS Information" header/type, would be parsed by `check_dmi_data_match` as "BIOS Information: Characteristics: ISA is supported" See the `dmidecode` man page for more details. .. warning:: Although `string` is technically a [match string](#match-strings), and supports negation in its own right, you probably don't want to use negated [match strings](#match-strings) here. Passing the `-n` or `!` parameters to the check means, "check all relevant DMI data and pass the check only if no matching line is found." Using a negated [match string](#match-strings) here would mean, "The check passes as soon as _ANY_ non-matching line is found" -- almost certainly not the desired behavior! A subtle but important distinction! _**Example** (check for BIOS version)_: `check_dmi_data_match "BIOS Information: Version: 1.0.37"` check_dmi_raw_data_match ------------------------- `check_dmi_raw_data_match match_string [...]` `check_dmi_raw_data_match` is basically like a `grep` on the raw output of the `dmidecode` command. If you don't need to match specific strings in specific sections but just want to match a particular string anywhere in the raw output, you can use this check instead of `check_dmi_data_match` (above) to avoid the additional overhead of parsing the output into handles, types, and expanded strings. _**Example** (check for firmware version in raw output; could really match any version)_: `check_dmi_raw_data_match "Version: 1.24.4175.33"` check_file_contents ------------------- `check_file_contents file mstr [...]` `check_file_contents` looks at the specified file and allows one or more (possibly negated) `mstr` [match strings](#match-strings) (glob, regexp, etc.) to be applied to the contents of the file. The check fails unless ALL specified expressions successfully match the file content, but the order in which they appear in the file need not match the order specified on the check line. No post-processing is done on the file, but take care to quote any shell metacharacters in your match expressions properly. Also remember that matching against the contents of large files will slow down NHC and potentially cause a timeout. Reading of the file stops when all match expressions have been successfully found in the file. The file is only read once per invocation of `check_file_contents`, so if you need to match several expressions in the same file, passing them all to the same check is advisable. .. note:: This check handles negated [match strings](#match-strings) internally so that they "do the right thing:" ensure that no matching lines exist in the entire file. _**Example** (verify setting of $pbsserver in pbs_mom config)_: `check_file_contents /var/spool/torque/mom_priv/config '/^\$pbsserver master$/'` check_file_stat --------------- `check_file_stat [-D num] [-G name] [-M mode] [-N secs] [-O secs] [-T num] [-U name] [-d num] [-g gid] [-m mode] [-n secs] [-o secs] [-t num] [-u uid] filename(s)` `check_file_stat` allows the user to assert specific properties on one or more files, directories, and/or other filesystem objects based on metadata returned by the Linux/Unix `stat` command. Each option specifies a test which is applied to each of the _filename(s)_ in order. The check fails if any of the comparisons does not match. Options to this check are as follows: | **Check Option** | **Purpose** | | ---------------- | ----------- | | `-D` `num` | Specifies that the device ID for _filename(s)_ should be `num` (decimal or hex) | | `-G``name` | Specifies that _filename(s)_ should be owned by group `name` | | `-M` `mode` | Specifies that the permissions for _filename(s)_ should include at LEAST the bits set in `mode` | | `-N``secs` | Specifies that the `ctime` (i.e., inode change time) of _filename(s)_ should be newer than `secs` seconds ago | | `-O``secs` | Specifies that the `ctime` (i.e., inode change time) of _filename(s)_ should be older than `secs` seconds ago | | `-T` `num` | Specifies that the minor device number for _filename(s)_ be `num` | | `-U``name` | Specifies that filename(s) should be owned by user `name` | | `-d` `num` | Specifies that the device ID for _filename(s)_ should be `num` (decimal or hex) | | `-g` `gid` | Specifies that _filename(s)_ should be owned by group id `gid` | | `-m` `mode` | Specifies that the permissions for _filename(s)_ should include at LEAST the bits set in `mode` | | `-n``secs` | Specifies that the `mtime` (i.e., modification time) of _filename(s)_ should be newer than `secs` seconds ago | | `-o``secs` | Specifies that the `mtime` (i.e., modification time) of _filename(s)_ should be older than `secs` seconds ago | | `-t` `num` | Specifies that the major device number for _filename(s)_ be `num` | | `-u` `uid` | Specifies that _filename(s)_ should be owned by uid `uid` | _**Example** (Assert correct uid, gid, owner, group, & major/minor device numbers for `/dev/null`)_: `check_file_stat -u 0 -g 0 -U root -G root -t 1 -T 3 /dev/null` check_file_test --------------- `check_file_test [-a] [-b] [-c] [-d] [-e] [-f] [-g] [-h] [-k] [-p] [-r] [-s] [-t] [-u] [-w] [-x] [-O] [-G] [-L] [-S] [-N] filename(s)` `check_file_test` allows the user to assert very simple attributes on one or more files, directories, and/or other filesystem objects based on tests which can be performed via the shell's built-in `test` command. Each option specifies a test which is applied to each of the _filename(s)_ in order. NHC internally evaluates the shell expression `test