Monitoring GT.M Messages

Monitoring GT.M Messages
Prev	Appendix B. Monitoring GT.M Messages	Next

This section covers information on monitoring GT.M messages. There are several types of messages.

The GT.M run-time system sends messages (such as when a database file extends) to the system log. These are not trapped by the application error trap.
Compilation errors generated by GT.M are directed to STDERR. These are not trapped by the application error trap. You can avoid them by compiling application code before deploying it in production, or log them by running mumps processes with STDERR directed to a file.
Errors trapped by the application and logged by the application. These are outside the purview of this discussion.

A system management tool will help you automate monitoring messages.

GT.M sends messages to the system log at the LOG_INFO level of the LOG_USER facility. GT.M messages are identified by a signature of the form GTM-s-abcdef where -s- is a severity indicator and abcdef is an identifier. The severity indicators are: -I- for informational messages, -W- for warnings, -E- for errors and -F- for events that cause a GT.M process to terminate abnormally. Your monitoring should recognize the important events in real time, and the warning events within an appropriate time. All messages have diagnostic value. It is important to create a baseline pattern of messages as a signature of normal operation of your system so that a deviation from this baseline - the presence of unexpected messages, an usual number of expected messages (such as file extension) or the absence of expected messages - allows you to recognize abnormal behavior when it happens. In addition to responding to important events in real time, you should regularly review information and warning messages and ensure that deviations from the baseline can be explained.

Some message identifiers are described in the following table:

Component	Instance File or Replication Journal Pool	Receiver Pool	Identifier
Source Server	Y	N/A	SRCSRVR
Source Server	N	N/A	MUPIP
Receiver Server	Y	N/A	RCVSRVR
Receiver Server	N	N/A	MUPIP
Update Process	Y	N/A	UPD
Update Process	N	N/A	MUPIP
Reader Helper	N/A	Y	UPDREAD
Reader Helper	N/A	N	UPDHELP
Writer Helper	N/A	Y	UPDWRITE
Writer Helper	N/A	N	UPDHELP

In addition to messages in the system log, and apart from database files and files created by application programs, GT.M creates several types of files: journal files, replication log files, gtmsecshr log files, inter-process communication socket files, files from recovery / rollback and output and error files from JOB'd processes. You should develop a review and retention policy. Journal files and files from recovery / rollback are likely to contain sensitive information that may require special handling to meet business or legal requirements. Monitor all of these files for growth in file numbers or size that is materially different than expectations set by the baseline. In particular, monitoring file sizes is computationally inexpensive and regular monitoring - once an hour, for example - is easily accomplished with the system crontab.

While journal files automatically switch to new files when the limit is reached, log files can grow unchecked. You should periodically check the sizes of log files and switch them when they get large - or simply switch them on a regular schedule.

gtmsecshr log file - gtm_secshr_log in the directory $gtm_log (send a SIGHUP to the gtmsecshr process to create a new log file).

Important

Starting with V6.0-000, GT.M logs gtmsecshr messages in the system log and ignores the environment variable gtm_log.
Source Server, Receive Server, and Update Process log files. For more information, refer to Section : “Changing the Log File” in the Database Replication chapter.

	Important
Starting with V6.0-000, GT.M logs gtmsecshr messages in the system log and ignores the environment variable gtm_log.

Since database health is critical, database growth warrants special attention. Ensure every file system holding a database file has sufficient space to handle the anticipated growth of the files it holds. Remember that with the lazy allocation used by UNIX file systems, all files in a system complete for its space. GT.M issues an informational message each time it extends a database file. When extending a file, it also issues a warning if the remaining space is less than three times the extension size. You can use the $VIEW() function to find out the total number of blocks in a database as well as the number of free blocks.

As journal files grow with every update they use up disk faster than database files do. GT.M issues messages when a journal file reaches within three, two and one extension size number of blocks from the automatic journal file switch limit. GT.M also issues messages when a journal file reaches its specified maximum size, at which time GT.M closes it, renames it and creates a new journal file. Journal files covering time periods prior to the last database backup (or prior to the backup of replicating secondary instances) are not needed for continuity of business, and can be deleted or archived, depending on your retention policy. Check the amount of free space in file systems at least hourly and perhaps more often, especially file systems used for journaling, and take action if it falls below a threshold.

GT.M uses monotonically increasing relative time stamps called transaction numbers. You can monitor growth in the database transaction number with DSE DUMP -FILEHEADER. Investigate and obtain satisfactory explanations for deviations from the baseline rate of growth.

After a MUPIP JOURNAL -ROLLBACK (non replicated application configuration) or MUPIP JOURNAL -RECOVER -FETCHRESYNC (replicated application configuration), you should review and process or reconcile updates in the broken and unreplicated (lost) transaction files.

In a replicated environment, frequently (at least hourly; more often suggested since it takes virtually no system resources), check the state of replication and the backlog with MUPIP REPLICATE -CHECKHEALTH and -SHOWBACKLOG. Establish a baseline for the backlog, and take action if the backlog exceeds a threshold.

When a GT.M process terminates abnormally, it attempts to create a GTM_FATAL_ERROR.ZSHOW_DMP_* file containing a dump of the M execution context and a core file containing a dump of the native process execution context. The M execution context dump is created in the current working directory of the process. Your operating system may offer a means to control the naming and placement of core files; by default they are created the current working directory of the process with a name of core.*. The process context information may be useful to you in understanding the circumstances under which the problem occurred and/or how to deal with the consequences of the failure on the application state. The core files are likely to be useful primarily to your GT.M support channel. If you experience process failures but do not find the expected files, check file permissions and quotas. You can simulate an abnormal process termination by sending the process a SIGILL (with kill -ILL or kill -4 on most UNIX/Linux systems).

	Caution
	Dumps of process state files are likely to contain confidential information, including database encryption keys. Please ensure that you have appropriate confidentiality procedures as mandated by applicable law and corporate policy.

GT.M processes issued with the JOB command create .mje and .mjo files for their STDERR and STDOUT respectively. Analyze non-empty .mje files. Design your application and/or operational processes to remove or archive .mjo files once they are no longer needed.

Use the environment variable gtm_procstuckexec to trigger monitoring for processes holding a resource for an unexpectedly long time. $gtm_procstuckexec specifies a shell command or a script to execute when any of the following conditions occur:

An explicit MUPIP FREEZE or an implicit freeze, such as for a BACKUP or INTEG -ONLINE that lasts longer than one minute.
MUPIP actions find kill_in_prog (KILLs in progress) to be non-zero after a one minute wait on a region.
BUFOWNERSTUCK, INTERLOCK_FAIL, JNLPROCSTUCK, SHUTDOWN, WRITERSTUCK, MAXJNLQIOLOCKWAIT, MUTEXLCKALERT, SEMWT2LONG, and COMMITWAITPID operator messages are being logged.

The shell script or command pointed to by gtm_procstuckexec can send an alert, take corrective actions, and log information.

	Note
	Make sure user processes have sufficient space and permissions to run the shell command / script.