Recovery

Recovery
Prev	Choosing between BEFORE_IMAGE and NOBEFORE_IMAGE journaling	Next

When an instance goes down, its recovery consists of (at least) two steps: recovery of the instance itself: hardware, OS, file systems, and so on - say t_sys; t_sys is almost completely^[6] independent of the type of GT.M journaling.

For database recovery:

With BEFORE_IMAGE journaling, the time is simply that is needed to execute a mupip journal recover backward "*" command or, when using replication, mupip journal recover -rollback. This uses before image records in the journal files to roll the database files back to their last epochs, and then forward to the most current updates. If this takes t_bck, the total recovery time is t_sys+t_bck.
With NOBEFORE_IMAGE journaling, the time is that required to restore the last backup, say, t_rest plus the time to perform a mupip journal -recover -forward "*" command, say t_fwd, for a total recovery time of t_sys+t_rest+t_fwd. If the last backup is available online, so that "restoring the backup" is nothing more than setting the value of an environment variable, t_rest=0 and the recovery time is t_sys+t_fwd.

Because t_bck is less than t_fwd, t_sys+t_bck is less than t_sys+t_fwd. In very round numbers, t_sys may be minutes to tens of minutes, t_fwd may be tens of minutes and tbck may be in tens of seconds to minutes. So, recovering the instance A might (to a crude first approximation) be a half order of magnitude faster with BEFORE_IMAGE journaling than with NOBEFORE_IMAGE journaling. Consider two deployment configurations.

Where A is the sole production instance of an application, halving or quartering the recovery time of the instance is significant, because when the instance is down, the enterprise is not in business. The difference between a ten minute recovery time and a thirty minute recovery time is important. Thus, when running a sole production instance or a sole production instance backed up by an underpowered or not easily accessed, "disaster recovery site," before image journaling with backward recovery is the preferred configuration ofis the preferred configuration ofbetter suits a production deployment. Furthermore, in this situation, there is pressure to bring A back up soon, because the enterprise is not in business - pressure that increases the probability of human error.
With two equally functional and accessible instances, A and B, deployed in an LMS configuration at a point in time when A, running as the originating instance replicating to B, crashes, B can be switched from a replicating instance to an originating instance within seconds. An appropriately configured network can change the routing of incoming accesses from one instance to the other in seconds to tens of seconds. The enterprise is down only for the time required to ascertain that A is in fact down, and to make the decision to switch to B— perhaps a minute or two. Furthermore, B is in a "known good" state, therefore, a strategy of "if in doubt, switchover" is entirely appropriate. This time, tswch, is independent of whether A and B are running BEFORE_IMAGE journaling or NOBEFORE_IMAGE journaling. The difference between BEFORE IMAGE journaling and NOBEFORE_IMAGE journaling is the difference in time taken subsequently to recover A, so that it can be brought up as a replicating instance to B. If NOBEFORE_IMAGE journaling is used and the last backup is online, there is no need to first perform a forward recovery on A using its journal files. Once A has rebooted:
- Extract the unreplicated transactions from the crashed environment
- Connect the backup as a replicating instance to B and allow it to catch up.


	Applications that can take advantage the forthcoming LMX capability will essentially make t_swch zero when used with a suitable front-end network.

^[6]The reason for the "almost completely" qualification is that the time to recover some older file systems can depend on the amount of space used.