A watchdog timer (WDT) is a bit of hardware that monitors the execution of code to reset the processor if the software crashes. For many years there has been a raging debate in the embedded world about their importance. More than a few engineers feel WDTs are unnecessary; a better solution, they claim, is to write firmware that does not crash. That is a noble sentiment, for perfection is a lofty and admirable goal.
However, few products ever reach that level of quality. As the software swells in size, even an uncompromising focus on quality will hardly yield perfection. A million line program, if the code is 99.99 percent correct (a number far higher than achieved by the vast majority of organizations) has 100 lurking bugs. Any of those may crash the system or worse – put it into a dangerous operating mode. (Alas, the average embedded systems ships with only 95 percent of all of the bugs removed.1
Bugs are not the only problem. Perfectly designed and built hardware on which perfect code executes can still fail.
Increasingly, cosmic rays are causing problems in digital systems. Consisting mostly of high-energy protons from space, they can interact with transistors on ICs and flip bits. In the early days of microprocessors, these were much less of a threat than today because the process geometry was large – lots of energy was needed to effect a bit-flip. Today the problem has grown significantly worse since a 45 nm geometry is routine and 28 nm not uncommon, with smaller nodes appearing yearly.
In the 1990s, IBM found that a typical computer experiences about one error due to cosmic rays every month per 256 MB of RAM.2
Geometries have shrunk a lot since then so presumably the problem has gotten worse.
Intel believes that cosmic rays are likely to be an increasing source of computer errors in the future. Their patent 7,309,866 uses a MEMs sensor to detect incoming cosmic rays and then signals a circuit to take corrective action.3
H. Kobayashi, et al. found that errors from cosmic rays and other particles more than doubled in devices built from 180 nm geometry compared to those of 250 nm.4
A 2004 paper by Tezzaron Semiconductor shows that SRAMs and logic are the primary sufferers of cosmic ray upsets.5
The authors claim a system with one GB of SRAM can expect a soft error every two weeks, and the problems are ten times worse in Denver than at sea level.
Amazingly, a particle with as little as a 10 femtocoulomb charge has enough energy to flip an SRAM bit.6
A decade ago, the larger cells needed five times more energy.
The bottom line: even perfectly-written code can crash. Only a watchdog timer can help a crashed system recover.
A great WDT
Since the WDT is the very last line of defense, its design must anticipate any failure mode. One may ask, “What are the characteristics of a great watchdog?”
First, the WDT must be independent of the CPU. No matter what odd mode the processor finds itself in, the watchdog timer has to be functional. Further, once set up at initialization, nothing the processor does should be able to disable or reprogram the watchdog. Otherwise a rogue program could accidentally disable this protective mechanism, rendering it useless.
The WDT must always, under any condition barring perhaps a hardware failure, bring the system back to life. This means issuing a hard reset to the CPU. No other option is guaranteed to bring a crashed processor back to life.
Some WDTs issue a non-maskable interrupt instead of a reset. The idea is that the NMI’s service routine can snapshot the stack and log debugging information. Alas, there is no reason to believe that a CPU which is in an arbitrary dysfunctional mode will respond to any interrupt; there is quite a lot of processing required before the service routine will be invoked. On many processors an interrupt service routine will not start if, for example, the stack pointer has an odd number or unaligned addresses; indeed, they may go into a double-bus fault mode, wherein the CPU shuts down and only a hard reset will restore operation.
The NMI approach is interesting, however. One alternative sometimes used is to issue an NMI, and start a timer. After a few milliseconds the timer then resets the CPU. The NMI service routine then, if it works, logs debugging information, but the inevitable hard reset insures the device comes back to life.
It is critical that the watchdog, independently of a perhaps crippled CPU, puts the system into a safe state when the system controls dangerous hardware. Moving machinery, hazardous radiation, etc. must be disabled, parked or otherwise disengaged, since the reset may not work if a hardware failure has crashed the processor.
Today’s embedded systems often have very sophisticated peripherals; in some cases the I/O may be much more complex than the microprocessor. The WDT reset sequence must insure that these devices are brought back to a known state. When code crashes, it may issue bizarre streams of data to the peripherals. If the design of the peripherals is such that the CPU is not always able to put the devices into known correct states, those devices need a hard reset from the WDT.
Finally, it is wise to leave debugging breadcrumbs behind, if possible. The previously-mentioned NMI/deferred-reset is an example. Save the stack and other critical parameters into a region of non-volatile memory the developers can access. Unfortunately, a reset destroys all processor state information, but there is often application-specific data that can help diagnose problems, like pointers into state machine tables. Before initializing those after a reset, save them. If there is a real-time clock, also save the time of the reset.
Watchdogs can be divided into two general categories: those that are on board the processor chip, and external devices added by the hardware designer. Most microcontrollers have an internal watchdog, though their efficacy varies widely.
An example is Maxim’s (nee Dallas’) DS80C320/DS80C323, an 8031 variant that has been around for quite a while. This part has two really nice features in the watchdog. First, one can program it to generate an interrupt, but 512 cycles later it will reset the CPU, so debugging breadcrumbs are easy to save. Also, access to the WDT registers can be restricted; one has to execute two particular move instructions back-to-back, and then there is just a three cycle window in which a WDT register access can take place. This hugely reduces the chances that rogue code will disable the protection mechanisms. However, one wonders what happens if an interrupt occurs between these instructions. Presumably the WDT access will not occur, making it impossible to enable that feature. Clearly, the software engineer must disable interrupts when executing this sequence.
Freescale’s MCF520x series are rather different. To tickle the watchdog, one must issue two writes to the watchdog service register, but any number of instructions may occur between these. This could defeat reliable operation if the CPU is crashed and running random code. On the up side, the reset status register does log whether the prior reset was due to an external hardware signal or from the WDT timeout, a useful way to log errors after rebooting. One may program the watchdog to generate either a reset or an interrupt; the latter is a very bad idea. If the stack were to go odd – due to a bug or rogue code – the system will go into a double-bus fault. An interrupt will not restore the CPU to normal operation; only a reset will.
STMicroelectronics' new series of STM32F4 Cortex™-M4 CPUs has two independent watchdogs. One runs from its own internal RC oscillator. That means that all kinds of things can collapse in the CPU and the WDT will still fire. There is also a “window watchdog” (WWDT) which requires the code to tickle it frequently, but not too often. This is a very effective way to insure crashed code that randomly writes to the protection mechanism does not cause a WDT tickle, and the WWDT can generate an interrupt shortly before reset is asserted.
Intriguingly, some of these parts also include an “analog watchdog” which fires an interrupt if an input to an A/D exceeds a programmed limit. One could monitor the power supply and detect brownouts. In a system that controls dangerous hardware, this early-warning could be used to put the system into a safe state before the power goes out of operating limits.
Many of Microchip’s PIC24F series have WWDTs, as do some of NXP’s parts such as the LPC18xx and LPC43xx series. NXP’s parts can be configured so that, once enabled, it is impossible for the software to turn the WDT off, which offers more protection from code that is running amok.
None of these processors signal the outside world that a timeout took place. The designer may have to assert a parallel I/O bit to reset external hardware if the software cannot guarantee a proper re-initialization.
Few microprocessors (as opposed to microcontrollers) have an internal watchdog timer, and in many cases internal WDTs do not provide the reliability needed for a particular application. In these cases the design should be augmented with external hardware that monitors system operation and issues a reset if needed.
In a system that uses two or more CPUs, it is reasonable to have each processor monitor the other’s operation.
There are a number of WDT chips available. In general, their operation is not controlled by software, so the crashed program cannot disable their functionality. Additionally, they also assert reset during power-up, eliminating the need for those external components.
One external WWDT is Maxim’s MAX6751. It has a WWDT whose timeouts are controlled by capacitors, as shown in Figure 1.
Figure 1: The MAX6751’s timeouts are set by a pair of capacitors (Courtesy Maxim). Texas Instruments’ TPS3126 is similar to a WWDT without the window capability. They are available for a variety of supply voltages and delay times, are inexpensive, and come in SOT-23 packages. Figure 2 outlines their configuration.
Figure 2: TI’s TPS3126 monitors power as well as have WDTs (Courtesy Texas Instruments).
TI also has a family of parts – including the TPS386000 – which monitor four separate power rails and include a WDT with a fixed delay. One of the voltage monitors can handle negative supplies. If any go out of tolerance, individual “RESET” outputs are activated. Being open drain, they can be wire-ORed together. Alternatively, one could connect these to input pins so the CPU can know which supply is low, and take appropriate action, as shown in Figure 3.
Figure 3: The TPS386000 monitors four power supplies (Courtesy Texas Instruments).
Analog Devices' ADM699 is a simple WDT which also monitors one supply. Figure 4 shows its implementation.
Figure 4: The ADM699 has a very clean and simple design (Courtesy Analog Devices).
Some microprocessors now are very particular about the reset input, tolerating only relatively rapid slew rates and signal levels. An open drain drive can meet those requirements only by using a very low-value pull-up resistor, which increases power consumption. Analog Devices has several components, like the ADM6316, that use a push-pull output to meet these stringent requirements. Figure 5 shows the part’s block diagram.
Figure 5: The ADM6316 has push-pull drivers (Courtesy Analog Devices).
Even the best watchdog circuitry is but a poor safety mechanism if the code is not properly constructed. Alas, in most systems, developers sprinkle watchdog tickles throughout the code without thinking through the design.
The most important consideration is to insure that all of the code is running correctly, not just part of it. Therefore, never put a WDT tickle in an interrupt service routine, and never devote an RTOS task to this activity. If the main code crashes the interrupts, and even the RTOS’s scheduler, it may continue to run, so the watchdog never times out.
In a single-threaded design, use a state machine-like architecture. Example code is shown in Figure 6.
Figure 6: Code to handle a non-multitasking WDT.
Here the main loop starts by setting variable “state” to 0x5555. It calls wdt_a() which checks to see if the value is correct; if not, it halts and the WDT resets the system. Otherwise the value in “state” is changed by adding an offset. Note that the watchdog has not been tickled.
At the end of the loop (we have executed all of the code), “state” is adjusted once again and wdt_b() gets called. Now, if “state” has not properly cycled through all of the changes to it – indicating we did run the entire loop – the code halts and the CPU gets reset. Otherwise the watchdog is tickled and “state” is set to zero. Note that if the code crashes and wanders into wdt_b(), “state” will not be correct and a reset will happen.
In a multitasking application, each task increments a value associated with the task in a data structure every time each starts. A low-priority task occasionally examines the data structure and checks to make sure the data is reasonable. If one task runs often, the number will be large; conversely, slow tasks will have low numbers. If everything is OK the task tickles the WDT, zeroes the values, and returns. Otherwise it halts and initiates a reset.
If there are exceptions that the code cannot handle or recover from (e.g., a divide by zero or a malloc() failure), write a handler that disables interrupts and halts. The watchdog will bring the system back to life.
The watchdog timer is the last line of defense against crashed code, and as such, must be well designed and implemented. Today many microcontrollers include WDTs that are very sophisticated and resilient to spoofing by a program that has run amok. Alternatively, use an external WDT; probably the safest bet is one that supports windowing. Also, structure the code carefully, so that errant software does not fall into the tickle routine and keep the timer from resetting the CPU.
- Applied Software Measurement, Capers Jones, 2008.