The Growing Importance of Watchdog Functionality in MCU Applications



With all the complexities of multithreaded, real-time, and multitasking embedded systems, it has become increasingly difficult to know when a micro is misbehaving. With so many service routines operating somewhat independently it is very possible for some to be alive while others are locked, dead, or executing nonsensical code.

Consequently, it is now more difficult for embedded-systems designers to protect the entire system from a failure occurring in some lowly service routine or core that has gone astray.

This article looks at the evolving need for more advanced watchdog functionality and the techniques engineers can use to assure reliable MCU operations. It examines the shortcomings that need to be addressed both externally in hardware and internally in software and looks at sample watchdog parts. All parts, datasheets, tutorials, and development references here can be found online on Hotenda’s website.

Simple protection

The need for watchdog functionality spans from 4-bit to advanced 32-bit machines and beyond. It is not uncommon to see simple processors performing in potentially dangerous situations. For example, a throttle-control loop may only need a simple mixed-signal 8-bit microcontroller locally to provide stable closed-loop control. It can take commands over a car’s CAN bus and can offload all the processing from a remote car computer. However, if that simple processor fails, full throttle can be deadly.

It is safe to say that virtually every modern microcontroller contains some rudimentary watchdog functionality, be it dedicated watchdog hardware blocks or general-purpose timers that can be used to implement software-controlled watchdog functions. These are all synchronized to a system clock in the processor.

As processors get more sophisticated, so too, can clocking structures and clock distribution within the processor itself. Oscillators can be particularly susceptible to ESD hits, for example, and if clocks go down, synchronous watchdogs do no good.

R/C oscillators and time constants can be old school but can provide independent clocking mechanisms and reset mechanisms as a fallback or backup. Likewise, redundant internal- and external-oscillator sources can provide some sort of heartbeat to keep monitoring circuitry alive during adverse failures. Selectable integration of clocks also is important as is the locations of the clocks in the tree (Figure 1).


Figure 1: Combining system clocks with backup R/C oscillators can allow power savings and a reliable independent backup clock for watchdog functions in the event that the system clocks become corrupted. Note how two R/C oscillators are used here.

The same holds true for low-voltage detection circuitry. While a rudimentary precision can be achieved using internal voltage references, comparators, and detectors, external circuitry may provide higher resolution and more precise voltage-level selectivity. For example, if part of your voltage-failure-mode software includes writing to EEPROM, you may want to trip the low-voltage detector threshold early to allow capacitive charge storage enough time to perform the EEPROM write before shutting down in an orderly fashion. Modern voltage detectors can achieve voltage resolutions down to 0.05 V for precise use of all energy. This is typically a much better resolution than you will find internally with a micro.

Another thing to be aware of is that max timeouts alone are not always effective. Most watchdog schemes basically implement a re-triggerable, monostable multivibrator (re-triggerable 1-shot) function. If software or hardware cycle clocks do not reset the timer within a maximum allowable time frame, the watchdog trips and resets the processor (or initiates a failure recovery service routine).

Minimum time requirements are also of interest. If a service routine, for example, is synchronized with a 60 Hz power-line zero cross, then pulses should be 8.33 ms apart. If they are arriving sooner, noise or fault conditions must be dealt with, often with safety ramifications.

Multiprocessor and multi-core designs have special situations. Individual watchdogs should be set up to monitor each processor or core with the unique conditions of the code running in that processor at that time. This means part of the software development targeting a core in a multicore environment should carry with it the specific watchdog conditions indicative of failure for that specific code block.

In addition, watchdog reporting should be hierarchical. Each core should report to a higher-level watchdog that ties together all failure modes reported from all subcores and processes. As an upper-level system function, a watchdog executive works hand-in-hand with the main task executive that assigns code-blocks to specific cores. It should also work closely with external watchdog systems.

A wired-OR type of multi-watchdog block can easily be expanded to use I/O specific to a core to report in (Figure 2). It can be an independent logic block inside an FPGA or CPLD and handle multiple processors and blocks with easy expansion. A register can accumulate the independent status of all reporting blocks to try and recover that core individually. As you may imagine, recovery routines become more complex at this level as does how to reboot a core while leaving the rest of the system running.


Figure 2: A top level of the watchdog hierarchy can use expandable wired-OR functionality to allow all micros or cores to report in at their own rates. Each 1-shot should allow the process it is monitoring to program in the duration interval. Each code block carries with it watchdog parameters.

Parts with special watchdog functions

Several micros feature unique or varying functionality when it comes to how they implement their watchdogs. Take for instance the 16-bit Maxim MaxQ Series, which combine a rich assortment of flexible timers with clever circuitry to enhance usefulness. Parts like the Maxim MAXQ2000-RBX+ have a secondary level of alert. If not serviced, and the MAXQ2000’s WDT overflows, it triggers an interrupt with an additional count for 512 additional system-clock cycles. It then resets all if not disabled or overridden.

That interrupt provides a “last chance” to save debugging information, a chance that most designers agree is useful during circuit development and troubleshooting. What’s more, instead of saving debugging information, the interrupt could be used to recover from the error and clear the watchdog. That latter approach, however, can compromise the system’s reliability if a systemic fault is present.

Like other internal WDTs, the MAXQ2000’s watchdog can be disabled by software. Note, however, that this capability is a double-edged sword: runaway code can disable the watchdog and then continue its rampage.

Some microprocessors connect their WDT to an internal oscillator separate from the system clock. Some use internal or external R/C oscillators, and some can use both. An interesting functionality is present on the Maxim MAXQ2000's WDT which derives watchdog timing from the system clock but will switch to a backup RC oscillator in the event of a failure in the main oscillator.

Another interesting MCU with unique watchdog functionality is the STM32F100 family of micros from STMicroelectronics, which have two watchdog timers. Parts like the STM32F100CBT6B are targeting smart-grid and smart-health applications that need to be reliable. Like most micros, it has multiple timers, in this case six, with another two 16-bit timers dedicated to watchdog functionality.

Each watchdog has a selectable prescaler (from 1 to 64 K) that can be used to clock the watchdog timers which can also trigger DMA requests and capture compare channels. Yet another independent watchdog is based on a 112-bit downcounter and 8-bit prescaler clocked from an independent 40 kHz internal R/C oscillator. Note how both these parts rely on R/C components as the ultra-reliable backup technology.

An interesting feature of the STMicroelectronics part is an analog watchdog function. Precise monitoring of one or more converted voltage levels from the A/Ds can trip the reset if the analog levels are outside the programmed thresholds. This can be useful for medical applications where sensors are connected to a body area network for health monitoring or active medication dispensing (Figure 3). As a design aid, STMicroelectronics offers engineers a Product Training Module for healthcare and wellness designs.


Figure 3: As a medical device integrates more actively with our body-area networks, some of the devices (pacemakers, defibrillators, insulin pumps, etc.) may be of critical importance for sustaining life. Reliable watchdogs need to be engineered into these systems.

On the outside looking in

Several good building-block external solutions include simple R/C threshold generators, biased transistors, low-power timers, and dedicated power-on reset and watchdog companion processors. In addition, development environments are available that encourage experimentation and ease of testing. Texas Instruments offers an interesting solution with its TPL5000 Nano-power programmable timer that draws 30 nA at a wide 1.8 to 5 VCC. The TPL5000EVM evaluation kit lets you test and optimize this functionality in a nice little self-contained module.

Several discrete watchdogs are available combined with other useful functions, such as real-time clocks and supervisory functions. An interesting combo comes from Lattice Semiconductor, with its ISPPAC-POWR607-01SN32I power supply supervisor, watchdog, and reset generator (Figure 4). Note the 1 percent analog-trip-point step size and the in-system programmable macro cell for state machine and combinatorial customizing.


Figure 4: In-system user-programmable parameters allow dynamic determinations of watchdog functions through the use of combinatorial and state-machine-based user-configurable logic.

Conclusions

Keep in mind that there are times where nothing will help restore a system. Some failure modes are not recoverable. For example, if system memory is corrupt, nothing else can be trusted.

Another case is if there are very high levels of noise. Even if a watchdog resets the processor, the noise can interfere with the processor’s initialization of the watchdog stage. If the watchdog functionality ca not be initialized, it is as if there is no watchdog at all.

No one can depend on 100 percent flawless operation of any MCU all the time. Micros can go awry. However, smart use of internal and external resources can make the difference between a mild failure and catastrophic property damage or even loss of life.

For more information about the parts discussed in this article, use the links provided to access product pages on the Hotenda website.

Supplier