When faulty software can lead to fatal accidents, large teams of engineers go to great lengths to validate the code and make sure it is as reliable as they can make it. Professional engineers developing software for avionics, military and space applications have worked closely with their tools vendors for over 20 years, and together they have developed better tools and techniques for eliminating bugs and ensuring that mission-critical systems will always perform reliably and flawlessly in every situation.
This article will explore the software development tools and techniques commonly used by engineers developing high-quality AMS software, including tools such as Keil MDK. The article will also look at how engineers developing software for AMS applications make use of the safety features built into modern Cortex-M4-based microcontrollers such as the Atmel SAM4
series, and Freescale's Kinetis
family. The article will explore how engineers who develop very large and complex software are able to keep software bugs from entering their applications.
Is it really worth paying for professional ARM® Cortex™-M development tools when there are so many free-of-charge alternatives available? What can these tools offer to make them worth thousands of dollars per developer? The answer depends on how much an undetected bug could end up costing you.
If you develop embedded software for a living, you are no doubt familiar with that uncomfortable feeling of knowing that your code contains a few hidden bugs. Even the most professionally designed product, subject to thousands of hours of rigorous testing, can end up on the market with bugs undetected. Customers have mostly learned to accept this, possibly because the vast majority of embedded system failures are merely annoying. The user experiences a hiccup, tries again, shrugs, and carries on. A little data got lost, and nobody got hurt.
However, in safety critical applications such as avionics, military or space, a tiny software bug has the potential to cause catastrophic failures. Such as the infamous Ariane space rocket Flight 501, which exploded in a gigantic fiery ball only 40 seconds into its flight. Several tons of liquid rocket fuel exploded because the on-board guidance system locked up, and consequently sent the rocket off course. The computer froze when it failed to convert a 64-bit floating-point number into a 16-bit integer. Engineers had simply copied the faulty software routine from a different rocket guidance system, and failed to test what would happen when an assumed to be good algorithm was put to use on a much more powerful rocket. When the rocket picked up speed, the conversion resulted in an overflow, which threw a software exception that stalled the CPU. This tiny, undetected bug resulted in the loss of a space rocket worth more than a billion dollars and created aerospace history.
The high cost of failure has motivated the avionics, military and aerospace suppliers to search for better ways to design reliable software. Engineers go to great lengths in eliminating bugs, especially in mission- and safety-critical components. Computer-assisted testing speeds up manual processes, and cowboy-coding styles are held in check by best practice rules.
Most of the test and validation tools and techniques developed for safety-critical applications are not very hard to use. The tools eliminate much of the boring, manual work involved in testing and validation. There is no point in spending countless hours and thousands of dollars on manual testing when computers can get the same work done in seconds. Where the avionics, military and aerospace industries lead the way, the rest of the embedded industry would be smart to follow. A bug in even the most “harmless” product has the potential to cause a storm of bad press, expensive product recalls, and disappointed customers fleeing to a competing brand.
Avoid shooting yourself in the foot
The sooner a bug is detected, the less expensive it will be to fix. The absolutely lowest cost will result from not introducing any bugs in the first place. In both the automotive and avionics industries, developers are routinely using coding standards to avoid shooting themselves in the foot. The most popular and widely used coding standard is MISRA-C. This is a best-practice coding standard, released by the Motor Industry Software Reliability Association. Tools that automatically test for MISRA-C compliance do not directly identify bugs, but merely point out lines of code written in ways that increase the risk of introducing bugs. By following the rules suggested by the MISRA-C coding style standard, the resulting embedded software will be safer, more reliable, and easier to port. A MISRA-C checker tool is built into two of the most popular professional IDEs: IAR Embedded Workbench, and Atollic TrueStudio. The third large IDE vendor for ARM Cortex-M microcontrollers, Keil, does not supply its own MISRA-C checker, but has made sure its development tools support the third party tool PC-Lint from Gimpel Software.
Another well-proven method for identifying software bugs is to conduct manual code reviews where multiple team members inspect every line of code. Inspectors will make a note every time they find something suspicious. This is an exercise that ensures that more than one pair of eyes has inspected the code before it ships. Some great tools are available for speeding up the review process and making sure that every suspected bug is properly inspected before the code is compiled.
A third method used to increase code quality is through the reduction of code complexity. Tools used to measure code complexity perform a code metric analysis. Overly complex functions are not only more difficult to read and maintain, but also far more likely to contain bugs, and should be rewritten into several simpler functions to avoid overly expensive testing.
On November 1st, 2010, the biggest worldwide technology news story was how the Apple iPhone built-in clock application had failed to adjust recurring alarms as phones switched automatically from Daylight Savings Time back to Standard Time. That morning, iPhone owners all over Europe overslept by one hour. It was pretty obvious that the Apple engineers who worked on improving the application had accidentally introduced an embarrassing bug. Apple never released any details, but experts on test software speculated that the error was so obvious it should have been easy to spot with a combination of automatic code complexity analysis and manual code reviews.
The investigation of the Ariane Flight 501 disaster revealed that the function which caused the CPU to freeze had been subject to manual code reviews and followed every rule of the standard coding style. The bug still managed to slip through the rigorous testing plan because engineers had mistakenly flagged the function as safe, and failed to test that it worked correctly with every combination of input variables. They did so to cut down on time required to test the guidance system, knowing that it would take too long to simulate every possible combination of variables in the complex algorithm.
A smarter way to cut down on test time is to test each function in separation. This method is called unit testing, and tells engineers whether a function works as expected in any condition. Tools for unit testing automate the process by automatically generating a test for each function. The best tools will analyze the algorithms found inside the function, and identify “suspicious” input values that need some attention. Suspicious variables are the ones that cause internal calculations to return a result close to zero, or able to overflow. By eliminating billions of unnecessary tests, engineers can focus their attention on the critical combinations.
Figure 1: Unit tests with automatic selection of suspicious variables can eliminate billions of unnecessary tests. Source: Atollic AB.
To further improve test coverage, it is better to use tools that make unit testing run on the target hardware. The Ariane software flaw would not have been detected if simulated on a PC because an Intel CPU does not throw an exception in the same situation. The bug would only cause a crash in a real guidance system.
Automatic unit testing is a great way to test code thoroughly. It is also easy to understand that because most code is modular, it will always be faster to test each module separately. A system that both collects and analyzes data could have a bug in either part. If there are N possible ways to fetch data, and M ways to use it in an analysis, there will be N x M tests needed. Testing each function separately would only require N + M tests, a far smaller number. Added up across an entire system, this turns into huge savings in test complexity.
Code coverage analysis
After each function has been thoroughly tested, it is time to perform testing of the complete system. This is the domain of dynamic analysis tools that measure code coverage. A better name for this kind of measurement might be test coverage measurement, but we will stick to the industry-standard expressions for consistency.
The basic form of code coverage analysis measures how often we execute each line of code in our software. As the theory goes, if there are parts of our code that we failed to execute, we need to improve our test suite to include the unused code.
An even better measure of test coverage is branch coverage. This keeps track of each conditional instruction (if, for, while, case) and tracks if we got a True or False condition. To get 100% test coverage, we first need to execute every possible branch path through our code.
The best code coverage analysis is a tool called Modified Condition/Decision Coverage (MC/DC). With this test, the tool keeps track of every conditional instruction, and ensures that each sub-condition has been met. For example, if we encounter the branch expression if ((a or b) && c), there are three variables a, b, and c and a total of eight possible combinations that determine further execution. Here we need to run our code in eight different ways to know there is full test coverage and the MC/DC coverage analysis tells us how many of them we have been able to trigger.
Figure 2: Modified Condition/Decision Coverage (MC/DC) tests keep track of every conditional instruction, and ensure that each sub-condition has been met. Source: Atollic AB.
You should never forget that code coverage analysis tools do not guarantee the code has zero bugs. These tools are designed to help us keep track of how much of our code we have covered with our test suite, but we still need to design that test suite to detect system failure.
Microcontroller safety features
Modern microcontrollers have some built-in peripherals designed to help you develop more reliable and robust applications. We strongly advise making use of them whenever you can.
A memory protection unit is a module located between the CPU and memory bus. It works by limiting which regions of memory the CPU can access. Unauthorized access will trigger a special interrupt called an exception, which traps the execution of the running function before it has time to corrupt any data. By simply enabling this module, you will catch bugs that might otherwise be extremely hard to pinpoint. Many Cortex-M-based MCUs including members of the Freescale Kinetis
, Atmel SAM4
, and NXP LPC
series will capture the calling function that caused the error so you can pinpoint where and why the error occurred.
The Cortex-M CPU can also catch a number of internal errors through its built-in detectors. The errors are reported directly in the Configurable Fault Status Register (CFSR). This register tracks if the CPU has attempted to perform an illegal bus access or encountered an illegal operation such as divide by zero.
The tools used to develop and test safety-critical applications may seem expensive, especially compared to all the free-of-charge tools that are available. However, investing in professional tools that help speed up the search for hidden bugs will make sense when compared to the potential cost of shipping a buggy product. In avionics, military and aerospace applications, a tiny bug can lead to a catastrophic crash. Even in consumer products, a tiny bug can cause public embarrassment and disgrace, causing loyal customers to leave in hordes.
- Thomas Ricker article “iPhone DST bug causing alarms to fail across Europe”, Engadget, November 2010.
- European Space Agency “No 33-1996: Ariane 501 - Presentation of inquiry board report”, July 1996