Understanding How an MCU’s Internal Bus Structure Drives Application Efficiency



MCUs now have an amazing number of on-chip peripherals that can be used simultaneously to off-load low-level functions from the CPU. This can dramatically improve processing efficiency, reduce power, and simplify your design. You may be in for an unfortunate surprise, however, if your peripheral functions overwhelm the internal bus interface and data transfers slow dramatically. Fortunately, MCU manufacturers have added new and efficient bus interfaces, often with multiple paths between key peripherals and on-chip memory that can help support multiple data transfers. These new buses do have limitations, however, since connecting everything to everything else is much too expensive in terms of die area and power. Understanding common-use models for these new on-chip buses will help you create efficient designs that maximize data-transfer bandwidth.

This article will quickly review some common intelligent on-chip bus features and will illustrate example designs that take advantage of these key features. Some of the topics covered will include: on-chip bus-matrix architectures, use of DMA controllers, dedicated peripheral data-transfer functions, intelligent buffering, bus priority systems, and interrupt control.

Common bus-interface architectures

Several key architectural approaches show up in just about every high-performance bus-interconnect structure. This should not be surprising since the key strategy to supporting high bandwidth is to be able to establish several parallel connections that can run independently. A bus-matrix architecture, where several bus masters can independently access several bus slaves, is perhaps the most common building block for high-efficiency bus architectures. The Freescale Kinetis K70 MCU is a good example of the type of interconnect architecture required for efficient data processing and movement.

As shown in Figure 1, the Freescale Kinetis K70 MCU uses a multilevel bus matrix that can interconnect between eight separate bus masters and eight separate bus slaves. It is possible for multiple masters and slaves to operate independently. Allocation of memory is critical to maximize efficiency. For example, the following operations could all operate in parallel with no overlap:
  • Core - instructions in Flash and core only data and stack in SRAM_L
  • USB - data buffers in SRAM_U
  • LCD controller - graphic buffers in DDR

Figure 1: Freescale Kinetis K70 MCU bus-interconnect architecture. (Courtesy of Freescale) 

Freescale also offers a modular development platform for the K70, part of its Freescale Tower System, that enables rapid prototyping and tool re-use through reconfigurable hardware. The TWR-K70F120M can be used with a broad selection of Tower System peripheral modules, including the new TWR-LCD-RGB which accepts RGB data from the K70 MCU graphics LCD controller.

When two or more masters attempt to access a single slave port, the interface will use an arbitration algorithm to determine which master will access the port first. Two common arbitration schemes that can often be used for bus access are fixed priority or round robin. In a fixed-priority scheme the master priorities are fixed, giving high-priority masters access rights over lower-priority masters. If there are several masters with equal priority, a round-robin priority scheme can be used. In this scheme masters rotate priority in order to, over time, have equal access to the resource.

Notice the importance of DMA access to the bus matrix. Often, DMA transfers are the most power efficient so it is critical for DMA to have efficient master access to the bus matrix. Some resources will have multiple connections to the bus matrix — take note of the DRAM controller for example — because they are critical resources for multiple masters. This improves overall performance by removing “access blocking” that may occur when multiple masters need access to the same resource.

Advanced peripheral bus architecture for improved efficiency

In many MCU applications, peripheral operations are just as important as CPU and memory operations. It can improve transfer efficiency if there are advanced bus interfaces used with key peripheral functions as well as CPU-based functions. The Renesas RX600 MCU has multiple peripheral buses that can be used to spread bandwidth loading more efficiently. As illustrated in Figure 2, the RX600 not only has a bus matrix for CPU-oriented operations (shown at the top of the figure) but multiple peripheral buses (shown at the bottom of the figure) to better allocate bandwidth between intelligent peripherals. A significant amount of peripheral traffic need never access the CPU bus matrix and this improves data transfer efficiency without increasing the size of the CPU bus matrix, typically a higher-performance, large die size and higher-power subsystem.


Figure 2: Renesas RX600 multiple bus architecture spreads bandwidth loading. (Courtesy of Renesas) 

In Figure 2 there are six parallel data transfer operations occurring at once:
  • The CPU fetches an instruction
  • USB data is transferred to the CPU
  • Ethernet data is moved out of SRAM
  • RGB data is moved out of external SDRAM to the LCD
  • ADC values are loaded into SRAM
  • Timer data is written to the DAC output
The availability of separate peripheral buses can provide a significant efficiency boost when multiple activities occur simultaneously. In systems with less simultaneous peripheral requirements one or two peripheral buses could be sufficient.

Dual-CPU core architectures

MCUs with dual-CPU cores, like the Atmel SAM4C8CA, also have the need for high-performance bus interfaces, perhaps even more than single-core MCUs, because it is important to allow each CPU to access key resources in parallel so that overall system performance is not impacted. In many implementations one CPU has higher processing capability while the other has less capability. This is useful in designs that require a lower-performance system controller and a higher-performance application processor.

As seen in Figure 3, the Atmel SAM4C8C has one CPU with floating-point capability while the other has a fixed-point CPU. SAM4C8C has 512 KB of Flash memory and 128+16+8 KB of SRAM. Processing tasks are allocated to the proper CPU to increase efficiency. Two high-speed AHB multilayer bus-matrix interconnects are used to support a maximum amount of processing overlap. Separate DMA controllers and interrupt controllers support efficient data transfers without CPU intervention. A simple asynchronous AHB-to-AHB bridge is used for processing synchronization and data transfers between CPU addressing spaces, even under DMA control.


Figure 3: Atmel dual-CPU core SAM4C8CA bus-interface architecture. (Courtesy of Atmel) 

Low power and efficient data transfers

You might get the idea that these multiple bus architectures are targeted at the highest-performance systems, but even low-power applications can take advantage of efficient busing architectures. The Texas Instruments MSP430F5507IRGZR, of the supplier’s MSP430 MCU family, integrates USB, LCD control, and high-performance analog all on a single chip for small footprint applications. Peripherals have several methods for operating autonomously, and this can help reduce operating power when the CPU is put into a low-power mode, as illustrated in Figure 4.


Figure 4: TI MSP430 family low-power operation by using autonomous peripherals. (Courtesy of TI) 

By using peripheral buses that stay active even during low-power operations it is possible to sample data from the ADC, transfer data to memory, output a PWM signal, update the LCD display, and send/receive serial data communications all while the CPU is in the low-power standby state. Note that the fast wake time makes it possible to quickly respond to peripheral requests when needed, without burning a significant amount of power while waking up. Even short CPU operations can be efficient with such a capability.

Summary

Getting the most performance out of a complex MCU requires a significant amount of overlapping bus activity between peripherals and memory and, when needed, to and from the CPU. Often the most efficient implementations will have several transfers operating at once without any CPU activity involved. Understanding the capabilities and limitations of the MCU’s bus interface architecture is critical to achieving a high level of efficiency.

Supplier