MCU-based designs face increasingly aggressive system requirements. Whether it is ultra-low-power, high-performance, or a difficult mix of the two, it can be a real struggle to deliver designs that satisfy today’s aggressive targets. One of the most common operations in MCU-based designs is data movement and the efficient implementation of data transfer functions which can be critical to hitting either low-power or high-processing targets. A detailed understanding of the various features included in modern MCU that support fast and efficient data transfers is critical to creating optimal MCU designs.
Data movement can be on-chip from one functional block to another, or off-chip using a standard interface. Transfers on- and off-chip typically involve a standard peripheral or memory interface. Many peripherals now include higher level functions, the intelligence to operate somewhat independently from the CPU. One example is buffering data locally to the peripheral, so that transfers are done in convenient packets, minimizing CPU overhead.
Data movement on-chip can be done efficiently using special data transfer resources. For example, data transfers between peripherals and memory can be done using intelligent Direct Memory Access (DMA) capabilities. The CPU need only be involved in setting up the transfer parameters (such as the memory location, transfer length, and active peripherals) and can be informed when the transfer is completed. The CPU can be doing other tasks while data is being transferred or can be put into a low-power mode to save power.
Many modern MCUs include advanced on-chip bus structures that help support multiple data transfers to occur simultaneously. When multiple data streams are moving simultaneously the very highest data transfer bandwidths can be achieved. Additionally, it is common in these cases for power consumption to be much less when compared to doing data transfers sequentially.
There are many other features and capabilities included in modern MCUs that assist in implementing efficient data transfers. It is not possible to illustrate all of them here, but we will review a few of the most common capabilities. Once these are understood, by extension, it will be clear how to use other features to improve data transfer efficiency in your designs. We will look at four key examples to see how data transfer efficiency can be improved: how a smart USB peripheral can reduce CPU overhead, how intelligent DMA can independently manage complex data movement requirements, how low-power modes can enhance power efficiency during data transfers, and how an advanced on-chip bus matrix can support multiple simultaneous data transfer functions. We will also see that when these capabilities are used in conjunction even higher levels of efficiency are possible.
Smart peripherals: A USB example
USB is a good example peripheral for showing possible improvements in data transfer. Early implementations of USB had a maximum throughput of only 1.5 Mb/sec and some of the elements of the standard were based on this slow data rate. As higher-performance versions of the standard emerged and a wider set of applications were targeted, USB peripheral implementations needed to become more creative. In particular, getting close to the theoretical maximum of the 12 Mbit/s full-speed USB standard required some differentiated features. Atmel added several features to the USB peripheral implementation of the XMEGA
MCU that supports dramatic efficiency improvements. These features are also good illustrations of similar techniques that can be used with other types of peripherals to improve data transfer efficiency.
Often a single memory buffer is used for peripheral data transfers. If the data buffer is full, the MCU will respond with a NAK (Negative Acknowledge) message. Upon receiving a NAK the host will wait and retry the transfer later. It will continue to retry until the MCU can successfully receive the data. The Atmel XMEGA
MCU uses a ping-pong buffer to eliminate this problem. A ping-pong buffer uses two memory banks for data transfers. When one bank is full the host can transfer data to the other bank. Alternating between two buffers eliminates retries and improves overall data bandwidth. Additionally, as shown in Figure 1, the use of two memory banks in a ping-pong arrangement, gives the MCU more time to process data. As illustrated in the figure without ping-pong, the CPU can only process data in-between transfers. With ping-pong the CPU can process data during a portion of the transfer cycle and reduces the likelihood of a NAK being required to “catch-up” on data processing requirements.
Figure 1: Ping-pong data buffering improves efficiency (Courtesy of Atmel).
Another key feature that helps improve data transfer efficiency in the Atmel XMEGA
MCU is the multi-packet buffering transfer mode. This mode is used when the data packet being transferred over the USB port exceeds the maximum allowed (64 bytes) in the BULK transfer mode for full-speed USB. Prior to the inclusion of this feature, the data packet needed to be split at the host and then merged at the receiver, resulting in increased CPU load. The multi-packet buffering feature adds dedicated hardware to the USB peripheral to automatically perform the packet splits and merges required when the data packet size is larger than the maximum USB packet size. Note that this mode also reduces the number of interrupts required, since only at the end of the entire transfer does the CPU need to be interrupted. This means that the CPU can deal with other tasks or be put into sleep mode until the entire transfer is complete and ready to be processed.
Combining both ping-pong buffering and the multi-transfer mode improves transfer bandwidth from 5.6 Mb/s (in the baseline BULK transfer implementation without either feature) to 8.7 Mbits/s. Perhaps more importantly, CPU load is reduced from 46 percent (in the baseline) to only 9 percent with both features used. The use of these features together delivers exceptional improvements in both performance and power, exactly the types of benefits intelligent peripherals can bring to your designs. Look for similar features in the key peripherals in your next design. For more information Atmel offers a Product Training Module on XMEGA
Efficient data transfers using intelligent DMA
Perhaps the first data-transfer-oriented special function that most designers become familiar with is Direct Memory Access (DMA). This block can move data from a source to a destination autonomously. First-generation DMA transfer functions were little more than an address register that could be incremented along with a small state machine to manage the memory read and write signals. Modern DMA controllers include advanced functionality to off-load the CPU as much as possible from even the most complex data transfer operations. The Renesas RX621
MCU family implementation of DMA is a good example to see some of the more advanced features available for data transfer functions.
MCU DMA Controller (Figure 2) can connect to the interrupt controller so that interrupt-driven data transfers are possible. For example, an intelligent peripheral could buffer data until a packet is available for transfer. It could issue an interrupt to the DMA controller and the packet could be moved to main memory for processing by the CPU. The CPU might not need to process data until a large number of packets are available, so the DMA controller could wait until enough packets have been transferred for an initiating interrupt to be issued to the CPU. Note how the use of DMA with a peripheral data buffer, a larger main memory buffer, and an integrated interrupt system all work together to eliminate CPU data transfer overhead. You can target dramatic efficiency improvements by matching your algorithms to the autonomous data transfer resources in your MCU.
Figure 2: Renesas RX621 DMA controller block diagram (Courtesy of Renesas).
DMA controller has four independent channels and a variety of addressing modes to support even complex data transfers without CPU intervention. For example, the use of offset addition during a DMA transfer skips addresses in the memory sequence. This allows scattered source data elements to be automatically gathered at the destination address. You can even implement complex operations like a matrix flip (where a data element’s X and Y locations are swapped) using the offset addition capability. These types of operations can stage data for very efficient CPU processing when all the required data is stored contiguously. If the CPU needed to access data out of sequence, cache misses could reduce processing efficiency dramatically. Look for similar opportunities to gather data for efficient CPU access in the “inner loops” of your data processing algorithms.
also includes two other DMA features that further improve data transfer efficiency, the Data Transfer Controller (DTC) and the External Memory DMA Controller (EXDMA). The DTC is a DMA-like function, but the data transfer information (starting addresses, transfer length, and more) is stored as activation vectors in main memory. This allows much more complex data transfer operations to be supported. Transfer chaining can be implemented by automatically moving to a new activation vector once the current transfer is completed. By managing the various links in the transfer records, complex data transfer operations can be defined and then linked together as required for the desired operation. CPU overhead is minimized since it needs only to manage the setup and teardown of the chaining configuration data prior to transfer initiation.
The EXDMA subsystem is designed exclusively for external memory bus transfers. It features two channels and many of the normal features of the normal DMA controller. An additional transfer mode, Cluster Transfer, can be efficiently used to manage complex data-buffer structures. A two-address mode is used to effectively make two data transfers at the same time. One data transfer moves a cluster of data from one location into the cluster memory area. A second transfer is used to move data from the cluster area to a different memory area. This approach can simplify the management of data structures for networking packets and video frames, eliminating the CPU from implementing common buffer management functions. Look for more specialized DMA controller functions like this as MCUs become more application focused.
For more information on RX DMAC see the Renesas Product Training Module.
Using low-power modes to improve data transfer efficiency
The previous examples showed that there are many opportunities to reduce CPU overhead by eliminating CPU low-level involvement in even complex data transfer operations. This begs the question, “What should the CPU be doing, if it is not needed for implementing data transfer operations?” In some cases it will be efficient to overlap CPU operations with data transfers. It is common to capture a significant amount of data and then have the CPU perform the processing and control portions of the algorithm. If the next data set can be captured, by using a peripheral and a DMA controller to implement the data transfers while the CPU is doing its computations on the previous data set, the effective processing bandwidth significantly increases. If you find that the CPU is powerful enough so that only a fraction of the processing power is required it may make sense to put the CPU into a low-power mode until it is needed for data processing.
The Texas Instruments MSP430 MCU family has a variety of low-power modes that can be used to cut power when the CPU is not needed for a portion of the algorithm. As shown at the bottom of Figure 3, there are two clock sources available in low-power modes: the on- demand high-speed Digitally Controlled Oscillator (DCO)-sourced clock, MCLK, and the always-on low-speed peripheral clock, ACLK. In Active Mode the MCU is in the highest power state. Both clocks are active and the MCU dissipates around 250 μA. In the CPU Off Mode, the CPU is off and the clocks are both on with power dissipation at 35 μA. In Standby Mode the CPU and the DCO clock are off, while ACLK is on. In this mode power dissipation is 0.8 μA. Finally, in the All Off Mode the CPU and clocks are off and power dissipation is only 0.1 μA. By reducing the amount of time the CPU is in the Active mode, and if the Standby power is sufficiently low, the total amount of power, indicated by the area under the curve in the top of Figure 3, can be minimized.
Figure 3: TI MSP430 active power, standby power, and fast clock wake-up time (Courtesy of Texas Instruments).
The time spent in Active Power depends not only on the amount of processing time required, but also on the amount of time it takes to transition into and out of the low- power states. On the TI MSP430 (for more information see the TI Product Training Module) the MCLK can be activated in just over 200 ns, making the wake-up time virtually instantaneous. This fast wake-up time means less power is wasted waiting for the clock to become active, so processing can begin right away. The combination of fast processing capabilities, fast wake-up time and low-power dissipation in low-power modes is the “sweet spot” for efficient MCU designs. Make sure you consider each of these elements carefully in your MCU design.
Multiple simultaneous data transfers using an on-chip bus matrix
One possible bottleneck that you may have considered in the previous examples, where a significant amount of simultaneous data transfer is occurring, is the requirement for multiple on-chip buses to carry all this data traffic efficiently. One example of an MCU with an implementation for efficient multiple busses is the Freescale Kinetis K70 MCU. The K70 uses a crossbar switch, as shown in Figure 4, to connect a variety of bus master modules and bus slave modules. Eight bus masters on the left side of the diagram can connect to eight slaves on the right side of the diagram. Note that some masters share ports. For example, the DMA and the EzPort share master port M2. These masters are never active at the same time, so sharing the same master port is not ever a conflict.
In some multiple bus implementations there are limitations on which master can connect to which slave. The Kinetis K70 implements a crossbar switch so that every master can connect to every slave. This supports a significant amount of overlapped data transfers. One application in particular the Kinetis K70 is optimized for, that requires a significant amount of data transfer, is a touchscreen LCD graphic user interface (GUI). The diagram in Figure 4 shows that the LCD controller background plane and the LCD controller graphic window both have separate masters. There are also multiple DRAM controller slaves so that both LCD masters can access their own memory buffers in DRAM. This significantly reduces CPU overhead in managing the GUI display.
Figure 4: Freescale Kinetis K70 crossbar switch supports overlapped data transfers (Courtesy of Freescale).
Other masters on the crossbar switch can also overlap operations for additional efficiency improvements. For example, the Ethernet master can transfer data directly to/from the DRAM via the DRAM controller, while the DMA controller manages transfers between slower speed peripherals and on-chip memory. Algorithms can be easily optimized based on processing requirements without facing artificial bottlenecks that less well-connected bus systems would impose. For more information see the Freescale Kinetis Product Training Module.
MCUs with special data transfer functions and intelligent peripherals can achieve very high levels of data transfer efficiency. However, because data transfer is the most common function required in the vast majority of MCU designs, knowing how to efficiently use these capabilities can be critical in meeting aggressive design goals of low-power and high-performance. The examples illustrated in this article can be extended to a wide variety of other data transfer functions so you can hit the efficiency levels needed in your design.
For more information on the MCUs and the data transfer features discussed here, use the links provided to access product pages and training modules available on the Hotenda website.