Give away medical masks when you place an order. learn more

Multicore Solutions for Portable Designs


In the beginning there were cell phones, and they were good – well, allowing for the fact that they were the size and weight of bricks and could only do one thing: make calls. Today’s cellular handsets are the microcomputers of the 21st century, able to run an endless number of “apps”, stream high-definition video and high-quality audio, snap and process 12 megapixel pictures, and still make calls.

Moore’s Law notwithstanding, that is a lot to ask of a single processor, especially one that needs to operate from a small battery for an extended period of time. Handsets have long used separate applications processors to offload work from the main processor. However, with ARM’s recent introduction of its big.LITTLE approach – and NXP’s implementation of it in a low-power dual-core (M4/M0) embedded MCU – the move to asymmetrical multicore processors (AMPs) in other portable devices looks set to move quickly from niche to mainstream.

Texas Instruments’ OMAP and DaVinci

Texas Instruments’ OMAP™ SoCs have long been the dominant applications processors in cellular handsets. Since streaming video and audio are best handled by a DSP in the data path, OMAP SoCs all combine a general-purpose ARM® processor with a TI DSP.

The original 130 nm OMAP 1 family – such as the OMAP5912ZZG – paired a 192 MHz ARM926EJ-S™ with a TMS320C55x™ DSP core. The internal bus structure – one program bus, three data read buses, two data write buses, and additional buses for peripheral and DMA activity – efficiently enabled the DSP to perform up to three data reads and two data writes in a single cycle for relatively high-speed video and image processing.

Moving down to 65 nm, the OMAP 3 family increased speed significantly. The 600 MHz OMAP3530DZCBB, for example, upgraded the ARM926EJ-S to a Cortex™-A8; the C55x™ DSP to a TMS320C64x™; and added a POWERVR™ SGX graphics accelerator and a NEON™ SIMD coprocessor. While most of the OMAP 3 series of SoCs are sold directly to handset OEMs, the OMAP3530 is a catalog item targeting embedded developers. TI provides a series of OMAP3530 training videos on Hotenda’s web site.

Continuing to up the ante, TI’s 45 nm OMAP 4 platform moved to a dual-core ARM Cortex-A9 MPCore™ processor supporting symmetrical multiprocessing (SMP); switched to a programmable multimedia engine based on the C64x™ DSP; added an IVA 3 hardware accelerator; and upgraded to a POWERVR SGX540 3D graphics accelerator (see Figure 1). With high-end applications in mind, the OMAP4460 can deliver 1080p multi-standard video record and playback as well as stereoscopic 3D encode/decode. While TI utilizes just about every power management trick in the book with these chips, one should not expect to be able to do high-speed online 3D gaming all day long without recharging your cell phone. Developers wishing to evaluate the OMAP4460 should check out the popular SVTronics’ Pandaboard ES.



Figure 1: Texas Instruments’ OMAP44x block diagram (Courtesy of Texas Instruments).

While TI’s OMAP 5 family – built around ARM’s dual-core Cortex-A15 and two Cortex-M4s – are clearly more interested in servers than cell phones, the OMAP-L138 – derived from the DaVinci™ family of video processors – moves back down the power curve with portable devices in mind. The OMAP-L138 utilizes an ARM926EJ-S RISC MPU and a TMS320C674x fixed/floating-point VLIW DSP running at a modest 375/456-MHz. In contrast to the DaVinci chips, the OMAP-L138 supports a wider, less video-specific set of peripherals and it includes a floating point DSP. TI markets an OMAP-L138 experimenter kit where you can check this out.

If your project is more video oriented, the DM644 series of dual-core DaVinci DSPs may just fill the bill. The TMS320DM6446 includes an ARM926EJ-S core running at up to 405 MHz and a VLIW TMS320C64x+ DSP core running at up to 810 MHz. The chip is available for in-circuit testing on the DM6446 evaluation module.

Analog Devices’ Blackfin

Designed for low-power portable applications, Analog Devices’ Blackfin® is a venerable processor that acts as if it is a dual core – and some of them actually are. Co-developed with Intel, the Blackfin family consists of a wide range of small 16/32-bit RISC processors running anywhere from 300 to 600 MHz. The processor is based on SIMD architecture and features two 16-bit MACs, two 40-bit ALUs, and a flat address space. Each MAC can perform a 16-bit by 16-bit multiply in each cycle, and special instructions are included to accelerate various signal processing tasks; so Blackfin can perform control functions and simultaneously act as DSPs. ADI claims that Blackfin displays “best-in-class MHz/mW performance,” though that has become a hotly contested metric that everyone is chasing.

The ADSP-BF561SBBZ600 (see Figure 2) is a true dual-core device containing two 600 MHz Blackfin cores, each with two 16-bit MACs, two 40-bit ALUs, four 8-bit video ALUs, a 40-bit shift register, 128 Kbytes of low-latency on-chip L2 SRAM, and an external memory controller. This is a symmetric multiprocessor (SMP) device targeting a variety of multimedia, industrial, and telecommunications applications. ADI provides numerous product training modules on the Hotenda site, including a Blackfin processor core architecture overview, Blackfin system services, and Blackfin optimizations for performance and power consumption.

Figure 2: Analog Devices’ ADSP-BF561 functional block diagram (Courtesy of Analog Devices).

Freescale Semiconductors’ QorIQ

The Freescale QorIQ™ P1022 is an SMP processor built around two Power Architecture™ e500v2 cores that share a 256 Kbyte L2 cache (see Figure 3). With a clear emphasis on connectivity, the P1022 includes virtualized enhanced three-speed Ethernet with TCP/UDP/IP offload, direct FIFO mode for ASIC connectivity, SATA for local storage, support for three PCI Express interface options, plus the usual USB, SPI, multiple GPIOs, etc. The QorIQ P1022NSN2LFB runs at 1055 MHz and features a double-precision floating-point unit. The P1 Platform Overview training module provides an introduction to the processor family, and the P1022 Multicore Development System lets you get some hands on experience with the chip.

Figure 3: Freescale Semiconductor’s QorIQ P1022 block diagram (Courtesy of Freescale).

NXP’s LPC4350

The latest entry into the low-power multicore market is NXP’s LPC4350, which it bills as “the world’s first dual-core DSC.” Following ARM’s “big.LITTLE” approach – the same one TI took with its OMAP 5 series using Cortex-M4s and Cortex-A15s – NXP combined Cortex-M4 and Cortex-M0 cores in the much lower power LPC4350.

In order to minimize power consumption, the 204 MHz LPC4350 uses the Cortex-M0 core to offload work from the Cortex-M4 whenever possible and the Cortex-M4 to burst data quickly as needed. Clearly targeting the embedded market, the LPC4350’s connectivity options include CAN, EBI/EMI, Ethernet, I²C, Microwire, SD/MMC, SPI, SSI, SSP, UART/USART, and USB OTG; built-in peripherals include brown-out detect/reset, DMA, I²S, LCD, motor control PWM, POR, PWM, and WDT (see Figure 4).

Figure 4: NXP Semiconductors’ LPC4350 block diagram (Courtesy of NXP Semiconductors).

The LPC4350 adds two interesting new features: a State Configurable Timer (SCT) and Quad SPI Flash Interface (SPIFI). The SCT subsystem sits on the AHB bus and consists of two 16-bit counters or one 32-bit counter that can be clocked either by the bus clock or an external input. The SCT enables sequencing across multiple counter cycles and enables events to control inputs, outputs, and other events. The SPIFI interface can carry four lanes of data at up to 40 MB per second to external flash memory, a unique and very useful trick.

All told, the LPC4350 is an interesting new entry into the low-power multicore market; one that raises the bar that others will have to meet. But then those products are presumably already in the pipeline, which will continue to make this market even more interesting.

ARM’s big.LITTLE architecture

Since all the vendors mentioned in this article are ARM licensees, it is safe to assume that ARM’s roadmap will play out in silicon in the fairly near term. The latest major addition to that roadmap – announced last October – is its ‘big.LITTLE’ approach to multicore processors. Even Freescale, its Power Architecture license notwithstanding, has formally signed on. NXP is already shipping first silicon based on big.LITTLE, though not the version that ARM proposed late last year.

The first big.LITTLE design from ARM pairs the Cortex-A15 with a Cortex-A7. The central tenet of big.LITTLE is that both cores must be essentially architecturally identical so that all instructions will execute consistently across both cores. The Cortex-A15 and Cortex-A7 both share the full ARM v7A architecture including virtualization and Large Physical Address Extensions, so that above the micro-architecture level, they are fully compatible. So too are the Cortex-M4 and Cortex-M0. As long as this symmetry is maintained, a SoC can contain any number of matched big.LITTLE cores.

Cortex-A15 vs Cortex-A7
Performance
Cortex-A15 vs Cortex-A7
Energy Efficiency
Dhrystone 1.9x 3.5x
FDCT 2.3x 3.8x
IMDCT 3.0x 3.0x
MemCopy L1 1.9x 2.3x
MemCopy L2 1.9x 3.4x

Table 1: Cortex-A15 and Cortex-A7 performance and energy comparison (Courtesy of ARM).

At the microarchitecture level, the Cortex-A15 has a far more complex pipeline than the Cortex-A7, and their performance is quite different (see Table 1). The Cortex-A15 tends to trade-off power for performance, while the Cortex-A7 tends to do the opposite; these differences will tend to determine the application partitioning.

The Cortex-A15 and Cortex-A7 share memory and system ports via the CCI-400 (Cache Coherent Interconnect) as shown in Figure 5, though each pair of cores shares an integrated level-2 cache. Both the Cortex-A15 and Cortex-A7 pairs also share a programmable Generic Interrupt Controller (GIC-400), which distributes up to 480 interrupts among the various cores. Addressing a major multicore hurdle, both the Cortex-A15 and Cortex-A7 offer trace solutions that enable programmers to debug their code using ARM’s CoreSight™ SoC.

Figure 5: Cortex-A15 CCI Cortex-A7 system (Courtesy of ARM).

ARM’s approach with big.LITTLE involves using the Cortex-A7 for as much of the processing as possible, only migrating tasks to the Cortex-A15 when that performance is needed, moving both the operating system and application to the faster core. With processors operating at 1 GHz, ARM says this migration can take place in less than 20 μsec. That is possible because the two processors are identical and there is a 1:1 mapping between the state registers in the inbound and outbound processors.

All told, ARM’s big.LITTLE approach would seem to make a lot of sense, both logically and architecturally. By using coherent caches and automating interrupt handling and memory access, this architecture could result in a new wave of multicore processors for portable devices hitting the distribution channels later this year. That is not guaranteed, but it is a very good bet.

Summary

As the number of multicore processors for portable devices designs continue to grow, by choosing the one with an appropriate architecture, embedded designers can avoid the twin pitfalls of overburdening their main MCU and adding an extra processor. That is just one further bad trade-off that developers can soon put behind them.

Supplier