Host Controller and System Partitioning Options for 3-D Gesture Recognition

User interfaces used to be simple switches like pressing a button on a keyboard or the left/right clicking of a mouse. Joysticks and mice gave us good linear control in the analog realm, and remain very straightforward and intuitive. An analog value, converted to digital, is read and acted upon. Often, several samples in time are used to verify a motion in a specific direction, speed and angle. Averaging of samples can also help tame glitchy motion.

Touchscreens followed this trend toward simplified detection and directional decoding, at least until multi-touch, which changed everything. With multi-touch, a new layer of useful, expressive input became possible. In reality, you could perform the same functions using a standard single-touch screen as you would with a multi-touch screen—but with multi-touch, you don’t need specialized-graphic-control zones (such as scroll bars on the side) to decode your actions.

Like a mouse, a touch screen gives us two dimensions of control, and two (or more) fingers can be detected in motion to convey intent. But why limit ourselves to two dimensions when we can do three (or four)? With wearable computers gaining a lot of attention, a mouse and touch screen can become a bother. Instead, gesture recognition through the use of video cameras may be one of the main ways we interact with our wearable computers and the local peripherals attached to us in a Personal Area Network (PAN).

Indeed, human gesture recognition has become a popular new way to input information in gaming, consumer and mobile devices, including smartphones and tablets. This article looks at system architectures and techniques that can be applied to decoding gesture recognition and explores the MCUs that can be used to interface and decode video streams that contain gesture-intent commands. It includes the features to look for that allow the micro to more easily page through the frame memory, find edges, detect movements and directions, and come to conclusions about intent. All parts, datasheets, reference designs, tutorials, and development kits referenced here can be found online at Hotenda’s website.

Better data flow through clever partitioning

Like fingerprint detection, gesture recognition looks for a preprogrammed recognizable pattern to kick off the sequences of events. Also similar to fingerprint detection, contrast differences mark the boundaries of edge detection (Figures 1A and 1B).

Figure 1: Similar to fingerprint identification, gesture recognition must first identify areas of interest, then use contrast to bring out patterns that software can map and recognize. Bit-plane separation of an image can produce sharp contrasts and enhance edges.

Like facial recognition, gesture recognition maps key points of interest that will be tracked to make a determination. With facial recognition and fingerprint identification, static images can be used to match a set of rather exacting patterns to verify identity. With gesture recognition, the preliminary match causes a sequential series of further examinations of processed frames in time-slices (Figures 2A and 2B). Slope and direction of motion throughout a series of images are used to decide if a valid gesture is occurring, and what gesture it is.

Figure 2: Once a key area of concern and interest are identified and cropped, finding key measurement points and tracking their distances and angles as they change can, for instance, detect fingers closing by measuring the change and rate of change of distance from palm to finger edges. Notice the two different triangles.

This has ramifications for system designers. It is not only single-image pages that need fast access and processing, it is many pages of images that need to be pre-processed, aligned, and then prepped to feed the actual algorithms for gesture recognition. This means that thought and cleverness should be applied to the memory architecture and partitioning schemes used.

For example, pipeline process stages can apply on-the-fly cropping to only examine areas of interest. When the size of the cropped area aligns with binary-addressed-memory boundaries, efficient use of memory can take place without the need for memory-alignment packing and unpacking routines. What’s more, clean memory boundaries mean that LSB bits of all pointers are the same. This saves time calculating indexes and pointers for each successive page.

Another architectural choice is whether or not to use external pass-through logic as part of your design. This can either be as a companion FPGA, CPLD or ASIC. While many algorithms can be tested in software, the ability to migrate that functionality into hardware can drastically improve performance while lowering the cost of the processor.

For example, ambient lighting can affect which bit planes represent the best edge detectors. Without ambient light detection, a software-based approach can use a sampled page image, and bit by bit generate corresponding image pages consisting of either absence or presence of single bits in each pixel’s color sample.

A much faster approach is to use masking logic to stream the page into separate memory pages, each corresponding to a single bit of each sample. In the time it takes to transfer a page of memory, it is automatically processed into separate bit planes. Once the ideal plane is selected, only that bit is needed from then on, simplifying life even more.

Just another peripheral

From a peripheral perspective, video-gesturing technology is just an input device like any other. Depending on where the camera is on the wearable computer, the gesture engine may be just a processing block cabled to a variety of devices and sensors on our person. If cost is a factor it can simply capture the data and format it in a real-time data stream to a host. If cost isn’t paramount it can monitor and decode the gestures and send intent-command data to the central CPU.

For example, a cell phone/tablet combo that is worn may already have a high-definition camera integrated within it. A separate camera just for gesture recognition may be redundant and can be eliminated if, independently, the video stream through an HDMI port connects the gesture recognition block to the camera data stream. This allows the gesture engine to share data without stealing CPU time.

Here parts like the low-power Analog Devices ADV7611BSWZ-P HDMI receivers can be used to accept raw video and decode it for further real-time processing for gestures only as a peripheral (Figure 3). With an internal 165 MHz TMDS clocked HDMI processor, color and space conversions block, and support for all 3D TV standards defined in HDMI 1.4a, resolutions up to UXGA (1600 x 1200) are decoded at 60 Hz rates and passed as three separate data buses (24 bit total). It also contains a repeater mode for pure pass through which can be very useful. This allows a single camera to feed the gesture recognition system independently of the video display system.

Figure 3: Inline between a video source and wearable CPU, a dedicated HDMI controller and processor can perform intermediary steps like color plane and bus separation. It can also function as a repeater and adds audio functionality as an added plus.

The Analog Devices part also has a full 24-bit pixel bus. As a front-end decoder bolted onto a dedicated low-power processor, this solution can capture and format data in memory pages and perform algorithms for real-time gesture identification without stealing constant CPU cycles from the wearable. It also captures and formats audio as an added plus.

Wearable hub processing

A viable approach is to have the CPU in the wearable hub do all the processing. This allows lower-cost wearable peripherals since they do less. But, this also requires a higher-end central processor like the Texas Instruments TMS320DM8168CCYG2 digital-media processor. This part contains an independent DSP core running up to 1.2 GHz that is powerful and fast enough to perform gesture recognition in 3-D while coordinating video inputs and outputs with overlays and audio.

As part of the DaVinci Series it features fully functional floating-point processor with its ARM Cortex A8 processor, which is fully code compatible with the C67x and c64x libraries and resources. Note, this part simultaneously supports up to three programmable high-definition video images and features hardware compression and decompression, as well as Codecs for MPEG and H2.64.

It also has two HD 165 MHz video-capture channels at full 24-bit as well as two HD 24-bit video-display channels. Note the independent-memory buses which let the processors run fully independently and concurrently and the dual 32-bit DDR2 and DDR3, which use a dynamic-memory manager to support up to 2 Gbytes of directly accessible storage RAM.

Multicore MCUs can also be used as a central-hub choice. For example, the Freescale i.MX family features one, two, and four 32-bit ARM and Power PC cores running up to 1.4 GHz.

Freescale’s single-core MCIMX535DVV1C has 32-bit ARM cortex A8 levels of performance running at 1 GHz with high-speed DDR2 and DDR3 RAM controllers. Its multimedia co-processors have DSP functionality and on-chip graphics acceleration allowing a processor of this class to handle video pre- and post-processing along with intent decoding as a stand-alone peripheral.

If the peripheral functionality is to migrate to the hub, you might consider the Quad-core Freescale MCIMX6Q5EYM10AC using one of the cores while the other three are dedicated to everything else. This allows a single design to be used as a standalone peripheral, or, as part of an encompassing-wearable computer. A Product Training Module on the i.MX 6 series is online at Hotenda’s website.

Also interesting is TI’s MSP430F5229, available with 1.8V I/O and enabling applications with advanced sensor-fusion capabilities such as gesture recognition, motion tracking, environment sensing, and contextual awareness. The MSP430F522x series are microcontrollers with four 16-bit timers, a 10-bit ADC, two universal-serial-communication interfaces (USCIs), a hardware multiplier, DMA, a comparator, and an RTC module with alarm capabilities. The part comes in wafer-level chip-scale packages as small as 2.0 x 2.2 x 0.3 mm and includes 53 GPIO’s.

What to look for

When choosing a processor to be the host controller for a 3-D gesture-recognition engine, the features we want are good- and deep-memory architecture, including a flexible MMU, DMA, and cache. In addition, our selected micro should feature well-thought-out high-speed DRAM controllers for external DDR types of dense, cost-effective, and fast memory (single cycle wherever possible).

Vision-specific, data-centric processing requires a processor architecture that can perform single-instruction, multiple-data (SIMD), fast floating-point multiplication and addition, and fast-search algorithms. The type of video source can also affect our decision. Simple raw-video feeds from composite video cameras offer the lowest cost but also the lowest performance. Then again, megapixel images are not needed solely for gesture recognition. If the same camera is going to be used as a picture and video camera, then a higher-resolution digital CCD interface can be a feature our micro should have.

Few if any processors provide composite- or component-video inputs and outputs native to the processor itself. If composite video is your source, this may mean that the video decoders and camera control may be external chips.

Properly selected processors with the necessary peripheral features and processing power can absorb discrete functions like overlays and video blending and reformatting. Our wearable computers may be using video goggles or pico projectors and these output devices can blend with our video streams but have different processing requirements.

Most likely, composite or component video systems will use high-performance general-purpose microcontrollers along with external video decoders, encoders and filters, and video switches. Modern MCUs now tend to have VGA, HDMI, DVI, and digital audio/video interfaces which support the higher-definition cameras and displays. One can also find miniature smart sensors that combine a 3-axis accelerometer with an embedded microcontroller in a compact package for advanced custom motion-recognition applications.

Unfortunately, composite video just doesn’t have the high definition that people want. While some may think high definition is overkill, going forward, the market without doubt will demand higher-end imaging. We already see phones with 40 Mpixel cameras. Digital interfaces will yield higher-resolution performance, and compact, high-speed wired interfaces like DVI and HDMI that will pass clean signals between processors and peripherals on our wearable computers. While wireless is possible, so many RF signals in shared spaces could clog bandwidth, so wired video may be the only clean and safe video solution.

Support tools

Future personal, wearable computers will be using advanced user interfaces and 3-D video-gesture recognition as one of the techniques that will be employed; it can be used with a pico projector, a virtual display, or a holographic display to control machines at a higher level.

A nice little development tool that can be used for test and development comes from STMicroelectronics with its STEVAL-CCH002V2 HDMI Eval board. It combines input and output connectors for S-Video, VGA, and DVI interfaces. It also breaks down color planes into primary color sets of Y, Pb, and Pr. These color planes can be used to help identify edges in further processing stages. The eval board also demonstrates several components for switching, multiplexing, equalizing, and protecting signal lines. A Reference Design¹ is available for the eval board as are Gerber PCB files.²

While still in the early stages of development and use, gesture-based technology is forging ahead, and discrete functional parts working in conjunction with high-end processors can perform these interface tasks. As this article has demonstrated, the parts, algorithms, and development systems are here today to help engineers develop this next generation of input technology.

For more information on the parts discussed here, use the links provided to access product pages on the Hotenda website.

  1. STEVAL-CCH002V2 reference design