Give away medical masks when you place an order. learn more

Designing Multithreaded and Multicore Systems



If your MCU application needs to handle digital audio, consider taking a multithreaded approach. Using a multithreaded design approach enables the designer to reuse parts of their design in a straightforward manner.

Multicore and multithreading are efficient methods for designing real-time systems. Using these techniques, a system is designed as a collection of many tasks which operate independently and communicate with each other when required. Breaking the system design down from large monolithic blocks of code into much more manageable tasks greatly simplifies system design and speeds product development. As a result, the real-time properties of the system as a whole are more easily understood. The designer only has to worry about the fidelity of the implementation of each task, asking questions such as, "Is the network protocol implemented correctly?"

In this article, we discuss how to use a multithreaded or multicore design methods to design real-time systems that operate on streams of data, such as digital audio systems. We use several digital audio systems to illustrate the design methods, including asynchronous USB Audio 2, AVB over Ethernet, and digital docks for MP3 players. We briefly discuss the notions of digital audio, multicore, and multithreading before showing how to effectively use multicore and multithreading to design the buffering and clocking schemes required.

Digital audio

Digital audio has taken over from analog audio in many consumer markets for two reasons. First, most audio sources are digital. Whether delivered in lossy compressed form (MP3) or in uncompressed formats (CD), digital standards have taken over from the traditional analog standards such as cassettes and tapes. Second, digital audio is easier to deal with than analog audio. Data can be transferred without loss over existing standards, such as IP or USB, and the hardware design does not need any "magic" to keep the noise floor down. As far as the digital path is concerned, the noise floor is constant and immune from TDMA noise which mobile phones may cause.

A digital audio system operates on streams of samples. Each sample represents the amplitude of one or more audio channels at a point in time, with the time between samples being governed by the sample rate. CD standards have two channels (left and right) and use a sample rate of 44.1 kHz. Common audio standards use 2, 6 (5.1), and 8 (7.1) channels, and sample rates of 44.1 kHz, 48 kHz, or a multiple. We use 48 kHz as a running example, but this is by no means the only standard.

Multicore and multithreading

In a multithreaded design approach, a system is expressed as a collection of concurrent tasks. Using concurrent tasks, rather than a single monolithic program, has several advantages:

  • Multiple tasks are a good way to support separation of concerns, which isone of the most important aspects of software engineering. Separation of concerns means that different tasks of the design can be individually designed, implemented, tested, and verified. Once the interaction between the tasks has been specified, teams or individuals can each get on with their own tasks.
  • Concurrent tasks provide an easy framework to specify what a system should be doing. For example, a digital audio system will play audio samples that are received over a network interface. In other words, the system should concurrently perform two tasks: receive data from the network interface and play samples on its audio interface. Expressing these two tasks as a single sequential task is confusing.
A system that is expressed as a collection of concurrent tasks can be implemented by a collection of threads in one or more multithreaded cores (see Figure 1). We assume that threads are scheduled at the instruction level, as is the case on an XMOS XCore processor, because that enables concurrent tasks to operate in real-time. Note that this is different from multithreading on Linux, for example, where threads are scheduled on a uniprocessor with context switching. This may make those threads appear concurrent to a human being, but not to a collection of real-time devices.

Concurrent tasks are logically designed to communicate by message passing, and when two tasks are implemented by two threads, they communicate by sending data and control over channels. Inside a core, channel communication is performed by the core itself and when threads are located on separate cores, channel communication is performed through switches (see Figure 2).

Multithreaded design has been used by embedded system designers for decades. To implement an embedded system, a system designer used to employ a multitude of microcontrollers. For example, inside a music player one may find three microcontrollers controlling the flash, the DAC, and an MP3 decoder chip.

Figure 1: Threads, channels, cores, switches, and links. Concurrent threads communicate over channels either inside a core, between cores on a chip, or between cores on different chips.

We argue that modern day multithreaded environments offer a replacement for this design strategy. A single, multithreaded chip can replace a number of MCUs and provide an integrated communication model between tasks. Instead of having to implement bespoke communication between tasks on separate MCUs, the system is implemented as a set of threads which communicate over channels.

Using a multithreaded design approach enables the designer to reuse parts of their design in a straightforward manner. In traditional software engineering, functions and modules are combined to perform complex tasks. However, this method does not necessarily work in a real-time environment because executing two functions in sequence may break the real-time requirement of either the function or module.

In an ideal multithreaded environment, the composition of real-time tasks is trivial, as it is just a case of adding a thread (or a core) for every new real-time task. In reality, the designer will have constraints on the number of cores (for example, for financial reasons) and hence will have to make a decision about which tasks to compose as concurrent threads, and which tasks to integrate in a single thread as a collection of functions.

Multithreaded digital audio

A digital audio system is easily split into multiple threads, including, a network protocol stack thread, a clock recovery thread, an audio delivery thread, and optionally, threads for DSP, device upgrade, and driver authentication. The network protocol stack may be as complex as an Ethernet/IP stack and comprise multiple concurrent tasks, or as simple as an S/PDIF receiver.

Figure 2: Physical incarnation of a three-core system with 24 concurrent threads. The top device has two cores and the bottom device has a single core.

We assume that the threads in the system communicate by sending data samples over channels. Whether the threads execute on a single core or on a multicore system is not important in this design method, since multicore just adds scalability to the design. We assume that the computational requirements for each thread can be established statically and are not data dependent, which is normally the case for uncompressed audio.

We will focus our attention on two parts of the design: buffering between threads (and their impact on performance) and clock recovery. Once these design decisions have been made, implementing the inside of each thread follows normal software engineering principles, and is as hard or easy as one would expect. Buffering and clock recovery are interesting because they both have a qualitative impact on the user experience (facilitating stable low latency audio) and they are easily understood in a multithreaded programming environment.

Buffering

Within a digital solution, data samples are not necessarily transported at the time that they are to be delivered. This requires digital audio to be buffered. As an example, consider a USB 2.0 speaker with a 48 kHz sample rate. The USB layer will transport a burst of six samples in every 125 µs window. There is no guarantee where in a 125 µs window the six samples will be delivered, hence a buffer of at least 12 samples is required in order to guarantee that samples can be streamed out in real-time to the speaker.

The design challenge is to establish the right amount of buffering. In an analog system, buffering is not an issue; the signal is delivered on time. In a digital system designed on top of a non-real-time OS, programmers usually stick to a reasonably large buffer (250 or 1,000 samples) in order to cope with uncertainties in scheduling policies. However, large buffers are costly in terms of memory, in terms of adding latency, and in terms of proving that they are large enough to guarantee click-free delivery.

Multithreaded design provides a good framework to informally and formally reason about buffering and avoids unnecessarily large buffers. As an example, consider the above USB speaker augmented with an ambient noise correction system. This system will comprise the following threads:

  • A thread that receives USB samples over the network.
  • A series of 10 or more threads that filter the stream of samples, each with a different set of coefficients.
  • A thread that delivers a filtered output sample to the stereo codec using I2S.
  • A thread that reads samples from a codec connected to a microphone sampling ambient noise.
  • A thread that subsamples the ambient noise to an 8 kHz sample rate.
  • A thread that establishes the spectral characteristics of the ambient noise.
  • A thread that changes the filter coefficients based on the computed spectral characteristics.
All threads will operate on some multiple of the 48 kHz base period. For example, each of the filtering threads will filter exactly one sample every 48 kHz period; the delivery thread will deliver a sample every period. Each of the threads also has a defined window over which it operates, and a defined method by which this window is advanced. For example, if our filter thread is implemented using a biquad, it will operate on a window of three samples which is advanced by one sample every period. The spectral thread may operate on a 256 sample window (to perform an FFT (Fest Fourier Transform)) which is advanced by 64 samples every 64 samples.

One can now establish all parts of the system that operate on the same period and connect them together in synchronous parts. No buffers are required inside those synchronous parts, although if threads are to operate in a pipeline, single buffers are required. Between the various synchronous parts buffers are required. In our example, we end up with three parts:

  1. The part that receives samples from USB, filters, and delivers at 48 kHz.
  2. The part that samples ambient noise at 48 kHz and delivers at 8 kHz.
  3. The part that establishes the spectral characteristics and changes the filter settings at 125 Hz.
These three parts are shown in Figure 3. The first part that receives samples from the USB buffer needs to buffer 12 stereo samples.

Figure 3: The threads grouped together based on their frequency.

The part that delivers needs to buffer one stereo sample. Operating the 10 filter threads as a pipeline requires 11 buffers. That means that the total delay from receiver to codec comprises 24 sample times, or 500 µs, and an extra sample can be added to cope with medium term jitter in the clock recovery algorithm. This part runs at 48 kHz.

The second part that samples ambient noise needs to store one sample on the input side and six samples for subsampling. Hence, there is a seven sample delay at 48 kHz, or 145 µs.

The third part that establishes the spectral characteristics needs to store 256 samples, at an 8 kHz sample rate. No other buffers are required. Hence, the delay between ambient noise and filter correction is 256 samples at 8 kHz and 145 µs for the subsampling, or just over 32 ms. Note that these are minimum buffer sizes for the algorithm that we have chosen to use; if this latency is unacceptable, a different algorithm has to be chosen.

There is often a temptation to design the threads to operate on blocks of data instead of single samples, but this will increase the overall latency experienced, increase the memory requirements, and increase complexity. This should only be considered if there is a clear benefit, such as an increased throughput.

Clocking digital audio

A big difference between digital and analog audio is that analog audio is based on this underlying sample rate and digital audio requires a clock signal to be distributed to all parts of the system. Although components can all use different sample rates (for example, some parts of the system may use 48 kHz and some other parts may use 96 kHz with a sample rate converter in between), all components should agree on the length of a second, and therefore agree on a basis to measure frequencies.

An interesting property of digital audio is that all threads inside the system are agnostic to the base of this clock frequency, assuming that there is a gold standard base rate. It does not matter if multiple cores in the system use different crystals, as long as they operate on a sample. However, at the edges of the system, the true clock frequency is important and the delay that a sample has incurred en route becomes important.

In a multithreaded environment, one will set a thread aside to explicitly measure the true clock frequency, implement the clock recovery algorithm, measure the local clock versus the global clock, and agree with a master clock on the clock offset.

The clock may be measured implicitly using the underlying bit rate of interconnects, such as S/PDIF or ADAT. Measuring the number of bits per second on either of those networks will give a measure for the master clock. The clock may be measured explicitly by using protocols designed for this purpose, such as PTP over Ethernet.

In the clock recovery thread, a control loop can be implemented, which estimates the clock frequency, and adjusts based on the error observed. In its simplest form, the error is used as a metric to adjust the frequency, but filters can be used to reduce jitter. This software thread implements what would have been traditionally performed by a PLL but in software, and hence it can be adjusted to the environment cheaply.

Conclusion

A multithreaded development method enables digital audio systems to be developed using a divide and conquer approach, where a problem is split into a set of concurrent tasks that are each executed in a separate thread on a multithreaded core.

Like many real-time systems, digital audio lends itself to a multithreaded design method because digital audio systems obviously consist of a group of tasks that work on data and also require those tasks to execute concurrently.

Supplier