This article presents a method for measuring Ethernet throughput, providing a good estimate of performance, and illustrating the different factors that affect performance.
Ethernet is the most widely installed Local Area Network (LAN) technology in the world. It has been in use since the early 1980s and is covered by the IEEE Std 802.3, which specifies a number of speed grades. In embedded systems, the most commonly used format runs at both 10 Mbps and 100 Mbps (and is often referred to as 10/100 Ethernet).
There are more than 20 NXP ARM MCUs with built-in Ethernet, covering all three generations of ARM (ARM7, ARM9, and the Cortex-M3). NXP uses essentially the same implementation across three generations, so designers can save time and resources by reusing their Ethernet function when systems move to the next generation of ARM.
This article discusses three different scenarios for measuring Ethernet throughput on the LPC1700 product and details what is really achievable in an optimized system.Superior implementation
NXP's Ethernet block (see Figure 1) contains a full-featured 10/100 Ethernet MAC (media access controller) which uses DMA hardware acceleration to increase performance. The MAC is fully compliant with IEEE Std 802.3 and interfaces with an off-chip Ethernet PHY (physical layer) using the Media Independent Interface (MII) or Reduced MII (RMII) protocol along with the on-chip MII Management (MIIM) serial bus.
The NXP Ethernet block is distinguished by the following:
- Full Ethernet functionality — The block supports full Ethernet operation, as specified in the 802.3 standard.
- Enhanced architecture — NXP has enhanced the architecture with several additional features including receive filtering, automatic collision back-off and frame retransmission, and power management via clock switching.
- DMA hardware acceleration — The block has two DMA managers, one each for transmit and receive. Automatic frame transmission and reception with Scatter-Gather DMA offloads the CPU even further.
Figure 1: LPC24xx Ethernet block diagram. NXP's Cortex-M3 architecture.
Ethernet throughput on NXP's LPC1700 microcontrollers
In an Ethernet network, two or more stations send and receive data through a shared channel (a medium), using the Ethernet protocol. Ethernet performance can mean different things for each of the network's elements (channel or stations). Bandwidth, throughput, and latency are measures which contribute to overall performance. In the case of the channel, while the bandwidth is a measure of the capacity of the link, the throughput is the rate at which usable data can be sent over the channel. In the case of the stations, Ethernet performance can mean the ability of that equipment to operate at the full bit and frame rate of the Ethernet channel. On the other hand, latency measures the delay in time caused by several factors (such as propagation times, processing times, faults, and retries).
The focus of this article will be on the ability of the NXP LPC1700 to operate at the full bit and frame rate of the Ethernet channel to which it is connected, via the Ethernet interface (provided by the internal EMAC module plus the external PHY chip). In this way, throughput will be defined as a measure of usable data (payload) per second, which the MCU is able to send/receive to/from the communication channel. The same concepts can also be applied to other NXP LPC microcontrollers supporting Ethernet.
Unfortunately, these kinds of tests generally require specific equipment such as network analyzers and/or network traffic generators, in order to get precise measurements. Nevertheless, using simple test setups it is possible to get estimated numbers. In fact, our goal is to understand the different factors that can affect Ethernet throughput, so users can focus on different techniques in order to improve Ethernet performance.
Here only the throughput of the transmitter is considered, as the case of the receiver is a little bit more complex because its performance is relative to the performance of the transmitter that put the information into the channel. In this case, the throughput of the receiver will be affected by the throughput of the transmitter sending the data over the channel. Once we get the throughput for the transmitter, we can consider this number as the maximum ideal number that the receiver will be able to achieve (under ideal conditions), and get the throughput for the receiver relative to this number.Reference information
Figure 2: Ethernet II frame.
Considering a bit rate of 100 Mbps, and that every frame consists of the payload (useful data, minimum 46 bytes and maximum 1,500 bytes), the Ethernet header (14 bytes), the CRC (4 bytes), the preamble (8 bytes), and the inter-packet gap (12 bytes), then the following are the maximum possible frames per second and throughout:
For minimum-sized frames: (46 bytes of data) —> 148,809 frames/sec —> 6.84 Mb/sec
For maximum-sized frames: (1,500 bytes of data) —> 8,127 frames/sec —> 12.19 Mb/sec
The above rates are the maximum possible values which are in reality impossible to reach. Those values are ideal and any practical implementation will have lower values (see Figure 2).
- Frames/second is calculated by dividing the Ethernet link speed (100 Mbps) by the total frame size in bits (84 * 8 = 672 for minimum-sized frames, and 1,538 * 8 = 12,304 for maximum-sized frames).
- Megabytes/second is calculated by multiplying the frames/second by the number of bytes of useful data in each frame (46 bytes for minimum-sized frames, and 1,500 bytes for maximum-sized frames).
(see Figure 3)
MCU: LPC1768 running at 100 MHz
Board: Keil MCB1700
PHY chip: National DP83848 (RMII interface)
Tool chain: Keil μVision4 v4.1
Code running from RAM
TxDescriptorNumber = 3
Ethernet mode: Full duplex – 100 MbpsTest description
In order to get the maximum throughput, there are 50 frames consisting of 1,514 bytes (including Ethernet header), each consisting 75 Kb of payload (useful data). The CRC (4 bytes) is automatically added by the EMAC controller (Ethernet controller).
Figure 3: The test setup.
In order to measure the time this process takes, a GPIO pin is set (P0.0 in our case) just before starting to send the frames and is cleared as soon as we finish with the process. In this way, an oscilloscope can be used to measure the time as the width of the generated pulse at the P0.0 pin. The board is connected to a PC using an Ethernet cross cable.
The PC runs a sniffer program (WireShark in this case, http://www.wireshark.org/) as a way to verify whether the 50 frames were sent and the data is correct. A specific pattern in the payload is used so any errors can be easily recognized. If the 50 frames arrive at the PC with no errors, the test is considered valid (see Figure 4).
Figure 4: Verifying the payload.
The EMAC uses a series of descriptors which provide pointers to memory positions where the data buffers, control, and status information reside. In the case of transmission, the frame data should be placed by the application into these data buffers. The EMAC uses DMA to get the user's data and fill the frame's payload before transmission. Therefore, the method the application uses in order to copy the application data into those data buffers will affect the overall measurement of the throughput. For this reason, three different scenarios are presented:
- An "ideal" scenario, which doesn't consider the application at all,
- A "typical" scenario, where the application copies the application's data into the EMAC's data buffers, using the processor,
- An "optimized" scenario, where the application copies the application's data into the EMAC's data buffers, via DMA.
- "Ideal" scenario: In this case, the software sets up the descriptors' data buffers with the test's pattern, and only the TxProduceIndex is incremented 50 times (once for every packet to send) in order to trigger the frame transmission. In other words, the application is not considered at all. Even though this is not a typical user's case, it will provide the maximum possible throughput in transmission.
- "Typical" scenario: This case represents the typical case in which the application will copy the data into the descriptors' data buffers before sending the frame. Comparing the results of this case with the previous one, it is apparent that the application is affecting the overall performance. This case should not be considered as the actual EMAC throughput. However, it is presented here to illustrate how non-optimized applications may lower overall results giving the impression that the hardware is too slow.
- "Optimized" scenario: This test uses DMA in order to copy the application's data into the descriptors' data buffers. This case considers a real application but using optimized methods which effectively take advantage of the fast LPC1700 hardware.
Test software in the form of a Keil MDK project is provided for this article (please check NXP's website for AN11053). The desired scenarios can be selected by using the Configuration Wizard and opening the "config.h" file (see Figure 5). Besides the scenario, both the number of packets to send and the frame size can also be modified through this file.Test results
After running the tests, the following results are tabulated as demonstrated in Table 1:
||Total Data (bytes)
||% relative to Max. Possible
Table 1: Test results.
Figure 5: Choosing the test scenarios.
Despite the fact that Scenario 1 is not a practical case, it provides the maximum value possible for our hardware as a reference, which is very close to the maximum possible for Ethernet at 100 Mbps. In Scenario 2, the application's effects on the overall performance become apparent. Finally, Scenario 3 shows how an optimized application greatly improves the overall throughput.
Other ways to optimize the application and get better results were found by running the code from flash (instead of from RAM), and in some cases by increasing the number of descriptors.
In summary, Ethernet throughput is mainly affected by how the application transfers data from the application buffer to the descriptors' data buffers. Improving this process will enhance overall Ethernet performance. The LPC1700 and other LPC parts have this optimization built in to the system hardware with DMA support, enhanced EMAC hardware, and smart memory bus architecture.