Ten64 Network Performance Benchmark¶

This is a brief sampling of the Ten64's performance under ideal conditions.

We are being very modest with our performance numbers in advertising as there are some caveats today:

The Linux network stack does limit the performance a bit, we see the ultimate solution as XDP in the short term and hardware offload (AIOP) in the long term.

For example, 3 Gbit/s single flow performance is easily achieved, but further traffic flows may be inconsistent as it depends how the flows are passed to each core (via DPIO portals) based on flow hashing (such as the port numbers on each connection).

For these tests, we have booted the system with the "eth0 only" DPL and added the two 10G ports dynamically via the ls-addni command - this ensures they have the most optimal settings in terms of queues, flow tables and inbound packet hashing.

iperf3 testing¶

These tests are done between a Linux client and Windows 10 workstation.

The Linux client has an Intel X520-T2 (Dual SFP+) card, while the Windows machine has an X540-T2 (Dual 10GBase-T). A Mikrotik S+RJ10 10GBase-T SFP module is used to provide a 10GBase-T connection on the Ten64.

Bridge mode¶

Command: iperf3 -P (num threads) -R -c (server) -t 60

Number of threads (-P)	Thread 0	Thread 1	Thread 2	Thread 3	Thread 4	Total Gbit/s
1	3.08					3.08
2	3	2.93				5.93
3	3.04	2.88	2.56			8.48
4	1.87	2.69	1.75	2.73		9.04
5	1.91	2.03	1.84	1.68	1.84	9.3

As discussed above, the per-flow performance does vary more as flows are increased - this can be due to different packet flows being processed on the same core, while the fastest flow gets a core by itself.

Routed mode¶

Note: this test is done with static routing rules (no NAT)

Number of threads (-P)	Thread 0	Thread 1	Thread 2	Thread 3	Thread 4	Total Gbit/s
1	2.91					2.91
2	2.81	2.6				5.41
3	2.64	2.52	2.15			7.3
4	2.33	2.36	1.97	2.39		8.3
5	2.17	2.07	2.07	1.46	1.37	9.3

DPDK¶

DPDK is one way to extract higher performance from the LS1088, by bypassing the kernel and doing all packet operations in userspace, using a poll-mode driver (PMD) mechanism rather than the traditional interrupt-driven mechanism in the kernel.

It should be noted these tests are only being done with a single core only.

To do this test, we just launch DPDK's testpmd which establishes a Layer2 bridge across a pair of ports.

Number of threads (-P)	Thread 0	Thread 1	Thread 2	Thread 3	Thread 4	Total	Improvement (relative to Linux bridge)
1	5.02					5.02	1.63
2	3.99	3.93				7.92	1.34
3	3.09	2.96	3.16			9.21	1.09
4	2.56	2.26	2.44	2.17		9.43	1.04
5	2.62	1.73	1.69	1.76	1.69	9.49	1.02

As we can see, even with just one core doing the work, DPDK can improve the single thread performance to 5Gbit/s, a 1.6x improvement, and DPDK is still able to hold an advantage as more flows are added.

The disadvantage of DPDK is that you need to port your entire network stack to DPDK, which runs in userspace. Some options exist, such as DANOS, TNSR etc, as well as OpenVSwitch-DPDK - but deployment and validation of these solutions requires a lot more time and effort over traditional kernel network stacks.

Practical Performance Turning¶

Disable IOMMU DMA Translation¶

Linux can operate the IOMMU in two modes:

DMA Translation

In this mode, the IOMMU hardware will 'translate' (remap) direct memory access (DMA) requests between the hardware and device drivers. It is more secure, but incurs a performance overhead
Pass-through mode

In this mode, IOMMU hardware will pass DMA addresses directly through to device drivers without translation, which could pose issues in a virtual machine environment, as a guest VM could break out.

Without the overhead of doing DMA translation, performance is improved.

We note that recent enterprise Linux distributions (RHEL 8, SLES 15) operate in passthrough mode by default¹.

NXP also advises to use passthrough mode.

If your distribution does not use passthrough mode by default, you can specify iommu.passthrough=1. To disable passthrough mode you can use iommu.passthrough=0.

The relevant kernel configuration is:

CONFIG_IOMMU_DEFAULT_PASSTHROUGH=y

Only provisioning needed Ethernet ports in the DPL (not needed after v0.8.9)¶

Note

The issue with all interrupts coming via CPU0 was resolved in firmware v0.8.9, after receiving recommendations from NXP on how to configure 10 ports within the LS1088's resources.

This section remains for historical interest.

As mentioned in the introduction, the default firmware configuration enables all 10 Ethernet ports (8 GBe + 2 SFP+).

This configuration stretches the resources of the LS1088's DPAA2 slightly, which means it can only push packet flows through one 'portal'.

By only provisioning the Ethernet ports you require, the DPAA2 can spread packet flows through eight portals - one per CPU core. (The LS1088 can support up to 9 ports without resource stretching).

If you have a WAN connection faster than >1Gbit/s we recommend doing this.

You can see the distribution of the packet flows in /proc/interrupts:

Default 10 Ports configuration

cat /proc/interrupts  | grep -E '(CPU\d|dpio)'
           CPU0       CPU1       CPU2       CPU3       CPU4       CPU5       CPU6       CPU7
124:     119703          0          0          0          0          0          0          0  ITS-fMSI 230000 Edge      dpio.8
125:          0          0          0          0          0          0          0          0  ITS-fMSI 230001 Edge      dpio.7
126:          0          0          0          0          0          0          0          0  ITS-fMSI 230002 Edge      dpio.6
127:          0          0          0          0          0          0          0          0  ITS-fMSI 230003 Edge      dpio.5
128:          0          0          0          0          0          0          0          0  ITS-fMSI 230004 Edge      dpio.4
129:          0          0          0          0          0          0          0          0  ITS-fMSI 230005 Edge      dpio.3
130:          0          0          0          0          0          0          0          0  ITS-fMSI 230006 Edge      dpio.2
131:          0          0          0          0          0          0          0          0  ITS-fMSI 230007 Edge      dpio.1

You can see all network interrupts are processed through one portal only - dpio.8.

Reduced port configuration

cat /proc/interrupts  | grep -E '(CPU\d|dpio)'
           CPU0       CPU1       CPU2       CPU3       CPU4       CPU5       CPU6       CPU7
124:     463934          0          0          0          0          0          0          0  ITS-fMSI 230000 Edge      dpio.8
125:          0     961301          0          0          0          0          0          0  ITS-fMSI 230001 Edge      dpio.7
126:          0          0     348047          0          0          0          0          0  ITS-fMSI 230002 Edge      dpio.6
127:          0          0          0     426312          0          0          0          0  ITS-fMSI 230003 Edge      dpio.5
128:          0          0          0          0     464717          0          0          0  ITS-fMSI 230004 Edge      dpio.4
129:          0          0          0          0          0    1344950          0          0  ITS-fMSI 230005 Edge      dpio.3
130:          0          0          0          0          0          0     337744          0  ITS-fMSI 230006 Edge      dpio.2
131:          0          0          0          0          0          0          0     377875  ITS-fMSI 230007 Edge      dpio.1

See "Default IOMMU modes on Linux OSes" in An Introduction to IOMMU Infrastructure in the Linux Kernel ↩