21.6. AF_XDP

AF_XDP (eXpress Data Path) is a high speed capture framework for Linux that was introduced in Linux v4.18. AF_XDP aims at improving capture performance by redirecting ingress frames to user-space memory rings, thus bypassing the network stack.

Note that during af_xdp operation the selected interface cannot be used for regular network usage.

21.6.1. Compiling Suricata

21.6.1.1. Linux

libxdp and libpbf are required for this feature. When building from source the development files will also be required.

Example:

dnf -y install libxdp-devel libbpf-devel

This feature is enabled provided the libraries above are installed, the user does not need to add any additional command line options.

The command line option --disable-af-xdp can be used to disable this feature.

Example:

./configure --disable-af-xdp

21.6.2. Starting Suricata

21.6.2.1. IDS

Suricata can be started as follows to use af-xdp:

af-xdp:
  suricata --af-xdp=<interface>
  suricata --af-xdp=igb0

In the above example Suricata will start reading from the igb0 network interface.

21.6.3. AF_XDP Configuration

Each of these settings can be configured under af-xdp within the "Configure common capture settings" section of suricata.yaml configuration file.

The number of threads created can be configured in the suricata.yaml configuration file. It is recommended to use threads equal to NIC queues/CPU cores.

Another option is to select auto which will allow Suricata to configure the number of threads based on the number of RSS queues available on the NIC.

With auto selected, Suricata spawns receive threads equal to the number of configured RSS queues on the interface.

af-xdp:
  threads: <number>
  threads: auto
  threads: 8

21.6.4. Advanced setup

af-xdp capture source will operate using the default configuration settings. However, these settings are available in the suricata.yaml configuration file.

Available configuration options are:

21.6.4.1. force-xdp-mode

There are two operating modes employed when loading the XDP program, these are:

XDP_DRV: Mode chosen when the driver supports AF_XDP
XDP_SKB: Mode chosen when no AF_XDP support is unavailable

XDP_DRV mode is the preferred mode, used to ensure best performance.

af-xdp:
  force-xdp-mode: <value> where: value = <skb|drv|none>
  force-xdp-mode: drv

21.6.4.2. force-bind-mode

During binding the kernel will first attempt to use zero-copy (preferred). If zero-copy support is unavailable it will fallback to copy mode, copying all packets out to user space.

af-xdp:
  force-bind-mode: <value> where: value = <copy|zero|none>
  force-bind-mode: zero

For both options, the kernel will attempt the 'preferred' option first and fallback upon failure. Therefore the default (none) means the kernel has control of which option to apply. By configuring these options the user is forcing said option. Note that if enabled, the bind will only attempt this option, upon failure the bind will fail i.e. no fallback.

21.6.4.3. mem-unaligned

AF_XDP can operate in two memory alignment modes, these are:

Aligned chunk mode
Unaligned chunk mode

Aligned chunk mode is the default option which ensures alignment of the data within the UMEM.

Unaligned chunk mode uses hugepages for the UMEM. Hugepages start at the size of 2MB but they can be as large as 1GB. Lower count of pages (memory chunks) allows faster lookup of page entries. The hugepages need to be allocated on the NUMA node where the NIC and CPU resides. Otherwise, if the hugepages are allocated only on NUMA node 0 and the NIC is connected to NUMA node 1, then the application will fail to start. Therefore, it is recommended to first find out to which NUMA node the NIC is connected to and only then allocate hugepages and set CPU cores affinity to the given NUMA node.

Memory assigned per socket/thread is 16MB, so each worker thread requires at least 16MB of free space. As stated above hugepages can be of various sizes, consult the OS to confirm with cat /proc/meminfo.

Example

8 worker threads * 16Mb = 128Mb
hugepages = 2048 kB
so: pages required = 62.5 (63) pages

See https://www.kernel.org/doc/Documentation/vm/hugetlbpage.txt for detailed description.

To enable unaligned chunk mode:

af-xdp:
  mem-unaligned: <yes/no>
  mem-unaligned: yes

Introduced from Linux v5.11 a SO_PREFER_BUSY_POLL option has been added to AF_XDP that allows a true polling of the socket queues. This feature has been introduced to reduce context switching and improve CPU reaction time during traffic reception.

Enabled by default, this feature will apply the following options, unless disabled (see below). The following options are used to configure this feature.

21.6.4.4. enable-busy-poll

Enables or disables busy polling.

af-xdp:
  enable-busy-poll: <yes/no>
  enable-busy-poll: yes

21.6.4.5. busy-poll-time

Sets the approximate time in microseconds to busy poll on a blocking receive when there is no data.

af-xdp:
  busy-poll-time: <time>
  busy-poll-time: 20

21.6.4.6. busy-poll-budget

Budget allowed for batching of ingress frames. Larger values means more frames can be stored/read. It is recommended to test this for performance.

af-xdp:
  busy-poll-budget: <budget>
  busy-poll-budget: 64

21.6.4.7. Linux tunables

The SO_PREFER_BUSY_POLL option works in concert with the following two Linux knobs to ensure best capture performance. These are not socket options:

gro-flush-timeout
napi-defer-hard-irq

The purpose of these two knobs is to defer interrupts and to allow the NAPI context to be scheduled from a watchdog timer instead.

The gro-flush-timeout indicates the timeout period for the watchdog timer. When no traffic is received for gro-flush-timeout the timer will exit and softirq handling will resume.

The napi-defer-hard-irq indicates the number of queue scan attempts before exiting to interrupt context. When enabled, the softirq NAPI context will exit early, allowing busy polling.

af-xdp:
  gro-flush-timeout: 2000000
  napi-defer-hard-irq: 2

21.6.5. Hardware setup

21.6.5.1. Intel NIC setup

Intel network cards don't support symmetric hashing but it is possible to emulate it by using a specific hashing function.

Follow these instructions closely for desired result:

ifconfig eth3 down

Enable symmetric hashing

ifconfig eth3 down
ethtool -L eth3 combined 16 # if you have at least 16 cores
ethtool -K eth3 rxhash on
ethtool -K eth3 ntuple on
ifconfig eth3 up
./set_irq_affinity 0-15 eth3
ethtool -X eth3 hkey 6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A:6D:5A equal 16
ethtool -x eth3
ethtool -n eth3

In the above setup you are free to use any recent set_irq_affinity script. It is available in any Intel x520/710 NIC sources driver download.

NOTE: We use a special low entropy key for the symmetric hashing. More info about the research for symmetric hashing set up

21.6.5.2. Disable any NIC offloading

Suricata shall disable NIC offloading based on configuration parameter disable-offloading, which is enabled by default. See capture section of yaml file.

capture:
  # disable NIC offloading. It's restored when Suricata exits.
  # Enabled by default.
  #disable-offloading: false

21.6.5.3. Balance as much as you can

Try to use the network card's flow balancing as much as possible

for proto in tcp4 udp4 ah4 esp4 sctp4 tcp6 udp6 ah6 esp6 sctp6; do
   /sbin/ethtool -N eth3 rx-flow-hash $proto sd
done

This command triggers load balancing using only source and destination IPs. This may be not optimal in terms of load balancing fairness but this ensures all packets of a flow will reach the same thread even in the case of IP fragmentation (where source and destination port will not be available for some fragmented packets).