Technical Support


This site is for customers who have purchased Netronome’s Agilio SmartNICs and software products in 2015 and beyond. The Netronome Legacy Support Site ensures seamless support for existing customers who have deployed Netronome Flow Processor solutions prior to 2015.

Add a New Topic

hitting performance cliff

Our payload processing application is experiencing a throughput performance cliff when running in full thread mode on an NFP-6000 card. Performance takes a hit when the payload is 835 bytes or higher -- each packet has a 54 byte header. Payload sizes 64-834 bytes have similar throughput and 835-1400 bytes have similar throughput. The drop in performance depends on the memory we place our data structures in; we see a 29% drop when in CLS and a 23% drop when in EMEM.  

The cycles per byte, for our application to process a payload, stays roughly the same regardless of size. So we think the cause is before or after our code.

We are not familiar with queue management inside the card's hardware fabric. Could this be caused by congestion on a resource hidden to the programmer?


Wanted to add that our traffic generator TXs at 40 Gbps and we see:

-- (in full thread mode) approx 21 Gbps throughput for 834 and lower byte payloads and approx 15 Gbps for 835+ byte payloads.

-- (in reduced thread mode) a steady decrease in throughput, from 13 to 12 Gbps, as payload size varies from 64-1400 bytes.

 

Since having more threads active increases congestion on the interconnect, I assume that there is some hidden buffer (queue) on the interconnect to the crossbar?

Hi Joel


Firstly what tool version are you using?


At around 888B (1024 - headroom) your packet will move from being entirely in CTM memory to being split between CTM and external memory. This size can change depending on whether you are adding/removing headers etc.


This wouldn't make _that_ big a difference as the MU portion is normally untouched during packet processing. There are only a few operations that affect the whole packet, off the top of my head just checksums and deparsing. Deparsing will only use the external memory if you add and remove LOTS of headers, this will tend to be rather slow.


Without seeing your application it is hard to know exactly what is happening.


You can also mitigate this a little by making the split point 2k and not 1024 (though this will slightly different NBI configs).


Also it is best to use the open-nfp forums for those sorts of discussions.


David George



Hi David,


Feel free to repost this thread to the other forum ... or I can do it ... please let me know ...


We are using version 6.0.3.1 build 3241.


I already have ctm split length set to 2k.


What is the best way to send you my application and support files -- includes a python script to set a few variables from the cpu.


Joel

This is posted on the open-nfp forum now.

Login or Signup to post a comment