NetFPGA Padded Switch

From Bobs Projects
Jump to: navigation, search

NetFPGA Padded Switch is a modification of the NetFPGA Reference Switch to utilise a 2-byte pre-frame padding on all input and output queues. This was the subject of my NetFPGA SummerCamp2012 Presentation.

Contents

Rationale

The User Data Path (UDP?!?) in the NetFPGA Reference framework is currently 64-bits wide. As almost all frames start with a 14 + 4n byte Ethernet header (n is number of VLAN headers, typically 0, 1, 2 or 3), most of the rest of the packet is not aligned well with the 64-bit data path.

One simple way to "fix" this is to add a 2-byte pad to the start of each packet to push the Ethernet header to end on a multiple of 4, which generally makes the IP and subsequent headers align well with the 64-bit data path.

Development

The Padded Switch project was copied from the Reference Switch.

Padded Hardwire

Padded Switch ("padded_switch" project) was subsequently copied to "padded_hardwire" for testing of the queues separately from the switch functionality (at Glens advice and with help from Adam). "padded_hardwire" uses the previously unreleased "hardwire_lookup" module in the output port lookup functional stage.

This allows packets to pass straight through the NetFPGA unmodified from any input queue to any output queue, under the control of a PCI register.

From here, we can modify the Phy input and output queues to inject and remove the additional two padding bytes and test that, before modifying the learning switch module(s) to handle the new data alignments.

Verilog Coding

Instead of modifying the Phy/MAC rx_queue module (at lib/verilog/core/io_queues/ethernet_queue/src/rx_queue.v) in place, the NetFPGA build infrastructure allows a similarly named module in the project src directory to override the one in the lib tree.

So started by copying rx_queue.v to project/padded_hardwire/src directory and set about modifying it to insert two 8-bit bytes (octets) into the receive FIFO during MAC Idle time (pre-load before the next packet arrives). The reason for injecting the bytes here is that the FIFO itself is responsible for converting from 8-bit data paths to the 64-bit User Data Path. The FIFO is implemented with Xilinx proprietry gateware.

A fragment of new code implementing a new state in the FSM for the input to FIFO:

        RX_IDLE_PAD: begin
           /* write out the two pre-frame padding bytes */
           if (num_bytes_written[PKT_BYTE_CNT_WIDTH-1:1] == 0) begin
              rx_fifo_wr_en   = 1;
              /* should also check if we already have data waiting - that would be bad... */
           end
           else
           if(dvld_d1) begin
              rx_fifo_wr_en   = 1;
              rx_state_nxt    = RX_RCV_PKT;
           end
        end // case: IDLE_PAD

Tested with the excellent NetFPGA framework Python testing harnesses. Module failed as expected and I could see in the dumps that the packets were indeed getting the extra two bytes on the front and that the IP header etc. were now all word aligned.

Nest step was to perform the opposite transform to the tx_queue.v module (copied from lib/verilog/core/io_queues/ethernet_queue/src/tx_queue.v to project/padded_hardwire/src). Here we remove the first two 8-bit bytes from the FIFO heading to the Phy/MAC and throw them out.

Testing now succeeded for packets passing in and out through the Phy/MACs.

At this stage, Adam pointed out that the CPU/PCI/DMA queues would also need to be modified to perform identical padding. This was made somewhat complicated by the fact that the DMA data path is 32-bits wide, so the padding involves shifting and saving half-words between FIFO reads and writes. Also the PCI bus is little-endian and the User Data Path is big-endian, so getting the correct bytes for holding between FIFO read/writes took extra care.

As with the Phy queues, I started with the cpu_dma_rx_queue (from lib/verilog/core/io_queues/cpu_dma_queue/src/cpu_dma_rx_queue.v copied to project/padded_hardwire/src). This module turned out to be the hardest to modify and ended up in converting a lot more existing code into a single Finite State Machine for handling the input to the FIFO. Again, the FIFO is responsible for converting from 32-bit data in to 64-bit data out.

The easiest conversion, and hardest to debug/get right, was the cpu_dma_tx_queue, where the 2 padding bytes coming out of the FIFOs with the packet data on it's way to the DMA system, the PCI bus and on to the CPU, need to be removed. After several hours of debugging with ModelSim etc. it became clear what was going wrong. Again, the data paths over PCI etc. are 32-bit, so the words need to be split and half carried over from one transfer to the next.

Design Decision

To what parts of the NetFPGA system should the presence of the padding bytes be apparent?

Because they only exist between the input and output queue FIFOs, their presence should never been revealed outside of that relatively clearly defined boundary. In particular, they should not be included in any count of bytes/words that is passed back to the host CPU using the registers.

On the other hand, all modules operating within the User Data Path do need to be aware of their presence, and, where necessary, offsets etc. need to be adjusted (simplified?) to account for them.

Padded NIC

Creating Padded NIC from the Reference NIC required no particular changes other than copying the new 2-byte padding queue modules into src.

Then the tests for Reference NIC were tried.

The first test (both_loopback_minsize) revealed that some queue byte counters were not being reported correctly, with the padding bytes being counted in the results returned to the CPU through the register system. Required fixing in rx_queue.v and tx_queue.v.

The second test (both_loopback_maxsize) revealed a more serious problem with boundary condition handling with non-multiple-of-4 sized packets. This was tracked down to another FIFO reading error in the cpu_dma_tx_queue module. Now fixed.

All tests (including a couple of new ones) pass.

Learning CAM Switch

Once all the I/O queues are fixed to deal with adding and removing the 2 byte padding, the next step is to test with the Learning CAM Switch output port lookup facility.

I also wanted to start coding in such a way that the 2 byte pre padding can be configurably added during simulation/synthesis time. This in handled in the Learning CAM Switch code by the ethernet_parser module, which selects between the ethernet_parser_32bit (for 32-bit User Data Path) and ethernet_parser_64bit (for the current standard 64-bit User Data Path) modules.

I simply added a new module: ethernet_parser_64bit_2pre (copied from ethernet_parser_64bit) and used a PARAMETER in ethernet_parser called "PRE_PAD = 2" to signal that the 2-byte pre padding should be used.

The actual changes to ethernet_parser_64bit to ethernet_parser_64bit_2pre were surprisingly few, and somewhat call into question the whole need to add/remove the 2-bytes padding.

Here are the changes:

            else if(in_wr && in_ctrl==0) begin
-              dst_mac_next          = in_data[63:16] ;
-              src_mac_next[47:32]   = in_data[15:0];
+              dst_mac_next          = in_data[47:0];
               state_next            = READ_WORD_2;
            end

and

            if(in_wr) begin
-              src_mac_next [31:0]   = in_data[63:32];
-              ethertype_next        = in_data[31:16];
+              src_mac_next          = in_data[63:16];
+              ethertype_next        = in_data[15:0];
               state_next            = WAIT_EOP;
               eth_done_next         = 1;
            end

(there are some other minor administrative changes, such as module name etc.).

Padded Switch was tested with the set of test vectors for the Learning CAM (Reference) Switch and passed all tests.

To Do

Next step is to make the 2-byte padding setup work with the Reference Router as well. Here, I expect the data alignment "fix" to make a greater difference to the code simplicity.

Then, as time permits, make 2-byte padding versions of all other Reference projects and submit the entire thing back to the NetFPGA community in case anyone else may find it useful one day...