> > You need to remove the IPv4 bits, that copy of the MAC has to happen
> > at a different layer, it does not belong in IPv4. At best, everyone
> > shouldn't eat that header copy.
>
> What if I make the memcpy conditional on "if (skb->physindev != NULL)"?
>
> First explain to me why the copy is needed for.
This is just to elaborate upon what Bart said earlier.
In the "L2 switched frame" case, we have a bit of a nasty problem
with IP fragmentation. And in the "L3 'switched' frame" case
(brouted frame), we have an ordering problem with IP fragmentation
and neighbor resolution.
This is what the call stack looks like when we have a purely
bridged frame (that needs to be netfiltered):
net_rx_action
-> br_handle_frame
-> PF_BRIDGE/PRE_ROUTING
-> br_handle_frame_finish
-> br_forward
-> PF_BRIDGE/FORWARD
-> __br_forward_finish
-> PF_BRIDGE/POST_ROUTING
-> dev_queue_xmit
This case is easy to see. With ip_conntrack enabled, packets
are reassembled in PRE_ROUTING and refragmented in POST_ROUTING.
This refragmenting messes up the hardware header, so the fragments
will leave the box with incorrect HW headers.
The broute case is a bit harder to see. If L3 (routed) packets
are destined for a bridge device, we don't know what subdevice
(slave port) they will go to until the bridge layer's br_dev_xmit
has its way. However, we would like to be able to use the real
outgoing interface (physoutdev) in FORWARD and POST_ROUTING.
To be able to do this, we postpone calling IPv4/FORWARD and
IPv4/POST_ROUTING until after PF_BRIDGE/POST_ROUTING has happened,
because at that point we know physoutdev so we can feed it to
said IPv4 hooks.
But. Packet refragmentation normally happens in IPv4/POST_ROUTING.
We don't want to do it there though, because that would cause the
eventual call to IPv4/FORWARD and IPv4/POST_ROUTING to see all
fragments instead of one packet (which goes against the idea of
conntrack).
So if we postpone FORWARD and POST_ROUTING until after br_dev_xmit,
we effectively reverse refragmentation and neighbor resolution.
But refragmentation messes up the hardware header.
The 16byte hardware header copy fixes this by copying to each
fragment the hardware header that was tacked onto or was already
present on the bigger packet. It's ugly, I admit. There's
currently no better way though.
(And Bart, I chose 16 because 16-byte aligned 16-byte copies
should be cheaper than 2-byte aligned 14-byte copies, and there
should be at least 16 bytes before skb->data at this point
anyway. That is, if I understood the code correctly.)
cheers,
Lennert
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/