I should also point out that the cost of reprogramming the interrupt
controllers isn't taken into account by the kernel irq balancer. In
the userspace implementation the reprogramming is done infrequently
enough to make even significant cost negligible; in-kernel the cost
is entirely uncontrolled and the rate of reprogramming unlimited.
Also, Linux' i386 IO-APIC programming model is quite fragile and does
not properly distinguish between physical and logical destinations or
SAPIC vs. xAPIC (which differ in the physical destination format) to
keep it coherent with i386 IO-APIC's DESTMOD. I would very much like to
see that confusion corrected before any significant amount of online
i386 IO-APIC RTE reprogramming is considered "stable". For instance, I
know of one subarch that claims to use logical DESTMOD with clustered
hierarchical DFR, but is using what appears to be SAPIC physical
broadcast for the RTE's, and a couple of other confusions where the
types of APIC ID's are ambiguous depending on subarch and broken by
dynamic reprogramming. It furthermore assumes flat logical DFR by
virtue of attempting to form APIC destinations representing arbitrary
sets of cpus in addition to assuming at least logical with
cpumask_to_logical_apicid() and is one of the major reasons irqbalance
is either disabled or unusable in various subarches.
The story of APIC code tripping over itself is an even unfunnier comedy
of errors, as the lack of TPR adjustment means that within any APIC
destination at which IO-APIC RTE's are targeted on Pentium IV systems
there will always be just a single cpu at which all interrupts are
concentrated. In order to work around this, all of the buggy code
choking on the fact arbitrary sets of cpus aren't representable as APIC
destinations is actually unused except as a buggy translation layer
from cpu ID's to APIC destinations, and the irqbalancing code works
around this by forming singleton cpumasks, which have historically been
frequently confused with APIC destinations of all 4 different formats.
Basically, the kernel has yet to handle IO-APIC RTE programming
properly, and until there is a remote semblance of action moving it
toward the correct formation of IO-APIC RTE's, in-kernel irqbalancing
is a house of cards built on rapidly shifting sands. There is no point
in anything but a userspace driver where the complexity the kernel has
failed to handle thus far can be punted, or reliance on hardware
mechanisms like the TPR that insulate the kernel from its prior and
current embarrassments in handling this complexity, until something is
done to correct IO-APIC RTE formation.
-- wli
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/