Yeah. Exactly. Especially Intel's own. Itanium's enormous, and I mentioned
how Pentium 4 had pushed pipelining slightly beyond where you really get all
that much benefit from it.
They've grown to take advantage of larger transistor budgets, but that's
starting to hit limits on how big you can usefully make them before you get
serious diminishing returns and problems with clock skew and signal
propogation delays and other such fun. (Hence the experiments with clockless
processors, etc.) The longer the wire, the longer it takes a signal to go
down it. You want contained modules so you can clock them fast (hence
pipeline stages).
Athlon hasn't gone pipeline happy the way P4 has, so they don't suffer as
badly from pipeline stalls, but it already has three execution cores (and
requesite monstrous front-end decoding and scheduling stuff to those cores),
with no plans to add a fourth core because the third isn't busy enough that
they think it would help.
Yet manufacturing is going to continue to shrink the density, giving you more
area to work with and a higher transistor budget, which is about 80% of
Moore's Law. Despite gloom and doom predictions since at least the early
1980's has got four or five more doublings to go before we even have to worry
about increasing the number of layers to get extra space. So where's that
extra transistor budget going to go? Bigger L1 cache? Fun, but the main
benefit of a really huge cache isn't on a UP system, but on SMP. The benefit
of an L1 cache is one of them "integral of somevalue/x" functions, the
increase in which falls off pretty rapidly the bigger it gets: the more bytes
of cache the greater percentage chance your next piece of data will be in
cache, but also the more transistors are guaranteed to be sitting idle
because they do NOT contain data you need this cycle. A point of diminishing
returns definitely exists here.
You also can't make the chips infinitely small because it's not worth the
money (manufacturing expense). Beyond a certain point, more chips per wafer
aren't that much cheaper because the taxidermy to test, cut, connect, mount,
package, and ship them becomes a bigger percentage of the cost. And you
still want to amortize your factory, so reducing per-chip manufacturing
expense won't reduce cost noticeably anyway as long as new manufacturing
processes require billions to get the new line up and running.
Intel is trying to transition to VLIW as a way to use more linked execution
cores to do SOMETHING useful with the extra transistor budget, and
Transmeta's even trying to get VLIW to do something USEFUL, but it's a
different programming model that'll take a while to transition to, and it
requires intelligent compilers finding paralellism in the chip, which isn't
easy.
Or, you could use the execution units to run independent threads, which Intel
is ALREADY experimenting with (SMT instead of SMP), but that's really just a
way of backfitting SMP-on-a-die onto the existing linked-core design without
having to redo your process counter and cache circuitry. And again this
requires compilers to catch up, which won't happen for a while, and even then
a programmer could really do a better job of telling the computer what to do.
So the logical thing to do is SMP on a single die (which IBM at least has
been attempting). Not only does it convert transistors into execution speed
efficiently, it allows you to have a flaming big L1 and L2 cache in a way
that it's more likely to accomplish something useful.
And THAT'S what makes SMP interesting in the future. To me anyway. I could
be wrong...
Rob
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/