The evolution of the microchip over the past 25 years has been amazing. It has been fueled by what is known as Moore's Law - the number of transistors available on a chip doubles every 18 months because of improvements in chip fabrication techniques. The following table tracks some of Intel's major microprocessors and records some of their attributes. The metric "Instructions per cycle" has improved tenfold over the recorded period. It is often quoted as the most relevant figure, but perhaps more important is overall efficiency - how much throughput do we get per million transistors per cycle? This is shown in the last column of the table. It shows a steady decline from 3.61 for the 8086 to 0.10 for the Pentium III - a factor of 36 reduction in efficiency.
An important point to note is that I am quoting system throughput in terms of BogoMips, a confection devised by Linus Torvalds to measure this quality. BogoMips does not include floating point operations, and hence does not reveal the huge investment made in Pentium chip space to accelerate these operations. Including floating point, Pentiums can process two instructions per machine cycle, whereas their BogoMips rating peaks at one. But the BogoMips measure is more relevant for most Linux servers, since they rarely execute floating point instructions.
There are many factors which contribute to this decline. Each chip added more function. The 286 introduced a 16 bit processor. The 386 introduced a 32 bit processor and virtual memory management. The 486 included a math coprocessor and an on-chip cache. Successive Pentium models added more parallel processing to speed throughput, and multimedia processing facilities.
Some of these changes were inevitable consequences of scaling up the technology, but many were not. Most significantly, Intel's processors are optimized for the desktop market, which is by far their largest. Desktop users tend to do one thing at a time. Chips optimized for this requirement tend to become muscle-bound monsters, adept at nothing but the short sprint.
Servers on the other hand support hundreds or thousands of concurrently active users. It's easy to spread this workload over multiple slower processors. Aggregate throughput is what counts, not star performances from individual units.
Intel's microprocessors have also been molded to compete for a number of markets such as the power users, who traditionally used UNIX workstations for compute-intensive workloads, and the multimedia junkies who want their PCs to emulate entertainment centers. In the process, the microprocessor has taken on a huge amount of baggage.
All of this may be fine for the "average" desktop user, and that's where Intel's major market lies, so we cannot fault their tactics. But it's bad news for servers, the area where Linux has emerged as a major contender.
Several very large scale parallel processor designs have been based on implementing many simple processors on a single chip and running them as a symmetric multiprocessor (SMP). As from version 2.4, Linux is well able to exploit SMPs with a large number of processors.
Suppose we take a giant leap backwards and implement 386 logic with a more modern technology - say that used in the 400MHz Pentium II. This was introduced in 1998 and has 7.5 million transistors. Of course this is far more than we need for a 386, but we can put many independent 386's on one chip. Let's call it a Poly386. The classic 386 had no cache. Modern chips are so much faster than the rest of the system that cache is mandatory. Let's allocate 20% of the transistors on the chip to a shared level 2 on-chip cache and EPROM.
The classic 386 had about 275,000 transistors. We need to increase that somewhat:
While the 386s are 32-bit processors, their L1 cache lines and the data path linking these to the L2 on-chip cache can be 128 bits wide. To keep things simple, we embed simple linear (non-associative) caches in each 386. If we allocate 300,000 transistors to each 386 processor then we have enough transistors for twenty. Let's make four of them 387 math coprocessors. Most servers do very little floating point math - embedding a high-power FPU in each 386 is a waste. The following figure shows sixteen 386 processors and four 387 FPUs clustered around a shared cache and EPROM. The 386 processors will share the 387s. In order to do this, each 386 will have to supply the eight floating point registers that normally reside within the 387.
Let's keep the major clock cycle of the chip at 400MHz. In a Pentium II signals spend more time traveling between transistors that they do inside transistors. But each 386 and 387 processor is only 20% of the linear chip size. Signals will take much less time to propagate within these subunits, so each can run at a faster minor clock rate than the overall chip. Let's say they run twice as fast, at 800MHz. We have given each 386 a tiny local cache. When it needs something from the shared cache, it may have to wait two of its minor cycles. Because of this and other inter-processor contentions let's assume that the sixteen 386 processors only get through the work of twelve. And to round it off we observe that a Pentium II completes 1.00 instruction every machine cycle, whereas the 386 completes 0.16. Then the overall throughput of this chip will be 2*12*0.16/1.00 = 4 times that of the Pentium II.
A 16-way SMP based on the 386 could clog up handling cache synchronization. With simple cache algorithms, each time a 386 stores data into its local cache (about 1 in every 4 instructions) it would have to broadcast the change to all the other 386s in case they had a copy of the same storage location in their cache. We can reduce the impact of this with "lazy" cache synchronization.
Most processors used in SMPs have many registers so that intermediate results do not have to get stored into memory, triggering needless cache synchronization. The 386 has few registers - it uses a stack for intermediate results. What happens in the stack is private to the process and never needs to be shared with others, but the 386 maps its stack straight into memory where cache synchronization would kick in. We can define a small buffer - just a few cache lines in size - to carry the top of the stack. When the stack bubbles up into a new cache line we will have to write the prior contents of this line to L1 cache, but otherwise changes to the stack need not be reflected into L1 cache until the processor switches to another process and changes the stack pointers. This would greatly reduce cache traffic.
When programs running under a multithreaded OS like Linux share storage they can't assume that they have exclusive access to it. They have to use kernel calls to regulate their access. And when multiple processors in an SMP share a common memory even the kernels can't assume that they will have exclusive access to key data. They must use a specialized SMP instruction to seize the lock that manages a data area before touching it, and another to release the lock afterwards. We can use these facts to reduce cache synchronization overheads. A dedicated bus is provided to handle one cache synch request each major cycle. Each L1 cache will keep a list of its cache lines that need to be broadcast, and bid for the bus when it has a change to broadcast, but it will allow its 386 to run ahead within limits. If the L1 change list gets full, the 386 will have to wait until an entry gets freed up. If the 386 program generates an interrupt to invoke any kernel service, or issues the SMP instruction to seize or release a data area lock, it will have to wait until its L1 cache has flushed the backlog of cache updates. Allowing the 386s to run ahead will allow them to apply multiple updates to each cache line before it gets broadcast. This is much more efficient than the classic approach of forcing a broadcast each time any part of a cache line changes.
A four-fold increase in throughput is great, but the other benefits of the Poly386 are more attractive. A large, complex monochip like a Pentium requires a huge investment of time, effort and money to design. This cost locks out all but the most determined and well-heeled competitors, reducing innovation and progress. The Poly386 is based on cookie-cutter design. Starting with the well-known 386 and 377 chips as a base, the design of the new components would be fairly simple. And once the few basic engines have been designed and built, the rest of the chip is filled with replicas.
A Poly386 implemented using Pentium II technology would not compete with a Xeon for floating point throughput, but then it doesn't have to. The great majority of Linux servers are used for routine tasks such as file and print servers, email servers, and http servers. None of these applications have significant floating point requirements. New areas that Linux is breaking into include web application servers and database servers. Linux recently topped the TPC-H pops for best database transaction throughput. But even in these environments, floating point instructions are a small minority. X servers burn floating point, but they have no business running on Linux servers - they should be on desktops.
Floating-point intensive workloads can be run on Beowulf clusters using SPARC or PowerPC chips since they have invested much effort into accelerating floating point.
If these arguments are valid for the Pentium II with its 7.5M transistors, they are even more so for more modern chips such as the Pentium 4 with its 42M transistors. Exploiting all of those in a uniprocessor design is a nightmare, whereas scaling up Poly386 to about 100 processors on a chip does not require much imagination. Granted, a 100-way SMP won't work, but other arrangements such as NUMA and Beowulf could be exploited. As Moore's law continues to unfold, the advantages of the Poly386 over uniprocessor chip design will continue to grow.
It's time some chip manufacturers started addressing the needs of the Linux server market directly instead of following the fashion lead set by Intel for desktops. The Linux server market is now big enough to merit serious investment, and beating Pentiums in this arena would be easy.