A Proposed High-Speed Computer Design

Trevor Turton
trevor@turton.co.za

First published in the ACM's Computer Architecture News, a bi-monthly publication of the Special Interest Group on Computer Architecture (SIGARCH) Volume 7 Number 10 on 15th October 1979.
Reprints are available from University of Washington Libraries http://rss.lib.washington.edu/rss.
This version was published at http://www.turton.co.za/pubs/virtualMultiprocessor.html on 2004-07-30. It has been retyped and the diagrams redrawn, but is otherwise unchanged.
Many of the ideas described in this paper have been realised in Intel's Hyper-Threading hardware.

Abstract

The designs of several high performance general purpose computers built during the last two decades are examined.  The performance limitations that they encountered and their approaches to overcoming these problems are discussed.  An alternate design approach is described which avoids these problems.  This is to have the computer support the simultaneous execution of several independent programs by multiprogramming its various components at the sub-instruction level.  Estimates of its performance are made, and the implications for the Operating System are examined.

The principles discussed are generally applicable, but to make the proposals more concrete the hardware implications are worked out in some detail for the IBM System/370 architecture.

1.0  Limitations on Computer Performance

The modern high-performance computer consists of several components.  One can identify registers, storage, instruction decode units, instruction execute units (including adders, multipliers, shifters, etc) and data channels for communication with I/O devices.  These components may not be distinct, but their functions are, and control circuitry must be designed for each function supported.  In faster machines the different functions have separate circuits.

Technology improvements have speeded up these circuits enormously, but a problem remains in that each instruction must go through a series of these functions one at a time.  For instance to perform an add, the computer will have to fetch the instruction from storage, decode it, generate the operand address, fetch the operand, and route the operation via the adder to its destination register.  If these operations are performed strictly in sequence, each individual function is utlilized only a small fraction of the time and overall systems throughput is poor.

The following table illustrates the problem.  Some common CPUs are shown with their approximate instruction processing rate in millions of instructions per second (MIPS), and major machine cycle times.  The average number of cycles per instruction completed is also presented.  Each of these machines is capable of decoding and issuing an instruction every cycle, but their execution rates generally fall far short of this.
Machine
Cycle time
MIPS
Cycles/Inst
CDC Cyber 175
25
5.3
7.5
CDC Cyber 176
27.5
8.1
4.5
Univac 1100/81
200
1.25
4.0
IBM 3031
115
1.1
7.9
IBM 3032
80
2.5
5.0
IBM 370/195
- scientific
- online

54
54

12
3

1.5
6.2
Cray-1
- scalars
- vectors

12.5
12.5

20
140

4.0
0.6
Note that the MIPS rates quoted are unauthenticated numbers drawn from the literature.  MIPS rates cannot be equated directly to performance, since the amount of work done by each instructions varies widely from one machine architecture to another.

2.0  Common Solutions to Performance Problems

The problem of low utilization of CPU components is very similar to the situation which arose with early computers.  Every time the CPU required I/O, it would initiate it and then wait for the operation to complete before proceeding, this despite the fact that the I/O unit involved would wait idle while the CPU handled previous input.  This problem was largely overcome by overlapping CPU and I/O operations.  I/O buffers were provided in CPU storage to allow input devices to read in data ahead of the program's needs, and to allow program output to stack up and await the services of the output devices.  This allowed the CPU, input and output devices to operate simultaneously instead of serially, and substantial performance improvements resulted.

2.1  Instruction Fetch and Execute Overlap

This same technique of buffering has been used with considerable success within CPUs.  On most medium to high speed machines, instruction fetch and execute are separate and concurrent functions.  Instructions are fetched in ahead of being needed into instruction buffers, then executed when the execution unit becomes available.  Since main storage is considerably slower than the CPU execution units, this approach reduces the amount of time that the CPU has to wait for storage.  Execution units have speeded up over the years, and to keep them reasonably busy now demands that several instructions be prefetched, and also any operands that they might need to access in storage.  Unfortunately this technique cannot be extended indefinitely to provide more and more performance.  Instruction streams contain conditional branches, and only once they are executed does the CPU know which instruction stream to prefetch.  IBM produced a high-speed computer in the early 1960's called the 7030 or Stretch, which attempted high performance by extended instruction prefetch.  Many jobs were found to have a high conditional branch content, and these prevented the machine from achieving its target throughput.

The IBM 3033 processor design tackles this problem by an improved ability to handle conditional branches.  It predicts more accurately the outcome of each branch, and can prefetch instructions for up to two alternate branch paths, by having three sets of instruction buffers.  A major hardware investment is made to try to ensure that there is always a queue of instructions waiting for the execution unit to work on, with their operands prefetched.  Despite this, and the fact that the execution unit completes most instructions in one or two cycles, the 3033 completes on average only one instruction every four cycles.

2.2  Core Storage Buffering

On larger CPUs, the relative slowness of main storage is a major limiting factor on performance.  A sophisticated system of storage buffering was developed for the System/360 Model 85 to alleviate this problem.

High speed buffers are inserted between the CPU and main storage so that the instruction execution units can operate directly to and from these buffers, providing they contain the instructions and operands that the CPU requires.  The CPU cannot know which areas of main storage to buffer in advance.  It relies on fetching on demand.  This results in occasional delays, but almost all programs iterate in confined storage areas, so subsequent iterations have their data immediately available in the buffers.  In practice more than 90% of storage references are satisfied by the high speed buffers.  This approach has been widely used on subsequent machines.

2.3  Multiple Instruction Overlap

Other attempts at improving processor throughput were made by the simultaneous execution of several successive instructions.
2.3.1  The CDC 6600
In the mid 1960s CDC developed a new approach to the problem of CPU throughout for their scientific processor, the 6600.  This machine incorporated an instruction preparation unit and nine autonomous execution units, each specialized to handle only a subset of instruction types.  The instruction preparation unit was able to decode and issue one instruction every machine cycle.  Each instruction was routed to the execution unit appropriate to its needs.  Execution of these instructions then proceeded independently in each execution unit.  A high degree of execution overlap was possible.  This same design approach has been continued in the CDC Cyber 170 series.  Most of the execution units are pipelined so that they can accept a new instruction every machine cycle.  The execution units have the combined capacity to execute about 7.5 operations per machine cycle, even though the supply of operations can never exceed one per cycle.

On selected hand-coded programs the CDC 6600 could approach one instruction completed per machine cycle, but on general jobstreams including online work the throughput dropped to one instruction every for to six cycles.  This implied that the instruction preparation unit was only 20% utilized, and the combined execution unit capacity only about 3% utilized.  The reason why this architecture could not achieve a consistent one instruction per machine cycle rate was because in practice successive instructions in a program are not independent.  Instructions require operands which are produced by preceding instructions.  Unrelated operations have to make use of the same limited number of machine registers, which introduces apparent dependencies.  Conditional branches cause potential forks in the instruction stream, which can only be resolved when the branch has completed.  The 6600 computer incorporated a "reservation control" station, which arrested the decode and issue of instructions whenever a conditional branch or operand interdependence was detected, until the dependence condition was resolved.

Another seminal idea was introduced with the 6600.  It had ten peripheral processors (PP) to manage all its I/O operations.  To keep their costs down these machines had very simple logic, and they shared a common arithmetic unit (AU).  Each PP had access to the AU once every ten cycles, when it could input an arithmetic operations with its operands.  The result would be made available to the requestor ten cycles later.  The AU was pipelined so that each PP could have an operation proceeding simultaneously.
2.3.2  The IBM 360/91 and 370/195
During the late 1960s IBM developed a new model in the 360 range to cater for high performance scientific applications.  This machine, the model 91, extended the principles developed for the CDC 6600 (Reference 1).  Instruction decode, fixed point arithmetic, floating point add/subtract, and floating point multiply/divide were all separate autonomous units.  Also separate were main storage read/write control and instruction prefetch.  If no overlap was realized in this machine, overall performance would have been poor.  However, on the Model 91 several instructions could be executed simultaneously.  The machine did not wait for one instruction to complete before decoding and scheduling the next.  Sixteen words of instructions could be prefetched and executed at the same time.  Although each individual instruction did not process very quickly, the high degree of overlap allowed fair utilization of all the instruction units in the Model 91 and total systems performance was good with suitable applications.

Performance limitations arose with the Model 91 for the same reason that they did with the CDC 6600.  In general, instructions in System/360 interdepend upon one another.  The operands of one instruction are frequently derived from a previous instruction.  This forced the Model 91 to observe some ordering of instruction execution.  Individual instructions would be delayed until the operands that they needed became available, even though the instructions following them might be executing.  Extensive and sophisticated circuitry was required to detect these interdependencies and keep track of which operands became available when, and where they were required.  Independent code sequences frequently made use of the same machine registers one after the other, since there are a limited number of these.  An ingenious algorithm was derived by Tamasulo to rationalize these dependencies and eliminate dependencies in the floating point execution units.  Even so, the performance of the Model 91 dropped from one instruction per 1.5 cycles on scientific work to one instruction per 6 cycles when it had to handle online workloads with a high interrupt and conditional branching content.

The problem of handling conditional branches was just as severe for the Model 91 as for the 6600.  When a branch was met in the instruction stream, none of the instructions following it could be completed until the CPU knew whether the branch would be taken or not.  Exactly the same situation arose when the Model 91 had to take machine interrupts.  These are equivalent to conditional branches occurring at unpredictable moments.  The Model 91 attempted to reduce this disruption by hedging its bets.  It prefetched instructions past the branch, but also prefetched four words of instructions from the branch target to use in case the branch was taken (a two-way Stretch action).  Even so, if the branch were taken the new instruction stream had to be decoded and issued to the execution units, which idled in the interim.

The System/360 Model 195 was essentially a combination of the Models 85 and 91, having both memory buffering and concurrent instruction execution.

The concurrent instruction execution in the Models 91 and 195 was achieved at the cost of extensive control and checking circuitry.  An estimated 60% of the Model 91 circuitry existed only to detect and resolve interdependence problems.  In a normal operating environment these CPUs lost a lot of performance to instruction interdependence holdups, conditional branches, and interrupts.  The Model 195 execution units had enough power to execute 70 million instructions per second (MIPS), but the instruction preparation unit could issue only 18 MIPS.  On selected scientific applications the machine achieved 14 MIPS, but on general jobstreams this dropped substantially.  When running the Airline Control Program (ACP) which supports online transaction processing, its throughput dropped to about 3 MIPS (Reference 3).

The problems encountered with conditional branching and interrupt handling described above have become more severe with the passing of time.  Computer workloads have become more and more online and demand driven.  This has resulted in an increasing percentage of conditional branch execution, and the attempts of the Models 91 and 195 to achieve concurrent execution of many instructions have become increasingly ineffective.

2.4  Parallel and Pipeline (Array) Processing

Other high-speed computers are designed to perform instructions over complete arrays at a time.  Examples are ILLIAC-V, CDC STAR and CRAY-1.  An Attached Vector Processor is available on IBM machines.  These devices are well suited to applications involving array operations, but are not cost-effective when normal scalar oriented workload are involved.  For example, the CRAY-1 processing a scalar workload completes about one instruction every four cycles, while it has enough execution power to complete 12 operations every machine cycle.  However given a suitable program of vector arithmetic it can get multiple execution units busy concurrently, and complete 1.75 operations per cycle (Reference 5).

2.5  Multiprocessing

The conventional approach to multiprocessing is to connect more than one CPU to a common memory, for instance the Burroughs D825, Univac 1100/82 or IBM 3033MP.  Since both cost and performance go up together, no net improvement in the performance/cost ration is achieved.  Indeed the overall throughput of such systems is less than the sum of their parts, since some contention and mutual interference must occur.  Multiprocessing in this way is usually employed to ensure systems availability by redundancy in all critical systems components.

3.0  Characteristics Required for Maximum Throughput

We have discussed the problems encountered in achieving high performance on a uniprocessor CPU.  These CPUs are able to decode and issue one instruction per major machine cycle, and have enough fire-power in the supporting execution hardware to process about four to ten times as many instructions as they can issue.  However, when confronted with an online workload, they complete about one instruction every four major machine cycles.  This is far short of their one instruction per cycle issue rate, and the net utilization of their execution units is plainly poor - these have the capacity to process 25 or more times the number of operations than they do.  This applies to the System/360 Model 91 and 195, and the CDC Cyber 175 and 176.  Clearly it is a general problem.

What is needed is an architecture where the CPU can process various types of workload including online work, and still get close to the target of completing one instruction per major machine cycle, without requiring engineering overkill in the execution units.  Their aggregate fire-power should more nearly approximate to the effective instruction execution rate.  Likewise the extensive instruction prefetch of the IBM 3033, with its complex logic when conditional branches and interrupts are encountered, entails hardware complexity and cost which would be better avoided.

An architecture that approaches this ideal is described in the following section.

4.0  Proposed Approach - the Virtual Multiprocessor (VMP)

The preceding sections have described the designs of several high performance computers.  Since individual instructions take a relatively long time to execute, these machines work on several at once.  To do this they contain many units that operate in parallel.  Even on the workloads for which they are optimized, these units are poorly utilized.  The general trend in computer workloads is towards more online transaction driven applications, with a higher conditional branch and interrupt content.  This frustrates the high performance computers' attempts to organize their workload efficiently.

A better approach is to accept the fact that each individual instruction takes a relatively long tome to complete, and achieve the required throughput by executing instructions from many independent programs concurrently.  To do this, the computing system has to have enough resources to sustain several concurrent programs.  It has to emulate multiple independent conventional CPUs, one per active program.

The benefit of this approach is that all the instructions being worked on at any moment will belong to separate independent programs.  No effort need be invested in controlling the sequence in which they execute.  If one particular program is held up by, for example, a buffer fault, the execution units can proceed with the processing of other programs, and so keep themselves well utilized, and overall systems throughput high.  The proposed system need have less execution unit capacity than current high-performance uniprocessors, but these units can achieve far better utilization, and greater throughput should result.

This approach can be likened to the IBM VM/370 operating system which simulates many independent computers on one uniprocessor, so that each user appears to have his own system.  The proposed CPU is multiprogrammed at the sub-instruction level, to emulate multiple virtual CPUs.

Another benefit can be gained by executing several programs concurrently.  Individual programs tend to concentrate on only a subset of instruction types, such as floating point arithmetic, or character manipulation.  This results in a subset of the execution logic being heavily used while the rest is underutilized.  This problem can be solved by running a mix of program types on VMP.

4.1.  General Operation of the VMP

VMP consist of a number of interconnected components, each performing specialized functions.  This system emulates multiple CPUs, each of which executes a separate program one instruction at a time.  The instructions undergoing execution move from one component of VMP to another, obtaining the services that they need to complete execution.  As each instruction completes, it triggers the fetch, decode and issue of the next instruction of its program.  No attempt is made to prefetch instructions or operands for any of the virtual CPUs.

The functional components that VMP requires are:
When an instruction is decoded, its opcode is converted to a string of routing and control bits, and an instruction parcel is issued by the Instruction Fetch and Decode unit.  This parcel contains:
Every instruction parcel issued progresses independently through the network of components in the sequence determines by its routing bits.  The components operate on the parcels to perform the services selected by their control bits.  The parcels are modified by each component before emerging.  Routing and control bits are stripped off as they are acted upon, to expose the subsequent routing and control bits.  Once an instruction has received all the services it needs to complete, its routing bits will take it back to the Instruction Fetch and Decode unit, so they next instruction of that program can be issued.

A component may modify the routing and control bits of an instruction parcel to alter its pre-planned routing.  For example, a Floating Add instruction obtaining its operand from Main Store Control may cause a storage fetch violation.  In this case Main Store Control would reroute the parcel back to Instruction fetch, with control bits set to direct a program check PSW load, whereas the original routing would have taken the parcel to the Floating Point unit.

The idea of multiprogramming a CPU at the sub-instruction level is very similar to the principle of multiprogramming a byte multiplexor I/O channel, as is common on many computers.  The channel executes many channel programs concurrently, one for each active I/O device attached to it.  The channel multiplexes itself from one channel program to the next to meet the demands of the devices, resuming channel commands (instructions) part-way through their execution.  This approach has proven very viable for channels, and the same approach can be applied to program instruction execution.

4.2  Register to Support Virtual CPUs

To support each virtual CPU emulated the base machine would need to provide distinct sets of addressable registers, instruction address registers, and processor state registers and flags.  The details depend on the machine architecture selected.  By way of example, the resources needed to implement the System/370 architecture are detailed in this section.  For each virtual CPU:
Layout of major computer components
Figure 1.  Organization of Major VMP Components for System/370 Architecture

Layout of storage management components
Figure 2.  VMP Main Store Control Unit for System/370 Architecture

4.3  Organization of the Major Components of VMP

A possible layout of the major components of VMP and their connections is shown in Figures 1 and 2.  Each major component includes the logic to provide some instruction service, and the registers associated with this service.
These major components are interconnected by data paths which allow the movement of instructions from component to component so that each instruction can obtain the services that it needs.

4.4  Execution of Storage-to-Storage Instructions

System/370 architecture includes many instructions which process multi-word operands, such as multiple-register loads and stores, and character vector moves and compare.  These are handled in VMP by the Storage-to-Storage unit (SSU).  This unit uses the Integer and Logic unit to calculate the operand starting addresses, then uses the same unit iteratively to process the multi-word operands one word at a time.

The SSU reconstitutes the routing and control bits of the instruction parcel on each iteration to ensure that the parcel keeps recirculating through the functional units until its operands have been completely processed.  It contains two operand address registers and a length register for each virtual CPU, and an adder to update them as the operands are processed.  The Integer and Logic unit contains an operand buffer for each virtual CPU to support the processing of multi-word operands.

During the execution of the interruptible instructions MVCL and CLCL, the SSU will detect a hardware interrupt pending condition should one arise, and will route these instruction parcels back to instruction fetch and decode to allow interrupts to be taken.

4.5  Examples of Instruction Routes through VMP

To illustrate how VMP would function, some examples are given to show the routes which various instructions would take through the network of major components.  Each example starts with issue from Instruction Fetch and Decode.

The lettering in parentheses in the following examples indicates the major component involved in each operation, using the lettering of Figure 1.  IAR denotes Instruction Address Register.
  1. AER - Floating Point add to register
    (A)  Decode Instruction, issue parcel
    (D)  Floating Point add
    (A)  add 2 to IAR

  2. AE - Floating Point add, storage to register
    (A)  Decode Instruction, issue parcel
    (C)  calculate operand address
    (E)  fetch operand
    (D)  Floating Point add
    (A)  add 4 to IAR

  3. STE - Floating Point store
    (A)  Decode Instruction, issue parcel
    (C)  calculate store address
    (D)  get register contents, issue with address
    (E)  store register contents in Main Store
    (A)  add 4 to IAR

  4. Branch
    (A)  Fetch and decode Instruction.
    (C)  calculate branch target address
    (A)  place branch target in IAR
           Else branch is not taken, route to:
    (A)  add 4 to IAR

  5. SVC - Supervisor Call
    (A)  Fetch and decode Instruction,
           issue SVC old PSW address
    (E)  store old PSW
    (A)  issue SVC new PSW address
    (E)  load new PSW
    (A)  replace IAR

  6. MVC - Move two words storage to storage
    (A)  Decode Instruction, issue parcel
    (B)  save operand length(s)
    (C)  calculate target operand address
    (B)  save target operand address
    (A)  get source operand address
    (C)  calculate source operand address
    (B)  save source operand address
    (E)  fetch first word of source operand
    (B)  output source operand, and target address
    (E)  store first word at target address
    (B)  increment and output source operand address
    (E)  fetch second word of source operand
    (B)  output source operand, and new target address
    (E)  store second word at target address
    (A)  add 6 to IAR

  7. BAL - Branch and Link
            (save return address and branch)
    (A)  Decode Instruction, issue IAR in parcel
    (C)  save IAR in nominated register
    (A)  re-issue parcel with branch address
    (C)  calculate branch target address
    (A)  place branch target in IAR

5.0  Performance Constraints and Proposed Solutions for VMP

We have seen in Section 1 that current high performance computers execute on average only one instruction every three to six machine cycles, despite large investments in logic circuits to try to speed them up.  The performance goal for VMP is to approach one instruction completion per cycle with minimal investment in circuitry to detect and resolve instruction interdependencies.  This target imposes certain design constraints which are described below.

5.1  Implementation of Register Storage

The most straightforward way of implementing registers in VMP would be as a single high speed storage array.  This would however significantly limit the performance of VMP.  Every instruction execution requires access to its instruction counter for readout and update, access to registers for address arithmetic, and access to registers for operands.  A single store array would only be able to service one such request per cycle, and so would limit throughput to one instruction every three to five cycles.

The approach proposed for VMP is to segregate the registers by type, and incorporate each type within the functional unit which references it most frequently.  This would have several advantages:-

5.2  Interconnection of VMP Components

The components of VMP cannot be interconnected via a common bus.  Although this leads to a very simple design, the average instruction in System/370 architecture would need to be transported three to four times.  Each transport would take a full cycle of bus time, so overall VMP throughput would be limited to  one instruction per three or four cycles, which is far below the performance target.  For this reason, separate bussing is needed between the components which communicate with each other.

The concept of VMP is that no major component can accept more than one parcel per cycle, and that those components which are heavily used such as main store control, and instruction fetch and decode, should be able to accept a new parcel every machine cycle.  Should they require longer than one cycle to service a request, pipelining and storage interleaving is assumed.  Since each component has multiple possible predecessors, it may be presented with more than one parcel per cycle.  It will only be able to accept one, so the other parcels will have to wait.  This implies that each major component may have to buffer parcels it has completed, before being able to issue them.  The components will be able to continue accepting and processing new parcels as long as their output buffers are not all full(or committed to work in progress in their pipelines).

For some operations the amount of data in a parcel may exceed the width of the data paths of VMP, for example a store operation with an operand and its address.  When this occurs, transmission of the instruction parcel takes place over two or more consecutive cycles.

5.3  Performance Prediction for VMP

Given the above design approach, the maximum throughput rate which VMP could sustain would be one instruction every major machine cycle.  However with normally encountered workloads the design shown in Figure 1 will not sustain this rate.  Given assumptions on the instruction mix to be executed by VMP, one can model the system as a Markov process and use the state transition probabilities to deduce how busy each component would be.  With the arrangement shown in Figure 1 the storage unit would probably be a bottleneck, as illustrated by the following assumptions taken from (Reference 1)
Then for each instruction executed there will be 0.45 Main Store references for instruction fetch, and 0.8 for operand reference, totalling 1.25 per instruction executed.  But of the 80% of instructions which reference storage operands, some will be branches - say 25%.  Their storage references will actually be to their branch targets.  We can account for this by decreasing the reference rate to 0.6 per instruction, and increasing the instruction fetch to about 0.56.  This leads to the conclusion that VMP will average 1.16 storage references per instruction completed.

If the SCU is the performance bottleneck then with the 54 nanosecond (ns) cycle of the 195, VMP would process 16 million instructions per second (MIPS), with the 25 ns cycle of the Cyber 175, 35 MIPS, with the 12,5 ns cycle of the CRAY-1, 70 MIPS.  Another potential limitation arises from the assumption that 80% of the instructions reference storage, and hence require the services of the integer and logic unit for address arithmetic.  If many of the instructions (re-)use the integer and logic unit to access a general register, then this unit could become a bottleneck.

In order to make a proper evaluation an accurate model would have to be set up for each machine architecture of interest.

5.4  Number of Virtual CPUs in VMP

The number of CPUs that VMP emulates determines the depth of multiprogramming its components support.  This number should somewhat exceed the number of major machine cycles that the average instruction takes to complete in VMP.  For a machine using components similar to those in a System/360 Model 195, about ten virtual CPUs would be adequate.

5.5  VMP Storage Control Unit

The VMP Storage Control Unit (SCU) performance is just as crucial to VMP's throughput as it is in a conventional uniprocessor design.  the SCU supporting VMP must be able to accept a new storage access request every machine cycle.  However the SCU need not respond after only one cycle, as is the case with the IBM 3033.  Address translate, buffer directory lookup, and buffer access may take place on consecutive cycles providing these functions are pipelined to support one request per cycle.  This would result in individual requests taking longer, but VMP could still achieve its target throughput is its multiprogramming depth is adequate.  This process of reducing the amount of work done in each machine cycle allows the cycle to be speeded, thus achieving greater throughput.

The SCU in Figure 2 follows the conventional IBM 303x design, except a buffer miss condition does not tie up the buffer during main store access - the failing request is sent to the appropriate (interleaved) storage module for service, and the buffer continues to service other requests.
5.5.1  Management of Paging by the Storage Control Unit
Computer workloads are tending more and more towards online processing.  The demand for main storage to contain these applications is growing rapidly.  Common solutions to this problem are program overlays, swapping programs to disk, and virtual storage systems.  Control Data have long used Extended Core Storage (ECS) as a swapping device.  Programs resident in ECS cannot execute, they are swapped into main store when needed by a block data move instruction.

Virtual storage systems handle paging on normal I/O channels.  With current operating systems a substantial overhead is incurred by each page fault serviced.  Scheduling the I/O, taking the I/O interrupt and task switching cost time.  Now that charge couple devices (CCD) are commercially available, demand paging could be brought under the control of the SCU.  The CCD could be directly accessible to the SCU as ECS is on Cyber machines.  VMP would detect page faults, and suspend the virtual CPU involved until it had swapped in the needed page(s).  During this time the other virtual CPUs would proceed with their programs, so the overall throughput could still be high.

The operating system controlling VMP would not be aware of the paging activity, so the overheads currently incurred by managing page exceptions and page I/O would be avoided.  This process would be very similar to the hardware-managed main storage buffering described in section 2.2.

Hardware managed paging also has implications for the operation of the I/O channels.  They generally operate out of real storage addresses provided to them by the operating system.  If the SCU managed address translation then the channels would have to operate with virtual storage addresses.  Each ongoing I/O operation would need a pointer to the address translate tables current when it was started.

Given the above the VMP managed paging system, there would be no need for a two-level storage hierarchy between the CPU and the paging device.  One level of storage, of intermediate size, speed and page size would suffice.  For example, a machine with a 50 nanosecond cycle time, executing 17 MIPS, should cope with 2MB of buffer storage.  This storage should be capable of accepting one storage request per cycle, but need only respond after four to five cycles.  A 256 byte page would be suitable.  This setup should result in about 20000 page swaps per second.  Four interleaved paging CCD devices with a transfer rate of 2MB per second would be able to support this paging rate.  The total amount of CCD storage would have to be sufficient to contain the swapped-in programs.  For a 17 MIP machine, 30MB should suffice.

6.0  VMP Compared to Other Virtual Multiprocessor Proposals

Other publications have suggested approaches to multiprocessing which have similarities to the VMP architecture described in this paper.

Reference 2 proposes the time-division multiprogramming of the circuits of a mini computer.  Each instruction follows a fixed routing through the instruction decode and execute steps.  This approach would not meet the needs of a large general purpose computer architecture, since the variance in instruction types and the execution resources that they require could not be satisfied by a fixed sequence of services.

Reference 3 proposes an extension of the principles which the  CDC 6600 Peripheral Processors use, as described in Section 2.3.1.  A collection of simple autonomous processors share a set of central execution resources for the more complex instructions that they have to execute.  The simple functions of instruction fetch and decode, branching, and operand fetch and store are handled by the autonomous processors.  The central facilities are shared on a fairly rigid time-division basis.  In practice, a program will often not want the central resource when its "turn" comes around. When it does take its "turn", the program will have to transmit its instructions plus one or two operands and some process state data (eg program interrupt masks) to the central execution resource, in one machine cycle.  This implies a very wide data path, which would necessitate a relatively long machine cycle.

The major virtues of the proposed VMP approach compared with the above are:

7.0  VMP Interrupt Handling Philosophy

This section details how VMP could handle hardware interrupts in a way compatible with IBM System/370 architecture.  Much of the discussion will be applicable to other interrupt driven machine architectures.

7.1  Response to Maskable Interrupts

Supervisor code to handle interrupts usually runs disabled to the type of interrupt it is handling until it has fully recorded the details of the interrupt.  To ensure the correct operation of such code, VMP will have to ensure that only one virtual CPU at a time can disable itself to any particular interrupt type.  for all maskable System interrupts (e.g. I/O interrupts, External interrupts, and Machine Check interrupts) VMP will maintain an overall system-wide interrupt mask which is the logical AND of the interrupt masks in each virtual CPUs state registers.  Thus should an External Interrupt condition become pending and any one of the virtual CPUs is disabled to External Interrupts, VMP will not honour the interrupt until the virtual CPU re-enables itself to External Interrupts by changing its interrupt mask.

If a virtual CPU attempts to disable itself to an interrupt type already seized by another, VMP will suspend execution of this virtual CPU until the one in possession of the interrupt type re-enables itself to that interrupt type again.  Hence code executing disabled to any interrupt type will automatically be serialized by VMP.  Virtual CPUs awaiting disabled state would use no resources and cause no interference.  The remaining enabled virtual CPUs will continue to process instructions at the normal rate.

7.2  Supervisor Call and Program Check Interrupts

Supervisor Call and Program Check Interrupts are usually private to the virtual CPU causing them, and result in a branch.  No other virtual CPU is affected.  Once the interrupts has been handled by the Supervisor, control is usually returned to the invoking program.  The VMP regards this as private to the virtual CPU, and no other virtual CPU is affected.  However, the Interrupt Handling code may update Supervisor information over which data integrity must be ensured.  See Section 8, "Software Support of VMP".

7.2  I/O and External Interrupts

When an I/O or External interrupt occurs, the program expecting the interrupt may or may not be executing.  In response to such interrupts, VMP suspends any one of its active virtual CPUs, stores its Program Status Word (PSW) in the appropriate storage location, and loads the appropriate new PSW, much as a conventional System/370 now does.  All the other active virtual CPUs continue with their own programs.  There is no need for a total cycle-down of work as with the System/360 model 91, nor any start-up delay in fetching a new instruction stream.  However, the integrity of System Software control blocks and data must be ensured, see Section 8, "Software Support of VMP".

Once the interrupt has been processed, a Task Switch may be necessary.  Once of the virtual CPUs will suspend its program and restart the program needing attention.  The other virtual CPUs will not be affected.

7.4 Machine Check Interrupts

This is obviously a very complex area.  When such a check occurs, VMP must determine whether the fault is transient and can be recovered from, or, if not, then whether it can be localised to a particular virtual CPU.  If it can, then that CPU takes the machine check interrupt.  Depending on the severity of the error, VMP may suspend all other processing activity.  However, if the fault is unlikely to affect other virtual CPUs (e.g. a parity error in main storage) then VMP may allow them to continue while the afflicted virtual CPU attempts hardware or software recovery.

8.0 Software Support of VMP

The techniques of software support for VMP are highly dependent upon the machine architecture and operating system chosen, and a detailed treatment of the topic is beyond the scope of this paper.  Most major computer lines and their operating systems support multiprocessing, and so could support the VMP approach without much change.  Some general observations are made below.

8.1. Uniprocessor Compatible Mode

A Machine Reset should put all of the VMP virtual CPUs into a stopped state.  IPL (initial program load by hardware) would activate only one of the virtual CPUs, which could complete the IPL and initialize the system.  Unless specifically activated, no other virtual CPU would run - the VMP would run in uniprocessor mode.  Hence, all existing programs and software systems which run on a uniprocessor would run on VMP, with the usual timing dependencies.

8.2  Initiating Multiprocessing

A new privileged instruction, Exchange Program Status Word (EPSW) must be introduced with the VMP.  This instruction's operand would designate a virtual CPU number.  The effect of the instruction would be to interrupt the designated CPU and cause it to enter a Supervisory routine to save the status of the program it was executing (if any), and start executing another.

If a virtual CPU finds no task ready to run, it will load a Wait State PWS, which will render it inactive.  The Supervisor will maintain a table in storage indicating for each virtual CPU either that it is inactive, or else the address of the Task which it is currently executing.

8.3  Ensuring the Integrity of Supervisor Data

The integrity of much supervisor data would be lost if multiple CPUs were allowed simultaneous access.  Such data is usually protected by two mechanisms.  When running on a uniprocessor, serial access to interrupt handling code and tables is ensured by running disabled to that interrupt type.  On a multiprocessor, integrity is insured by use of a set of exclusive locks.

From the description of VMP given, it is clear that the technique of running disabled to an interrupt type while handling it will work as well on VMP as it does on a conventional uniprocessor.  In addition exclusive locks in storage can be used to ensure system integrity.  With the current  multi-CPU approach to multiprocessing, contention for locks often results in CPUs idling while they wait for control of the lock that they need.  To reduce this contention for locks, VMP could offer (and even standard multiprocessors should offer) a small associative memory to contain the most recently contended-for locks.  The CPUs waiting for these locks would be suspended until the Storage Control Unit detected that the lock had been modified.

8.4  Determining the Current Program

At any moment may different programs may be executing on VMP.  Each virtual CPU must be able to identify which program it is executing.  To do this requires an instruction which returns to the virtual CPU its processor number.  This number can then be used to index over an array of program identifiers.

8.5  Accounting for Machine Time

A very strong requirement for a computer is that it must account for the time that it spends executing problem programs.  Since VMP executes many problem programs simultaneously, and they compete with one another for resources, the amount of CPU time that they spend to execute a fixed amount of work will vary each time that they run.  To provide a measure of CPU work done that does not vary from one execution to another, VMP could provide the path length executed by each program as it runs.  This requires less work than counting instructions, since VMP would only have to do arithmetic when a program branched or was interrupted - subtract the branch/interrupt target, add the old Instruction Address Register content to that program's running total.

A more accurate but expensive solution would be for VMP to maintain a counter for each virtual CPU, and increment it for each instruction executed by that virtual CPU.  The increment would depend on the instruction opcode, so long instructions would be more heavily weighted.  For storage-to-storage instructions, the operand length could be taken as the instruction weighting.

9.0  Summary

This paper describes a way of achieving a high instruction throughput rate on a relatively modest amount of hardware, by multiprogramming the CPU at a sub-instruction level.  The design will cope well with current computer workloads in online environments, where conditional branches and interrupts are frequent.

This paper has shown that when current high performance computers execute jobstreams that include online work, they complete only one instruction every four machine cycles, while VMP should achieve close to one instruction per cycle.  This level of performance can be realized without the large investments in complexity which current high performance computers make in order to reduce the number of cycles per instruction.

The paper has also shown how the new rapid access CCD and bubble memory devices can be incorporated into a hardware managed storage hierarchy.  The software overheads currently associated with handling I/O would otherwise preclude the full exploitation of these devices for paging in virtual storage systems.

References

  1. "The IBM System/360 Model 91 : Machine Philosophy", IBM Journal of R&D 11 : No 1 (Jan 1967)
  2. LE Shar and ES Davidson, "A Multiminiprocessor System Implemented through Pipelining", Computer Feb 1974
  3. MJ Flynn et al, "A Multiple Instruction Stream Processor with Shared Resources", Parallel Processor Systems, Technologies and Applications
  4. MJ Flynn, "Towards More Efficient Computer Organizations", Spring Joint Computer Conference 1972
  5. RM Russell, "The CRAY-1 Computer System", Communications of the ACM Vol 21 no. 1 January 1978.