The previous two chapters gave, in some detail, a survey of parallel computing based around the transputer, however, transputer parallel computers form only a fraction of the parallel computer market. This chapter surveys the architectures of several of the more important non-transputer computers.
The purpose of this chapter is to extend the appreciation of parallel computers beyond the possibilities offered by the transputer. Chapter 7 will cover parallel computer taxonomy. The detailed descriptions in this and the previous chapter will help identify the important issues about the structure of parallel computers. It is beyond the scope of the thesis to give a comprehensive examination or even a general overview of non-transputer parallel computers.
The DAP is an early fine-grained array processor that is still in commercial production. The ILLIAC-IV is also an array processor, however, it is large grained. The ILLIAC-IV greatly influenced the design of many subsequent parallel computers. The CRAY-1 is the most prominent vector processor to date.
The machines described here where chosen on the combined basis of historical and commercial importance. Due to space considerations many important machines have had to be left out.
ICL started working on the Distributed Array Processor (DAP) in 1970. A prototype DAP was working in 1974, and the first production machine was installed at Queen Mary College, London in 1980.
The original machine was fabricated with SSI chips. ICL continued development of the DAP for fifteen years and in 1985 produced a second generation machine fabricated with LSI chips. In all, only six SSI and two LSI DAPs were ever made, however, the architecture is an important example of a bit-slice array processor (see section 2.3.3). Furthermore, a company called Active Memory Technology (AMT) was formed in October 1986 to continue development of the DAP. The DAP architecture has reached its third generation with AMT currently selling several different models of DAP fabricated with VLSI chips.
Different versions of the DAP vary not only in fabrication technology but also there are slight differences in the architecture. Consequently, descriptions of the DAP in the literature may appear to be inconsistent especially when the text does not does not make it clear to which version of the DAP is being referred to. The following description will refer to the SSI DAP unless otherwise stated.
The machine comprises a single master control unit (MCU) and 4,096 very simple processing elements each with 1/2KByte of memory, see figure 6.1.1-1. Much like an associative processor, the DAP's processing power is intimately distributed throughout the computer's memory, see figure 6.1.1-2. This is where the name Distributed Array Processor comes from. The MCU fetches and decodes instructions transmitting the derived control signals to the DAP array for execution. The DAP array has its processors arranged in a 64 x 64 grid with all of the processors lockstepped together. The edge connections can be either cyclic or planar depending in the particular instruction being executed. Cyclic instructions have the output of the grid edge fed back into the beginning of the same row or column. Planar instructions have the results collected from an outputting edge and zeros fed to an inputting edge. The geometry of rows can be set independently of columns and vice-versa.
Figure 6.1.1-1 DAP Array
Figure 6.1.1-2 DAP Processors and Memory
Design of the DAP was based around the ICL 2900 series mainframe which acts as the host. To the 2900 the DAP looks like a 2MBytes bank of main memory, and when the DAP is not being used for processing it can act as normal main memory. An advantage of having the DAP look like a bank of memory is that associative processing power can be increased simply by replacing several of a 2900's main memory banks with DAPs.
The individual processing elements of the DAP are very simple. Figure 22.214.171.124 shows a DAP processor which comprises two multiplexers, three one-bit registers, a one-bit full adder, and 4,096 one-bit memory words. The processing elements can not fetch or decode instructions, this is done for them by the Master Control Unit. Working with only one-bit words and registers means the instructions can do little more than fundamental Boolean operations. The processing element's register set is shown in table 126.96.36.199-1.
Figure 188.8.131.52 DAP Processing Element
|A||Activity or masking register|
The A register is called the activity flag of the processing element. Once the register has been set the processing element will remain dormant during certain instructions. A better name for this register is the 'hibernation register'.
In general, when the hibernation flag is set, logical and arithmetic instructions will be ignored. However, instructions that could reset the hibernation flag will continue to be executed as normal. An example of the use of this register comes up when the square root of a number is required. If the number is negative, attempting to evaluate the square root would cause an error. Therefore, before the square root routine is performed the 'set A register if negative number' instruction is executed. This effectively excludes all the processing elements holding negative numbers from performing the square root.
The Q register is an accumulator and the C register acts as an overflow flag. It is a happy accident that the overflow flag can also be regarded as an overflow store - this is only possible because the word length of the processors is one bit. The ADD instruction takes the Q register, the C register, and the specified memory location, adds them all together placing the result in the Q register and any overflow in the C register.
Arithmetic is performed by the adder which can only work on one-bit data. Multi-bit arithmetic is done by a small process which repeats one-bit arithmetic in a loop. Such processors are said to be of bit-serial design. (All very early computers were also of bit-serial design.) Arithmetic precision can be increased simply by increasing the loop count, consequently, low-precision arithmetic is much faster than high-precision. The DAP is said to have 'bit complexity', ie execution time is proportional to the bit length of the data. Furthermore, because uncommon length numbers (eg 3- or 29-bit) are just as easy to work on as 8- or 32-bit numbers, data need not waste memory space in artificially large words.
|Operation||Formula||Relative Execution time|
|1-bit Boolean operation||L1 AND L2||1|
|16-bit fixed point addition||I = J + K||10|
|16-bit data movement||I = J||6|
|32-bit fixed point addition||I2 = J2 + K2||20|
|32-bit floating point addition||X = Y + Z||180|
|32-bit floating point multiplication||X = Y * Z||280|
|32-bit floating point square root||X = SQRT(Y)||250|
|64-bit floating point multiplication||X2 = Y2 * X2||1000|
|Operation||Time (microseconds)||Performance (MFLOP)|
|X, Y, and Z are 4096-element vectors of 32-bit real numbers;|
I is a 4096-element vector containing 32-bit integers.
|Z = X||17||241|
|Z = X * X||125||33|
|Z = X + Y||150||27|
|I = MOD(I)||1||4096|
|64 x 64 matrix multiplication||30||17|
When designing numerical algorithms it is usual to avoid the SQRT() function where possible because, being performed by Newton-Raphson iterations, it is expensive in computing power. Many established numerical algorithms have developed around this and similar premises. Because of its many bit-serial processors, the relative cost of the various instructions on the DAP is markedly different from conventional processors. These differences are highlighted in tables 184.108.40.206-2 and 220.127.116.11-3 which show the execution times for various instructions.
For a bit-serial processor the SQRT() function is only slightly more expensive than a multiplication. The relative cost of performing some of these instructions is so different from conventional machines that many algorithms need to be redesigned. From tables 18.104.22.168-2 and 22.214.171.124-3 the following general rules for the behaviour of various instructions can be inferred.
Many important problems are bit orientated, these include most graphics, pattern matching and cryptography algorithms. The DAP is particularly well suited to these types of applications and comfortably outperforms machines with much higher FLOP rates.
The clock speed of the DAP's processing elements is 5MHz which is slow compared with many commercial processors. The DAP's power comes not from the processing elements being fast, but from there being so many of them 4,096. The slow clock is mainly due to long memory 'access times'. As the memory transfers are only 1-bit, the rate determining step is the RAM cycle time rather than the 'access time'. The cycle time is the minimum elapsed time between successive memory accesses. A RAM with a 50ns access time may well have a 200ns cycle time. The difference is called the 'RAS precharge time', and is the time needed for the chip to recover from the previous access and 'ramp up' for the next.
The DAP processor array can not do any useful work on its own. It needs to be externally coordinated and controlled. This is done by several units that are collectively known as the DAP controller, see figure 126.96.36.199. The most important of these units is the Master Control Unit (MCU).
Figure 188.8.131.52 Structure of DAP Unit
Instructions are stored throughout the DAP processor array, being scattered between the processing elements amongst the data. The instructions can not be stored in a continuous block with the data at the end, as in conventional machines. This arrangement would lead to some processing element's memories holding instructions and no data.
The Master Control Unit (MCU) is responsible for fetching and decoding instructions as well as issuing the relevant decoded signals to the processor array necessary for executing those instructions. Every processor in the DAP array receives the same control signal at the same time and executes the same instructions on its local data.
There are eight 64-bit registers in the MCU which are connected to either the row or column highways. Every 'bit' of a register is connected to a separate line of the highway, see figure 184.108.40.206.1. Several of the MCU's registers can be connected to either highway. Registers are used to slice up the processor array by selecting which rows or columns of processing elements are to be interrogated or sent data. Data can be transmitted to all processing elements simultaneously by selecting all the rows and columns.
Figure 220.127.116.11.1 DAP Processing Element Array
Two of the DAP's 32-bit instructions are stored per row of processing elements. Instruction fetches pull in a complete memory row at a time, therefore, two instructions are fetched every clock cycle.
Whenever a calculation requires multi-bit arithmetic it has to be done by a small loop, which is a slow way to do arithmetic. Bit-serial processor designs are used because they are simple to implement. In order to speed up arithmetic the DAP controller is provided with a high-speed instruction buffer. The buffer can store up to 60 32-bit instructions, which is enough to hold most numerical loops. This arrangement allows the DAP controller to bypass many expensive main memory accesses and issue the processing elements with control signals at a much higher rate.
The MCU broadcasts and collects data from slices of the DAP via row and column highways. The sole use of the row highway is to transfer data between columns and the MCU registers. The column highway, however, is multifunctional. Firstly, it can be used by the MCU for fetching instructions which are stored two per row of processing elements. Furthermore, it can be used for transferring data between rows and the MCU's registers in the same way that the row highway does. Finally, it can be used for access to the host ICL 2900 via the DAP Access Controller (DAC).
The DAP access controller (DAC) is responsible for interfacing the DAP to its host. The column highway is used to transfer data between the two. The host uses the DAP as main memory and the DAP uses the host for backing store etc, so the interface is two way.
The original DAP was built exclusively around the ICL 2900 series of mainframes, see figure 18.104.22.168-1, and no other host was supported. The DAP was designed to look like a 2MByte bank of main memory to the 2900, (see figure 22.214.171.124-2,) and is arranged as a 64 x 64 x 4,096-bit array. The 2900's word length is 64-bits which correspond to memory rows in the DAP. Incrementing the 2900's memory address first moves down the columns of the processor array and then through the 4Kbit DAP address space, see figure 126.96.36.199-3. The 2900 loads the DAP with programmes and data simply by writing them to what it regards as main memory. This is much easier than most other parallel computers which have to be loaded through serial I/O ports.
Figure 188.8.131.52-1 ICL 2900 Mainframe
Figure 184.108.40.206-2 DAP ICL 2900 Mainframe with DAP Installed
Figure 220.127.116.11-3 DAP Memory structure
As well as holding the backing store and performing I/O the 2900 provides both the computer environment and the software development environment for the DAP. The DAP and the 2900 communicate through a common memory block. At a high-level the DAP is accessed through special call instructions from within 2900 FORTRAN programs. At a low-level it can be programmed in APAL the DAP's assembler language.
ICL call the normal serial processor in the 2900 the 'Order Code Processor'. It instructs the DAP to execute its own code rather than behave as normal main memory. When the DAP is being used the Order Code Processor can perform some pre- and post-processing for it using the normal memory banks.
The Store Access Controller provides access to peripherals and a transfer facility between memory units. If the DAP is being used then the 2900's other banks of memory can be used as a fast backing store. This unit can also perform block transfers between the memory banks, however, the usual method for communication between the 2900 and the DAP is with common blocks. This is achieved by the Store Access Controller allowing the 2900 to read and write into the DAP's memory.
The 2900 can access the DAP's memories even while the DAP is working by using a technique called 'cycle stealing'. This involves the 2900 halting the DAP for one clock cycle and taking over the memory bus. During this time the DAP's processors are 'frozen' and do nothing while the 2900 accesses the memory. After one memory access the DAP's processors are allowed to continue as though nothing had happened. If the message requires more than one access the DAP is allowed to work for a few more cycles and then another cycle is 'stolen' by the 2900 and used to transfer more data. This process is repeated until the transfer is complete. The DAP does not have to run any special software to communicate in this way, the only effect is that it is slowed down a bit. In order to maintain data integrity the DAP can protect certain segments of its memory.
In October 1986 ICL sold the patents for the DAP to Active Memory Technology (AMT), in return for a 25% stake in the company. AMT reworked the DAP design and delivered the DAP 510 in the last quarter of 1987. The AMT DAP 510 is a 32x32 array of processors with 32-64Kbits of memory per processor and a Sun workstation or VAX as its host. With modern VLSI technology 64 processors have been integrated on a single chip. The DAP 510 can perform 10,000MIPS on 1-bit data which drops to between 6-60MFLOPS on 32-bit numbers.
The university of Illinois has built a range of experimental computers over the years: in 1952 the ILLIAC-I was built from vacuum tubes, in 1963 the ILLIAC-II from transistors, and the ILLIAC-III - a machine for scanning large quantities of visual data - was built in 1966. In 1966 the US Department of Defence commissioned the University of Illinois to build a SOLOMON-type computer. SOLOMON is a parallel computer first described in a 1962 paper but never built. By 1967 the Illinois group, led by D Slotnick, published a paper outlining a new machine which came to be known as the ILLIAC-IV     
The final design was completed in collaboration with Burroughs who were engaged as system contractors. Unfortunately, the machine underwent so much refinement that the final design was not available until 1971. The complete design - with four identical sections - was never built. One quadrant, however, was built and delivered to NASA's Ames Research Centre in 1972. The ILLIAC-IV's single quadrant is a functional parallel computer in its own right. The implementation was plagued with problems and a reliable service could not be offered until 1975, even so, it is still regarded as the first operational parallel computer. The machine was connected to a network and so available to many users which prompted much of the early work on parallel algorithms and languages.
The original ILLIAC-IV was to comprise of four quadrants, each with an array of 64 processing elements connected in an 8 x 8 grid. Each of the quadrants was to have a single control unit which would interpret instructions for all its processing elements. All four quadrants were to be fed the same instruction stream and all the processing elements within a quadrant were to be lockstepped together. The individual processing elements were based on a mainframe and are much more sophisticated than the processing elements of many parallel computers built since.
The design called for the four quadrants to be connected by a highly parallel I/O bus, and backed up by a large disk as secondary memory. Each of the 256 processing elements was envisaged to produce a floating-point operation in 240ns giving a maximum performance of 1GFLOPS for the whole machine.
The designers of the ILLIAC-IV endeavoured to include the latest technologies in the ILLIAC-IV machine. As new semiconductor materials became available in the early 70's the machine underwent repeated redesign which led to schedule delays, and considerable cost escalations. The fabricated design was reduced to a single quadrant with 64 processing elements.
The ILLIAC-IV is an array of 64 processing elements regulated by a single control unit. The processing elements are arranged in an 8 x 8 grid with nearest neighbour connections, see Figure 18.104.22.168-1. Taking into account edge connections, the topology could be described as a staggered toroidal mesh. By having the toroid staggered, all the processing elements are effectively arranged in a ring with extra 'short-cut' connections between every eighth element. This topology is called a chordal ring and is illustrated in figure 22.214.171.124-2.
Figure 126.96.36.199-1 Illiac-IV Processor Array
Figure 188.8.131.52-2 Illiac-IV Chordal Ring Topology
All processing elements sit on a bus called the 'instruction control path'. This bus transmits the same control information simultaneously to all 64 processing elements, see figure 184.108.40.206-3. Each processing unit executes the instruction on data held in its own 2Kword memory. With 64 processors the ILLIAC-IV has a total main memory of 128Kwords, and as the words are 64-bit this corresponds to 1MByte. Each processing element's memory can be accessed by the control unit, the I/O system and by the processing element itself.
Figure 220.127.116.11-3 Connections between Illiac-IV Control Unit and Processing Elements
The ILLIAC-IV is capable of about 200MIPS or 80-100MFLOPS, which is far short of the original 1GFLOP target. Even so, this is enough to make it the most powerful computer of its time. The ILLIAC-IV was the first machine to have an all semiconductor main memory, however, considering the machine's power it has a very small amount of main memory. This is partly due to the high expense of designing new chips and working with a new memory technology. The design was further complicated by the need to allow three units to access the memory system.
Theoretically, a floating-point operation could be performed on every memory word in 1/600ths of a second! To help keep the processing elements busy, the main memory is augmented with a 16Mword magnetic drum secondary memory. Applications whose data sets are too large for the main memories can read jobs from the drum and write their results back there. The performance of this drum is enhanced by having 128 heads - one per track. All 128 heads can transfer data at the same time along a 1024 line bus giving rise to disk transfer rates of 500Mbits per second. Unfortunately, this very high transfer rate is not matched by the drum's access time of 20ms. The placement of information on the secondary storage drum has to be carefully managed to keep the ILLIAC-IV busy. It was largely the unreliability of the drum secondary memory that stopped the ILLIAC-IV providing a reliable service for its first few years.
The ILLIAC-IV was supported by 20 or so PDP 10s and 11s that can collectively be called the systems host. They provide backing store, I/O, networking, and run a rudimentary operating system.
The ILLIAC-IV's instructions are distributed throughout the memories of the processing elements. These memories are all directly accessible by the control unit via the control unit bus, see figure 18.104.22.168-3. In addition, the control unit has two buffers, one that can hold 64 64-bit data words, and another that can hold 128 32-bit instructions. The fast instruction buffer is large enough to hold the inner loops for many programs hence reducing the number of accesses to main memory.
The control unit is a scalar computer in its own right, with a 64-bit arithmetic and logic unit. When an instruction is fetched the control unit must determine where it is to be executed. Instructions that 'transfer control' (eg jumps and calls), along with scalar instructions are executed by the control unit itself. Only instruction that work on vectors are executed in parallel on the ILLIAC-IV processor array.
Should a vector need to be added to another vector, an element from both vectors would be stored in each processing element. The control unit would issue control signals to all the processors to load their vector elements, add them and store the result. If a vector is to be multiplied by a constant the task could be executed in the same way as described above. Unfortunately, this would lead to the same constant being stored in every processing element's memory. This is particularly wasteful on the ILLIAC-IV because main memory is so precious. To avoid this waste a constant can be transferred directly from the control unit along the 'common data bus' and into the arithmetic unit of every processing element. Similarly any piece of data or memory address that is common to all calculations can be stored just the once and broadcast to all the processing elements along the common data bus.
Movement of data between processing elements is regulated by the control unit. The transfer takes place between the R register's of neighbouring processing elements. Every processing unit must transfer data at the same time, in the same direction, and for the same distance. The control unit works out the shortest path and issues communications instructions accordingly.
In addition to running applications, the control unit runs several small routines that enable it to communicate with the operating system on the host.
Each of the ILLIAC-IV's 64 processing elements is based on a mid-range Burroughs 64-bit mainframe and is capable of directly executing floating-point arithmetic. The power and sophistication of the individual processing elements is the most striking difference between the ILLIAC-IV and most other array processors.
Each processing element is essentially an 'Arithmetic and Logic Unit' (ALU), some registers, and four 64-bit communications connections to its neighbours. The control unit sends every processing element the same control signals thus enabling the ALUs to execute instructions. Most instructions use two operands both of which come from the processing elements local memories, or alternatively, one operand can be collected from the global control unit bus.
The D register is used to store the results of comparisons or tests. One of the bits in the D register is used as an enable/disable flag. If this flag is set then that particular processing element does not take part in the calculations that follow.
Most applications require much more memory than the 2Kword offered by the local processing element memory. To get around this, the local processing unit memory is used as a working store, (ie for intermediate results, pointers, etc) and the bulk of the data is stored on the disk. This is analogous to the transputer's small on-chip memory, and the larger but slower off chip memory.
A particular feature of the ILLIAC-IV's processing elements is that they are able to execute complex instructions. In addition to 64-bit arithmetic, they can perform local indirect addressing. The X register in each processing unit can be used to modify the address sent out by the control unit. This register can hold a different value in each processing unit and so, given the same operation, each processing unit can work on data in different local memory locations.
Although the word size of the processing elements is 64-bit, they are also able to work on other data types, see table 22.214.171.124.1. This enables each of the local processing unit's memories to hold 64, 128, or 512 element vectors.
The ILLIAC-IV used a collection of smaller computers arranged around a shared memory as its host, see figure 126.96.36.199. All the ILLIAC-IV's I/O goes through the I/O subsystem to the host which is also called the 'Central System'. Several PDP 10s provide interactive time-sharing facilities and over 20 PDP 11s are used for control and communication.
Figure 188.8.131.52 Connections between Illiac-IV and the Central System
Collectively these machines provide interactive job preparation, file management, data movement, edit and debug utilities, backing store, terminals, network connections and a rudimentary operating system. A separate B6700 computer compiles programs and provides programming support which includes ASK the ILLIAC-IV assembler, a compiler for the high-level parallel language GLYPNIR, and an ILLIAC-IV simulator - SSK.
The host primes the ILLIAC-IV array for calculation by directly reading and writing to the disk secondary memory. A hierarchical backing store for the ILLIAC-IV is provided by drums, disks and tapes as well as a 700Gbit laser based subsystem.
An important feature for the development of the ILLIAC-IV's software is the host's network connection. The host was connected to the US Department of Defence Advanced Research Project Agencies network, or 'ARPA net' for short. This allowed anyone with access to the network to use the machine. The general availability of the machine to many researchers in many disciplines led to a large base of software.
In many ways it is possible to regard the ILLIAC-IV project as a failure. Only one quadrant was ever built, it cost almost four times the original budget, it was delivered late, it only just came to within a factor of ten of its original proposed performance, the proposed clock speed had to drop from 25MHz to 16MHz to 12.5MHz, and it was unreliable. One reason for its failure is that it was too ambitious for the technology of the time. Another is that it tried to incorporate many immature technologies that were appearing at that time.
Much of the performance deficit came from problems associated with the implementation and not the architecture itself. The small size of the local main memory was particularly debilitating. If all four quadrants had been built and the clock could be doubled back to its original value the machines would have been 8 times faster which is most of the way back to the original specification.
The ILLIAC-IV project was a valuable learning experience and had considerable influence on the development of many other parallel computers. The network connection allowed many researchers to use it which led to the development of many parallel algorithms and several parallel languages. For example, ACTUS, CFD, GLYPNIR, TRANQUIL were all developed originally for the ILLIAC-IV.
NASA shut down the ILLIAC-IV facility in August 1981.
In 1972 Seymore Cray, the senior computer designer at Control Data Corp' (CDC), left to start up Cray Research Inc. The first machine to be manufactured by Cray Research was based on the CDC 7600 and called the CRAY-1. Since delivering the first machine in 1976 over 60 CRAY-1's have been bought, making it the most popular supercomputer to date.
The CRAY-1      is a single processor pipelined vector computer with 12 independent pipelines that provide a range of arithmetic, logic and vector operations. Each of the 12 pipelines is dedicated to a small group of functions making it quick to configure. The Cray-1 achieves most of its performance through two principal forms of parallelism - pipelineing and multiple independent functional units that can work at the same time, see figure 6.3-1.
Figure 6.3-1 Cray-1's Functional Units
The CRAY-1 was designed as an attached processor, with Data General Eclipses, IBM 370/168s or Cray Research 'A' processors acting as hosts. The host can be connected to any one of the CRAY-1's 12 I/O channels with the other I/O channels being connected to peripherals of various sorts, see figure 6.3-2. The maximum I/O bandwidth is 80Mwords per second.
Figure 6.3-2 Cray-1 I/O Connections
The CRAY-1 is usually programmed in FORTRAN and the programmer need not know any more about the internal architecture of the processor than what is revealed by FORTRAN. The compiler rearranges the program in a way that the original meaning is not lost but so that it can take advantage of the CRAY-1 architecture's parallel features.
The most important feature of the CRAY-1 is its ability to execute vector instructions. Vector instructions call for the same operation to be performed on a list of numbers - called a vector - rather than on a single number - called a scalar. A 'load vector' instruction takes 64 numbers from main memory and places them in a special group of registers called a vector register. The CRAY-1 has eight of these vector registers. The vector operation V0(i) = V1(i) + V2(i) would add V1(1) to V2(1) and place the result into V0(1) and so on down the list until V0(64) = V1(64) + V2(64). Special 'hidden' registers keep track of how far through the vector the calculation has progressed. Save vector instructions copy a complete vector register back to main memory.
Six of the twelve functional units work on vectors. Once a vector operation has started the pipeline remains dedicated to that operation and need not be reconfigured until every element has passed through. The instruction only need be decoded once - at the beginning of the vector operation - and not for every data item as with scalar processors. This in turn allows the pipeline to be clocked at a higher speed. The clock speed of the CRAY-1 is 80MHz.
Most processors incorporate the control unit into the first few stages of a pipelines which must accept both instructions and data. Unfortunately, when these pipelines come across 'jump' instructions the pipeline has to be flushed and restarted with new instructions and data. The CRAY-1's pipelined functional units are only for operands, consequently they do not suffer from some of the drawbacks of typical pipelines.
The CRAY-1 makes extensive use of registers and can be described as a register-register pipelined vector processor. Register-register architectures have functions units that can not access memory directly, but take operands from, and return their results to, registers. This contrasts with the Cyber-205 which is a memory-memory pipelined vector processor and takes its operands and returns its results to the main memory. There are over 800 registers in the CRAY-1's processor which corresponds to over 4KBytes of storage.
Once a register load instruction has been issued the multi-cycle load operation is taken over by the memory interface unit leaving the processor free to do other useful work. To reduce delays a register should be loaded some time in advance of it being used. Registers can be transferred to functional units in 6ns as opposed to the 50ns main memory cycle time. Another advantage of transferring operands into registers is that a functional unit can load two operands simultaneously from two different registers. The memory interface unit has only one port so multiple main memory loads have to be done sequentially.
The three principal groups of operand registers are address, scalar and vector. There are eight 24-bit address or A registers, designated A0 through to A7. They are primarily used to store memory addresses and offset or index values, but can also be used for shift counts, loop indices, and for controlling I/O operations. There are two functional units dedicated to operations on these registers.
Eight 64-bit scalar or S registers hold general purpose operands that are used by the scalar functional units. Some vector operations also take constants from S registers. Scalar registers are designated as S0 through to S7. The scalar registers are the principal operand registers for scalar operations.
The largest group of registers are the vector or V registers. Each of the eight vector registers can hold 64 64-bit words. Three functional units are dedicated to vector operations. Vector registers are designated as V0 through V7. Complete vectors are loaded from main memory by specifying the start address and increment, which allows individual matrix rows or columns to be specified. The memory interface unit can load or store one V register at the same time as another V register is being used by the functional units. Individual elements are not specified as all operations are performed on complete vectors. One way of thinking about the V registers is as first-in-first-out (FIFO) data buffers.
|Address||A||8 x 24-bit||24|
|Scalar||S||8 x 64-bit||64|
|Vector||V||8 x 4096-bit (8 x 64 x 64-bit)||4096|
There 12 different pipelined functional units in the CRAY-1. These can be split into four groups: scalar arithmetic, vector arithmetic, floating-point, and address, see figure 6.3.2. Parallelism occurs at two distinct levels, firstly there is pipelining within the functional units and then there is parallelism between functional units because they are independent and several can be working at the same time.
Figure 6.3.2 Cray-1 With Principle Registers in Place
The functional units cannot get operands directly from main memory but must take their operands from the A, S or V registers. Once the calculation has completed, the results also have to go to these registers. Each functional unit can accept operands and produce results every clock period. The time taken to produce the first results depends on the depth of the particular pipeline.
All four scalar arithmetic units take their operands from, and return results to, the 64-bit S registers. The four units are: scalar add, scalar shift, scalar logical and the bitwise unit.
The scalar add unit is pipelined into three stages and performs 64-bit two's complement integer addition and subtraction. The scalar shift unit performs standard shifts in two clock periods or double length shifts on a pair of operands in three clock periods. The logical unit performs Boolean bitwise AND, OR and XOR operations.
The final scalar units performs the bitwise operations 'population count' and 'leading zero count'. Population count takes four clock periods to complete and involves counting up the number of bits set to one in an operand. The leading zero count operation totals up the number of zeros before encountering the most significant bit set to one, and takes three clock periods to complete. Unlike the other three scalar units, the bitwise unit produces a 7-bit result which is sent to an A register, rather than an S register. These last two bitwise operations are important for error checking, cryptography, graphics and pattern matching algorithms.
Scalar floating-point operations are performed by separate floating-point units.
|Functional Unit||Number of Pipeline Stages|
The CRAY-1 gets around the problem of issuing more than one instruction per clock cycle by using vector instructions. Three of the CRAY-1's functional units are dedicated to operations involving vectors and take their principal operands from the V registers. There is no need to issue new instructions for every element in the vector. Once a vector instruction has been issued the same operation is automatically performed on every element of that vector. It would take many scalar instructions to do the work of a single vector instruction and so issuing a vector instruction in one clock cycle can be thought of as issuing many scalar instructions in one clock cycle.
The three vector units - add, shift and logical - all perform essentially the same operations as the corresponding scalar units. The differences lie in the length of the pipelines see table 184.108.40.206 and 220.127.116.11, and the source of the operands which come as a stream from the V registers rather than alone or in pairs from the S registers.
Double length shifts are done on consecutive pairs of V register elements. Vector floating-point operations are performed by separate floating-point units.
|Functional Unit||Number of Pipeline Stages|
Additionally, the CRAY-1's vector functional units can take part in chaining, where the results from one pipeline can be fed directly into another pipeline without needing to be returned to main memory, see section 6.3.6.
There are three pipelined functional units dedicated to arithmetic on numbers stored in the CRAY-1's floating-point format, see figure 18.104.22.168. These units can take either scalar operands from the S registers or vector operands from the V registers. The three units are add, multiply and reciprocal, execute instructions of the form:
Figure 22.214.171.124 Cray-1 Floating point Number Format
|V0 <-- V1 + V2|
|V3 <-- 1/V4|
|S0 <-- S1 + S6|
|V5 <-- V4 * S5|
The floating-point add unit does addition and subtraction in a six stage pipeline, producing normalised floating-point results every clock period. There are seven stages in the pipeline that does floating-point multiplication and cooperates in floating-point division.
The way most processors do division is not amenable to pipelining. On the CRAY-1 division is done by using Newton-Raphson iterations which can be pipelined and give an approximate value for 1/x. The reciprocal unit uses a 14 stage pipeline to produce a result accurate to 30 bits. Unfortunately this is well short of the 48-bit capacity of the floating-point mantissa. The result is fed directly from the reciprocal unit to the multiply unit which performs an extra Newton-Raphson iteration bringing the accuracy up to 47-bits. After going through the reciprocal and multiply unit's pipelines the 47-bit result is deemed sufficiently accurate to be multiplied by the original 'dividend' to produce the desired quotient. The floating-point units can also take part in chaining, see section 6.3.6.
|Functional Unit||Number of Pipeline Stages|
When working on matrices it is often necessary to multiply or add constants to memory address pointers held in registers. This could easily be done in parallel while a vector is being processed. Unfortunately, the arithmetic units are often 'tied up' during vector operations and so separate address addition and multiplication units have been added to the CRAY-1.
Memory addresses are held in the 24-bit A registers. The address functional units perform 24-bit 'two's complement' integer arithmetic. The multiplication unit takes two numbers from the A register and returns the least significant 24-bits of the multiplication back to the A register. The addition unit performs both addition and subtraction.
|Functional Unit||Number of Pipeline Stages|
The CRAY-1 works on 64-bit words and can have 0.5, 1, 2 or 4Mwords of main memory with a 50ns cycle time. The memory is arranged into 16 interleaved banks all of which can work in parallel, see figure 6.3.3. Each bank has 72 modules all of which contribute 1 bit towards a 64-bit word - the extra 8 bits are used for error correction.
Figure 6.3.3 Cray-1 With Memory Banks in Place
The processor transfers data and instructions from main memory into its many registers along a 64-bit bus. The memory interface is pipelined and takes seven clock periods to load a register from main memory and six clock periods to save a register back to main memory. However, in the clock period after the first memory request has been sent, the processor can issue another memory request to one of the other memory banks without interfering with the first request. By staggering memory requests in this way a data word can be drawn from main memory every processor clock cycle, resulting in a memory bandwidth of 80Mwords/sec.
To make memory management easier, successive memory addresses are in adjacent, rather than the same, memory banks. Delays can still arise if accesses step through memory in 8 or 16 word increments. This would result in all the words coming from the same memory bank which requires four clock periods between each word transferred and brings the bandwidth down by a factor of four to 20Mwords/sec.
The auxiliary registers comprise buffer registers and support registers. The registers associated with the instruction buffer will be described in section 6.3.6 on instructions. Two sets of buffer registers sit between main memory and the A and S registers, see figure 6.3.4.
Figure 6.3.4 Cray-1 With Buffer & Support Registers in Place
There are 64 24-bit B registers that act as buffers for the A registers and 64 64-bit T registers that act as buffers for the S registers. These are designated B0 to B77 and T0 to T77 respectively. (Note Cray Research use the octal number base for these subscripts)
Functional units can only access the A, S and V registers but the A and S registers can get their values either from main memory or the B and T registers respectively. B and T registers can transfer data into the A and S register at the rate of one word per clock period.
The memory interface unit can transfer blocks of data in or out of the buffer registers at the rate of one word per clock period. The B and T registers can be used to hold addresses and intermediate operands for complicated calculations. These registers greatly assist the CRAY-1's scalar performance by decoupling the main memory speed from the processor speed.
|Address Buffer||B||64 x 24-bit||192|
|Scalar Buffer||T||64 x 64-bit||512|
Table 6.3.4-2 lists the auxiliary registers that help co-ordinate the actions of the processor.
|Real Time Clock||RT||64-bit||8|
Operations on vectors are supported by the Vector Mask (VM) and Vector Length (VL) registers. The 7-bit vector length register holds the number of elements in the a vector register that are to take part in subsequent operations, ie 1-64. For example, an operation on a 100 element vector would be split up into two successive operations, one on a 64 element vector and another on a 36 element vector, for which VL will be set to 64 and 36 respectively.
The 64-bit vector mask register takes part in vector merge and vector test operations. Each bit in the vector mask register corresponds to a separate element in a vector register. Boolean vector instructions test all the elements of a vector register and set the corresponding bits of the VM register to false (0) or true (1). The VM register can also act as a mask or template for merge operations, where only the elements corresponding to the bits of VM set to one take part in the subsequent operation.
A program can be confined to a particular area of main memory by setting the base address BA and limit address LA registers. Both of these 18-bit registers store the upper 18 bits of a 22-bit memory address - the lower four bits always being zeros. Every time the main memory is referenced, the address specified is automatically added to the BA register giving the absolute physical address. This requires all programmes to use relative memory addressing. This process ensures that memory references are always above the base address.
Should a program attempt to access main memory above the value in the limit address register an interrupt flag is set and the processor is interrupted. The LA register is also checked every time an instruction is issued.
The program counter or P register contains a 22-bit memory address that points to a 2-byte instructions. As each instruction is executed the P register is automatically advanced by one. However, when jump instructions are executed the P register is replaced by the destination of the jump.
When an error or an interrupt occurs the processor switches its attention to the cause of the error and what to do about it. Vector instructions in progress are allowed to complete. The CRAY-1 Operating System (COS) takes over from the application and executes 'interrupt service routines'. The current state of the processor needs to be saved into main memory in case the error is not fatal and the original application can be restarted. The operating system arranges for the B and V registers to be saved and the processor itself automatically saves the A and S registers.
An 8-bit exchange address or XA register is set before a program starts and tells the processor where the state of the processors is to be saved when an interrupt occurs. The state of the processor is defined by a 16-word block of data, called the exchange package, which is derived from the registers. The 8 bits of XA form the upper 8 bits of a 12-bit address. Changing the value of XA by one changes the value of the 12-bit address by 16, furthermore, 8 bits make provision for 256 exchange packages to be saved.
Execution errors interrupt the processor and trigger the error recovery routines. The error recovery routines read the 9-bit F register to see which error occurred. Table 6.3.4-3 shows which interrupt conditions correspond to which F register bits. A few of these interrupts can be disabled with the aid of the 3-bit mode or M register.
|Interrupt Condition||F Register Bit|
|Uncorrectable Memory Error||3|
|Program Range Error||4|
|Operand Range Error||5|
|Real-Time Cock Interrupt||7|
If the first bit of the M register is set to zero any error arising from floating-point calculations does not interrupt the processor. If the second bit has been set to zero then parity failures in the error checked main memory will be ignored. Finally, if the third bit is set to one then all bar the memory failure interrupts are disabled. Figure 6.3.4 shows the Cray-1 with the auxiliary registers in place.
The CRAY-1 has 128 basic instructions that occupy either one or two 16-bit parcels. The computation unit can issue up to a maximum of one instruction per clock period.
The four instruction buffers each hold 64 16-bit instruction packets. Each instruction buffer is connected to the main memory by a 64-bit wide bus. The computation unit loads a new instruction every time the program counter (P register) is updated. If the instruction is not in any of the buffers then the least recently used buffer is reloaded from the main memory. Instructions are loaded from main memory at the rate of four 64-bit words (which corresponds to 16 instruction parcels) per clock period.
Instructions move from the buffers into the 16-bit Next Instruction Parcel register (NIP). From the NIP register instructions flow into the 16-bit Current Instruction Parcel register (CIP). If the instruction requires two instruction parcels the second parcel is sent to the 11-bit Lower Instruction Parcel register (LIP). When the instruction is to be used both the CIP and LIP registers are loaded in parallel, see figure 6.3.5.
Figure 6.3.5 Cray-1 With Instrtuction Buffers in Place
|Instruction Buffer||I||4 x 64 x 16-bit||512|
|Next Instruction Parcel||NIP||16-bit||8|
|Current Instruction Parcel||CIP||16-bit|
|Lower Instruction Parcel||LIP||16-bit||8|
An instruction is only issued if it can complete without interference from any instructions that are already in the process of being executed. For example, only one A register and one S register can be accessed during a single clock period. A clash could conceivably occur if the results of two instructions are returned at the same time to either the A or S register files - one result having been through a longer pipeline than the other. To prevent such a clash the processor was designed to detect this condition in advance and hold back the second instruction until it could proceed without clashing.
A typical vector instruction takes a pair of vector operands, performs an operation on them and return the results to another vector. If the subsequent instruction is also a vector instruction and does not use any of the same vectors involved in the previous instruction then it can be issued in the next clock period.
|V0 <-- V1 + V2|
|V3 <-- V4 * V5|
Both instructions can be proceeding at the same time and both produce one floating-point result every clock period. Two FLOPS every 12.5ns clock period gives a speed of 160MFLOPS.
The functional units and registers can be regarded as resources of the control unit. Most instructions tie up some of these resources for several clock periods. For example, the following instruction takes three clock periods to complete and ties up the scalar add unit, and the registers S3, S5 and S7.
|S3 <-- S5 + S7|
It is wasteful to have all the other functional units idle during this time so the control unit is able to issue instructions every clock period. This is fine as long as the subsequent instructions do not use any of the resources that the first instruction is using. If subsequent instruction do use the same resources then programs would not execute correctly. For example, if the following pair of instructions were to be issued in subsequent clock periods the second instruction would be trying to use S3 two clock periods before it is correctly assigned.
|S3 <-- S5 + S7|
|S0 <-- S3 * S2|
If the second instruction were to be delayed for two clock periods then the add unit would be trying to write to S3 at the same time that the multiply unit was trying to read from it, which is not permissible. The only correct solution is not to issue the second instruction until the fist has completed. The compiler will attempt to avert the clash by looking for three instructions that could be inserted between the two above without interfering with them nor changing the meaning of the program. Unfortunately, this is not always possible and a pair of instructions that would cause a resource contention may end up next to each other. To stop the second instruction being executed before the first has completed the CRAY-1 uses a technique called reservation.
Once an instruction has been issued all the resources used by that instruction are automatically flagged as reserved until the instruction has completed. Any subsequent instruction that involves reserved resources is prevented from being issued until the reservation has been lifted. When vector registers are involved the whole register is reserved not just one element, this is because the circuitry that permits access to vector registers can only access one element in a vector register at a time. Furthermore, operand V registers are reserved as well as result V registers. While the control unit is waiting for a 'reservation' to be lifted all inactive functional units and registers remain idle.
Reservations are not placed on the VL register or on S registers taking part in vector operations. Values from these registers are copied into the functional units when the instruction begins, which immediately leaves the registers free to be used by other operations.
'Chaining' allows much more effective use of the functional units by connecting the output of one vector pipeline to the input of another, thereby, producing a deeper pipeline that performs two operations. As soon as the result comes out of the first pipeline, it is duplicated, with one copy being forwarded to the next pipeline and the other copy being sent to the original result vector register. Using the following two instructions as an example, reservation would normally prevent the second instruction from starting until the final element of V1 has been computed.
|V1 <-- V0 * S1|
|V3 <-- V1 + V2|
To allow chaining, the reservation on V1 is removed as soon as the first element has been computed. This allows the second instruction to begin as soon as its first operand is available. The results from the second functional unit may themselves be sent to another functional unit forming an even deeper pipeline. Up to eight functional units can be chained together. It is not possible to chain together the following pair of instructions because they both use the same pipeline.
|V1 <-- V0 + S1|
|V3 <-- V1 + V2|
The CRAY-1 computer introduced the concept of chaining.
The maximum main memory bandwidth of 80Mwords per second is a serious impediment to high performance. A typical instruction would make four memory references - one to get the instruction, two to get operands and one to return the result. This limits the processing rate to 20MFLOPS. Buffers and registers help compensate for the speed of the main memory being one quarter of that of the functional units. Multiple functional units and chaining not only provide parallelism but also help reduce the number of main memory accesses by allowing several operations to be performed on an operand between it being fetched and being returned to main memory. It is only through the use of buffers registers, vectors and chaining that the potential of the functional units can be realised.
Vector processing is only effective if the application has enough vector operations and they can be identified during compilation. On applications with a high vector content the CRAY-1 operates at about 140 MFLOPS.
However, on a typical application the processing rate drops to 20-50MFLOPS. The CRAY-1 has fast scalar processing, which gives a good balance between vector and scalar performance and helps give the machines good overall throughput.
Many vector machines have memory-memory architectures and consequently can work on much larger vectors. On these machines it is usually 'expensive' time-wise to start a vector operation and so for short vectors it is quicker to do several scalar operations instead. The CRAY-1 has a short vector start-up time and works more efficiently in vector mode even on small vectors. It is only worth switching to scalar operations when vector lengths drop to about 3-4 elements.
The CRAY-1 architecture underwent minor architectural changes during its life time. CRAY-1S and CRAY-1M models increased the I/O bandwidth and amount of main memory, respectively. The CRAY X-MP series of machines are essentially several CRAY-1s running in parallel. Each processor runs at 105MHz which is faster than a standard CRAY-1 which runs at 80MHz. The CRAY X-MP can have two or four processors depending on the model with the processors communicating through a shared memory.