Parallel Computer Taxonomy



"I find digital computers of the present day to be very complicated and rather poorly defined. As a result, it is usually impractical to reason logically about their behaviour. Sometimes, the only way of finding out what they will do is by experiment. Such experiments are certainly not mathematics. Unfortunately, they are not even science, because it is impossible to generalise from their results or to publish them for the benefit of other scientists."

Speech given by Tony Hoare at the Boston
Computer Museum on the occasion of BYTE
[magazine's] 10th Anniversary celebration.
 

7 Taxonomy

In the study of macroscopic computer architecture it is useful to compare one machine against another. It is desirable to do this without having to compare detailed data sheets for each computer. For this purpose computers can be classified based on just a few of their characteristics. The process of classification is called 'taxonomy'.

A successful taxonomy should be able to classify all current computers as well as computers in the foreseeable future. Selecting suitable characteristics is not a formal activity but requires prejudice, insight and luck. Prejudice, because as computers develop their significant characteristics change. What is thought to be important now may become superseded by developments and innovations in the future. Another possibility is that a characteristic becomes so important that every computer in the future will have it. If this happens, some categories of the classification will become redundant. In either case, if a characteristic no longer becomes tenable the taxonomy will fall into disuse, and be superseded by another.

In devising a taxonomy insight is used to pre-judge the possible future significance of a characteristic and as prejudice is used it needs an element of luck to be successful. To some extent, the choice of characteristics can be judged qualitatively by the acceptance and longevity of a taxonomy. For example, the most popular taxonomy in use today is that of Flynn reported in 1966.

The ability to place machines into categories is useful in two different ways. Firstly, one machine can be contrasted with another. Here the taxonomy is being used as a qualitative measure of dissimilarity. Secondly, the taxonomy can measure the resemblance between two machines. This can be useful in that a property discovered about one machine may also be true for other machines in the same category.

A good taxonomy should be able to place a computer in one, and only one, category or 'taxon' (plural 'taxa'). In practice, the versatility of some computers makes this very difficult and a small number of machines will fit several categories or taxa. Some machines have been designed for research into parallelism and have the ability to be reconfigured or run in several distinct modes. Carnegie-Mellon University's Cmm.p computer is an example of such a machine. As long as the number of multi-category machines is small the taxonomy can still be of use. A further difficulty arises in classifying machines whose category changes when they are upgraded.

In practice, if a taxonomy is to be accepted it should be simple, and its nomenclature easy to remember and easy to use. Many taxonomies have been reported, and the most accepted ones include those of; Flynn, Handler, Hockney & Jesshope and Shore.

It should be emphasised that taxonomies are simply cataloguing systems designed to provide a conceptual framework for the discussion of computer architecture.

 

7.1 Flynn's Taxonomy

Currently, the most popular nomenclature for the classification of computer architectures is that proposed by Flynn[1] in 1966. Flynn chose not to examine the explicit structure of the machine, but rather how instructions and data flow through it. Specifically, the taxonomy identifies whether there are single or multiple 'streams' for data and for instructions. The term 'stream' refers to a sequence of either instructions or data operated on by the computer. The four possible categories of Flynn's taxonomy are;

Table 7.1 The Four Categories Identified by Flynn's Taxonomy
SISD Single Instruction Single Data
SIMD Single Instruction Multiple Data
MISD Multiple Instruction Single Data
MIMD Multiple Instruction Multiple Data

"The multiplicity is taken as the maximum possible number of simultaneous operations [instructions] or operands [data] being in the same phase of execution at the most constrained component of the organization"[1] Flynn (1966)

7.1.1 SISD Machine

SISD machines are conventional serial computers that process only one stream of instructions and one stream of data. Figure 7.1.1 illustrates the architecture of a SISD machine, which is also called a von Neumann computer.

SISD Archirecture

Figure 7.1.1 Flynn's SISD Architecture

Instructions are executed sequentially but may be overlapped by pipelining - most SISD systems are now pipelined. SISD computers may have several functional units, examples include extra mathematics coprocessors, vector units, graphics processors and I/O processors. As long as everything can be regarded as a single processing element the architecture remains SISD.

Examples of SISD machines include:

  • CDC 6600 which is unpipelined but has multiple functional units.
  • CDC 7600 which has a pipelined arithmetic unit.
  • Amdhal 470/6 which has pipelined instruction processing.
  • Cray-1 which supports vector processing.

7.1.2 SIMD Machine

This category encompasses computers which have many identical interconnected processors under the supervision of a single control unit, see figure 7.1.2. The control unit transmits the same instruction, simultaneously, to all processors. The topology of the communications network is not addressed by Flynn's taxonomy.

SIMD Archirecture

Figure 7.1.2 Flynn's SIMD Architecture

All the processing elements simultaneously execute the same instruction and are said to be 'lock-stepped' together. Each processor works on data from its own memory and hence on distinct data streams. (Some systems also provide a shared global memory for communications.) Every processor must be allowed to complete its instruction before the next instruction is taken for execution. Thus, the execution of instructions is said to be synchronous.

This category corresponds to the array processors discussed in section 2.3.3 and examples include; ILLIAC-IV, PEPE, BSP, STARAN, MPP, DAP and the Connection Machine (CM-1).

7.1.3 MISD Machine

MIMD Archirecture

Figure 7.1.4 Flynn's MIMD Architecture

MISD machines have many processing elements, all of which execute independent streams of instructions. However, all the processing elements work on the same data stream. There are two ways in which this can be done: firstly the same data item can be fed to many processing elements each executing their own stream of instructions, see figure 7.1.3-1. Alternatively, the first processing element could pass on its results to the second processing element and so on, thus forming a macro-pipeline, see figure 7.1.3-2.

MISD Archirecture

Figure 7.1.3-2 Macropipeline Varient of Flynn's MISD Architecture

The only known example of a computer capable of MISD operation is the C.mmp built by Carnegie-Mellon University. This computer is reconfigurable and can operate in SIMD, MISD and MIMD modes.

7.1.4 MIMD Machine

Most multiprocessor systems and multiple computer systems can be placed in this category. A MIMD computer has many interconnected processing elements, each of which have their own control unit, see figure 7.1.4. The processors work on their own data with their own instructions. Tasks executed by different processors can start or finish at different times. They are not lock-stepped, as in SIMD computers, but run asynchronously.

MIMD Archirecture

Figure 7.1.4 Flynn's MIMD Architecture

Examples include; C.mmp, Burroughs D825, Cray-2, S1, Cray X-MP, HEP, Pluribus, IBM 370/168 MP, Univac 1100/80, Tandem/16, IBM 3081/3084, C.m*, BBN Butterfly, Meiko Computing Surface (CS-1), FPS T/40000, iPSC.

7.1.5 Extensions to Flynn's Taxonomy

As more work was done on computer architecture, Flynn's taxonomy proved too coarse. Various authors have used Flynn's taxonomy as a basis of their own taxonomy - adding extra parameters to subdivide the categories.

The SISD category can be split into machines with one functional unit, and those with many. Computers with one functional unit are also called 'serial scalar computers', these include; IBM 701, IBM 1620, IBM 7090, VAX 11/780. Computers with more than one functional unit include; CDC 6600, CDC Cyber 205, CDC-NASF, CDC Star 100, Cray-1, FPS-164, FPS AP-120B, Fujitsu VP 200, Fujitsu FACOM-230/75, IBM 360/91, IBM 370/168UP, IBM 3838, TI-ASC, which are all vector machines.

The SIMD category can be divided into those machines whose individual processors can only work on a single bit during an instructions cycle, and those that work on whole word, (ie word-sliced and bit-sliced machines).

Word sliced machines include; ILLIAC-IV, PEPE, BSP
Bit-sliced machines include; STARAN, MPP, DAP, CM-1

There are four main methods by which the processing elements can communicate with each other. The first two rely on a shared memory, which can be either in one main block, or distributed among the processing elements. In both cases the memory has a continuous address space. If the memory is in a single main block there will inevitably be contention between processors wishing to access the memory bus. The processors are, therefore, heavily interdependent and are said to be 'tightly coupled'.

In distributed memory systems each processor can access its own memory without penalty. If, however, the access is to a memory cell located on another processing element there is a large overhead. These processors are said to be 'loosely coupled'. Table 7.1.5 gives some abbreviated extensions used to extend Flynn's taxonomy.

Table 7.1.5 Extensions used with the SIMD and MIMD Categories of Flynn's Taxonomy
-SM-R shared memory multiple reads to same location allowed
-SM-W shared memory multiple writes to same location allowed
-SM-RW shared memory, both multiple reads and writes allowed
-SM shared memory, both multiple reads and writes disallowed
-TC tightly coupled
-LC loosely coupled
-UC uncoupled

Computers that allow simultaneous writing to memory (ie -SM-W and -SM-RW) may resolve conflicts in several different ways. In 'equality conflict resolution', simultaneous writing to memory is only permitted if all the processors wish to store the same value. In 'priority conflict resolution', all the processors are uniquely numbered according to some priority. Should a writing conflict occur the processor with the highest priority gets to write to memory, and all the others are blocked. In 'arbitrary conflict resolution', any one arbitrary processor may succeed.

The other two communication methods involve channels. These can be either bus based or network based. In bus based systems all the processors are directly linked together, whilst networked based systems usually have a processor directly linked to just a few others. Messages in networked systems will often need to be relayed via several processors in order to reach their destination. Communicating channel systems have local exclusive memory and do not need a shared memory.

Tightly coupled (-TC) machines include; C.mmp, Burroughs D825, Cray-2, S1, Cray X-MP, HEP, Pluribus. Loosely coupled (-LC) machines include; IBM 370/168 MP, Univac 1100/80, Tandem/16, IBM 3081/3084, C.m*, BBN Butterfly. Uncoupled (-UC) machines include; Meiko Computing Surface, FPS T/40000, iPSC.

(Some authors regard MIMD computers without shared memories as a collection of independent SISD uniprocessors and term them Multiple SISD or MSISD machines. However, these machines appear to fit well enough in the MIMD division and are usually thus categorised.)

Almost all current interest in parallel computers centres about the multiple data stream concept (MD).

 

7.2 Handler's Classification

In 1977 Handler proposed an elaborate notation for expressing the pipelining and parallelism of computers. Handler's taxonomy[2] addresses the computer at three distinct levels: the processor control unit (PCU), the arithmetic logic unit (ALU), and the bit-level circuit (BLC). The PCU corresponds to a processor or CPU, the ALU corresponds to a functional unit or a processing element in an array processor, and the BLC corresponds to a the logic needed to perform one-bit operations in the ALU.

Handler's taxonomy uses three pairs of integers to describe a computer:

		Computer = (k * k', d * d', w * w')

	Where k = number of PCUs
	Where k'= number of PCUs that can be pipelined
	Where d = number of ALUs controlled by each PCU 
	Where d'= number of ALUs that can be pipelined 
	Where w = number of bits in ALU or processing element (PE) word
	Where w'= number of pipeline segments on all ALUs or in a single PE

The following rules and operators are used to show the relationship between various elements of the computer. The '*' operator is used to indicate that the units are pipelined or macro-pipelined with a stream of data running through all the units. The '+' operator is used to indicate that the units are not pipelined but work on independent streams of data. The 'v' operator is used to indicate that the computer hardware can work in one of several modes. The '~' symbol is used to indicate a range of values for any one of the parameters. Peripheral processors are shown before the main processor using another three pairs of integers. If the value of the second element of any pair is 1, it may omitted for brevity. Handler's taxonomy is best explained by showing how the rules and operators are used to classify several machines.

The CDC 6600 has a single main processor supported by 10 I/O processors. One control units coordinates one ALU with a 60-bit word length. The ALU has 10 functional units which can be formed into a pipeline. The 10 peripheral I/O processors may work in parallel with each other and with the CPU. Each I/O processor contains one 12-bit ALU. The description for the 10 I/O processors is:

	CDC 6600I/O = (10, 1, 12) 
The description for the main processor is:
	CDC 6600main = (1, 1 * 10, 60)

The main processor and the I/O processors can be regarded as forming a macro-pipeline so the '*' operator is used to combine the two structures:

   CDC 6600 = (I/O processors) * (central processor)

	    = (10, 1, 12) * (1, 1 * 10, 60)

Texas Instrument's Advanced Scientific Computer (ASC) has one controller coordinating four arithmetic units. Each arithmetic unit is an eight stage pipeline with 64-bit words. Thus we have:

	ASC = (1, 4, 64 * 8)

The Cray-1 is a 64-bit single processor computer whose ALU has twelve functional units, eight of which can be chained together to from a pipeline, see section 6.3. Different functional units have from 1 to 14 segments, which can also be pipelined. Handler's description of the Cray-1 is:

	Cray-1 =  (1, 12 * 8, 64 * (1 ~ 14))

Another sample system is Carnegie-Mellon University's C.mmp multiprocessor. This system was designed to facilitate research into parallel computer architectures and consequently can be extensively reconfigured. The system consists of 16 PDP-11 'minicomputers' (which have a 16-bit word length), interconnected by a crossbar switching network. Normally, the C.mmp operates in MIMD mode for which the description is (16, 1, 16). It can also operate in SIMD mode, where all the processors are coordinated by a single master controller. The SIMD mode description is (1, 16, 16). Finally, the system can be rearranged to operate in MISD mode. Here the processors are arranged in a chain with a single stream of data passing through all of them. The MISD modes description is (1 * 16, 1, 16). The 'v' operator is used to combine descriptions of the same piece hardware operating in differing modes. Thus, Handler's description for the complete C.mmp is:

	C.mmp = (16, 1, 16) v (1, 16, 16) v (1 * 16, 1, 16)

The '*' and '+' operators are used to combine several separate pieces of hardware. The 'v' operator is of a different form to the other two in that it is used to combine the different operating modes of a single piece of hardware.

The ICL DAP is another processor that can be reconfigured, it is a mesh connected bit-sliced array processor with a single control unit, see section 6.1. It can work as 128 ALU's with 32-bit words and 64 Kbits of memory. Reconfiguring gives a computer with 4,096 ALU's with 1-bit words and 4Kbits of memory. Originally this machine was 'front-ended' by an ICL 2900 computer for which the description is (1, 1, 32). The DAP was designed to look like a 2MByte block of memory to the ICL 2900.

	ICL DAP = (1, 1, 32) * [(1, 128, 32) v (1, 4096, 1)]

Handler's taxonomy does not indicate by what topology the processors are connected, (in this case a mesh network).

The ILLIAC-IV array processor was developed at the University of Illinois and fabricated by Burroughs, see section 6.2. It is made up from a mesh connected array of 64 64-bit ALUs, front-ended by a Burroughs B6700 computer. Later designs used two DEC PDP-10's as the front end, however, the ILLIAC-IV can only accept data from one PDP-10 at a time.

	B6700-ILLIAC-IV  = (1, 1, 48) * (1, 64, 64)

	PDP 10-ILLIAC-IV = (2, 1, 36) * (1, 64, 64)

The ILLIAC-IV can also work in a half word mode where there are 128 32-bit processors rather than the normal 64 64-bit processors.

	PDP 10-ILLIAC-IV/2  = (2, 1, 36) * (1, 128, 32)

Combining this with the above we get the following:

	PDP 10-ILLIAC-IV = (2, 1, 36) * [(1, 64, 64) v (1, 128, 32)]

The OMEN-60 by the Sanders Corporation uses a PDP-11 to front end a block of 64 1-bit ALUs working as an associative processor. The PDP-11 interprets the program and forms the control signals for the associative processor. If the program can use the associative processor it is run on the associative processor otherwise it is run on the PDP-11. The two do not act as a pipeline so the '+' rather than the '*' operator is used to combine them.

	OMEN-60 = (1, 1, 16) + (0, 64, 1)

The '0' indicates that the associative unit has no control unit and cannot interpret its own instructions.

 

7.3 Shore's Taxonomy

Shore proposed his taxonomy in 1973. It is based on the structure and number of the functional units in the computer. Shore's taxonomy has six categories each designated by a number.

7.3.1 Type-I

Classical von Neumann serial computers belong to category 'I', which also corresponds to the SISD category in Flynn's taxonomy. Only one control unit is allowed, however, it may coordinate several functional units, see figure 7.3.1. Pipelined vector supercomputers fit this criterion and belong in this category. The processor works on memory words (horizontal memory slices) and is said to be 'word-serial' bit-parallel. Examples of type-I machines include the CDC 7600 and the Cray-1.

Shores Type-I Archirecture

Figure 7.3.1 Shore's Type-I Architecture

7.3.2 Type-II

Computers in this category are similar to those of Type-I except that they work on bit-slices of memory (vertical slices) rather than word-slices (horizontal slices), see figure 7.3.2. The processing unit works in 'word-parallel' bit-serial mode. Examples include the ICL DAP and the Goodyear Aerospace STARAN.

Shores Type-II Archirecture

Figure 7.3.2 Shore's Type-II Architecture

7.3.3 Type-III

Type-III machines are combinations of Type-I and Type-II, see figure 7.3.3. A Type-III machine has two processing units, one that works on data-words and the other that works on data-slices.

Shores Type-III Archirecture

Figure 7.3.3 Shore's Type-III Architecture

The DAP and the STARAN machines can be programmed to work on data-words and data-slices. However, this is done with a single processing unit, rather than two, and so they do not properly fit into the type-III category. The Sanders Associates OMEN-60 series of machines do fit this category.

7.3.4 Type-IV

Type-IV machines are similar to Type-I machines, except that the processing unit and the data memory are replicated, see figure 7.3.4. Communication between the various processing elements can only occur though the control unit. The machine is easy to expand by adding more processing elements, however, the communications bandwidth is limited by having all messages go though the central control unit. Machines in this category could be called unconnected arrays, eg PEPE.

Shores Type-IV Archirecture

Figure 7.3.4 Shore's Type-IV Architecture

7.3.5 Type-V

This class of machine is similar to that of Type-IV, except that the processors are arranged in a linear chain and have nearest-neighbour connections, see figure 7.3.5. Every processor is able to address the memory of its nearest neighbours. An example of this machine is the ILLIAC-IV. (In addition to the nearest-neighbour connections the ILLIAC-IV also provides direct short-cut communications between every 9th processing element.) These machines can also be called connected arrays.

Shores Type-V Archirecture

Figure 7.3.5 Shore's Type-V Architecture

7.3.6 Type-VI

The first five types of machines have distinct processing units and memories interconnected by a bus or a switching network. The final class of machines identified by Shore have their processing power intimately intermeshed with the memory, see figure 7.3.6. This is sometimes called 'Logic In Memory Array' or LIMA.

Shores Type-VI Archirecture

Figure 7.3.6 Shore's Type-VI Architecture

Simple associative memories and complex associative processors, are type-VI machines, see section 2.5.

 

7.4 Hockney and Jesshope's Structural Taxonomy

Hockney and Jesshope developed an elaborate notation called Algebraic-style Structural Notation, (or ASN,) to describe parallel computers. This notation is the basis of their structural taxonomy. Unfortunately, it is too detailed to be described here, and herein lies its main limitation - it is complicated. The taxonomy is arranged as several trees, with a machine being described by all the labels in all the parent nodes all the way back to the root node.

Some of the nodes contain the algebraic-style structural notation, which can include many terms with several superscripts and subscripts. For example, the following expression describes the Cray-1 computer:

C(Cray-1) = Iv12[12Ep12 - 16M50]r; 12Ep = {3Fp64,9B}

With the notation and the taxonomy it is possible to sit down with a pencil and paper and write a detailed description of the machines architecture. The expression for the Cray-1 given above translates into:- An unpipelined instruction unit with vector instructions and a clock period of 12ns. A twelve stage pipelined execution unit with a clock period of 12ns connected to sixteen memory banks with an access time of 50ns. Instructions are issued when the execution unit and registers are free. The twelve execution units comprise of three 64-bit floating-point arithmetic pipelines that can work in parallel with nine unpipelined integer units.

This taxonomy addresses more characteristics than any other, however, the architecture is not apparent at the first glance of an expression - or the second. [Hockney & Jesshope's taxonomy is described in detail in their book 'Parallel Computers 2'[4], p60-81.]

 

7.5 Towards a New Taxonomy

Although many computer taxonomies have been devised over the years, none have enjoyed the same level of popularity as Flynn's taxonomy, which dates back to 1966. Unfortunately, Flynn's taxonomy is limited in the number of categories it has, worse still, most machines of current interest fit into just one category - MIMD.

Flynn's taxonomy has four categories, SISD, SIMD, MISD, MIMD. The MISD category has only one known member - the Carnegie-Mellon University's C.mmp multiprocessor - and while this can be configured to work in MISD mode it normally works in MIMD mode. The SISD category is for serial computers, so all bar one parallel computers fit into just two categories. The variety of modern parallel computers warrants more categories. While there is interest in SIMD machines, the thrust of current work is in MIMD machines. Flynn's taxonomy tells us little more than whether a machine is parallel or not.

A problem arises when Flynn's taxonomy is applied to vector supercomputers like the Cray-1. In Hockney and Jesshope's book[4] the Cray-1 is categorised as SIMD machine because it has vector units. However, Hwang and Briggs in their book[5] categorise the Cray-1 as a SISD because there are no multiple processing elements. Which of these classifications is correct comes down to whether the vector units are regarded as processing a single or multiple data stream. This is open to interpretation, and leads to problems.

Handler's taxonomy is a more useful - if a little more complicated. While Flynn's taxonomy is easy to use in speech, Handler's is cumbersome. The direct use of numbers in a taxonomy's nomenclature makes it much more abstract and hence difficult. Handler's taxonomy is highly geared towards the description of pipelines and chains. And while it is well able to describe the parallelism in a single processor, the variety of parallelism in multiprocessor computers is less well addressed. Data flow computers and neural networks are particularly awkward in Handler's taxonomy.

In Shore's taxonomy, Type-I machines match Flynn's SISD category, and machines of Type-II through to Type-V are divisions of Flynn's SIMD category. Most modern parallel machines fit into Flynn's MIMD category which is not addressed by Shore. Labelling the categories with numbers, which do not in any way relate directly to the architecture, makes the taxonomy difficult to remember. Furthermore, it is not possible to derive the architecture from the label, it is necessary to know what every category represents.

Hockney and Jesshope's structural notation is very detailed but far too complicated to be communicated verbally. A variety of computer characteristics are now examined to see which are suitable to form the basis of a new taxonomy.

7.5.1 Performance

Possibly the most important parameter of a computer is its computational power or performance. This is often measured by running small programs called benchmarks which require a known number of operations or calculations for their completion. By observing how quickly they complete, an idea of what the machine is capable of is gained. The results are usually quoted in MIPS, MFLOPS, Dhrystones, Whetstone, SPECmarks and LIPS, although other units also exist.

At first sight benchmarks seem a fine, objective and quantitative way of evaluating a machine. Unfortunately, it has been found that for a single machine the benchmark figures vary depending on the particular programming language used and the operating system's level of activity. The sophistication of the compiler and how well it 'knows' the machine also plays an important part. Consequently, many versions of the same algorithm, solving the same problem, on the same machine give wildly differing results. Optimising compilers and the careful choice of benchmarks can give a machine a rating that is way beyond what the machine will achieve in general use. This removes the notion that benchmarks are either representative or objective. A cynical view of manufacturer's benchmark ratings is to regard them as a level of performance that the machine is incapable of exceeding.

Benchmarks give a figure for a specific type of machine performance, ie specifically on the benchmark program and similar types of calculation. General performance is much more difficult to measure. No way has yet been devised to give a single figure which represents the performance of a machine on all problems. Different applications exercise different parts of a computer. Scientific and engineering applications tend to work the arithmetic unit or vector units. Database systems, artificial intelligence, and expert systems tend to exercise the logic circuitry. Commercial data processing on the other hand involves a great deal of memory accessing and peripheral I/O, whilst hardly using the arithmetic unit at all. Machines marketed for these different uses have had the corresponding parts of the processors optimised. Conversely, if a machine is to be used for one type of application it makes financial sense not to optimise those parts of the processor that will not be too heavily used.

If it were possible to give a figure which represents the performance of the computer, all this figure would allow us to do is compare and rank on the basis of computational power. It tells us nothing about the architecture of the machine. This, along with the tenuous link between capability and benchmark, and no generally agreed upon units make performance an unsuitable parameter for a taxonomy.

7.5.1.1 Clock Speed

The clock speed of the processor indicates how quickly a processor can get through instructions. It can be considered as a primitive measure of performance. Unfortunately, it does not tell us how much work the machine can do in a single clock cycle. This will be much higher than the clock speed suggests if there is: pipelining, one or more vector units, or there are many processors. Furthermore, many instructions in a CISC like processor will require several clock cycles for their completion, in which case the clock speed will be an overestimate of performance. Clock speed does not give any information about parallelism at any level.

There is research under way into producing 'free running' processors, ie they do not have a clock, but allow the instructions to flow through them as quickly as they can. An instruction is started off by the previous instruction clearing the first stage of the processor. Clock speed would have no meaning to a free running processor and the taxonomy would fail.

7.5.1.2 Word Size

The word size of a machine is the number of bits a processing unit treats as elemental. Early processors were bit-serial, ie they could only work on 1-bit at a time. Work on larger data items involved a string of 1-bit data items processed one after another. The simple processors found in highly parallel machines like the ICL DAP are still of this type, see section 6.1. To increase throughput, all the bits of the larger data items can be processed at the same time - a process called bit-parallel.

Early microprocessors were 4- and 8-bit, modern microprocessors are 32- and in some cases 64-bit. The data path between the processor and main memory, the registers, and all the functional units work in parallel on this number of bits.

Word size reflects the sophistication of a processor - a 64-bit processor will be more sophisticated than a 4-bit processor. Unfortunately, a given word size will cover a large range of capabilities and significantly overlap the capabilities of processors of similar word sizes. Word size does not convey enough information to be used in a taxonomy.

7.5.2 Number of Processing Elements

A parallel computer, by definition has several different functional units working at the same time. The performance of individual functional units or processors vary greatly between various parallel computers. The processing elements in neural networks are very simple, and many such processors fit on a single chip. At the other extreme, all modern supercomputers have several processors, each of these processors can be supercomputers in their own right.

Computers with large numbers of processors tend to have, small processors, and a small grain size. Conversely, many computers with a small number of parallel processors have large processors, a large memory and work with a large grain size. The behaviour of a computer with a large number of processing elements is unlikely to be too different from a similar computer with a few processors less. However, if the number of processing elements is radically different then the behaviour of the two computers is also likely to be different[3]. The precise number of processors gives an unnecessary amount of detail, however, a coarse indication of the number of processors can infer a lot of useful information about the architecture of the machine.

7.5.3 Topology

Different topologies have proved to be optimal for various algorithms and applications. However, there is a great deal more to learn about the affects of communication and topology on the process of computation. As more is discovered about this relationship the knowledge can be applied to other computers with the same topology. Topology, more than anything else, indicates how or if a machine will scale up. Topology therefore, is a good candidate for inclusion in a taxonomy. If the processing elements are connected by a switching network rather than a fixed topology then the type of network can be indicated.

7.5.4 Grain Size

Grain size is the length of a typical thread of code that is executed by a processing element before it is interrupted or halted. The interruption can be triggered by events that are either external or internal to the process. An external event might be the operating system descheduling the current process so that another process may have its turn at execution. An internal event would result from the process needing to communicate with another process or processor. Many systems only allow internal events to trigger an interrupt, in which case the grain size can be though of as the typical ratio of computation to communications.

Grain size is expressed as; fine/small, medium or coarse/large. A small or fine grained computer will tend to execute only a few machine code instructions before another program thread takes over. On the other hand a large or coarse grained machine could execute complete high-level procedures or programs before relinquishing the processor. Medium grained machines lie between these two extremes.

Grain size also indicates how parallelism is expressed in software. Coarse grained machines usually have communication, synchronisation, and scheduling explicitly stated in the programs. As the grain size decreases, its management progressively moves from the compiler to the operating system to the firmware.

Coarse grained machines tend to have powerful processors with large memories whilst fine grained machines have small processors and small memories. Grain size reveals enough about the architecture of a machine to warrant inclusion in a taxonomy.

7.5.5 Memory Size

Memory size can be stated in two ways; the amount of main memory in the system as a whole, or the amount of memory attached to an individual processor. The amount of memory gives some indication as to the size of the programs the machine is capable of running. A large memory size implies a large grain-size and fairly powerful individual processors. Most systems with powerful processing elements only have a small number of them in the complete parallel computer.

It is not necessary for all the processors in a parallel computer to have the same amount of memory. Processors topologically at the edge of the system may well serve as backing-store or graphics interfaces, and will often have much more memory than 'compute only' processors. Rather than give a list of the differing processors and their memory content it is possible to give the average amount of memory per processor. Unfortunately, this may be a complicated number (ie not a power of two), and so difficult to remember. Changing the amount of memory in one processor will change the taxonomy, without significantly changing the architecture.

A further complication arises in systems where the processors can not only access their own memory but also the memories of some or all of their neighbours. The inclusion of cache makes a significant difference to the behaviour of shared memory systems, and should somehow be indicated along with main memory size. Like clock speed, memory size looks like a simple parameter to state, but in the context of parallel computers there are too many complications. Grain size and number of processors can be inferred from memory size, conversely, a rough idea of memory size can be inferred from the number of processors and the grain size.

7.5.6 Control Multiplicity

This is the ratio of control units to processing elements and will normally only apply to the computers found in Flynn's SIMD category. In Flynn's other categories this ratio will be one-to-one. Array processors tend to have 10's or 100's of processing units coordinated by a single control unit. With associative processors this number rises to 1,000's or 100,000's of rudimentary processing units to one control unit. 'Control multiplicity' tells us about the nature of the data and control streams in much the same way Flynn's taxonomy does.

7.5.7 A New Taxonomy

The nomenclature for the new taxonomy has four fields which are stated in the following order: processor multiplicity, grain size, topology, and control multiplicity. Rather than stating processor multiplicity, grain size and control multiplicity as a number, they are to be stated as rough orders of magnitude. The most familiar scale of 'approximate multiplicity' is used to indicate the integration levels for silicon chips, see table 7.5.7-1 This is the basis of the scale in the new taxonomy.

Table 7.5.7-1 Integration Levels for Silicon Chips
Number of GatesNumber of TransistorsMultiplicity
01ZSI - Zero Scale Integration
102-99SSI - Small Scale Integration
100100-1000MSI - Medium Scale Integration
10001000-10000LSI - Large Scale Integration
1000010000- +VLSI - Very Large Scale Integration

To allow for future expansion it is desirable to extend this scale up, however, silicon chip integration levels do not seem to be well differentiated above VLSI. The way physicists indicate radio frequency bands roughly corresponds to the scale of silicon chip integration levels, see table 7.5.7-2.

Table 7.5.7-2 Radio Frequency Bands
FrequencyMultiplicity
3-30kHzVLF - Very Low Frequency
30-300kHzLF - Low Frequency
0.3-3MHzM - Medium Frequency
3-3MHzHF - High Frequency
30-300MHzVHF - Very High Frequency
0.3-3GHzUHF - Ultra High Frequency
3-30GHzSHF - Super High Frequency
30-300GHzEHF - Extra High Frequency

The nomenclature for the integration levels for silicon chips along with the top end of the radio frequency table is used to indicate multiplicity in the new taxonomy, see table 7.5.7-3.

Table 7.5.7-3 Multiplicity Bands for New Taxonomy
Number PresentMultiplicity
*In order to allow the new taxonomy to be communicated verbally only one letter is used to indicate a band. Unfortunately, if only the first letter is used there is a clash between 'Small' and 'Super-large'. To avoid this conflict 'P' - the first letter of the second syllable - is used to indicate the suPer-large band.
1Z - Zero
2 - 20S - Small
20 - 100M - Medium
100 - 1000L - Large
1000 - 10000V - Very-large
10000 - 100000U - Ultra-large
100000 - 1000000P - suPer-large*
1000000 - 10000000E - Extra-large

The first field of the new taxonomy is processor multiplicity - the number of processors in the computer is turned into a single letter using table 7.5.7-3. The second field is grain size. As described in section 7.5.4, grain-sizes can either be: small, medium or large - which, by a happy coincidence, can also be condensed into a single letter using table 7.5.7-3.

With the rules given so far, supercomputers would share the same category as personal computers. This is deemed unsatisfactory and has been resolved as follows:- a forth state has been added to the 'grain size' parameter. The fourth term, 'Very-large', comes from the observation that supercomputers tend to run huge programs, essentially without interruption. A very-large grain size is to be used to indicate a supercomputer.

The third parameter in the new taxonomy is to be topology. It is left to the user's discretion to state this as concisely as practical. Where possible just one word should be used.

The final field in the new taxonomy is control multiplicity. Control multiplicity is not simply the number of control units present, but can be defined as: one less than the number of processing units driven by a single control unit. As before the control multiplicity is turned into a single letter by using table 7.5.7-3.

Many parallel computers will have one control unit for every processing unit, (ie control multiplicity is zero,) in which case the control multiplicity need not be stated. If the computer has only one processor (ie processor multiplicity is zero), the matter of topology does not arise and also need not be stated. With only one processor the control multiplicity must also be zero and so is not stated.

Control multiplicity and topology are the only terms that are sometimes not given, the other two terms should always be stated. The possible states that the four parameters can take are given in table 7.5.7-4.

Table 7.5.7-4 Possible States of the Four Parameters in the New Taxonomy
ParameterPossible States
Processor Multiplicity:Zero, Small, Medium, Large, Very-large, Ultra-large, Super-large, Extra-large
Grain size:Small, Medium, Large, Very-large
Topology:Tree, Ring, Hypercube, Pipe, Mesh, etc
Omega, Bens, Shuffle-Exchange, Butterfly, etc
Control Multiplicity:Zero, Small, Medium, Large, Very-large, Ultra-large, Super-large, Extra-large

7.5.7.1 Examples

The application of the new taxonomy can be illustrated with the examples given below.

The ICL DAP has 4,096 simple processing units interconnected by a mesh and managed by a single control unit, see section 6.1.

	Processors	Grain  Topology  Control Multiplicity
            V		  S      Mesh              V

The ICL DAP is a 'VS Mesh V' machine.

A Tandem T.node has 64 transputers that can work independently. The transputers are interconnected by a crossbar switching network, see section 5.3.

	Processors	Grain  Topology  Control Multiplicity
	    M		  M    Crossbar            Z

The Supernode T.node is an 'MM Crossbar' computer. Control multiplicity is zero and so need not be stated. All the processors are linked to a supervisor bus. This is used for low level monitoring and debugging rather than transferring an applications data and so need not be added to the topology.

An IBM PC is a single processor is responsible for all the application computing. As there is only one processor so there is no interconnection topology.

	Processors	Grain  Topology  Control Multiplicity
	    Z		  L      none	           Z

The IBM PC is a 'ZL' computer.

The Connection Machine CM-1 has 65,536 simple 1-bit processors connected into a hypercube and each having 4Kbits of memory. Every processor is connected to a central unit called the 'microcontroller' which issues identical 'nanoinstructions' to all of them. This unit can be regarded as a control unit.

	Processors	Grain  Topology  Control Multiplicity
	    V		  S    Hypercube           V

The Connection Machine CM-1 is a 'VS hypercube V' computer.

The Cray-1 is a pipelined vector supercomputer. Even though it has several scalar and vector functional units, it can best be regarded as having a single processing unit. Because it is a vector supercomputer it is regarded as having a very-large grain size.

	Processors	Grain  Topology  Control Multiplicity
	    Z		  V      none	           Z

The Cray-1 is a 'ZV' computer. Further examples of the new taxonomy can be found in table 7.5.7.1.

Table 7.5.7.1 Characterisation of Various Computers by Flynn's Taxonomy and the New Taxonomy
ManufacturerModelNew TaxonomyFlynn's Taxonomy
CDC6600ZLSISD
IBMPCZLSISD
CrayCray-1ZVSISD
AlliantFX/8SL BusMIMD
CrayX-MP 22SV MemoryMIMD
Illinois U/BurroughsILLIAC-IVMM Chordal Ring MSIMD
SequentBalance 2100MM BusMIMD
PARSYS/TelmartT.nodeMM CrossbarMIMD
Carnegie Mellon UC.mmpML CrossbarMIMD
LoralLDF 100ML BusMIMD
BBNButterflyLL ButterflyMIMD
FPST.SeriesLL HypercubeMIMD
Goodyear AerospaceMPPVS GridSIMD
ICLDAPVS Grid VSIMD
Thinking MachinesCM-1VS Hypercube VSIMD
Typical Data Flow MachineMS CrossbarMIMD
Typical Systolic ArrayMS Hexagonal Grid MSIMD
Typical Neural NetworksLS Randomly ConnectedLayers????

Neural networks tend to have four or five layers with an equal number of processors in each layer. Initially every processor in one layer is connected to every processor in the next layer. After training the neural network many of the connections will become redundant, and will be 'disconnected' when the network is in use.

It is difficult to express this topology fully in the new taxonomy so it is proposed that rough designations like 'Fully Connected Layers', 'Partially Connected Layers', or 'Randomly Connected Layers' be used instead. A neural network would come out something like 'VS Fully Connected Layers'.

7.5.7.2 Number of Categories

The number of categories in the new taxonomy is given by multiplying the number of possible states in each category by each other. Leaving topology out of the calculation for the moment we get:

      No(Categories) = No(Processor Multiplicity)  x 
			    No(Grain size)  x  No(Control Multiplicity)
		     = 8 x 4 x 8
		     = 256

In the case where there is only one processing unit, there can only be one control unit, ie the control multiplicity can only take the value of zero, the other seven states are not possible. Removing seven from the total we get 249. In general, the control multiplicity must be less than or equal to the processor multiplicity. Applying this restriction brings the number of categories down to 144. Each category, (except those with just one processing element), can be further subdivided by the inclusion of topology. Unfortunately for this calculation, the number of possible topologies is not known, but even if only seven are ever used that multiplies the number of categories into the thousands.

In practical terms, however, some categories will remain empty, and only a handful of topologies will be popular at any one time. This will lead to a few popular categories becoming familiar, but leave the others in place for future developments.

With these simple rules it should be it possible to 'work out' a rough architecture for all the categories that are not already familiar. The chosen parameters carry a large amount of inferred information, and so it should be possible to add some detail to the explicit information given by a particular category.

7.5.7.3 Problems with the New Taxonomy

A machine close to the borderline between two categories can be pushed over this borderline by the addition of a few processors. As neither the architecture nor the operation of the machine is likely to have changed significantly by this addition, the change of category can be regarded as a problem with the taxonomy.

By using rough designations for the number of processors the addition of a few processors to a machine near the beginning or middle of a category will not result in a change of category. This problem only occurs if the machine is already borderline. It is an unavoidable consequence of including highly variable parameters in a taxonomy that small changes will cause machines to jump categories in a few borderline case. The quality of information conveyed by these parameters so much outweighs this problem as to justify their inclusion in the taxonomy.

A similar problem occurs with parallel computers that can be extended through a range of categories. The transputer based Supernode (see section 5.3) can be extended from 16 to 1024 processors pushing the machine from SM Crossbar through MM Crossbar and LM Partial Crossbar to VL Partial Crossbar. It was initially stated that a machine should only appear in one and only one category if possible. Manufacturers usually designate upgraded machines as different models. Including the model in the title of the machine should substantially reduce the number of machines that appear in several categories.

General information for this chapter also came from references [4], [5] and [6].

 

References

[1] Flynn MJ, Very High-Speed Computing Systems, Proceeding of the IEEE, 54(12), December 1966, p1901-1909
[2] Handler W, Innovative computer architecture - how to increase parallelism but not complexity, p1-41, in 'Parallel Processing Systems, An Advanced course', Evans DJ ed, Cambridge University Press, Cambridge, 1982, 0-521-24366-1
[3] Hennessy I & Patterson D, Computer Architecture a Quantitative Approach, Morgan Kaufmann, Palo Alto, 1990, 1-55880-069-8
[4] Hockney RW & Jesshope CR, Parallel Computers 2, Adam Hilger/IOP Publishing, Bristol, 1988, 0-85274-812-4
[5] Hwang K & Briggs FA, Computer Architecture and Parallel Processing, McGraw Hill, London, 1984, 0-07-031556-6, p32-40
[6] Quinn MJ, Designing Efficient Algorithms for Parallel Computers, McGraw Hill, London, 1987, 0-07-051071-7