Parallel Computer Taxonomy - Conclusion


 

8.1 Introduction

In this final chapter I will address various ideas about the future of parallel computers. Although the idea of parallel processing has been around for a long time there has been strong resistance to computing in parallel. Much of this resistance is based around several philosophical arguments devised between the 1940's and 1960's, which suggested that parallel processing was an intrinsically inefficient.

 

8.2 Arguments Against Parallel Computers

8.2.1 Barsis vs Amdahl

The Speedup (S) of an application program, executed in parallel, is defined as the time taken to complete the calculation on a serial computer (Ts) divided by the time taken to complete the same calculation on a parallel computer (Tp).

Let the total time taken to complete an application program on a serial computer be set to 1 from the outset. It is usually possible to break some parts of a calculation down into independent blocks that can be executed in parallel. However, there will also be a part of a problem that will not break down in this way, and will have to be executed sequentially. Let f represent the fraction of time taken to execute this sequential component. Because the total time was set to 1 at the beginning f must lie somewhere between 1 and 0. On a sequential computer the sequential component executes in time f, and the parallelisable component executes in the remaining time, 1-f.

Moving the same problem onto a parallel computer would still leave the sequential component requiring time f. However, under optimal conditions the parallelisable component would now only require time (1-f)/n, where n is the number of processors available. Therefore, the total time taken to complete the calculation on a parallel computer is f + (1-f)/n. Feeding this into the definition of speedup we get equation 8.2.1-1.

S < 1 / (f + (1-f)/n)Equation 8.2.1-1

The inequality arises because communication, should it exist, might slow down the parallel component down by some extent. Equation 8.2.1-1 is an expression of Amdahl's Law[2]. As the number of processors goes to infinity the right hand term of the denominator goes to zero, (ie the parallel component takes an infinitesimal time to complete), leaving equation 8.2.1-2.

S ≤ 1 / fEquation 8.2.1-2

Equation 8.2.1-2 is a function of one term - the time taken by the sequential code - the number of processors does not come into it. If 5% of the code has to be done sequentially equation 8.2.1-2 tells us the speedup is limited to 20 - irrespective of the number or processors. Figure 8.2.1-1 shows how devastating the serial fraction is on speedup. A corollary following from Amdahl's Law states that a small number of sequential operations can effectively limit the speedup of a parallel algorithm.

Amdahl's law is a strong argument against parallel computers and for many years had turned the attention of many computer designers away from multiprocessors to vector processors. However, many important applications exist that have almost no serial component. Furthermore, a knowledge of Amdahl's law can be used to aid parallelism - by steering parallel programmers away from including large serial fractions in parallel algorithms.

The strongest argument against Amdahl's Law originated from observations at the Sandia National Laboratories in the US of an almost linear speedup for many applications with a 40-80% serial fraction. This level of speedup should not be possible under Amdahl's Law.

Amdahl's Law, as stated, assumes that exactly the same application is run on several machines that vary in the number of processors they have. Barsis[3], observed that for 'real' applications this was virtually never the case. Usually as the number of processors increased so did the size of the application. The only time when an application would remain the same size is in academic exercises. In mathematical modelling for example, more processors means it is practical to: extend the grid, increase the grid resolution, increase the number of degrees of freedom, or increase the number of iterations in the application.

The formula for speedup is now reevaluated for applications that grow as the number of processors is increased. Let the time taken to solve a problem on a parallel computer be 1. As before the time can be split into serial and parallel components. The total time taken to complete the problem on a parallel computer with n processors is equal to the serial fraction (f) plus the parallel fraction (p).

Tp = f + p = 1Equation 8.2.1-3

If the problem is moved onto a serial computer the parallel components can no longer be executed in parallel but must be done sequentially. The time taken to execute the problem on a serial computer is equal to the serial component, plus n times the parallel component.

Ts = f + p × nEquation 8.2.1-4

Speedup now becomes:

S = Ts/Tp = (f + p×n) / (f + p)Equation 8.2.1-5

Substituting equation 8.2.1-3 for the denominator we get:

S = f + p×nEquation 8.2.1-6

Rearranging equation 8.2.1-3 we get:

p = 1 - fEquation 8.2.1-7

Substituting this into equation 8.2.1-6 we get:

S = f + (1-f)×n
= f + n - nf
= n + f - nf
= n + (1-n)fEquation 8.2.1-8

Figure 8.2.1 shows how the Barsis formula for speedup compares with Amdahl's. Equation 8.2.1-8 represents a straight line with slope 1-n. It can be seen that for applications that are scaled up with the number of processors a linear speed up is achievable[6], [8].

Comparison between Amdahl and Barsis Models for Speedup on 1000 Processors

Figure 8.2.1 Comparison between Amdahl and Barsis Models for Speedup on 1000 Processors

Observed speedup may be lower than those predicted because the communication time has been included into the parallel fraction used to calculate the total serial time Ts. On a serial processor this would not be needed so Ts would be slightly lower.

8.2.2 Grosch's Law

Grosch's Law - formulated during the 40's but not published until 1953 relates the price of a computer to its power. Grosch asserted: "I believe that there is a fundamental rule .... giving added economy only as the square root of the increase in speed - that is, to do a calculation ten times as cheaply you must do it one hundred times as fast".

c = f(p½)Equation 8.2.2-1

Where p is performance and c is cost.

Cost Performance Relationship Suggested by Grosch

Figure 8.2.2-1 Cost Performance Relationship Suggested by Grosch

Another way of looking at this is that computer power is a function of the cost squared, see figure 8.2.2-1.

p = f(c2)Equation 8.2.2-2

To see how this is an argument against parallel computers let us see what Grosch's Law predicts about buying a computer of power 2y. Let us assume that we could either buy a single processor of power 2y, or we could buy two processors of power 1y and connect them together.

Removing all the constant terms from equation 8.2.2-1 we get:

c = p½Equation 8.2.2-3

Costing the parallel machine by equation 8.2.2-3 we get:

c = (1st processor power)½ + (2nd processor power)½
= y½ + y½
= 2y½

Costing the serial machine gives us:

c = (2y)½0.5
= 2½.y½
= 1.414y½

which is cheaper than the equivalent parallel machine. Similarly, a serial machine of power 100y would only cost 10% of parallel machine with 100 processors of power y. Grosch's Law is an example of an 'economy of scale', where the overheads associated with buying one big item are less than those of buying the same amount but in several small batches.

At the time an analysis of the prices and power of a variety of machines[10] seemed to substantiate Grosch's Law.

Let us now perform a 'thought experiment' and push Grosch's Law to its limits. Let us assume that, given the laws of nature, the most powerful single processor possible has been built. (Details, such as architecture, materials, and switching speed, are irrelevant as this only is a thought experiment.) Given the premise that we have the most powerful single processor it is impossible to build a processor twice as fast - at any price. Here, therefore, Grosch's Law must break down.

If we now consider the most powerful single processor that has actually been built. This time it may be possible to buy a processor that is twice as powerful. However, production of this processor would require a new architecture, new materials, a great deal of research, prototyping, probably a delivery time of several years, and consequently much more than 1.414 times the price of the current fastest processor as predicted by Grosch's Law.

Something seems wrong - there is evidence for Grosch's Law but common sense tells us it can't be right. Taking into account the exaggerated costs of top-of-the-range machines Handler[4] has suggested that the cost of manufacturing a range of machines is more like the 'S' curve in figure 8.2.2-2.

Cost Performance Relationship Suggested by Grosch

Figure 8.2.2-2 More Realistic Price Performance Relationship

There is evidence to suggest that manufacturers price their computers according to Grosch's Law, (figure 8.2.2-3,) rather than by a fixed multiple of the manufacture cost, see figure 8.2.2-4. If computers were priced as in figure 8.2.2-4 the machines at the top of the range would appear to be more expensive per MFLOP than small machines. This would make it more difficult for the sales personnel to persuade their customers to upgrade their current machine rather than buy from a different manufacturer. Grosch's Law appears to be a result of marketing policy more than it is of manufacturing expense.

IBM 360 Pricing Policy

Figure 8.2.2-3 IBM 360 Pricing Policy
More Representative Pricing Policy

Figure 8.2.2-4 More Representative Pricing Policy

Ein-Dor[1], [5] split computers into five categories based on price. The five categories roughly corresponding to supercomputer, big mainframes, small mainframes, minicomputers and microcomputers. His analysis showed that if machines in the same category are compared they follow Grosch's Law. However, Grosch's Law is not followed when machines in different categories are compared.

Mendelson[7], has re-examined Ein-Dor's data and found that if the categories are removed, and the data analysed as a whole, a linear cost -vs- power relationship cannot be ruled out. He suggested that splitting computers into categories resulted in a selection bias which led to the appearance of Grosch's Law.

8.2.3 Minsky's Conjecture

Minsky's conjecture[4], [2] states that due to the need for parallel processors to communicate with each other, speedup increases as the logarithm of the number of processing elements, see figure 8.2.3-1. This would make large-scale parallelism unproductive.

Minsky's Conjecture

Figure 8.2.3-1 Minsky's Conjecture

Minsky's evidence for this comes from parallel computers of the late 60's and early 70's which had 2-4 processors. At this time communication was not very well catered for and Minsky's conjecture held. Two innovations have changed this state of affairs and today's parallel computers often achieve almost linear speedup. Clever multitasking allows processors to interleave processing with relatively low speed communications. Multitasking can dramatically reduce the amount of time wasted by a processor waiting around during communication.

The other development is to use 'Direct Memory Access' (DMA) like units to do the communication for the processor, the transputer does this with its link engines, see section 4.4. The DMA unit is told what to transmit, and where to transmit it, while the processor can get on with other tasks. By cycle stealing the DMA can further reduce the amount by which communication impinges on computation. In this way the processor has effectively been decoupled from communication.

It has been further argued that as the number of processors grows so does the amount of communications and at some point, even with high speed DMA communications, speedup will be limited, see figure 8.2.3-2. On a single processor a process makes data available to a second processes by saving it in memory from where the second process can read it. This is communication between processes, even though both processes reside on the same processor. If that piece of data needs to be made available to a process on another processor then it can just as easily be sent to an I/O unit than to a memory unit - the cost is comparable.

Crippling Communications Overhead

Figure 8.2.3-2 Crippling Communications Overhead

The main advantage of main memory as a communications medium is that all the data is made available to all the processes. On a message passing parallel computer if it is not known which data will be needed by a particular processor then problems can occur. Either all data has to be transmitted to all processors or the processor wishing to use the data must find out where it is and request it. Both of these solutions to the problem of making data available as it is needed, will make the application communication bound, as shown in figure 8.2.3-2.

Fortunately, there are many applications where an almost linear data flow graph can be drawn between processes. In these cases data use is completely predictable and the algorithm can ensure that data is sent to where it is needed. In essence the point at which an application becomes communications bound is a product of both the number of processors the topology and the algorithm. For many applications the algorithm can be adjusted to maintain almost linear speedup. Sometimes Minsky's Conjecture is observed, however, this can regarded as a worst case scenario and usually changing the algorithm gives much better results.

8.2.4 Little Problems

A large parallel computer with n processors may be required to run a small application requiring only k processors where k < n. This situation would leave n - k processors idle which is a waste of resources. (This problem was touched upon in section 3.2.2.2 on 3-D grid topology.) It could be argued that this waste is not really a problem because many computers are often under-utilised. (Many of an institutions personal computers stay idle most of the time and minicomputers tend to be idle 16 out of 24 hours - ie out of normal working hours.) The extra capacity is there so that the computer can produce results fairly quickly on demand.

Ignoring poor efficiency is not a satisfactory argument for parallel computers. For parallel computers with a medium to large grain size it can be argued even more strongly that small applications are not wasteful. Any processors left over once the main application is running can be assigned to smaller batch jobs by the operating system. This could also be done with small grain sized parallel computers but it is much more difficult.

8.2.5 The Myth of the General Purpose Computer

Another objection to the likely success of parallel computers is that treating processing power as a resource is so complicated that it will not be possible to build a general purpose parallel computer in the near-future. This argument assumes that general purpose computers is something that the market particularly needs and is what we have already with uniprocessors.

I believe neither of these is so. Microcontrollers can be though of as specialised computing chips. They have a CPU core - often taken directly from the manufacturers main range of microprocessors. Usually the memory capacity and sophistication are reduced to allow room for a multichannel I/O interface, sophisticated interrupt logic and extra timers. Microcontrollers are built into products that requires some 'intelligent' behaviour. These are called embedded systems because the processor is not accessible by the user. Integrated circuit sales figures show that after memory chips the most saleable device is the microcontroller, not the microprocessor. Therefore, the largest market for computing chips is not for building general purpose computers.

Recently the computational performance of microcontrollers has increased rapidly. The microcontroller market is forecast to increase, as is the demand for increased performance. Microcontrollers are being built with ever more computing power, however, these are becoming expensive. In some cases it is now cheaper to use several average microcontrollers rather than one very high performance one, (another counter-example to Grosch's Law). This is particularly true of Digital Signal Processing (DSP) chips.

8.2.5.1 Tweaking

A surprising stage in the design of general purpose computers can best be called 'tweaking'. This happens when most of the software has been written, and the machine is just about to go into full production. Various people responsible for the design of the machine along with those selling applications, run sample programs on the new computer. At this stage they are not interested in the results from the program, but more the result of the program. They want to know what the program does to the machine; how it uses the network, disks, operating system, buffer spaces, cache etc.

Advanced tools, both software and hardware, have been prepared in advance just to aid this stage of design. These include tools to indicate how much CPU time is used by the operating system, and how much by the various applications. How much buffer space is optimal from a particular applications and how the cache memory performs under various applications and system loadings.

At the end of this stage, the machine can go into full production. The buffer spaces should be optimal, the cache is optimal, the network latency is optimal. The question arises, optimal for what? The machine is usually targeted for a certain fairly specific market: corporate local network, scientific small mini, scientific minisupercomputer, educational local network, stand alone workstation, DP single user mini, etc. Tweaking allows the machine to perform well on specific benchmarks recognised by the targeted niche. Unfortunately, optimising one feature will often degrade others so tweaking lowers a machines performance outside of its niche. The idea that a machine has a niche is counter to the argument that it is a totally general purpose machine.

Little of this is overtly stated in a manufacturers brochures. It will appear that only a few machines from a manufacturers range fit your needs. Some manufacturers don't seem to have anything suitable - although they have an extensive range of products. Often therefore, when is comes down to selecting a computer, the choice is small, which is because machines are designed for highly specific markets.

[To some extent a machine can be reconfigured on site by a systems manager. However, many features such as cache memory size, disk performance, memory hierarchy, are fixed during manufacture.]

It is unlikely that a computer built for one niche will be much used in another. On the occasions when this does occur the users seem to have become accustomed to poor performance. If, for example, someone wants to use a scientific minisupercomputer for wordprocessing, they most likely will not question the limited choice of software, inappropriate display, and the lack of laser printer. "Oh that's the mini - its not really intended for wordprocessing."

8.2.6 Software Investment

Users have a huge investment in both computer hardware and computer software. When it comes to changing the hardware it is not always necessary to change all the software. Providing the change in computer architecture is not to drastic it may be possible to preserve the investment in software simply by recompiling the old applications on the new machine.

It is argued that parallel computers represent such a large change in computer architecture that the old programs could not be reused and new programs would have to be written. This not only represents a loss of investment in the old programs but may well represent considerable expense in getting new ones.

As an argument against parallel computers it is unduly pessimistic. It assumes that the applications that exist today will be the only applications to exist in the future. Parallel computers will find new applications that are not feasible on current machines due to cost or performance. Some old software has reached the end of its useful life and is in need of rewriting. If this has to be done anyway it might as well be rewritten for a parallel computer as for a serial computer.

Many parallel computer manufactures recognise the need to maintain the software investment and provide 'intelligent' cross-compilers that convert standard FORTRAN programs into parallel FORTRAN programs. This will not produce programs as efficient as those written for a parallel computer in the first place, but it allows the old programs to continue being used.

Many users run standard applications, eg spreadsheets, database managers, accountancy packages, CAD packages, etc. Parallel versions of some standard packages are gradually being made available and applications can be moved across with little or no change to the users files.

Over the past several years more and more parallel computers have been reaching the market place. Initially, these were advanced supercomputers going into national science laboratories. Now small parallel computers are undercutting the minicomputer market whilst some are used to boost the performance of workstations and micros. All supercomputers currently use multiple processors and an increasing number of supercomputers now have massively parallel architectures. With the introduction of parallel computers into the market place the argument over whether parallel computers are practical has changed to which form of parallel computer is most practical.

 

8.3 Gaps

In this thesis I have tried to give a comprehensive account of the ideas and components used to build parallel computers. Armed with this information I have introduced a new taxonomy of parallel computers. During the research, several lines of investigation had to be abandoned due to my inability to find the relevant details or evaluate results.

8.3.1 Interconnection Networks

In chapter three on topology, the average path length is given for some topologies but not all. The evaluation of the average path length of the chain proved to be difficult and the chain is possibly the simplest network. The other networks (except the tree) have multiple paths. This functional advantage is a disadvantage when it comes to the analysis of the average path length, which becomes horrendous. Even this is not so bad if the only paths considered are those of minimum length. However, the ability to use longer paths when no minimum ones are free is widely covered in the text, and this case needs to be considered in any analysis of average path lengths. It is justifiable to consider all the theoretical paths for which a message passes through no node more than once. Analytical results for the average path length may well already exist in graph theory but I was not able to find them.

Many modern parallel computers are connected by a switching network rather than a fixed topology (eg the crossbar switch in the Supernode, section 5.3.1.1). The properties of some switching networks were investigated, however, after some time, it became apparent that this area was so large that it threatened to swamp the whole thesis. Much of the analysis of switching networks is significantly more involved than fixed topologies.

8.3.2 Parallel Computer Taxonomy

Several modern taxonomies have come to light since the time of writing. Whilst I do not believe they influence or compete with the new taxonomy presented here, it would have been nice to include them for completeness sake.

It would have been nice to be able to classify more machines with the new taxonomy and contrast these machines against their categories in other taxonomies. It would also have been nice to classify several representative computers from each section of chapter two on Models of Parallel Computation.

8.3.3 The Transputer

Inmos have released many preliminary details about the T9000 - the first member of the next generation of transputers. The T9000 represents a significant change in the transputer concept and it would have been interesting to contrast these two generations and work out the structure of the next generation of transputer parallel computers.

Unfortunately, the T9000 is already a year overdue which means that some problems with production have been encountered. Inmos may find it necessary to change the specification of the final version of the T9000 to overcome production difficulties. For this reason the work on the T9000 has not been followed up.

8.3.4 Parallel Non-Transputer Computers

In chapter six only three non-transputer parallel computers have been described. There is a large diversity of parallel computers and it is regrettable that more could not be included on space grounds.
 

8.4 Missing Work

This thesis has presented parallel computers from the perspective of hardware. Based on the structure of these parallel computers a new taxonomy has been developed. The structure of computers is also influenced by software. In parallel computers the need for processes to communicate with each other has a direct influence on the architecture. A survey of parallel computer programming would have given some extra insight into how parallel computers are likely to develop and so which aspects should be covered in a taxonomy. Unfortunately, to present parallel computer programming in sufficient depth for it to reveal its influence on architecture would take much more space than is available here.

 

8.5 Future of Parallel Computers

It has been relatively easy to arrive at the state where a single processor can accept several users and still appear to each user as though they are the only one using the machine. It has only recently become practical to invert this situation: where a single user is given simultaneous command of several processors. Consequently, the environments for this condition are still evolving.

The next logical stage - which has already started with machines like the Edinburgh Regional Supercomputer - is to build an environment where many users work on a computer with many processors. Although this is already possible, the way in which resources are allocated is still too crude to be considered a mature product. The management of resources, such as memory and backing store, are still in evolution for uniprocessors - though the rate has slowed down. The ability to treat computing power as a dynamic resource is a new and unfamiliar concept and much work still needs to be done.

Another possible way forward for parallel computers is to have processing elements distributed amongst many networked workstations. Each workstation would have 10-100 processing elements to speed up its local performance. When a workstation is not being used a global operating system would enlist all its processing elements to form a pool of 'out workers'. The operating system could then distribute work from particularly busy workstations to all the out workers. This approach has been used with some success at the Inmos design laboratories at Bristol. Furthermore, as local networks rapidly increase the bandwidth of their links, distributed processing will make inroads into multiprocessing.

8.5.1 Software

The single greatest thing holding back parallel computers is lack of off-the-shelf applications software. This underlies that fact that the architectures of parallel computers are much more diverse than the architectures of serial computers.

Furthermore, there is a need for a standard parallel operating system, in a similar manner to the way Unix is a standard. A standard would allow many people to use the machine via a familiar environment. (Many manufacturers already use a superset of the Unix operating system, but the supersets are not compatible with each other.)

It could be said that the main advantage of a standard operating environment is that it hides the details of a machine from the user. The users are treated to an abstract machine. On a parallel computer the details (eg architecture, topology) can drastically effect the machine's performance on various programs. However, returning the details of the machine to the user is unpopular because there is more to learn.

Automatic dynamic load balancing and automatic message routing can only ever be partially effective. A knowledge of a machines topology and the likely loading, along with some static load balancing can give rise to improvements of several orders of magnitude.

Much work still has to be done in devising fundamental parallel algorithms. Many of these will initially be roughly hacked up serial algorithms, but in many cases there is scope for a totally new parallel approach. The relationship between computer topology and parallel algorithm is of fundamental importance in this field and has not yet been adequately addressed.

 

8.6 Conclusion

There are many problems still facing the utilisation of parallel computers. However, none of these are serious enough to stop parallel computers from becoming practical and cost effective.

I hope that the new taxonomy presented here will provide a conceptual tool making it easier to discuss, reason about and teach parallel computer architecture.

In a sense, parallel computers have been working their way down. Supercomputers have been largely parallel for many years, many mid-range minicomputers are simple multiprocessors. Parallel computers are here to stay, and are set to dominate the computing market.

General information for this chapter also came from reference [9].

 

References

[1] Ein-Dor P, "Grosch's Law Re-revisited, CPU power and the cost of computation", Comms of the ACM, 28(2), February 1985, p142-151
[2] Flynn M.J., Some Computer Organisations and their Effectiveness, IEEE Transactions on Computers, C-21(9), September 1972, p948-960
[3] Gustafson J.L., Reevaluating Amdahl's Law, Comms of the ACM, 31(5), May 1988, p532-533
[4] Handler W, Innovative computer architecture - how to increase parallelism but not complexity, p1-41, in 'Parallel Processing Systems, An Advanced course', Evans D.J. Ed, Cambridge University Press, Cambridge, 1982, 0-521-24366-1
[5] Kang Y.M. et al, Comments on "Grosch's Law Re-revisited, CPU power and the cost of computation", Comms of the ACM, 29(8), August 1986, p779-781
[6] Karp A.H. & Flatt H.P., Measuring Parallel Processor Performance, Comms of the ACM, 33(5), May 1990, p539-543
[7] Mendelson H, "Economies of scale in Computing: Grosch's Law Revisited", Comms of the ACM, 30(12), December 1987, p1066-1072
[8] Nussbaun D & Agarwal A, Scalablility of Parallel Machines, Comms of the ACM, 34(3), March 1991, p56-61
[9] Quinn M.J., Designing Efficient Algorithms for Parallel Computers, McGraw Hill, London, 1987, 0-07-051071-7
[10] Solomon M.B., Economies of Scale and the IBM System/360, Comms of the ACM, 9(6), June 1966, p435-440