# A 32-BIT EXECUTION UNIT IN AN ADVANCED NMOS TECHNOLOGY

Michael Pomper, Wolfgang Beifuß, Karlheinrich Horninger, Ulrich Schwabe

Research Laboratories, Siemens AG, Munich, W. Germany

#### Abstract

A 32-bit execution unit has been realized in a scaled NMOS singlelayer poly technology using 2  $\mu$ m gate length, 7  $\mu$ m Al pitch and low-ohmic polysilicide for gates and interconnections. The circuit operates at 5 V with a maximum clock frequency of 8 MHz. To reduce design time, the chip (25000 transistors, 16 mm<sup>2</sup> including pads) has been designed with a high degree of regularity.

## Introduction

In the next few years an increasing number of 32-bit microprocessors will come on the market. The first of these 32-bitters has been introduced at this years ISSCC /1/. In the following paper, the execution unit of a 32-bit wide general purpose microprocessor will be described. This circuit has been realized in a scaled-down NMOS technology with 2  $\mu$ m geometrical channel length, a 7  $\mu$ m wide aluminium line pitch and low-ohmic polysilicide used both for gate electrodes and interconnection wiring. The circuit operates with a clock frequency of 8 MHz and uses a single 5 V supply voltage. The execution unit can perform logic and arithmetic operations and has an on-chip ROM for instruction decoding.

# Circuit description

The block diagram of the realized circuit is shown in Fig. 1. A two bus system is used for data flow between the 32 word register (dual-port RAM), the ALU, the shifter and the temporary register. These two busses (A and B) are then combined so that the chip only has a single data bus to external circuits for input and output. Data in the dual-port register is selected via the external A and B address. Data into the register can only be written from the A-bus.

When designing a fast ALU, special attention must be given to the carry path. The ripple-carry method can be designed very regularly, on the other hand it is the slowest method. A full carry look ahead circuit would be the fastest method but results in a very unregular design and a large area. In the ALU we have realized a compromise between these two methods is used. The transfer gate carry path is precharged during one clock phase (phase  $\emptyset_2$ ) and a carry-bypass circuit is implemented over four stages. From the first four bit block to the next one, the carry is rippled. This solution gives reasonable speed and a regular design.

To speed up multiplication and division operations a special circuitry has been added to the ALU (Fig. 2, table I), which performs these operations after Booth's algorithm /2/. If the operations MUL or DIV are initiated, the ALU will be autonomously controlled by the least significant bits  $FBØ_T$  and  $FBØ_{T-1}$  (see table I). In this way a time-consuming communication between execution unit and controller is avoided. Thus a  $32 \times 32$  bit multiplication of integer numbers with sign and a 64 bit result is finished in 35 cycles. Division without sign (32 bit result and 32 bit residue) needs 37 cycles.

To decode the 8-bit op-code and control the different operations of the functional blocks, a mask programmed ROM has been integrated on the chip. This ROM has approximately 7500 transistor sites and is occupied to about 25%. Transformation of this control ROM into a control PLA has also been made. This resulted in a decrease of transistor sites of approximately 25% at the expense of a slower matrix, since the polysilicide word lines of the PLA are 3.5 times longer than the word lines in the ROM.

To reduce the design time necessary for such a complex logic circuit, emphasis has been put into making the circuits as regular as possible. The result was a floor plan (Fig. 3) which differs drastically from the initial function block diagram (Fig. 1). A plot of the complete execution unit is shown in Fig. 4. The data busses A and B run from top to bottom through the dual-port RAM, ALU, status, shifter and bus control. The circuits along these data paths all have the same width, so that they can easily be rearranged when new or different functions are implemented. The whole data path is drawn only once and then multiplied 32 times. The data lines run in aluminium whereas the control lines, which run perpendicular to the data lines, are realized in polysilicide. The control ROM fits nicely into the height of the data path, resulting in a very compact and area efficient layout. The whole circuit contains about 25000 transistors, the die size is 16 mm<sup>2</sup> with pads. The circuit operates on two clock inputs. From these clocks, a four phase clock system is generated on-chip.

## Simulations

The whole circuit has been designed and simulated with a circuit simulation program. In addition logic simulation has been done with the program CAP /3/. Circuit simulations of the ALU are shown in Fig. 5 (8 MHz cycle). During clock-phase Ø2 the operands, which in this case are read from the DP-register, are fed into the ALU latch and the ALU logic circuit is operated, while the carry line of the adder is precharged. The operations ADD and ALU-shift are started with clock phase Ø3 and can be performed in about 55 ns (worst case). The time left within Ø4 can be used to rewrite the operands into the DP-register.

## Conclusion

The circuit and the design of a 32-bit execution unit with an on-chip control ROM has been described. Simulations show, that the circuit can operate with a clock frequency of 8 MHz. At this frequency a 32 x 32 bit multiply operation is accomplished in less than 4.5  $\mu$ s. The circuit is currently being fabricated.

#### Acknowledgement

The authors would like to thank Mr. W. Kaschte for making the artwork and Drs. H.-J. Pfleiderer and E. Hörbst for encouraging this work. This work was supported by the Technological Program of the Federal Department of Research and Technology of the Federal Republic of Germany. The authors alone are responsible for the contents.

51

# References

- ISSCC 1981, Digest of Techn. Papers, Session IX, pp. 104-117 /1/
- Booth, A.D. et al., Automatic Digital Calculators, Academic Press Inc., New York, 1956 12/
- F. Rammig, CAP/DSDL (Version O, Kurzbeschreibung), Universität Dortmund /3/



READY

DATA

Ň

30VITOA





Fig. 3: Floor plan

Fig. 4: Plot of the complete execution unit

53