A THREADED MICROPROGRAM MACHINE Brad Rodriguez, McMaster University A. ABSTRACT One of the corollaries of the trend toward migrating function to microcode is the prospect of developing such code in a high level language. In this realm, where resource costs are high, threaded languages may play an important role. This paper presents a processor that was designed to support a Forth model at the microcode level. The resulting 16-bit machine, implemented in 7400 series TTL, is extremely simple, involving approximately 2100 gates. While this machine is understandably less efficient at directly implementing a conventional CPU instruction set, it is comparable to conventional CPUs at implementing Forth primitives. This suggests the removal of one layer of organization between the application and microcode. It also suggests a new model, the "extensible instruction set," for microcoded CPUs. B. BACKGROUND In 1978, I and two fellow students -- contemplating the drive to speed functions by moving them to microcode -- realized that, eventually, a limit would be reached when the entire application was moved to microcode. Therefore, some form of high-level language would be required. We decided that Forth was ideal: as efficient as most HLLs, and very economical of memory -- an important consideration in the microcode realm. However, since we naively believed that a) the word size of the instructions must equal the word size of the memory, and b) Forth was a 16-bit language, we envisioned a machine with a 16-bit microinstruction. (This is not entirely a foolish restriction, when the microcode resides in the main memory.) This CPU should be simple and inexpensive to prototype, and should be built from standard TTL. In 1986 I finally devised a combination of ALU, register file, control logic, and memory that required only 16 control signals. Viewed as a microcoded CPU with a very short micro- instruction, this was originally called the REduced Micro- Instruction Set computer (REMIS). Viewed as a hardwired CPU with a very limited and strangely encoded instruction set, it is known as the Pathetic Instruction Set Computer. PISC-1, the prototype, was completed in 1991. C. HARDWARE DESIGN (FIGURE 1.) Every instruction requires two clock cycles. The first cycle is the fetch phase, during which the microinstruction is read from memory and latched into the Instruction Register (IR). The second cycle is the execute phase. The fetch operation is actually performed by a hardwired microinstruction, which is loaded into the IR during the execute phase. It reads memory addressed by the Program Counter (PC) register, and increments the PC. The key to the narrow microinstruction (and to the small package count) is the three-port register file, constructed from eight 74172s -- an IC which is, sadly, obsolescent. One of eight registers can be written (A), another can be read (B), and a third can be read and/or written (C), all simultaneously. Port C is used for transfers to and from the D bus. During memory reads (in the execute phase), the D bus is written into Register 0. (Memory reads during the fetch phase are latched into the IR.) At all other times, Register 0 is enabled onto the D bus, providing memory write data and the ALU's B operand. All memory reads and writes go through Register 0, the Memory Data Register (MDR). (Because of timing restrictions, the ALU's B operand cannot come directly from memory.) Any of the eight registers can be enabled through port B onto the A bus, providing both the memory address, and the ALU's A operand. The ALU result can be written to any of the eight registers through port A. Register 0 is a valid destination, except during memory read operations, when ports A and C conflict. The register being output on the A bus can be updated by the ALU: e.g., for post-increment of addresses. 74181s are used in a straightforward ALU. The carry output and the A=B output from the ALU can be latched by any instruction. (For obvious reasons, we must be able to execute instructions which don't alter the status.) The carry input to the ALU can come from the latched carry or the latched A=B flag, allowing status tests and multi- precision arithmetic; or it can be forced to "1" or "0", as required for many ALU operations (such as "increment"). D. INSTRUCTION FORMAT (FIGURE 2.) The architecture of this machine is reminescent of the PDP-11: eight registers, any of which can be used as an address register, with post-increment and post-decrement modes. Register R7 is the program counter, and load- immediate is performed by post-increment addressing using R7. There is also a strong flavor of 1802: all memory reads and writes must go through a single register (in this case, R0), all two-operand ALU functions use this register as one operand, and there is no subroutine Call instruction. The 16-bit PISC-1 microinstruction is shown in Figure 2. MRD and MWR are the active high memory read and write strobes; not to be selected simultaneously. A-reg is the register which is enabled onto the A bus. F-reg is the destination register of the ALU; note that R0 must not be selected when MRD is active. STS*, when low, causes the two status bits output from the ALU to be latched. CYin selects the source for the ALU's carry input. ALU FUNCTION is the 5-bit function select for the 74181s. The assembler for this microinstruction set requires only one screen of Forth source code. In the microassembler, the ALU function and the carry input select are usually combined in a single mnemonic. For example, the operation A+1 is provided by setting the ALU's M input to zero, S inputs to 0000, and carry input to a logical "1". The bit pattern shown in Figure 2 is the hardwired micro- instruction executed during the fetch phase. In mnemonic form, it is MRD 7 ADR 7 DST A+1, which means a) output R7 as memory address, and to A input; b) issue read strobe to memory; c) ALU operation is A+1; d) ALU result is written back to R7; and e) ALU status is not latched. This same "fetch" microinstruction, when used in a program (i.e. during the execute phase), will write the memory data to R0 instead of the IR: a load-immediate instruction. When RESET is active, the hardwired microinstruction is modified to: 1 0 1 1 1 1 1 1 1 1 1 0 0 0 1 1 which changes the ALU function to 0. This forces the Program Counter, R7, to zero. There is no Jump instruction. An unconditional jump is performed by first loading an immediate value into R0, and then transferring it to R7. This requires 3 words and 4 clock cycles. Conditional branches are performed by converting a status result to an all-ones or all-zeros mask, ANDing that with a relative offset, and adding the result to R7, requiring 5 words and 8 clock cycles. There is no subroutine Call or Return. All subroutine functions are provided at the level of the Forth model. E. FORTH MODEL In addition to the Program Counter (PC) and the Memory Data Register (MD), four of the registers R1-R6 are designated for the Forth model, and named SP, RP, IP, and W. Figure 3 shows two possible assignments of these registers. (The rationale for these assignments will be discussed later.) The lack of a subroutine Call instruction mandates an indirect-threaded Forth. NEXT requires three micro- instructions (6 clock cycles): MRD IP ADR IP DST A+1, \ memory(IP) -> MD, \ IP+1 -> IP MRD MD ADR W DST A+1, \ MD+1 -> W, \ memory(MD) -> MD MD ADR PC DST A, \ MD -> PC This leaves the W register pointing to the parameter field of the word being executed. NEXT is assembled in-line by an assembler macro. The common code to ENTER a colon definition requires 7 instructions (14 clock cycles): IP ADR MD DST A, \ IP -> MD RP ADR RP DST A-1, \ predecrement RP MWR RP ADR RP DST A, \ MD -> return stack W ADR IP DST A, \ W -> IP NEXT To EXIT a colon definition requires 5 instructions (10 clock cycles): MRD RP ADR RP DST A+1, \ pop return stack -> MD MD ADR IP DST A, \ MD -> IP NEXT F. OBSERVATIONS The PISC-1 CPU requires 22 TTL packages, containing a total of about 2100 gates -- about half the gate count of a Novix. If the PISC were implemented in semi-custom logic, still fewer gates would be needed, since many of the TTL ICs in the PISC-1 are not fully utilized. (For example, the "C" port of the 74172s is permanently wired to address zero.) The prototype was built with standard TTL (except for 74S181s), and has been tested with a 2 MHz clock. Timing analysis indicates that, with 100 nsec memory devices, a 5 MHz clock (200 ns cycle time) can be used. By using Schottky and Advanced Schottky logic, a look-ahead carry generator, and 45 nsec memory devices, a clock speed of 10 MHz should be possible. The limiting factor in the PISC-1 cycle is memory access. Non-memory "execute" cycles require only 100 nsec. By using the CPU's WAIT input to stretch memory reference cycles, a speed increase of up to 25% could be realized. It is instructive to compare the performance of the 5 MHz PISC-1 with the venerable 5 MHz 8086. Using the F83 model -- admittedly, not the fastest 8086 Forth -- various primitives require the following number of clock cycles: PISC-1 8086 PISC-1 8086 NEXT 6 25 ENTER 14 51 EXIT 10 40 0= 14 69/74 DUP 10 55 + 16 58 @ 12 51 0BRANCH 18 48/60 R> 12 50 SWAP 18 65 ! 14 55 For simple operations, the PISC compares favorably with the 8086. This is noteworthy in view of the fact that the PISC-1 uses 1976-vintage TTL, and could easily have been built at the time of the 8086. The major limitation of the PISC is its requirement for a larger and faster memory. (A 5 MHz PISC requires much faster memory devices than a 5 MHz 8086.) G. FURTHER WORK By adding supplementary decoding logic to the micro- instruction, a few bits could be "liberated." With these, a "conditional add" could be implemented (by modifying the ALU function). A "right shift" function would also be useful. A simple interrupt scheme would have the interrupt latch select R6 instead of R7 as the Program Counter (by modifying the hardwired fetch instruction). An otherwise-unused ALU function could be decoded to clear this latch. This requires dedicating R6 to interrupt service. Bringing the high A-reg select bits out to the memory would allow different memory spaces to be selected, depending on which address register is used. Figure 3 shows two examples, and the requisite register assignments. The ROM/RAM (Instruction/Data) model uses two spaces: one for microcode and threads, and another for stacks and data. The "segmented" model uses four memory spaces, allowing (for example) fast PROM for microcode, and slow PROM for threads. The PISC can be easily widened to 32 bits, although at this width a look-ahead carry is recommended. A 32-bit micro- instruction would allow many more functions: explicit addressing of the register file C port; more registers; a wider selection of status inputs to the ALU; short literal operands; and (as mentioned above) conditional add/branch, and a right shift. H. CONCLUSIONS / APPLICATIONS As a conventional microcoded machine, the PISC is inefficient at implementing a processor instruction set. Its lack of conditional branching and subroutines at the microcode level are a major weakness. But when used to implement Forth primitives, the PISC does quite well. So, rather than writing a Forth model on a microcoded CPU, the PISC encourages eliminating the middle layer, and developing the Forth environment directly in microcode. Would other programming languages be as amenable to this approach? Also, consider the execution cycle of a microcoded CPU: it fetches a machine instruction, and uses this instruction to index into the microprogram store. But this is exactly what the PISC does with a high-level Forth thread! We can view the compiled addresses of a Forth thread as the microcoded "instruction set" of a stack processor. But, unlike most CPUs, the programmer can extend this instruction set with new instructions. (This "extensible instruction set" paradigm is not new; the Novix and RTX2000 have been described in similar terms.) The PISC may have application where a minimal CPU is required in custom logic. Since it requires only about 2000 gates, and only a 16-bit-wide instruction store, it might fit "in the corner" of a more complex IC. Probably the greatest value of the PISC is educational. It illustrates the basic principles of microcoded CPUs. Its shortcomings are evident, and highlight many areas in which CPU performance can be improved: wider microinstruction, faster memory, pipelining, or parallel processing. It also introduces the concepts of interrupts and multiple memory spaces. The PISC is simple, obvious, and well within the grasp of an introductory computer architectures student. FIGURE 1. THE PATHETIC INSTRUCTION SET COMPUTER -------------------- ---------------------------- | A BUS | | D BUS | | __v_ _v__ | | \ A \ / B / _____ | | IR[4:0]--->S \ \_/ / | |<--"0" | | _______ __\4x74181/__ | 4:1 |<--"1" | | | 2-bit | <---CY \ ALU / Cn<---| mux |<--CY latch | | | latch | <---A=B \_F_/ | |<--A=B latch | | |_______| | |_____| | | ^ | ^ | | | | | | | IR[7] | IR[6:5] | | | | | | --------------------------| | | | | | _____v_____v______ | | | Ain Cin | | | IR[10:8]--->|A 8x16 bit | | | | register file C|<---"0" | | IR[13:11]--->|B (8x74172) | | | |____Bout__Cout____| | | | | | |<----------------------- ------------------------->| | | /16 ________ /16 | | IR | | | | |<---"fetch" | | | 2:1 | uinst. | | IR[15:0]<---| mux & | | | | 16-bit |<-----------------| | __ __ | latch | | | CLK | |__| |__| |________| ___v____ | _____ ^ | | | F/E | |_____| | | buffer | | fetch exec F/E (fetch/exec) |________| | ^ | IR[15] IR[14] | | | | | v v v v address to memory MRD MWR data to/from memory FIGURE 2. THE MICROINSTRUCTION WORD 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 --------------------------------------------------------------- | 1 | 0 | 1 1 1 | 1 1 1 | 1 | 1 1 | 0 0 0 0 0 | --------------------------------------------------------------- MRD MWR \_________/ \_________/ STS* \____/ M S3 S2 S1 S0 : : A-REG F-REG : CY-IN \_________________/ : : (ALU input) (ALU output) : : ALU FUNCTION active high : : 00000 = A+Cn 0 = latch ALU : 00001 = A+B+Cn status outputs : 00011 = -1+Cn Instruction shown : 11110 = A OR B is the hardwired : etc. "fetch" pseudo- 00=latched CY instruction. 01=latched A=B 10= "0" 11= "1" FIGURE 3. REGISTER USAGE / MEMORY SPACES RAM/ROM (split I/D) model CODE/THREAD (segmented) model R7 = PC* \ R7 = PC* \ CODE R6 = IP \ PROM R6 = / space R5 = / space R4 = / R5 = IP \ THREAD R4 = / space R3 = RP \ R2 = SP \ RAM R3 = RP \ STACK R1 = / space R2 = SP / space R0 = MDR* / R1 = \ DATA R0 = MDR* / space * register assignments for PC and MDR are hard-wired. \ \ eforth primitives for PISC-1 3 Nov 91 \ \ TOS in memory TOS in register \ ------------- --------------- \ : NEXT ( macro) same MRD IP ADR IP DST A+1, MRD MD ADR W DST A+1, MD ADR PC DST A, ; CODE doLIT \ -- w CODE doLIT MRD IP ADR IP DST A+1, SP ADR SP DST A-1, SP ADR SP DST A-1, TS ADR MD DST A, MWR SP ADR SP DST A, MWR SP ADR SP DST A, NEXT MRD IP ADR IP DST A+1, END-CODE MD ADR TS DST A, NEXT END-CODE CODE doCOLON \ a.k.a. doLIST? IP ADR MD DST A, same RP ADR RP DST A-1, MWR RP ADR RP DST A, W ADR IP DST A, NEXT END-CODE CODE EXECUTE \ a -- CODE EXECUTE MRD SP ADR SP DST A+1, TS ADR W DST A, MRD MD ADR W DST A+1, MRD SP ADR SP DST A+1, MD ADR PC DST A, MD ADR TS DST A, END-CODE MRD W ADR W DST A+1, MD ADR PC DST A, END-CODE CODE EXIT MRD RP ADR RP DST A+1, same MD ADR IP DST A, NEXT END-CODE CODE next \ FOR..NEXT, rel. MRD RP ADR RP DST A, same STS MD ADR MD DST A-1, MWR RP ADR RP DST A, XX ADR W DST -1+EQ, MRD IP ADR IP DST A+1, W ADR MD DST A&B, IP ADR IP DST A+B, NEXT END-CODE CODE ?branch CODE ?branch MRD SP ADR SP DST A+1, STS TS ADR TS DST A, STS MD ADR MD DST A, MRD SP ADR SP DST A+1, XX ADR W DST -1+EQ, MD ADR TS DST A, W ADR W DST ~A, XX ADR W DST -1+EQ, MRD IP ADR IP DST A+1, W ADR W DST ~A, W ADR MD DST A&B, MRD IP ADR IP DST A+1, IP ADR IP DST A+B, W ADR MD DST A&B, NEXT IP ADR IP DST A+B, END-CODE NEXT END-CODE CODE branch \ relative MRD IP ADR IP DST A+1, same IP ADR IP DST A+B, NEXT END-CODE CODE ! CODE ! MRD SP ADR SP DST A+1, MRD SP ADR SP DST A+1, MD ADR W DST A, MWR TS ADR TS DST A, MRD SP ADR SP DST A+1, MRD SP ADR SP DST A+1, MWR W ADR W DST A, MD ADR TS DST A, NEXT NEXT END-CODE END-CODE CODE @ CODE @ MRD SP ADR SP DST A, MRD TS ADR TS DST A, MRD MD ADR W DST A, MD ADR TS DST A, MWR SP ADR SP DST A, NEXT NEXT END-CODE END-CODE CODE RP@ CODE RP@ RP ADR MD DST A, TS ADR MD DST A, SP ADR SP DST A-1, SP ADR SP DST A-1, MWR SP ADR SP DST A, MWR SP ADR SP DST A, NEXT RP ADR TS DST A, END-CODE NEXT END-CODE CODE RP! CODE RP! MRD SP ADR SP DST A+1, MRD SP ADR SP DST A+1, MD ADR RP DST A, TS ADR RP DST A, NEXT MD ADR TS DST A, END-CODE NEXT END-CODE CODE R> CODE R> MRD RP ADR RP DST A+1, TS ADR MD DST A+1, SP ADR SP DST A-1, SP ADR SP DST A-1, MWR SP ADR SP DST A, MWR SP ADR SP DST A, NEXT MRD RP ADR RP DST A+1, END-CODE MD ADR TS DST A, NEXT END-CODE CODE R@ CODE R@ MRD RP ADR RP DST A, TS ADR MD DST A+1, SP ADR SP DST A-1, SP ADR SP DST A-1, MWR SP ADR SP DST A, MWR SP ADR SP DST A, NEXT MRD RP ADR RP DST A, END-CODE MD ADR TS DST A, NEXT END-CODE CODE >R CODE >R MRD SP ADR SP DST A+1, TS ADR MD DST A, RP ADR RP DST A-1, RP ADR RP DST A-1, MWR RP ADR RP DST A, MWR RP ADR RP DST A, NEXT MRD SP ADR SP DST A+1, END-CODE MD ADR TS DST A, NEXT END-CODE CODE SP@ CODE SP@ SP ADR MD DST A, TS ADR MD DST A, SP ADR SP DST A-1, SP ADR SP DST A-1, MWR SP ADR SP DST A, MWR SP ADR SP DST A, NEXT SP ADR TS DST A, END-CODE NEXT END-CODE CODE SP! CODE SP! MRD SP ADR SP DST A+1, TS ADR SP DST A, MD ADR SP DST A, MRD SP ADR SP DST A+1, NEXT MD ADR TS DST A, END-CODE NEXT END-CODE CODE DROP CODE DROP SP ADR SP DST A+1, MRD SP ADR SP DST A+1, NEXT MD ADR TS DST A, END-CODE NEXT END-CODE CODE DUP CODE DUP MRD SP ADR SP DST A-1, TS ADR MD DST A, MWR SP ADR SP DST A, SP ADR SP DST A-1, NEXT MWR SP ADR SP DST A, END-CODE NEXT END-CODE CODE SWAP CODE SWAP MRD SP ADR SP DST A+1, MRD SP ADR SP DST A, MD ADR W DST A, MD ADR W DST A, MRD SP ADR SP DST A-1, TS ADR MD DST A, MWR SP ADR SP DST A+1, MWR SP ADR SP DST A, W ADR MD DST A, W ADR TS DST A, MWR SP ADR SP DST A-1, NEXT NEXT END-CODE END-CODE CODE OVER CODE OVER SP ADR SP DST A+1, MRD SP ADR SP DST A-1, MRD SP ADR SP DST A-1, MD ADR W DST A, SP ADR SP DST A-1, TS ADR MD DST A, MWR SP ADR SP DST A, MWR SP ADR SP DST A, NEXT W ADR TS DST A, END-CODE NEXT END-CODE CODE 0< CODE 0< MRD SP ADR SP DST A, STS TS ADR TS DST A+A, STS MD ADR MD DST A+A, XX ADR TS DST -1+CY, XX ADR MD DST -1+CY, TS ADR TS DST A+1, MD ADR MD DST A+1, NEXT MWR SP ADR SP DST A, END-CODE NEXT END-CODE CODE AND \ likewise OR, XOR CODE AND MRD SP ADR SP DST A+1, MRD SP ADR SP DST A+1, SP ADR W DST A, TS ADR TS DST A&B, MRD SP ADR SP DST A, NEXT W ADR MD DST A&B, END-CODE MWR SP ADR SP DST A, NEXT END-CODE CODE UM+ \ u u -- u cy CODE UM+ MRD SP ADR SP DST A+1, MRD SP ADR SP DST A, SP ADR W DST A, STS TS ADR MD DST A+B, MRD SP ADR SP DST A, MWR SP ADR SP DST A, STS W ADR MD DST A+B, XX ADR TS DST -1+CY, MWR SP ADR SP DST A-1, TS ADR TS DST A+1, XX ADR MD DST -1+CY, NEXT MD ADR MD DST A+1, END-CODE MWR SP ADR SP DST A, NEXT END-CODE