The Freedom CPU ISA and Register Organization Andrew D. Balsa (maintainer/coordinator) Rev. 0.0.2, 24 September 1998 Describes the F-CPU ISA (Instruction Set Architecture) and Register Organization, discusses current choices. ______________________________________________________________________ Table of Contents 1. Acknowledgments 2. Introduction 3. Mission statement 4. Terminology 4.1 Data Size 4.2 Exceptions 4.3 Context switch vs. Process switch 4.4 Jumps and Branches 5. Notation 6. Early architectural choices 6.1 Hard choices 6.2 Firm choices 6.3 Soft choices 7. DLX and the F-CPU 8. Instruction Classes 8.1 Move instructions 8.2 ALU instructions 8.3 Branches and Jumps 8.4 Special 9. Memory Windows and Virtual Registers 9.1 Registers - what do we need them for? 9.2 Virtual registers 10. Addressing modes 10.1 Data addressing modes 10.1.1 Immediate 10.1.2 Register 10.1.3 Displacement 10.1.4 Register indirect or register deferred 10.1.5 Direct or absolute 10.2 Data addressing affectiveness 10.3 Program flow addressing modes 10.3.1 PC-relative 10.3.2 Register indirect with displacement 10.3.3 Register indirect 11. Virtual memory: pure paging 12. CPU control registers (CCRs) 13. FP coprocessor TTA architecture 14. Block Move coprocessor architecture ______________________________________________________________________ 1. Acknowledgments The architecture described in this document is the result of the efforts and ideas of many people. Consequently this section is an attempt to list those people (in alphabetical order) and if possible, the particular area they contributed to. · Jecel Assumpcão Jr.: software assisted MMU, inverted hash table paging. · Andrew D. Balsa: coordination. · Alan Pinkerton: cleaning up the concept of virtual registers. · Rafael Reilova: Block Move Coprocessor. · Teemu Suutari: TTA logic. · Brion Vibber: instruction format discussions. · ?: dropping 64-bit instructions in favour of 32-bit instructions. [if your name is not here, please drop me a note and Ill add it - ADB] 2. Introduction The F-CPU architecture will be unique in many ways by the time it is finalized, but certainly its distinctive characteristic is that it is being designed over the Internet by dozens of volunteers in a free, collaborative effort. Engineers in general and CPU architects in particular wont easily recognize that an organizational model ultimately influences the resulting design of any large-scale engineering project: there is a widespread (mis)belief that the technical results are independent of human issues, guided only by rational choices based either on technical constraints or economic factors. Another widespread myth is that good CPU architectures are the work of highly talented individuals working in total isolation, much like Seymour Cray designed the Cray-1 alone in the mountains. A slightly modified version of this myth states that keeping a small team of talented CS and EE engineers in isolation and under pressure will ultimately lead to revolutionary CPU architectures. The F-CPU architecture is being developed within the framework of the Freedom Project, which is itself based on ideas borrowed from the Free Software movement. The cornerstone of the Free Software movement is the GNU/GPL (General Public License), a document that expresses in legal terms deep-rooted beliefs and ideals related to individual freedom in the Information Age. We cannot summarize the foundations of the Free Software movement in a few phrases, but we can at least mention one noteworthy aspect of Free Software projects: the source code is available for all to view, examine, understand and improve. When applied to a hardware project, this would translate into making the design and its implementation (in the form of VHDL, Verilog or any other hardware description language) as well as the masks corresponding to any particular IC process, freely available. For the Freedom Project, not only will the final results of the development process be made freely available over the Internet, but the entire development process itself is free, in the sense that anyone can join and contribute by exchanging ideas, discussing design choices, and implementing them. The F-CPU architecture and its implementation(s), like a (collective) work of art, will be the expression in silicon of these beliefs and ideals, as they reflect on our organizational, development and communication methodologies. The present document is also being written taking into account this pervasive freedom to think, learn and create that runs through the Freedom Project: it can be changed, updated, corrected and extended at any time. The information herein contained can and should always be improved. The Freedom Project defines the framework in which these improvements can take place. One last remark: we are doing our best to separate the implementation- dependent parts of the CPU design from the architecture-dependent, when this logical separation makes sense. Sometimes the architecture and its implementation cannot be separated conceptually and/or in practice (well, perhaps they can, but we dont know how to do it - yet). 3. Mission statement The mission of the Freedom Project is to develop a new high- performance 64-bit CPU architecture, suitable for Personal Workstation use, and to implement this architecture in the available process technology. The new F-CPU architecture must support gcc and the Linux kernel (customizing these pieces of software is part of the Freedom Project). The implementation(s) must also conform to commodity PC technical requirements (e.g. an F-CPU mainboard has to fit in a standard PC case, with a standard PC keyboard connector, standard PCI bus connectors, etc.). It is perhaps useful to mention development issues that are not part of the Freedom Project, e.g.: · Develop a new microcontroller-type 8, 12, 14 or 16-bit microprocessor (e.g. PIC devices). Reason: outside the scope of the project. · Develop yet another 32-bit RISC processor for low-power/low cost embedded applications (e.g. ARM). Reason: there is an infinite variety of such processors on the market now, with excellent characteristics. · Develop new IC process technologies. Reasons: not feasible and outside the scope of the project. In fact we want to use exactly the most cost-effective technology available for each implementation of the F-CPU architecture. · Build a foundry. Reasons: obvious. · Develop a new compiler or OS kernel. Reasons: gcc and the Linux kernel nowadays constitute a standard. Developing a new compiler and/or kernel would mean throwing away years of development, and would present unsurmountable difficulties. · Develop new PC standards. Reasons: not feasible and not needed. This mission statement pretty much delimits the performance envelope of the F1 implementation, and somehow also influences the final CPU architecture. 4. Terminology 4.1. Data Size Even though the F-CPU word size is 64 bits, we prefer to stick to H&Ps and standard RISC notation. Since this is a matter of convention, we believe keeping with tradition is more befitting to this case. · 8 bits: byte. · 16 bits: half-word. · 32 bits: word. · 64 bits: double. 4.2. Exceptions The word exception is used to designate traps, interrupts, faults and exceptions. Exceptions can be divided in various classes, as described by H&P. 4.3. Context switch vs. Process switch Many authors do not make a distinction between the expressions context switch and process switch (e.g. Tanenbaum or H&P). However, in this document we shall use the definitions found in [Stallings, William, Operating Systems, Macmillan 1992, p. 148]. Quoting: It is clear, then, that a context switch is a concept dis­ tinct from that of a process switch. A context switch may occur without changing the state of the process that is cur­ rently in the Running state. In that case, the context sav­ ing and subsequent restoral involve little overhead. How­ ever, if the currently running process is to be moved to another state (Ready, Blocked, etc.), then the operating system [the kernel] must make substantial changes in the environment. ... Thus, the process switch, which involves a state change, requires considerably more effort than a context switch. 4.4. Jumps and Branches Again, well stick to H&P terminology: jump will be used when the PC is changed unconditionally and branch will be used when it is changed conditionally. 5. Notation H&Ps notation conventions will be used throughout to define the F-CPU instruction set. 6. Early architectural choices 6.1. Hard choices Early in the development of the F-CPU architecture, some hard technical choices were made. It is likely that the F-CPU architecture will have the following features: 1. 64-bit internal data paths (i.e. this is a 64-bit architecture). 2. Standard Von Neumann CPU<->memory architecture. 3. 64-bit address space. 4. Virtual memory support. 5. Kernel and user operating modes. 6. Co-processor support (FP, Graphics, etc...) on a separate (internal) bus. 7. Designed for Y2K-current process technologies (0.25 micron or smaller). 6.2. Firm choices Some other architectural choices are being dictated by the desire to innovate and improve upon current CPU architecture design practices. These choices are more questionable compared to the hard choices listed in the previous subsection, and yet once a consensus is reached over these choices, the F-CPU architecture will be in many ways determined. At we write these lines, these firm choices are: 1. A memory-to-memory (m2m) / RISC mixed architecture (this is explained below). 2. Both big-endian and little-endian modes supported (H&P pages 73 and C-12). 3. Memory windows with 32 user-visible virtual registers. The number of memory windows is implementation dependent, with a minimum of 2 windows. These memory windows can be thought of as virtual register banks. They are used to accelerate context switching (they are not used to accelerate procedure calls, as are register windows in the SPARC architecture). 4. Three operand instructions, as per standard RISC practice. 5. A 32-bit fixed length instruction format, also as per standard RISC practice. 6. Separate I/O, coprocessor and memory addressing spaces. The I/O and coprocessor addressing spaces (16 bits each) only allow Moves. The I/O addressing space is protected (only accessible in supervisor mode). The coprocessor addressing space is user-visible. 7. N-stage pipelined operation, where n is implementation dependent. 8. Hardware integer multiply and divide. 9. FP coprocessor implemented as coprocessor mapped Transport Triggered Architecture (TTA) logic (the presence of an FPU is implementation dependent). 10. Block Move coprocessor also implemented as coprocessor mapped TTA logic (the presence of a Block Move coprocessor is implementation dependent). 11. No condition codes, a la MIPS. 12. Immediate operands are limited to 16-bit size (due to the fixed 32-bit instruction length). 13. CPU Control Registers (CCRs) are mapped in the first half (32KB range) of the coprocessor address space. 6.3. Soft choices Many ideas are still floating around on the various Freedom Project mailing lists. 7. DLX and the F-CPU The DLX CPU architecture was designed by H&P as a logical outcome of their research into RISC. Although it was never implemented exactly as described in their book Quantitative Approach, it is so similar to the many commercial RISC implementations that at least a parallel can be drawn between the DLX projected performance and the actual performance of MIPS, SPARC, etc. The similarity between the Freedom CPU architecture and the DLX hypothetical CPU is not an accident. Basing ourselves on RISC concepts is a natural consequence of the desire to obtain the best performance from a (conceptually) simple design. In designing the F-CPU architecture, we are also feeling compelled to make use of GCC DLX machine descriptions, DLX simulation and emulation tools, and the vast amount of information freely available on the Web about DLX. If possible, well avoid reinventing the wheel. However, the F-CPU design is right now so different from DLX that it cannot be called a DLX descendent. 8. Instruction Classes We'll use the following broad instruction classes defined by H&P for the DLX architecture: 1. Moves (H&P defines this as load/store instructions). Since we specified that our coprocessors would be designed using TTA logic, our FP and block move instructions are included in this instruction class. 2. ALU operations. 3. Branches and Jumps. 4. We'll add a fourth class for special instructions e.g. HALT, so: Special. In fact H&P had defined a separate instruction class for FP, but we managed to fold that into the Moves instruction class (we are really pushing RISC philosophy, here: simplify, simplify). ;-) 8.1. Move instructions Move instructions support all data sizes up to and including 64 bits, as expected. Signed and unsigned byte, half-word and word are accounted for, as well as double. (128-bit moves?) The F-CPU recognizes three distinct address spaces: I/O, coprocessor and memory. 8.2. ALU instructions Logical and arithmetic operations operate on three registers or two registers and a 16-bit immediate operate. · Logical: AND, OR, XOR, NAND, NOR, NXOR. · Arithmetic: ADD, SUB, MULT, DIV; signed and unsigned. MULT and DIV store their result in two consecutive virtual registers (unlike MIPS, but very much like SPARC). · Shifts and Rotates: rotates as well as logical and arithmetic shifts, left and right, 1 to 64 bits. · Set conditional. 8.3. Branches and Jumps For lack of imagination, we adopted almost exactly the same control flow instructions as DLX: · Branch on zero/not zero. · Unconditional Jump. · Jump and Link. · SYS. System call exception (H&P use TRAP). Switches to MW0, entering supervisor mode. Unlike DLX, this is not a vectored trap: the OS kernel is assumed to have a single entry point. · RFE. Return from exception. Switches back to the previously active MW, restores user mode. 8.4. Special · HALT. Stops and powers down the CPU. · STI. Enable maskable interrupts. · CLI. Disable maskable interrupts. · LCX. Locked compare and exchange. Compares and exchanges the contents of a register with the contents of a memory address, pointed to by another register plus 16-bit displacement. 9. Memory Windows and Virtual Registers Easily the most controversial and misunderstood feature in the F-CPU (proposed) architecture, is, to date, the choice of a memory-to-memory architecture. Nevertheless, it is simpler to think of the F-CPU in terms of a standard RISC machine - without registers! "What? No registers? Is this a joke?" or "Well, if it's not a RISC CPU, it will perform like a dog." are the typical remarks. Some people with more experience notice that the TMS9900 microprocessor (one of the first 16-bit microprocessors) was a memory-to-memory machine, but they too observe that this was in the good old days were CPUs were slow and DRAM was fast. Today, we have exactly the opposite: CPUs are very fast and DRAM is slow. We know all that. Now please turn to page C-0 of H&P's Quantitative Approach. It says: "RISC: any computer announced after 1985.", quoting Steven Przybylski, a designer of the Stanford MIPS. Why do H&P begin their description of various RISC CPUs with this humorous remark? Because it's true: you can't design a microprocessor architecture nowadays without taking into account the work done by the early RISC pioneers. The term m2m/RISC was coined to describe the F-CPU architecture, so lets see how it differs from a standard RISC. 9.1. Registers - what do we need them for? The rationale behind the concept of registers is, as we know, that of a memory hierarchy. This concept is discussed extensively in H&P, section 1.7. Each level of storage has smaller sizes, faster access times, greater bandwidth and smaller latencies as we go from disk storage to main memory, to L3 cache (if any), to L2 cache (if any), to L1 cache (all modern CPUs have this), to registers (all modern CPUs have these). The sole memory-to-memory microprocessor architecture known to the authors was the Texas Instruments TMS9900. The idea sounds counter- intuitive given H&P description of memory hierarchies in modern RISC CPUs. However, the choice of a m2m/RISC architecture stems from a few observations of modern CPU designs: 1. On-chip L1 CPU caches can be dual-ported and still operate at full CPU speed. These can be as large as 64KB using 0.35 micron technology (e.g. the Cyrix 6x86MX, the Centaur C6 or the AMD K6), and probably larger using 0.25 micron technology. 2. The number of registers available is an important architectural parameter, obeying sometimes contradictory requirements/limitations: · On the one hand, compiler writers would prefer to have as many registers as possible. In fact, they would even prefer to view registers and memory as the same. · Instruction set designers will limit the number of accessible registers to avoid using too many bits for register selection. · Operating systems designers will want to save as few registers as possible, to avoid expensive context-switches. · Hardware designers can implement as many registers as required, but the complexity of the design usually increases in direct proportion to the number of registers. · The SPARC architecture inaugurated the efficient use of register windows. However, SPARC like all RISC CPUs is a register-to- register machine. It has been proved that register windows can increase the efficiency of CISC designs as well as RISC ones. The design objectives of the Freedom CPU architecture WRT its data path organization were: · Minimize latencies for memory accesses. · Maximize bandwidth for memory accesses. · Reduce "register pressure" in compilers, particularly gcc. · Minimize context switch latency. · Present a simple, regular memory/registers addressing instruction encoding. · Exploit modern cache design. 9.2. Virtual registers The proposed Freedom architecture has banks of virtual registers, which are used to shadow (earlier on we used the terms to cache or to mirror, but these led to confusions and are now deprecated) 256-byte regions of memory. How does that work and why is it good? The idea is really very simple: each memory window (virtual register bank) has an associated 64-bit special register which defines the (virtual, as opposed to physical) base address in memory of a block of 32 (64-bit) words. Well call this register MWBn, where 0 <= n <= (number of windows of the implementation-1). The MWB registers are control registers, which means their value can only be changed in supervisor mode. A single MW (memory window) is active at any time, meaning that a user program only ever sees 32 memory addresses that can be directly operated upon (other memory addresses must be accessed indirectly). Well call the virtual registers in the active memory window VR0 - VR31. VR0 is always set to 0, following standard RISC tradition. The active memory window is defined by the least significant bits in register AMW, which is one of the control registers MW0 is reserved for the operating system kernel. Switching between memory windows (we call this a context switch) is achieved by changing the value of AMW. Since AMW is one of the control registers, it is only accessible to the OS kernel. Whenever MWBn is loaded with a new value (a new base address), a 256-byte block is brought in from memory into the corresponding 32 internal VRs, and the previous values in the VRs are simultaneously written back to the 256 byte block (memory window) previously pointed to. This we define as a process switch. Note that the OS kernel must also save paging information during a process switch. Well see how this can be achieved. After a process switch, the contents of the VRs are _not_ maintained coherent with their corresponding memory addresses anymore. So, as can be seen, each register set is in fact a "memory window", in the sense that its contents reflect the state of a memory block. How does the hardware work? Just like any RISC CPU. Are there additional delays to use any of the internal virtual registers? No. Is this slower in any way compared to a RISC? No. Does this add complexity to the data path, compared to a RISC? No. What is gained, compared to a standard RISC? Mainly lower context switch latencies: assuming the OS wants to switch from one MW to another, all it takes is to load a new value in the corresponding bits in the AMW register. This can be done in a single clock cycle, and obviously has a very low latency, compared to saving/restoring 32 registers. For the compiler, the F-CPU is very much like many RISC CPUs: 32 general purpose registers (in reality special memory locations). From the software point of view, the F-CPU is a pure memory-to-memory machine. From the hardware point of view, the F-CPU is a standard RISC machine with (32 x n) registers, where n is the number of register sets (32 in the F1 CPU implementation), and each register set is made up of 32 registers. The F-CPU doesnt have a flags register. A mechanism similar to the DLX/MIPS is used to test for condition codes. The only conventional register is the 64-bit Program Counter register. Interestingly, during a context switch, the value of the PC is saved in VR0. Context switches can be achieved without any use of a stack to save the CPU state. 10. Addressing modes We shall refer to the addressing modes as though they would use standard registers. On the other hand, one must keep in mind the fact that each virtual register shadows a memory location. In this sense, there is no register addressing! Like H&P, we establish a distinction between data addressing modes (which operate on data in the virtual registers and/or memory) and program flow (instruction) addressing modes (which always take as their destination operand the Program Counter). 10.1. Data addressing modes 10.1.1. Immediate The 32-bit instruction format allows for 16-bit immediate operands (according to H&P, 16-bit immediate data covers 75-80% of all uses of immediate data). This is basically similar to the DLX I-type instruction format, but some elements are re-arranged: The opcode occupies the 6 most significant bits. The destination virtual register occupies the following 5 bits. The source VR follows, with another 5 bits. The immediate operand comes last, with 16 bits. 10.1.2. Register These are really three memory operands instructions. The encoding is again similar to the DLX R-type instructions, but as with the I-type instructions, the destination register field comes next to the opcode field. 10.1.3. Displacement Uses exactly the same format as the immediate addressing mode, but the 16-bit field is used to hold the displacement. According to H&P, 16-bit displacement capture 99% all displacements. 10.1.4. Register indirect or register deferred Can be synthesized from the displacement addressing mode by setting the displacement field to zero. 10.1.5. Direct or absolute Can be synthesized from the displacement addressing mode by using VR0 as the base register. This addressing mode is only useful for the I/O and coprocessor address spaces (for OS kernel use). 10.2. Data addressing affectiveness Obviously, other more sophisticated addressing modes can be synthesized from the basic register and displacement modes, with a slight performance penalty. According to H&P, the basic addressing modes provided above cover from 75% to 99% of the addressing modes in standard benchmark programs. 10.3. Program flow addressing modes 10.3.1. PC-relative PC-relative Jumps and Branches are available with a 26-bit offset, plus 2 bits since our instructions are aligned on 32-bit words. This 28-bit range provides ample program space (256MB) and ease of linking. 10.3.2. Register indirect with displacement The destination address is taken from one of the registers, and a 16-bit (+2) displacement is added, resulting in the new PC value. 10.3.3. Register indirect This is synthesized from the previous mode by setting the immediate field to zero. 11. Virtual memory: pure paging The F-CPU architecture does not implement segmentation: pure paging is used, with 8KB pages. The MMU is particularly simple: a 64-entry TLB is used like in MIPS. In parallel with the fixed size 8KB paging scheme, the F-CPU architecture makes use of an implementation dependent number (minimum of two) of variable-sized pages, ranging (in powers of 2) from 16KB up to the entire address space. These variable-sized pages can be used , for example, to map an entire video frame buffer in a single step, or to assign a fixed mapping to the OS kernel. Like in MIPS, a TLB miss will trigger an exception, which is handled by context switching to the kernel. With a 64-bit address space, it was judged that an inverted hash table implementation would be more effectively implemented in software rather than in hardware, and would yield adequate performance. 12. CPU control registers (CCRs) Modern CPUs have a plethora of special registers, used to control various features such as performance monitoring, virtual memory, cache control, timestamp counters/timers, etc. The F-CPU architecture is no exception, but it innovates in that there is a single, standard mechanism for accessing such registers that is at once fast, secure and simple to implement. In the F-CPU architecture, all CPU control registers are mapped in the first half (32KB) of the special coprocessor address space. Moving a value between any CCR and any virtual register takes a single instruction which executes in a single clock cycle. The CCRs are only accessible in supervisor mode. Trying to access a CCR in user mode results in an illegal instruction exception. 13. FP coprocessor TTA architecture It was felt that Transport Triggered Architecture would provide a good conceptual base for the development of a highly parallelizable FPU architecture. Moving FP data from any virtual register to any TTA functional unit in the FP coprocessor address space takes a single instruction and a single clock cycle. 14. Block Move coprocessor architecture Block moves are frequent operations in modern Personal workstations. Most CISC architectures implement block move instructions, however early RISC didnt. The F-CPU architecture does not specify a Block Move Coprocessor (BMC) as part of the instruction set, but instead as a general-purpose coprocessor that will handle lengthy block move operations, freeing the main processing unit for other tasks. As with the FP coprocessor, initiating a block move command (by transferring the command word from any of the virtual registers to the BMC) takes a single instruction and a single clock cycle.